# Introduction to Big Data Modern Technologies course

## TOPIC 3: Hadoop and MapReduce practice
### Part 1

### 1. Libraries

In [1]:
import os
import re
import json
import socket
import subprocess
import pandas as pd

In [2]:
# we need port only for Web UI
YARN_PORT = 8088

# working directory for default user `jovyan`
# `/jovyan/home` for the Jupyter 
# and `/jovyan` for the Hadoop environment
WORK_DIR = '/jovyan'

### 2. HDFS commands

Help is all you need!

In [3]:
!hdfs dfs -help

Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
	[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
	[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-v] [-x] <path> ...]
	[-expunge [-immediate]]
	[-find <path> ... <expression> ...]
	[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
	[-head <file>]
	[-help [cmd ...]]
	[-ls [-C] [-d]

#### 2.1. Navigation

Navigation through HDFS is available with `hdfs dfs` [commands](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html) which are quite simular to Unix shell navigation (`ls`, `cat`, etc.):

In [4]:
# list root directory
!hdfs dfs -ls /

Found 4 items
drwxr-xr-x   - hadoop supergroup           0 2023-05-02 15:25 /hbase
drwxr-xr-x   - jovyan hadoopusers          0 2023-05-02 15:24 /jovyan
drwxrwxrwx   - hadoop supergroup           0 2023-05-02 15:24 /tmp
drwxr-xr-x   - jovyan hadoopusers          0 2023-05-02 15:25 /user


In [5]:
# list directory
!hdfs dfs -ls /jovyan

...or with `WORK_DIR` variable:

In [6]:
# list working directory '/jovyan'
# NOTE: variable WORK_DIR='/jovyan' used in braces
!hdfs dfs -ls {WORK_DIR}

#### 2.2. Put and get files

Put an arbitary file to HDFS:

In [7]:
!ls -la ~/__DATA/IBDT_Spring_2023/topic_3

total 16004
drwxrwxrwx 2 root root        0 Mar  1 13:21 .
drwxrwxrwx 2 root root        0 Feb 10 12:43 ..
-rw-rw-rw- 1 root root     9161 Feb 22 09:29 big_data_trends_2021.txt
-rw-rw-rw- 1 root root    11836 Feb 22 09:28 big_data_trends_2023.txt
-rw-rw-rw- 1 root root    43234 Feb 28 15:52 coin_gecko_2022-03-17.csv
-rw-rw-rw- 1 root root 14966021 Feb 28 14:51 Hotel_Reviews.csv
drwxrwxrwx 2 root root        0 Feb 22 08:06 .ipynb_checkpoints
-rw-rw-rw- 1 root root       55 Feb 22 08:06 test_hdfs.txt
-rw-rw-rw- 1 root root  1355781 Mar  1 13:21 vgsales.csv


In [8]:
# put local file to HDFS
!hdfs dfs -put ~/__DATA/IBDT_Spring_2023/topic_3/test_hdfs.txt {WORK_DIR}

In [9]:
!hdfs dfs -ls {WORK_DIR}

Found 1 items
-rw-r--r--   1 jovyan hadoopusers         55 2023-05-02 15:31 /jovyan/test_hdfs.txt


In [10]:
# look at the file's content
!hdfs dfs -cat {WORK_DIR}/test_hdfs.txt

Hello, I am a file.
I am in the HDFS now and feel good.

Create folders and move files:

In [11]:
!hdfs dfs -mkdir {WORK_DIR}/texts

In [12]:
!hdfs dfs -ls {WORK_DIR}

Found 2 items
-rw-r--r--   1 jovyan hadoopusers         55 2023-05-02 15:31 /jovyan/test_hdfs.txt
drwxr-xr-x   - jovyan hadoopusers          0 2023-05-02 15:31 /jovyan/texts


In [13]:
!hdfs dfs -mv {WORK_DIR}/test_hdfs.txt {WORK_DIR}/texts

In [14]:
!hdfs dfs -ls {WORK_DIR}

Found 1 items
drwxr-xr-x   - jovyan hadoopusers          0 2023-05-02 15:31 /jovyan/texts


In [15]:
!hdfs dfs -ls {WORK_DIR}/texts

Found 1 items
-rw-r--r--   1 jovyan hadoopusers         55 2023-05-02 15:31 /jovyan/texts/test_hdfs.txt


In [16]:
!hdfs dfs -cat {WORK_DIR}/texts/test_hdfs.txt

Hello, I am a file.
I am in the HDFS now and feel good.

Get files back from `HDFS`:

In [17]:
!hdfs dfs -get {WORK_DIR}/texts/test_hdfs.txt .

get: `test_hdfs.txt': File exists


#### 2.3. Something useful

Useful functions:

In [18]:
def hdfs_dirs(path, filter_str=''):
    """
    Returns files in path provided as a list. 
    File names may be filtered by `filter_str` parameter,
    e.g. `filter_str='csv'` will display only `csv` files.
    
    """
    process = subprocess.Popen(
        ['hdfs', 'dfs', '-ls', path], 
        stdout=subprocess.PIPE, 
        stderr=subprocess.PIPE
    )
    out, err = process.communicate()
    dirs = out.decode('utf-8').split('\n')
    dirs = list(filter(lambda x: filter_str in x, dirs))
    dirs = list(map(lambda x: x.split(' ')[-1], dirs))
    return dirs

def file_content(path):
    """
    Returns content of the file.
    Similar to `cat` command.
    
    """
    process = subprocess.Popen(
        ['hdfs', 'dfs', '-cat', path], 
        stdout=subprocess.PIPE, 
        stderr=subprocess.PIPE
    )
    out, err = process.communicate()
    return out.decode('unicode_escape')

In [19]:
# use function defined above
hdfs_dirs(WORK_DIR, 'txt')

[]

In [20]:
hdfs_dirs(WORK_DIR + '/texts', 'txt')

['/jovyan/texts/test_hdfs.txt']

In [21]:
hdfs_dirs(WORK_DIR + '/texts', 'csv')

[]

In [22]:
# display the content of the 'telecom_churn.csv' file
content = file_content(f'{WORK_DIR}/texts/test_hdfs.txt')
content

'Hello, I am a file.\nI am in the HDFS now and feel good.'

### 3. MapReduce intro

#### 3.1. WordCount with Java

`WordCount` is a simple application that counts the number of occurrences of each word in a given input set. For this demo ready `jar` package is used.

First let's copy files to HDFS:

In [23]:
%%bash
work_dir=/jovyan

# create input directory on HDFS
hdfs dfs -mkdir -p ${work_dir}/input

# put files to HDFS
hdfs dfs -put ~/__DATA/IBDT_Spring_2023/topic_3/big_data_* ${work_dir}/input
hdfs dfs -ls ${work_dir}/input

Found 2 items
-rw-r--r--   1 jovyan hadoopusers       9161 2023-05-02 15:31 /jovyan/input/big_data_trends_2021.txt
-rw-r--r--   1 jovyan hadoopusers      11836 2023-05-02 15:31 /jovyan/input/big_data_trends_2023.txt


Run a map-reduce job and enjoy long logs output:

In [24]:
%%bash
work_dir=/jovyan

# delete directory if exists
#hdfs dfs -rm -r ${work_dir}/output

# run wordcount
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar wordcount \
    ${work_dir}/input ${work_dir}/output

2023-05-02 15:31:29,155 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2023-05-02 15:31:29,485 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/jovyan/.staging/job_1683041087371_0001
2023-05-02 15:31:29,691 INFO input.FileInputFormat: Total input files to process : 2
2023-05-02 15:31:29,758 INFO mapreduce.JobSubmitter: number of splits:2
2023-05-02 15:31:30,260 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1683041087371_0001
2023-05-02 15:31:30,262 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-05-02 15:31:30,394 INFO conf.Configuration: resource-types.xml not found
2023-05-02 15:31:30,394 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2023-05-02 15:31:30,607 INFO impl.YarnClientImpl: Submitted application application_1683041087371_0001
2023-05-02 15:31:30,658 INFO mapreduce.Job: The url to track the job: http://0.0.0.0:60000/proxy/application_1683041087371_0001/
2023-05-02

In [25]:
!hdfs dfs -ls {WORK_DIR}/output

Found 2 items
-rw-r--r--   1 jovyan hadoopusers          0 2023-05-02 15:31 /jovyan/output/_SUCCESS
-rw-r--r--   1 jovyan hadoopusers      11711 2023-05-02 15:31 /jovyan/output/part-r-00000


In [26]:
!hdfs dfs -cat {WORK_DIR}/output/_SUCCESS

In [27]:
%%bash
work_dir=/jovyan

# print the output of wordcount
echo -e "\nwordcount output:"
hdfs dfs -cat ${work_dir}/output/part-r-00000


wordcount output:
(09	1
(D&A)	1
(IoT)	1
(ML).	1
(balanced	1
(context,	1
(real-time)	1
(see	2
(small	1
(velocity)	1
(veracity),	1
(wide	1
-	2
--	8
05	1
1).	1
1.	2
10	2
16	1
193	1
2.	1
2020,	1
2020.	1
2021	1
2021)	1
2021:	2
2023	4
2023.	1
2025,	1
3.	1
360-degree	2
4	1
4.	1
6	1
63%	1
70%	1
AI	13
AI,	5
AI-enabled	2
AI.	1
AWS,	1
Actions:	1
Additional	1
Adoption	1
Advanced	1
Analysis	1
Analytics	6
Analytics,	1
Android	1
Apple	1
As	4
Assumption	1
Big	7
But	1
By	4
COVID-19	1
Changes	1
Choudhary,	1
Clougherty	1
Cognilytica	1
Collectively,	1
D&A	2
Data	12
DataOps	2
DataOps,	1
Dealing	1
Description:	1
Distributed	1
Due	1
Each	1
Edge	2
Emerging	1
Enrich	1
Enterprise	2
Enterprises	3
Evidence	1
Explore	1
Extend	1
Farhan	1
February	1
Figure	2
Fitbit,	1
For	2
Four	1
From	4
G00738992	1
Gartner	2
Google	1
Google,	1
Group	2
Hadoop	1
Hamer,	2
Hare	1
Hare,	1
Here	1
Here's	1
IBM	1
ID	1
IT	1
Implications:	1
In	9
Increasingly,	1
Indeed,	1
Initiatives:Data	1
Internet	1
IoT	2
It	1
Jan	1
January	1
Jim	2
Jones	1

#### 3.2. WordCount with Python

Next example will use [Hadoop streaming](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html) concept.

Two Python scripts are used `mapper.py` and `reducer.py`, let's look at them:

In [28]:
%%bash

echo -e "\n************** MAPPER.PY ****************\n"
cat ./utils/mapper.py
echo -e "\n************** REDUCER.PY ****************\n"
cat ./utils/reducer.py


************** MAPPER.PY ****************

#! /usr/bin/python

import sys

def do_map(doc): 
    for word in doc.split(): 
        yield word.lower(), 1 

for line in sys.stdin: 
    for key, value in do_map(line): 
        print(key + '\t' + str(value))

************** REDUCER.PY ****************

#! /usr/bin/python

import sys
 
def do_reduce(key, values):
    return key, sum(values)

prev_key = None
values = []

for line in sys.stdin:
    key, value = line.split('\t')
    if key != prev_key and prev_key is not None:
        result_key, result_value = do_reduce(prev_key, values)
        print(result_key + '\t' + str(result_value))
        values = []
    prev_key = key
    values.append(int(value))

if prev_key is not None:
    result_key, result_value = do_reduce(prev_key, values)
    print(result_key + '\t' + str(result_value))


##### How this Python code works out of Hadoop

First of all, a few words about bash `stdin` and `stdout`. Here is a [good article](https://medium.com/linuxstories/bash-pipes-and-redirections-4c267c13643b).

In [29]:
%%bash

cat test_hdfs.txt

Hello, I am a file.
I am in the HDFS now and feel good.

In [30]:
%%bash

# let's send our file to `stdin` of our mapper
# `cat` is to list content of the file
# pipe `|` is for sending that output to our `mapper.py` as input

cat test_hdfs.txt | python ./utils/mapper.py

hello,	1
i	1
am	1
a	1
file.	1
i	1
am	1
in	1
the	1
hdfs	1
now	1
and	1
feel	1
good.	1


In [31]:
%%bash

# write result of mapper to the file

cat test_hdfs.txt | python ./utils/mapper.py > result.txt

In [32]:
%%bash

cat result.txt

hello,	1
i	1
am	1
a	1
file.	1
i	1
am	1
in	1
the	1
hdfs	1
now	1
and	1
feel	1
good.	1


In [33]:
%%bash

cat result.txt | sort -t 1 | python ./utils/reducer.py

a	1
am	2
and	1
feel	1
file.	1
good.	1
hdfs	1
hello,	1
i	2
in	1
now	1
the	1


In [34]:
%%bash

cat test_hdfs.txt | python ./utils/mapper.py | sort -t 1 | python ./utils/reducer.py

a	1
am	2
and	1
feel	1
file.	1
good.	1
hdfs	1
hello,	1
i	2
in	1
now	1
the	1


##### Python code within Hadoop (YARN)

Now let's run our Python MapReduce scripts in Hadoop.

In [35]:
!hdfs dfs -ls {WORK_DIR}/input

Found 2 items
-rw-r--r--   1 jovyan hadoopusers       9161 2023-05-02 15:31 /jovyan/input/big_data_trends_2021.txt
-rw-r--r--   1 jovyan hadoopusers      11836 2023-05-02 15:31 /jovyan/input/big_data_trends_2023.txt


Run the job and print the result:

In [36]:
%%bash
work_dir=/jovyan
out_dir=/output_py

# delete directory if exists
hdfs dfs -rm -r ${work_dir}${out_dir}

yarn jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.2.jar \
    -input ${work_dir}/input/*.txt -output ${work_dir}${out_dir} \
    -file ./utils/mapper.py -file ./utils/reducer.py \
    -mapper "python3 mapper.py" -reducer "python3 reducer.py"

packageJobJar: [./utils/mapper.py, ./utils/reducer.py, /tmp/hadoop-unjar5713892345926893084/] [] /tmp/streamjob8458008549747069715.jar tmpDir=null


rm: `/jovyan/output_py': No such file or directory
2023-05-02 15:31:55,927 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
2023-05-02 15:31:56,645 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2023-05-02 15:31:56,803 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2023-05-02 15:31:56,983 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/jovyan/.staging/job_1683041087371_0002
2023-05-02 15:31:58,060 INFO mapred.FileInputFormat: Total input files to process : 2
2023-05-02 15:31:58,916 INFO mapreduce.JobSubmitter: number of splits:3
2023-05-02 15:31:59,409 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1683041087371_0002
2023-05-02 15:31:59,411 INFO mapreduce.JobSubmitter: Executing with tokens: []
2023-05-02 15:31:59,546 INFO conf.Configuration: resource-types.xml not found
2023-05-02 15:31:59,546 INFO resource.ResourceUtils: Unable to find 

Options for Hadoop streaming:

| Option | Description| 
| --- | --- |
| -files| A command-separated list of files to be copied to the MapReduce cluster |
| -mapper | The command to be run as the mapper |
| -reducer | The command to be run as the reducer |
| -input | The DFS input path for the Map step |
| -output | The DFS output directory for the Reduce step |

In [37]:
%%bash
work_dir=/jovyan
out_dir=/output_py

hdfs dfs -ls ${work_dir}/${out_dir}

Found 2 items
-rw-r--r--   1 jovyan hadoopusers          0 2023-05-02 15:32 /jovyan/output_py/_SUCCESS
-rw-r--r--   1 jovyan hadoopusers      11244 2023-05-02 15:32 /jovyan/output_py/part-00000


In [38]:
%%bash
work_dir=/jovyan
out_dir=/output_py

hdfs dfs -cat ${work_dir}/${out_dir}/part-00000

(09	1
(balanced	1
(context,	1
(d&a)	1
(iot)	1
(ml).	1
(real-time)	1
(see	2
(small	1
(velocity)	1
(veracity),	1
(wide	1
-	2
--	8
05	1
1).	1
1.	2
10	2
16	1
193	1
2.	1
2020,	1
2020.	1
2021	1
2021)	1
2021:	2
2023	4
2023.	1
2025,	1
3.	1
360-degree	2
4	1
4.	1
6	1
63%	1
70%	1
a	46
ability	1
able	2
about	6
accelerate.	1
accuracy	1
accurate	2
achieved	1
across	5
actions:	1
adaptive	1
adaptive,	2
addition	1
addition,	3
additional	1
address	1
address.	1
adoption	1
advanced	7
advances	1
advancing	1
advantages	1
aggregate	1
agile,	1
ai	13
ai,	5
ai-enabled	2
ai.	1
all	5
alleviated	2
allow	1
allowing	1
almost	1
also	7
alternative	1
among	1
amount	1
amounts	2
an	4
analysis	10
analysis.	1
analysis;	1
analyst	1
analytical	3
analytics	33
analytics,	5
analyze	2
analyzing	1
and	178
android	1
anomalies	1
apple	1
application	2
applications	2
applications,	1
applications.	1
applied	1
applies	1
apply	2
approach	7
approach.	2
approaches	12
approaches.	2
appropriately	1
apps	1
architecture	1
architectures	1
arch

### 4. YARN jobs monitoring

Hadoop also provided YARN Web UI for Yarn Resource manager. All the jobs (submitted, running or finished) can be traced in YARN Web UI:

In [39]:
print(
    'YARN Web UI available at:',
    'https://jhas01.gsom.spbu.ru{}proxy/{}/cluster'.format(
        os.environ['JUPYTERHUB_SERVICE_PREFIX'],
        YARN_PORT
    )
)

YARN Web UI available at: https://jhas01.gsom.spbu.ru/user/vgarshin/proxy/8088/cluster


### 5. More MapReduce

#### 5.1. Not only word count

We will count the number of reviews for each rating (1, 2, 3, 4, 5) in the [Kaggle Hotels Reviews dataset](https://www.kaggle.com/datasets/yash10kundu/hotel-reviews).

In [40]:
!tail ~/__DATA/IBDT_Spring_2023/topic_3/Hotel_Reviews.csv

"ok price look hotel ok little run average cleanliness chose price seattle quite expensive, did n't room/bed reserved staff unhelpful, best westerns used nice hotels does n't fit mold, choose different hotel probably expensive time visit seattle,  ",2
"great choice wife chose best western quite bit research, looking near downtown free parking air conditioning pregnant hot, location good close space needle seattle offers free bus rides downtown area free wireless nice continental breakfast pretty impressive, room spacious bed fairly comfortable good pillows shower hot water good pressure, complaints desk aside john c. n't help, desk staff consisted young people mean n't social skills confidence answer good tourist questions, negotiate free parking told offered phone, nice cooperative parking issue, great value going place offers needs good big city price, note golden singha thai restaurant cedar short walk central downtown needle excellent food 8 dinner,  ",5
"good bed clean convenien

In [41]:
%%bash

# test our scripts

cat ~/__DATA/IBDT_Spring_2023/topic_3/Hotel_Reviews.csv | \
    python ./utils/mapper_rr.py | \
    sort -t 1 | \
    python ./utils/reducer.py

1	1421
2	1793
3	2184
4	6039
5	9054
Rating	1


In [42]:
!hdfs dfs -mkdir {WORK_DIR}/input_csv

In [43]:
# now put local file to HDFS
!hdfs dfs -put ~/__DATA/IBDT_Spring_2023/topic_3/Hotel_Reviews.csv {WORK_DIR}/input_csv

In [44]:
# now put MORE local file to HDFS (cause we can do it!)
# we can add many files and that is how Hadoop works
# because we can have thousands of CSV files all across many servers

!hdfs dfs -put ~/__DATA/IBDT_Spring_2023/topic_3/Hotel_Reviews.csv {WORK_DIR}/input_csv/Hotel_Reviews_more.csv 

In [45]:
!hdfs dfs -ls {WORK_DIR}/input_csv

Found 2 items
-rw-r--r--   1 jovyan hadoopusers   14966021 2023-05-02 15:32 /jovyan/input_csv/Hotel_Reviews.csv
-rw-r--r--   1 jovyan hadoopusers   14966021 2023-05-02 15:32 /jovyan/input_csv/Hotel_Reviews_more.csv


In [None]:
%%bash
work_dir=/jovyan
out_dir=/output_csv

# delete directory if exists
hdfs dfs -rm -r ${work_dir}${out_dir}

yarn jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.2.jar \
    -input ${work_dir}/input_csv/*.csv -output ${work_dir}${out_dir} \
    -file ./utils/mapper_rr.py -file ./utils/reducer.py \
    -mapper "python3 mapper_rr.py" -reducer "python3 reducer.py"

In [None]:
%%bash
work_dir=/jovyan
out_dir=/output_csv

hdfs dfs -ls ${work_dir}/${out_dir}

In [None]:
%%bash
work_dir=/jovyan
out_dir=/output_csv

hdfs dfs -cat ${work_dir}/${out_dir}/part-00000

#### 5.2. Meet MRJob

Module [mrjob](https://mrjob.readthedocs.io/en/latest/) helps to write MapReduce jobs in Python 2.7/3.4+ and run them on many platforms:
- Write multi-step MapReduce jobs in pure Python
- Test on your local machine
- Run on a Hadoop cluster
- Run in the cloud using Amazon Elastic MapReduce (EMR)
- Run in the cloud using Google Cloud Dataproc (Dataproc)
- Easily run Spark jobs on EMR or your own Hadoop cluster

In [None]:
!pip install mrjob

Again, we need to write Python script:

In [None]:
!cat ./utils/mrjob_ratings.py

In [None]:
%%bash

# test mrjob script locally 
# only Python works with no YARN, Hadoop, HDFS etc.

python ./utils/mrjob_ratings.py \
    ~/__DATA/IBDT_Spring_2023/topic_3/Hotel_Reviews.csv

In [None]:
!hdfs dfs -ls /jovyan/input_csv/

In [None]:
# put local file to HDFS
!hdfs dfs -put ~/__DATA/IBDT_Spring_2023/topic_3/test_hdfs.txt /tmp

In [None]:
%%bash

# now let's run mrjob script within Hadoop 
# NOTE: python3 is used, it os a feature of mrjob

python3 ./utils/mrjob_ratings.py \
    --python-bin /opt/conda/bin/python3 \
    -r hadoop hdfs:///jovyan/input_csv/*.csv

In [None]:
# traces of mrjob in the HDFS (logs, outputs etc.)
!hdfs dfs -ls /user/jovyan/tmp/mrjob

#### 5.3. MRJob for crypto currencies analysis

[Cryptocurrency Price & Market Data dataset](https://www.kaggle.com/datasets/thedevastator/cryptocurrency-price-market-data) provides the insights into the cryptocurrency markets. It collects important data points such as:
- name of the cryptocurrency
- symbol
- price
- hourly and daily change trends
- 24 hour volume traded
- market capitalization

Our goal will be to find top-10 `24 hour volume traded` crypto currencies with the help of `mrjob`.

In [None]:
!head ~/__DATA/IBDT_Spring_2023/topic_3/coin_gecko_2022-03-17.csv

In [None]:
!hdfs dfs -mkdir {WORK_DIR}/input_crypto

In [None]:
# put data to HDFS
!hdfs dfs -put ~/__DATA/IBDT_Spring_2023/topic_3/coin_gecko_2022-03-17.csv {WORK_DIR}/input_crypto

In [None]:
!hdfs dfs -ls {WORK_DIR}/input_crypto

In [None]:
%%bash

# test mrjob script locally 
# only Python works with no YARN, Hadoop, HDFS etc.

python ./utils/mrjob_crypto.py \
    ~/__DATA/IBDT_Spring_2023/topic_3/coin_gecko_2022-03-17.csv

In [None]:
%%bash

# now let's run mrjob script within Hadoop 
# NOTE: python3 is used, it is a feature of mrjob

python3 ./utils/mrjob_crypto.py \
    --python-bin /opt/conda/bin/python3 \
    -r hadoop hdfs:///jovyan/input_crypto/*.csv

### 6. Home assignment

We will use [Video Game Sales dataset](https://www.kaggle.com/datasets/gregorut/videogamesales) that contains a list of video games with sales greater than 100,000 copies

Fields of the dataset include:
- Rank - Ranking of overall sales
- Name - The games name
- Platform - Platform of the games release (i.e. PC,PS4, etc.)
- Year - Year of the game's release
- Genre - Genre of the game
- Publisher - Publisher of the game
- NA_Sales - Sales in North America (in millions)
- EU_Sales - Sales in Europe (in millions)
- JP_Sales - Sales in Japan (in millions)
- Other_Sales - Sales in the rest of the world (in millions)
- Global_Sales - Total worldwide sales.

In [None]:
# here is the Video Game Sales dataset loaded
!ls ~/__DATA/IBDT_Spring_2023/topic_3/vgsales.csv

Your home assignment for this part is:
1. Take the `vgsales.csv` and load it to HDFS
2. Count the number of video games by the platform (field `Platform` in the file)
3. Find top-5 video games by sales in Japan (field `JP_Sales`)

Please use `mrjob` library count for the tasks above.