# Introduction to Big Data Modern Technologies course

## TOPIC 3: Hadoop and MapReduce practice
### Part 1

### 1. Libraries

In [None]:
import os
import re
import json
import socket
import subprocess
import pandas as pd

In [None]:
# we need port only for Web UI
YARN_PORT = 8088

# working directory for default user `jovyan`
# `/jovyan/home` for the Jupyter 
# and `/jovyan` for the Hadoop environment
WORK_DIR = '/jovyan'

### 2. HDFS commands

Help is all you need!

In [None]:
!hdfs dfs -help

#### 2.1. Navigation

Navigation through HDFS is available with `hdfs dfs` [commands](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html) which are quite simular to Unix shell navigation (`ls`, `cat`, etc.):

In [None]:
# list root directory
!hdfs dfs -ls /

In [None]:
# list directory
!hdfs dfs -ls /jovyan

...or with `WORK_DIR` variable:

In [None]:
# list working directory '/jovyan'
# NOTE: variable WORK_DIR='/jovyan' used in braces
!hdfs dfs -ls {WORK_DIR}

#### 2.2. Put and get files

Put an arbitary file to HDFS:

In [None]:
!ls -la ~/__DATA/IBDT_Spring_2024/topic_3

In [None]:
# put local file to HDFS
!hdfs dfs -put ~/__DATA/IBDT_Spring_2024/topic_3/test_hdfs.txt {WORK_DIR}

In [None]:
!hdfs dfs -ls {WORK_DIR}

In [None]:
# look at the file's content
!hdfs dfs -cat {WORK_DIR}/test_hdfs.txt

Create folders and move files:

In [None]:
!hdfs dfs -mkdir {WORK_DIR}/texts

In [None]:
!hdfs dfs -ls {WORK_DIR}

In [None]:
!hdfs dfs -mv {WORK_DIR}/test_hdfs.txt {WORK_DIR}/texts

In [None]:
!hdfs dfs -ls {WORK_DIR}

In [None]:
!hdfs dfs -ls {WORK_DIR}/texts

In [None]:
!hdfs dfs -cat {WORK_DIR}/texts/test_hdfs.txt

Get files back from `HDFS`:

In [None]:
!hdfs dfs -get {WORK_DIR}/texts/test_hdfs.txt .

#### 2.3. Something useful

Useful functions:

In [None]:
def hdfs_dirs(path, filter_str=''):
    """
    Returns files in path provided as a list. 
    File names may be filtered by `filter_str` parameter,
    e.g. `filter_str='csv'` will display only `csv` files.
    
    """
    process = subprocess.Popen(
        ['hdfs', 'dfs', '-ls', path], 
        stdout=subprocess.PIPE, 
        stderr=subprocess.PIPE
    )
    out, err = process.communicate()
    dirs = out.decode('utf-8').split('\n')
    dirs = list(filter(lambda x: filter_str in x, dirs))
    dirs = list(map(lambda x: x.split(' ')[-1], dirs))
    return dirs

def file_content(path):
    """
    Returns content of the file.
    Similar to `cat` command.
    
    """
    process = subprocess.Popen(
        ['hdfs', 'dfs', '-cat', path], 
        stdout=subprocess.PIPE, 
        stderr=subprocess.PIPE
    )
    out, err = process.communicate()
    return out.decode('unicode_escape')

In [None]:
# use function defined above
hdfs_dirs(WORK_DIR, 'txt')

In [None]:
hdfs_dirs(WORK_DIR + '/texts', 'txt')

In [None]:
hdfs_dirs(WORK_DIR + '/texts', 'csv')

In [None]:
# display the content of the 'telecom_churn.csv' file
content = file_content(f'{WORK_DIR}/texts/test_hdfs.txt')
content

### 3. MapReduce intro

#### 3.1. WordCount with Java

`WordCount` is a simple application that counts the number of occurrences of each word in a given input set. For this demo ready `jar` package is used.

First let's copy files to HDFS:

In [None]:
%%bash
work_dir=/jovyan

# create input directory on HDFS
hdfs dfs -mkdir -p ${work_dir}/input

# put files to HDFS
hdfs dfs -put ~/__DATA/IBDT_Spring_2024/topic_3/big_data_* ${work_dir}/input
hdfs dfs -ls ${work_dir}/input

Run a map-reduce job and enjoy long logs output:

In [None]:
%%bash
work_dir=/jovyan

# delete directory if exists
#hdfs dfs -rm -r ${work_dir}/output

# run wordcount
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.3.jar wordcount \
    ${work_dir}/input ${work_dir}/output

In [None]:
!hdfs dfs -ls {WORK_DIR}/output

In [None]:
%%bash
work_dir=/jovyan

# print the output of wordcount
echo -e "\nwordcount output:"
hdfs dfs -cat ${work_dir}/output/part-r-00000

#### 3.2. WordCount with Python

Next example will use [Hadoop streaming](https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html) concept.

Two Python scripts are used `mapper.py` and `reducer.py`, let's look at them:

In [None]:
%%bash

echo -e "\n************** MAPPER.PY ****************\n"
cat ./utils/mapper.py
echo -e "\n************** REDUCER.PY ****************\n"
cat ./utils/reducer.py

##### How this Python code works out of Hadoop

First of all, a few words about bash `stdin` and `stdout`. Here is a [good article](https://linuxhandbook.com/redirection-linux/).

In [None]:
%%bash

cat test_hdfs.txt

In [None]:
%%bash

# let's send our file to `stdin` of our mapper
# `cat` is to list content of the file
# pipe `|` is for sending that output to our `mapper.py` as input

cat test_hdfs.txt | python ./utils/mapper.py

In [None]:
%%bash

# write result of mapper to the file

cat test_hdfs.txt | python ./utils/mapper.py > result.txt

In [None]:
%%bash

cat result.txt

In [None]:
%%bash

cat result.txt | sort -t 1 | python ./utils/reducer.py

In [None]:
%%bash

cat test_hdfs.txt | python ./utils/mapper.py | sort -t 1 | python ./utils/reducer.py

##### Python code within Hadoop (YARN)

Now let's run our Python MapReduce scripts in Hadoop.

In [None]:
!hdfs dfs -ls {WORK_DIR}/input

Run the job and print the result:

In [None]:
%%bash
work_dir=/jovyan
out_dir=/output_py

# delete directory if exists
hdfs dfs -rm -r ${work_dir}${out_dir}

yarn jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.3.jar \
    -input ${work_dir}/input/*.txt -output ${work_dir}${out_dir} \
    -file ./utils/mapper.py -file ./utils/reducer.py \
    -mapper "python3 mapper.py" -reducer "python3 reducer.py"

Options for Hadoop streaming:

| Option | Description| 
| --- | --- |
| -files| A command-separated list of files to be copied to the MapReduce cluster |
| -mapper | The command to be run as the mapper |
| -reducer | The command to be run as the reducer |
| -input | The DFS input path for the Map step |
| -output | The DFS output directory for the Reduce step |

In [None]:
%%bash
work_dir=/jovyan
out_dir=/output_py

hdfs dfs -ls ${work_dir}/${out_dir}

In [None]:
%%bash
work_dir=/jovyan
out_dir=/output_py

hdfs dfs -cat ${work_dir}/${out_dir}/part-00000

### 4. YARN jobs monitoring

Hadoop also provided YARN Web UI for Yarn Resource manager. All the jobs (submitted, running or finished) can be traced in YARN Web UI:

In [None]:
print(
    'YARN Web UI available at:',
    'https://jhas01.gsom.spbu.ru{}proxy/{}/cluster'.format(
        os.environ['JUPYTERHUB_SERVICE_PREFIX'],
        YARN_PORT
    )
)

### 5. More MapReduce

#### 5.1. Not only word count

We will count the number of reviews for each rating (1, 2, 3, 4, 5) in the [Kaggle Hotels Reviews dataset](https://www.kaggle.com/datasets/yash10kundu/hotel-reviews).

In [None]:
!tail ~/__DATA/IBDT_Spring_2024/topic_3/Hotel_Reviews.csv

In [None]:
!cat ./utils/mapper_rr.py

In [None]:
%%bash

# test our scripts

cat ~/__DATA/IBDT_Spring_2024/topic_3/Hotel_Reviews.csv | \
    python ./utils/mapper_rr.py | \
    sort -t 1 | \
    python ./utils/reducer.py

In [None]:
!hdfs dfs -mkdir {WORK_DIR}/input_csv

In [None]:
# now put local file to HDFS
!hdfs dfs -put ~/__DATA/IBDT_Spring_2024/topic_3/Hotel_Reviews.csv {WORK_DIR}/input_csv

In [None]:
# now put MORE local file to HDFS (cause we can do it!)
# we can add many files and that is how Hadoop works
# because we can have thousands of CSV files all across many servers

!hdfs dfs -put ~/__DATA/IBDT_Spring_2024/topic_3/Hotel_Reviews.csv {WORK_DIR}/input_csv/Hotel_Reviews_more.csv 

In [None]:
!hdfs dfs -ls {WORK_DIR}/input_csv

In [None]:
%%bash
work_dir=/jovyan
out_dir=/output_csv

# delete directory if exists
hdfs dfs -rm -r ${work_dir}${out_dir}

yarn jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.3.jar \
    -input ${work_dir}/input_csv/*.csv -output ${work_dir}${out_dir} \
    -file ./utils/mapper_rr.py -file ./utils/reducer.py \
    -mapper "python3 mapper_rr.py" -reducer "python3 reducer.py"

In [None]:
%%bash
work_dir=/jovyan
out_dir=/output_csv

hdfs dfs -ls ${work_dir}/${out_dir}

In [None]:
%%bash
work_dir=/jovyan
out_dir=/output_csv

hdfs dfs -cat ${work_dir}/${out_dir}/part-00000

#### 5.2. Meet MRJob

Module [mrjob](https://mrjob.readthedocs.io/en/latest/) helps to write MapReduce jobs in Python 2.7/3.4+ and run them on many platforms:
- Write multi-step MapReduce jobs in pure Python
- Test on your local machine
- Run on a Hadoop cluster
- Run in the cloud using Amazon Elastic MapReduce (EMR)
- Run in the cloud using Google Cloud Dataproc (Dataproc)
- Easily run Spark jobs on EMR or your own Hadoop cluster

In [None]:
!pip install mrjob

Again, we need to write Python script:

In [None]:
!cat ./utils/mrjob_ratings.py

In [None]:
%%bash

# test mrjob script locally 
# only Python works with no YARN, Hadoop, HDFS etc.

python ./utils/mrjob_ratings.py \
    ~/__DATA/IBDT_Spring_2024/topic_3/Hotel_Reviews.csv

In [None]:
!hdfs dfs -ls /jovyan/input_csv/

In [None]:
%%bash

# now let's run mrjob script within Hadoop 
# NOTE: python3 is used, it os a feature of mrjob

python3 ./utils/mrjob_ratings.py \
    --python-bin /opt/conda/bin/python3 \
    -r hadoop hdfs:///jovyan/input_csv/*.csv

In [None]:
# traces of mrjob in the HDFS (logs, outputs etc.)
!hdfs dfs -ls /user/jovyan/tmp/mrjob

#### 5.3. MRJob for crypto currencies analysis

[Cryptocurrency Price & Market Data dataset](https://www.kaggle.com/datasets/thedevastator/cryptocurrency-price-market-data) provides the insights into the cryptocurrency markets. It collects important data points such as:
- name of the cryptocurrency
- symbol
- price
- hourly and daily change trends
- 24 hour volume traded
- market capitalization

Our goal will be to find top-10 `24 hour volume traded` crypto currencies with the help of `mrjob`.

In [None]:
!head ~/__DATA/IBDT_Spring_2024/topic_3/coin_gecko_2022-03-17.csv

In [None]:
!hdfs dfs -mkdir {WORK_DIR}/input_crypto

In [None]:
# put data to HDFS
!hdfs dfs -put ~/__DATA/IBDT_Spring_2024/topic_3/coin_gecko_2022-03-17.csv {WORK_DIR}/input_crypto

In [None]:
!hdfs dfs -ls {WORK_DIR}/input_crypto

In [None]:
!cat ./utils/mrjob_crypto.py

In [None]:
%%bash

# test mrjob script locally 
# only Python works with no YARN, Hadoop, HDFS etc.

python ./utils/mrjob_crypto.py \
    ~/__DATA/IBDT_Spring_2024/topic_3/coin_gecko_2022-03-17.csv

In [None]:
%%bash

# now let's run mrjob script within Hadoop 
# NOTE: python3 is used, it is a feature of mrjob

python3 ./utils/mrjob_crypto.py \
    --python-bin /opt/conda/bin/python3 \
    -r hadoop hdfs:///jovyan/input_crypto/*.csv

### 6. Home assignment

We will use [Video Game Sales dataset](https://www.kaggle.com/datasets/gregorut/videogamesales) that contains a list of video games with sales greater than 100,000 copies

Fields of the dataset include:
- Rank - Ranking of overall sales
- Name - The games name
- Platform - Platform of the games release (i.e. PC,PS4, etc.)
- Year - Year of the game's release
- Genre - Genre of the game
- Publisher - Publisher of the game
- NA_Sales - Sales in North America (in millions)
- EU_Sales - Sales in Europe (in millions)
- JP_Sales - Sales in Japan (in millions)
- Other_Sales - Sales in the rest of the world (in millions)
- Global_Sales - Total worldwide sales.

In [None]:
# here is the Video Game Sales dataset loaded
!ls ~/__DATA/IBDT_Spring_2024/topic_3/vgsales.csv

Your home assignment for this part is:
1. Take the `vgsales.csv` and load it to HDFS
2. Count the number of video games by the platform (field `Platform` in the file)
3. Find top-5 video games by sales in Japan (field `JP_Sales`)

Please use `mrjob` library count for the tasks above.