<a href="https://colab.research.google.com/github/smduarte/spbd-2223/blob/main/lab1/SPBD_Labs_mapreduce1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python MapReduce Example

Word count implemented in pure Python.

This notebook exemplifies the execution of a map-reduce program in Python, using Hadoop.
In this example, hadoop runs in standalone mode and reads data from the local filesystem, while in cluster mode data is read typically from HDFS dsitributed file system.


### Download the dataset 

In [None]:
!wget -q -O os_maias.txt https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0

## WordCount Example
Read the words from input and count them.

The processing is split into two steps:

+ The mapper emits for each line the number of words
+ The reduces sums all the tuples produced by the mapper stage...

### Mapper

By starting an element with "%%file", you are specifying that when run, the contents are written to the local disk.

In [None]:
%%file mapper.py
#!/usr/bin/env python

# import sys
import sys
# import string library function  
import string  

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # remove punctuation characters
    line = line.translate(str.maketrans('', '', string.punctuation+'«»'))
    # split the line into words
    words = line.split()
    print('words\t%s' % len(words))

### Reducer

In [None]:
%%file reducer.py
#!/usr/bin/env python

import sys

total_count = 0

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    key, count = line.split('\t', 1)

    # convert count (currently a string) to int
    count = int(count)

    total_count += count

print('words\t%s' % (total_count))

## Local execution

The scripts can be tested using just the unix shell, as follows...

### Make the scripts executable

In [None]:
!chmod a+x mapper.py && chmod a+x reducer.py

### Execute

The execution workflow is as follows:

+ The input file is piped into the input of the mapper;
+ The output the mapper is sorted;
+ The sorted output of the mapper is fed to the reducer stage.

In [None]:
!cat "os_maias.txt" | ./mapper.py | sort -k1,1 | ./reducer.py

## MapReduce with HADOOP

In [None]:
#@title Install Hadoop on Google Colab
!curl -s https://raw.githubusercontent.com/smduarte/spbd-2223/main/lab1/install_hadoop.sh | bash

## Hadoop standalone mode execution

For executing in an hadoop cluster, input data should be moved into an HDFS directory. For executing in standalone mode, data can be read from the local filesystem. 


The output directory needs to be cleared...

In [None]:
rm -rf results

### Submitting the job

The _hadoop_ command is used to submit the mapreduce job to the cluster...

In [None]:
!hadoop jar /usr/local/hadoop-3.3.4/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py -input os_maias.txt -output results

#### Checking the results
The result is stored in directory results.

In [None]:
!cat results/part-*