# Exercise 3 (MapReduce in Practice)   &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     [4 points]
---

For this exercise, you are tasked with writing your own Hadoop MapReduce program in Python and to
run it on the cluster on the provided datasets.   
You may look at the exercise sheet for all the information on the datasets and this task.


**Note:** *When accessing files in the HDFS, you need to prepend “hdfs://” to the address string. For
quick testing of solutions, there are smaller versions of large datasets (> 1 GB) on the local file-system,
ending with “_small.csv”. Make sure that your MapReduce job also works on the complete dataset
on the cluster.*

In [7]:
# Saving variables to access the file locations
articles_hdfs='hdfs:///user/adbs23_shared/hm/articles.csv'
articles='/home/adbs23/adbs23_shared/hm/articles.csv'

customers_hdfs='hdfs:///user/adbs23_shared/hm/customers.csv'
customers='/home/adbs23/adbs23_shared/hm/customers.csv'

transactions_hdfs='hdfs:///user/adbs23_shared/hm/transactions.csv'
transactions='/home/adbs23/adbs23_shared/hm/transactions.csv'

transactions_small_hdfs='hdfs:///user/adbs23_shared/hm/transactions_small.csv'
transactions_small ='/home/adbs23/adbs23_shared/hm/transactions_small.csv'


- ### **a) Write a MapReduce job with “articles.csv” as input and following output:**  

For each garment group, show the most frequent product, the second most frequent section and the most frequent department it appears inside the article.csv file; make sure output has the following schema:

            garment_group_name, prod_name, section_name,  department_name

The product names are stored in "prod_name", the deparment name in "department_name", the garment group in "garment_group_name" and the section in "section_name". In case that there are multiple departments, garment groups or sections with the same number of occurences, you may resolve these conflicts randomly, i.e. pick one of them arbitrarily. In case there is only one section, or all sections appear with the same frequency, just pick the most frequent one, and resolve conflicts randomly. 

Make sure that your program correctly deals with the header, and possible sparse values.

In [14]:
%%file mymrjob1.py

# This will create a local file to run your MapReduce program  

from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.util import log_to_stream, log_to_null
from mr3px.csvprotocol import CsvProtocol
import csv 
import operator
import logging
from collections import defaultdict

log = logging.getLogger(__name__)
# 
#  Below is the skeleton for a MapReduce program in mrjob.
#  Write your own solution here. Be sure that it actually runs successfully.

class MyMRJob1(MRJob):
    
    
    OUTPUT_PROTOCOL = CsvProtocol  # write output as CSV
    
    def set_up_logging(cls, quiet=False, verbose=False, stream=None):  
        log_to_stream(name='mrjob', debug=verbose, stream=stream)
        log_to_stream(name='__main__', debug=verbose, stream=stream)

    def mapper_prodcount(self, _, line):
        # TODO
        row = next(csv.reader([line]))
        if row[0] == "product_id":
            return  # skip header row
        garment_group = row[23]
        
        prod_name = row[2]
        section_name = row[21]
        department_name = row[15]
        yield garment_group, (prod_name, section_name, department_name)
        
# use of a combiner is optional. It may speed up your job. Be sure that using the combiner preserves the correctness. 
#     def combiner_mrjob1(self,key,valuelist):
        #TODO
        
     # The reducer now creates a dict for all department_names, product_names and sections
     # and in the end returns the most or second most frequent values based on its contents
    def reducer_prodcount(self,key,pairs):
        # TODO
        product_counts = defaultdict(int)
        section_counts = defaultdict(int)
        department_counts = defaultdict(int)
        for prod_name, section_name, department_name in pairs:
            product_counts[prod_name] += 1
            section_counts[section_name] += 1
            department_counts[department_name] += 1
        # get most frequent product
        most_freq_product = max(product_counts, key=product_counts.get)
        # get second most frequent section
        freq_sections = sorted(section_counts, key=section_counts.get, reverse=True)
        if len(freq_sections) == 1:
            second_most_freq_section = freq_sections[0]
        elif section_counts[freq_sections[0]] != section_counts[freq_sections[1]]:
            second_most_freq_section = freq_sections[1]
        else:
            second_most_freq_section = freq_sections[0] if hash(freq_sections[0]) < hash(freq_sections[1]) else freq_sections[1]
        # get most frequent department
        most_freq_department = max(department_counts, key=department_counts.get)
        # yield output as a list
        
        #log.warning(var)
        yield key, (key, most_freq_product, second_most_freq_section, most_freq_department)
        
    
    def steps(self):
        return [
            MRStep(mapper=self.mapper_prodcount,
                   reducer=self.reducer_prodcount)            
        ]

if __name__ == '__main__':
    MyMRJob1.run()


Overwriting mymrjob1.py


Running a local MRjob 

In [15]:
!python3 mymrjob1.py $articles > output1.csv

Using configs in /etc/mrjob.conf
No configs specified for inline runner
Creating temp directory /tmp/mymrjob1.e12141198.20230505.160205.901491
Running step 1 of 1...
job output is in /tmp/mymrjob1.e12141198.20230505.160205.901491/output
Streaming final output from /tmp/mymrjob1.e12141198.20230505.160205.901491/output...
Removing temp directory /tmp/mymrjob1.e12141198.20230505.160205.901491...


Running a Hadoop job

In [14]:
!python3 mymrjob1.py -r hadoop --hadoop-streaming-jar "/usr/lib/hadoop/tools/lib/hadoop-streaming-3.3.4.jar" $articles_hdfs > output.csv

Using configs in /etc/mrjob.conf
Looking for hadoop binary in $PATH...
Found hadoop binary: /bin/hadoop
STDOUT: #
STDOUT: # There is insufficient memory for the Java Runtime Environment to continue.
STDOUT: # Native memory allocation (mmap) failed to map 429391872 bytes for committing reserved memory.
STDOUT: # An error report file with more information is saved as:
STDOUT: # /home/adbs23/e12141198/hs_err_pid2143763.log
Traceback (most recent call last):
  File "/home/adbs23/e12141198/.local/lib/python3.6/site-packages/mrjob/fs/hadoop.py", line 310, in exists
    ok_stderr=[_HADOOP_LS_NO_SUCH_FILE])
  File "/home/adbs23/e12141198/.local/lib/python3.6/site-packages/mrjob/fs/hadoop.py", line 183, in invoke_hadoop
    raise CalledProcessError(proc.returncode, args)
subprocess.CalledProcessError: Command '['/bin/hadoop', 'fs', '-ls', 'hdfs:///user/adbs23_shared/hm/articles.csv']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Tracebac

---

- ### **b) Write a MapReduce job with all three datasets as input and following output:**  
For all customers older than 25 years, show the number of transactions items they were involved in with articles from department with name 'Jacket' or 'Woven'. 


Make sure to have the following format in your final output:

            customer_id,count_transactions


In [8]:
%%file mymrjob2.py
# This will create a local file to run your MapReduce program  

from mrjob.job import MRJob
from mrjob.step import MRStep
from mrjob.util import log_to_stream, log_to_null
from mr3px.csvprotocol import CsvProtocol
import csv 
import logging

log = logging.getLogger(__name__)
# 
#  Below is the skeleton for a MapReduce program in mrjob.
#  Write your own solution here. Be sure that it actually runs successfully.
class MyMRJob2(MRJob):
    
    OUTPUT_PROTOCOL = CsvProtocol  # write output as CSV
    def set_up_logging(cls, quiet=False, verbose=False, stream=None):  
        log_to_stream(name='mrjob', debug=verbose, stream=stream)
        log_to_stream(name='__main__', debug=verbose, stream=stream)

#   Feel free to rename the functions
    def mapper_mrjob2(self, _, line):
        if not line.startswith("custom") and not line.startswith("article") and not line.startswith("t_dat"):   
            column_entries = line.split(",")
            if len(column_entries) >= 25 and (column_entries[15] == "Jacket" or column_entries[15] == "Woven"):
                yield column_entries[0], "article"      
            #    yield "article", column_entries[0]
            if len(column_entries) == 5:
                article_num_with_0 = "0" + column_entries[2]
                yield column_entries[2], ["transaction", column_entries[1]] 
                yield column_entries[1], "transaction"
                #yield "transaction", [column_entries[1], column_entries[2]]
            if len(column_entries) == 7:
                if not column_entries[5] == '':
                    if int(column_entries[5]) > 25:
                        yield column_entries[0], "customer"
        
# use of a combiner is optional. It may speed up your job. Be sure that using the combiner preserves the correctness. 
#     def combiner_mrjob2(self,key,valuelist):
        #TODO
        
    def reducer1_mrjob2(self,key,valuelist):
        valList = list(valuelist)
        #yield key, valList
        if len(valList) > 1:
            if valList[0] == "article" and valList[1][0] == "transaction":
                for i in range (len(valList) - 1):
                    yield valList[i + 1][1], [1, "article"]
               # yield key, valList
            if valList[0] == "transaction" and valList[(len(valList) - 1)] == "customer":
                yield key, [0, "customer"]
                 
    def reducer2_mrjob2(self,key,valuelist):
        valList = list(valuelist)
        #yield None, [key, valList]
        
        sum_of_vars = 0
        useable_customer = False
        useable_article = False
        for i in range (len(valList)):
            sum_of_vars = sum_of_vars + 1
            if valList[i][1] == "customer":
                useable_customer = True
                sum_of_vars = sum_of_vars - 1
            if valList[i][1] == "article":
                useable_article = True
        if useable_customer and useable_article:
            yield key, [key, sum_of_vars ]
        
        
    def steps(self):
        return [ 
            MRStep(
            mapper=self.mapper_mrjob2, 
#             combiner=self.combiner_mrjob1, 
            reducer=self.reducer1_mrjob2
            ),
            MRStep(reducer=self.reducer2_mrjob2
            )
        ]
if __name__ == '__main__':
    MyMRJob2.run()


Overwriting mymrjob2.py


In [9]:
!python3 mymrjob2.py  $articles $transactions_small $customers > output2.csv

Using configs in /etc/mrjob.conf
No configs specified for inline runner
Creating temp directory /tmp/mymrjob2.e12141198.20230505.155914.527368
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/mymrjob2.e12141198.20230505.155914.527368/output
Streaming final output from /tmp/mymrjob2.e12141198.20230505.155914.527368/output...
Removing temp directory /tmp/mymrjob2.e12141198.20230505.155914.527368...


In [8]:
!python3 mymrjob2.py -r hadoop --hadoop-streaming-jar "/usr/lib/hadoop/tools/lib/hadoop-streaming-3.3.4.jar" $articles_hdfs $transactions_hdfs $customers_hdfs > output2.csv

Using configs in /etc/mrjob.conf
Looking for hadoop binary in $PATH...
Found hadoop binary: /bin/hadoop
STDOUT: #
STDOUT: # There is insufficient memory for the Java Runtime Environment to continue.
STDOUT: # Native memory allocation (mmap) failed to map 429391872 bytes for committing reserved memory.
STDOUT: # An error report file with more information is saved as:
STDOUT: # /home/adbs23/e12141198/hs_err_pid2143411.log
Traceback (most recent call last):
  File "/home/adbs23/e12141198/.local/lib/python3.6/site-packages/mrjob/fs/hadoop.py", line 310, in exists
    ok_stderr=[_HADOOP_LS_NO_SUCH_FILE])
  File "/home/adbs23/e12141198/.local/lib/python3.6/site-packages/mrjob/fs/hadoop.py", line 183, in invoke_hadoop
    raise CalledProcessError(proc.returncode, args)
subprocess.CalledProcessError: Command '['/bin/hadoop', 'fs', '-ls', 'hdfs:///user/adbs23_shared/hm/articles.csv']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Tracebac

---

- ### **c) Once your jobs have run successfully on the cluster, use the provided commands in the notebook to look up the counters of your Mapreduce job(s).**  
Alternatively, you can also read the counters from the output cells above, after a job has succesfully run on the cluster.
Use the counters to determine for each job what the replication rate was, as well as the input and output size. Note: for this you will need to determine the job ids. These are shown in the output when running a job.  
**Note:** _Be sure to replace the dummy job ID below with the real one you get after running it on the cluster!_

In [10]:
job_id = "job_1681716720238_0466"
!mapred job -counter $job_id org.apache.hadoop.mapreduce.TaskCounter MAP_INPUT_RECORDS

2023-05-05 18:01:31,286 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at captain01.os.hpc.tuwien.ac.at/192.168.88.133:8032
2023-05-05 18:01:32,474 INFO mapred.ClientServiceDelegate: Could not get Job info from RM for job job_1681716720238_0466. Redirecting to job history server.
2023-05-05 18:01:32,533 INFO tools.CLI: Could not obtain job info after 1 attempt(s). Sleeping for 2 seconds and retrying.
2023-05-05 18:01:34,537 INFO mapred.ClientServiceDelegate: Could not get Job info from RM for job job_1681716720238_0466. Redirecting to job history server.
2023-05-05 18:01:34,580 INFO tools.CLI: Could not obtain job info after 2 attempt(s). Sleeping for 2 seconds and retrying.
2023-05-05 18:01:36,584 INFO mapred.ClientServiceDelegate: Could not get Job info from RM for job job_1681716720238_0466. Redirecting to job history server.
2023-05-05 18:01:36,627 INFO tools.CLI: Could not obtain job info after 3 attempt(s). Sleeping for 2 seconds and retrying.
2023-

In [11]:
!mapred job -counter $job_id org.apache.hadoop.mapreduce.TaskCounter MAP_OUTPUT_RECORDS

2023-05-05 18:01:40,193 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at captain01.os.hpc.tuwien.ac.at/192.168.88.133:8032
2023-05-05 18:01:41,326 INFO mapred.ClientServiceDelegate: Could not get Job info from RM for job job_1681716720238_0466. Redirecting to job history server.
2023-05-05 18:01:41,379 INFO tools.CLI: Could not obtain job info after 1 attempt(s). Sleeping for 2 seconds and retrying.
2023-05-05 18:01:43,384 INFO mapred.ClientServiceDelegate: Could not get Job info from RM for job job_1681716720238_0466. Redirecting to job history server.
2023-05-05 18:01:43,431 INFO tools.CLI: Could not obtain job info after 2 attempt(s). Sleeping for 2 seconds and retrying.
2023-05-05 18:01:45,435 INFO mapred.ClientServiceDelegate: Could not get Job info from RM for job job_1681716720238_0466. Redirecting to job history server.
2023-05-05 18:01:45,479 INFO tools.CLI: Could not obtain job info after 3 attempt(s). Sleeping for 2 seconds and retrying.
2023-

In [12]:
!mapred job -counter $job_id org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter BYTES_READ

2023-05-05 18:01:49,046 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at captain01.os.hpc.tuwien.ac.at/192.168.88.133:8032
2023-05-05 18:01:50,195 INFO mapred.ClientServiceDelegate: Could not get Job info from RM for job job_1681716720238_0466. Redirecting to job history server.
2023-05-05 18:01:50,246 INFO tools.CLI: Could not obtain job info after 1 attempt(s). Sleeping for 2 seconds and retrying.
2023-05-05 18:01:52,251 INFO mapred.ClientServiceDelegate: Could not get Job info from RM for job job_1681716720238_0466. Redirecting to job history server.
2023-05-05 18:01:52,295 INFO tools.CLI: Could not obtain job info after 2 attempt(s). Sleeping for 2 seconds and retrying.
2023-05-05 18:01:54,298 INFO mapred.ClientServiceDelegate: Could not get Job info from RM for job job_1681716720238_0466. Redirecting to job history server.
2023-05-05 18:01:54,348 INFO tools.CLI: Could not obtain job info after 3 attempt(s). Sleeping for 2 seconds and retrying.
2023-

In [13]:
!mapred job -counter $job_id org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter BYTES_WRITTEN

2023-05-05 18:01:57,912 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at captain01.os.hpc.tuwien.ac.at/192.168.88.133:8032
2023-05-05 18:01:58,998 INFO mapred.ClientServiceDelegate: Could not get Job info from RM for job job_1681716720238_0466. Redirecting to job history server.
2023-05-05 18:01:59,066 INFO tools.CLI: Could not obtain job info after 1 attempt(s). Sleeping for 2 seconds and retrying.
2023-05-05 18:02:01,071 INFO mapred.ClientServiceDelegate: Could not get Job info from RM for job job_1681716720238_0466. Redirecting to job history server.
2023-05-05 18:02:01,119 INFO tools.CLI: Could not obtain job info after 2 attempt(s). Sleeping for 2 seconds and retrying.
2023-05-05 18:02:03,123 INFO mapred.ClientServiceDelegate: Could not get Job info from RM for job job_1681716720238_0466. Redirecting to job history server.
2023-05-05 18:02:03,180 INFO tools.CLI: Could not obtain job info after 3 attempt(s). Sleeping for 2 seconds and retrying.
2023-

### Replication Rates and Input and Output Sizes:

* Replication Rate for Task 1. :   MAP_INPUT_RECORDS / MAP_OUTPUT_RECODRS
* Communication Cost:   MAP_OUTPUT_RECORDS  (Or the size of the output in bytes, either works)
*  Input Size for Task 1. :  BYTES_READ
* Output Size for Task 1. : BYTES_READ

Note: if your job had multiple steps, just state the replication rates for each step. Make sure to compute the cummulative costs for the other measures, though.
* Replication Rate for Task 2. :  AS ABOVE (but for each round)
* Communication Cost:  MAP_OUTPUT_RECORDS (of each job) plus the BYTES_READ (or MAP_INPUT_RECORDS) of the second job
* Input Size for Task 2. :  AS ABOVE
* Output Size for Task 2. : AS ABOVE

For more counters, you can check the documentation, under https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/TaskCounter and https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormatCounter.html

---
## **Your solution for Exercise 3 will consist of:**  
*  This notebook, filled with your solution, including the information on the replication rate, and the input and output sizes. 
