<h1>MapReduce Simulator - COMP5349 Homework Week 1</h1>

This week's self-test homework requires you to implement a small MapReduce simulation program that scans a given input file (data.csv), and calls two functions to filter and aggregate the data found there.

The data set (data.csv) we are going to use is stored in CSV format with \t as delimiter. It contains 100,000 user ratings for films.

Make sure you have downloaded the every files (data.csv, map_reduce_simulator.py, mapper.py, and reducer.py) from our GitHub repository (https://github.sydney.edu.au/COMP5349-Cloud-Computing-2022/python-resources/tree/master/week1) for this week, and saved them in the same directory as this notebook. 

Execute the cell block below (click it with your mouse and either press the play button on the toolbar above or hit Ctrl+Enter) to see what the file looks like. 

In [1]:
import csv

header = ['user_id','film_id', 'rating', 'timestamp']
print(header)
num_lines = 12
with open('data.csv', 'r') as data_file:
    data_reader = csv.reader(data_file, delimiter='\t')
    for i in range(0, num_lines):
        print(next(data_reader))

['user_id', 'film_id', 'rating', 'timestamp']
['196', '242', '3', '881250949']
['186', '302', '3', '891717742']
['22', '377', '1', '878887116']
['244', '51', '2', '880606923']
['166', '346', '1', '886397596']
['298', '474', '4', '884182806']
['115', '265', '2', '881171488']
['253', '465', '5', '891628467']
['305', '451', '3', '886324817']
['6', '86', '3', '883603013']
['62', '257', '2', '879372434']
['286', '1014']


There are 4 fields: <i>user_id</i>, <i>film_id</i>, <i>rating</i>, and <i>timestamp</i>. 

<i>user_id</i>, <i>film_id</i>,and <i>rating</i> are integers. The value for <i>rating</i> ranges from 1 to 5.

<b>Note</b> that some lines are incomplete (notice how <i>user_id</i> 286 and <i>film_id</i> 1014 above is missing <i>rating</i> and <i>timestamp</i>). 


Your homework task is to implement two missing classes: <b>RatingFilter</b> and <b>RatingReducer</b>, so that the final program determines the average film rating (as float value rounded to 1 decimal place) for those films whose ID is in a given range.

<h2>RatingFilter</h2>

The <i>RatingFilter</i> class is a specialisation of the generic Filter class. You have to implement:
* <code style="color:green">&#95;&#95;init&#95;&#95;(self, start_movieid, end_movieid)</code> - an initialiser  that allows <i>RatingFilter</i> to store the search range of the film ids (start_movieid and end_movieid)
* <code style="color:green">filter(self, line)</code> - a filter method that will filter the input file and return appropriate films and their respective ratings. This method will be called for each individual line of the input file above (line parameter). It should return the rating for all films with an ID within the requested range. The method must return the result as a tuple (key, value) with the <i>film_id</i> as the key, and <i>rating</i> as the value. If there are no film within the given ID range, then just return None. 

The <i>RatingFilter</i> class is located at the bottom of [mapper.py file](../edit/mapper.py). Scroll all the way to the bottom and replace the <code style="color:red">raise</code> <code style="color:blue">NotImplementedError</code> line with your implementation

<h2>RatingReducer</h2>

The <i>RatingReducer</i> class is a specialisation of the generic Reducer class. You have to implement:

* <code style="color:green">reduce(self, key, values)</code> - a method which computes the average rating (as Float) of all the given input ratings of the same film. 
This method is called by our map_reduce_simulator with film_id as the <i>key</i> (String) and a list of Integer ratings given by various users for this film (identified by the film_id) as the <i>values</i>. 
It should compute the average of all the ratings given, and return it as a Float value rounded to one decimal place (eg. 3.5).

The <i>RatingReducer</i> class is located at the bottom of [reducer.py file](../edit/reducer.py). Scroll all the way to the bottom and replace the <code style="color:red">raise</code> <code style="color:blue">NotImplementedError</code> line with your implementation

Once you implemented the these two classes, execute the cell directly below to setup some modules.

<b>Note<b> you only need to execute the cell directly below once.

In [54]:
%load_ext autoreload

from mapper import RatingFilter
from reducer import RatingReducer
from map_reduce_simulator import map_reduce


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Then run the cell below to check your implementation.

In [74]:
%autoreload 2

filename = "data.csv"
start_movie_id = 1
end_movie_id = 10

rating_filter = RatingFilter(start_movie_id, end_movie_id)
rating_reducer = RatingReducer()
stats = map_reduce(filename, rating_filter, rating_reducer)

print("MapReduce Simulator using mapper:  " + rating_filter.__repr__())
print("MapReduce Simulator using reducer: " + rating_reducer.__repr__())
print("Lines in File:    {} records (should be: {})".format(rating_filter.get_num_calls(), stats[0]))
print("Filtered records: {} (should be: {})".format(rating_filter.get_num_records(), stats[1]))

4: 3.6
5: 3.3
2: 3.2
7: 3.8
3: 3.0
8: 4.0
9: 3.9
6: 3.5
MapReduce Simulator using mapper:  <mapper.RatingFilter object at 0x7fe9abb3a0d0>
MapReduce Simulator using reducer: <reducer.RatingReducer object at 0x7fe9abb3a0a0>
Lines in File:    100000 records (should be: 100000)
Filtered records: 1453 (should be: 1450)


If you code is correct, it should produce the following output:

* 4: 3.6
* 5: 3.3
* 2: 3.2
* 10: 3.8
* 7: 3.8
* 3: 3.0
* 1: 3.9
* 8: 4.0
* 9: 3.9
* 6: 3.5
* MapReduce Simulator using mapper:  <mapper.RatingFilter object at ...>
* MapReduce Simulator using reducer: <reducer.RatingReducer object at ...>
* Lines in File:    100000 records (should be: 100000)
* Filtered records: 1990 (should be: 1990)

If not, try again and run the cell directly above this to check the result again.

Feel free to play around with different <i>film_id</i> range.

Good Luck!