# Assignment 1


## PART 1:  Manipulating JSON Data from Food Fact Database to Extract Key Information

This assignment focuses on handling a large dataset from the [Open Food Facts database](https://world.openfoodfacts.org/), which describes itself as:

```
[...] a food products database made by everyone, for everyone. You can use it to make better food choices, and as it is open data, anyone can re-use it for any purpose.
```

This first assignment is not heavy on analytics. Instead, it is designed to give you a taste of the challenges that come with handling large data files — long download times, limited storage, and slow data processing. While these inconveniences are minor in this assignment, they are just the tip of the iceberg when it comes to real-world big data analytics tasks.


Our dataset, which can be obtained [here](https://static.openfoodfacts.org/data/openfoodfacts-products.jsonl.gz), consists of a JSONL file, which contains 3 million products, each with hundreds or even thousands of fields.  For example, you can refer to the data of one such product from the URL [here](https://world.openfoodfacts.org/api/v0/product/5060292302201.json), which contians 1304 fields.

We are only interested in top 5 ingredients used in each of the products. Therefore, the primary objective is to parse each JSONL object and extract specific fields of interest: the product code, its description, and the details of its five most prevalent ingredients. You will then save this filtered information in a TSV (Tab-Separated Values) file.



### Relevance to Big Data Analytics

Though this assignment might not involve high-level data science techniques, it's fundamentally crucial for big data analytics. It teaches you how to handle, manipulate, and filter large datasets, which are essential skills you'll need before moving on to more complex operations in analytics.

## Detailed Requirements


## Working with a Sample Product: How to Extract Fields of Interest

To give you a hands-on example, let's look at a sample product and see how to extract the fields we care about.

1. **Downloading the Sample Product Data:**  
	Start by downloading the JSON file for a specific product using the `wget` command.  
	```bash
		!wget https://world.openfoodfacts.org/api/v0/product/5060292302201.json
	```

2. **Reading the Product Info:**  
	Next, use Python's `json` library to read this file.  
	```python
		import json
		product_info = json.loads(open("./5060292302201.json").read())
	```

3. **Getting the Product ID:**  
	You can find the product ID for the sample item like this:  
	```python
		product_id = product_info['code']
	```
	This will return `'5060292302201'`.

4. **Getting the Product Name:**  
	The product name can be accessed using the following code:  
	```python
		product_name = product_info['product']['product_name']
	```

5. **Listing the Ingredients:**  
	To get the list of ingredients for the product, use the code below:  
	```python
		ingredients_list = product_info['product']['ingredients']
	```
	This will return a list of dictionaries, each containing details about an ingredient. For example, the first two elements in the list look like this:

	```python
		[
			{
				'ciqual_food_code': '4003',
				'id': 'en:potato',
				'percent': 54,
				'percent_estimate': 54,
				'percent_max': 54,
				'percent_min': 54,
				'processing': 'en:dried',
				'rank': 1,
				'text': 'potatoes',
				'vegan': 'yes',
				'vegetarian': 'yes'
			},
			{
				'ciqual_food_code': '17440',
				'from_palm_oil': 'no',
				'id': 'en:sunflower-oil',
				'percent_estimate': 28.75,
				'percent_max': 46,
				'percent_min': 11.5,
				'rank': 2,
				'text': 'sunflower oil',
				'vegan': 'yes',
				'vegetarian': 'yes'
			}
		]
	```


In [None]:
# Load the test product file with !wget

!wget https://world.openfoodfacts.org/api/v0/product/5060292302201.json

In [None]:
# Read file and create key file objects

import json
product_info = json.loads(open("./5060292302201.json").read())
product_total = product_info['product']
product_ingredients = product_total['ingredients']

### Question 1.

In this part of the assignment, you are tasked with identifying the top 5 ingredients by their `percent_estimate` field. You will display these ingredients in a table, sorted from the highest `percent_estimate` to the lowest. It turns out that ingredient have a rank field (see above). So, essentially we are only interested in  considering those ingredients whose `rank` field is 1, 2, 3, 4, or 5.

#### Background on Functional Programming:

We have discussed the importance of functional programming, as used in libraries such as Pandas, PyArrow, and, as we will see, pySpark. You can leverage this programming style in Python as well. Specifically, for this assignment, you will need to use the `filter` function to remove irrelevant ingredients from the list. 

	The `filter` function takes a list of elements and a function. It applies the function to each element in the list, keeping elements for which the function returns `True` and discarding those for which it returns `False`.

##### Example:

	For instance, let's say you want to filter out strings that have more than 4 characters. You could use the following code:


In [None]:
some_values = ["hi", "honolulu", "something", "by", "relevance", "ok"]

def filter_long_strs(x):
    if len(x) <= 4:
        return True
    else:
        return False

output = filter(filter_long_strs, some_values)
list(output)  # Output will be ['hi', 'by', 'ok']

In [None]:
# Temp task 1

def rank_filter(product_ingredients):
    if type(product_ingredients.get('rank')) == int and product_ingredients.get('rank') <= 5:
        return True
    else:
        return False

output = filter(rank_filter, product_ingredients)
# list(output)

##### Task:

Write a function named `filter_irrelevant_ingredients` that takes a list of ingredient objects (dictionaries like the example provided earlier) as input. Your function should filter out any ingredients whose `rank` is not 1, 2, 3, 4, or 5.

Here's the function signature for reference. Please do not change the function name, as it is required for autograding. Please test your code thoroughly before submitting the assignment.


In [None]:
def filter_irrelevant_ingredients(product_ingredients):
    if type(product_ingredients.get('rank')) == int and product_ingredients.get('rank') <= 5:
        return True
    else:
        return False

output = filter(filter_irrelevant_ingredients, product_ingredients)
top_ingredients = list(output)
# top_ingredients

### Question 2. 

#### Objective:

In the second part of the assignment, your task is to remove the "en:" prefix from each ingredient's `id` field. This operation is another example of functional programming, similar to the first part of this assignment. Specifically, you will employ the `map` function.

#### Background on Functional Programming:

The `map` function applies a specified function to each item in an iterable (like a list). For example, if you want to get the first letter of each string in a list, you define a function that takes a single string and return the first letter of that string. You can then apply, or map, this function into the elements of a list. Here is an example.




In [None]:
def get_first_letter(x):
    return x[0]

some_strings = ["hi", "bye", "ok"]
output = map(get_first_letter, some_strings)
list(output)  # Output will be ['h', 'b', 'o']

In [None]:
# Temp task 2

def remove_en_prefix(top_ingredients):
    top_ingredients['id'] = top_ingredients['id'].removeprefix('en:')
    return top_ingredients

output = map(remove_en_prefix, top_ingredients)
top_ingredients_no_en = list(output)
# top_ingredients_no_en

#### Task:

Write a function that removes the "en:" prefix from the `id` field of each ingredient in a list. Your function should take a list of ingredients (preferably, the top 5 filtered by the `filter_irrelevant_ingredients` function from question 1 and return a new list of objects where the `id` field of each ingredient is stripped of the "en:" prefix.

Here's the function signature for reference. Please maintain this function name for autograding purposes. Please test your code to ensure its accuracy before submitting this part of the assignment.


In [None]:
def remove_en_prefix(top_ingredients):
    top_ingredients['id'] = top_ingredients['id'].removeprefix('en:')
    return top_ingredients

output = map(remove_en_prefix, top_ingredients)
top_ingredients_no_en = list(output)
# top_ingredients_no_en

### Question 3.

#### Objective:

For the third part of the assignment, you will use functional programming constructs (`map` and `filter`) to create a function named `get_string_row`. This function will take a list of all ingredients (e.g., `product_info['product']['ingredients']`) and produce a string that can be considered a row in a table. This row should contain 12 fields, separated by tabs, in a specific order as detailed below.

#### Requirements:

Product ID<tab>Product Name<tab>ingredient_1<tab>percentage_ingredient_1<tab> ...

The objective is to compine th  `map` and `filter` functions developed above to transform the given product information and ingredients list into a row for a table. This function will aid in the creation of a dataset where each row represents a different product. Note that in addition to the map and filter, you will need to sort the ingredient based on their rank or percentages.

For example, after applying all transformations, the product sample might be presented in the table as:



In [None]:
print("5060292302201\tBarbeque Potato Chips\tpotato\t54\tsunflower-oil\t28.75\tcoating\t8.625\trice-flour\t4.3125\tpotato-starch\t4.3125")


In [None]:
# Temp line extraction

lines = []

lines.append(product_total['id'])
lines.append(product_total['product_name'])

for field in top_ingredients_no_en:
    line = '\t'.join([field['id'], str(field['percent_estimate'])])

    lines.append(line)

    row_t = '\t'.join(lines)

In [None]:
print(row_t)


#### Task:

Write a function called `get_string_row` that takes a list of all ingredients and turns it into a string that represents a row of data, following the requirements and example format above. Please maintain this function name for autograding purposes. Ensure you test your function thoroughly before submitting this portion of the assignment.





In [None]:
def get_string_row(product_total):
    filter_irrelevant_ingredients(product_ingredients)
    remove_en_prefix(top_ingredients)
    
lines = []

lines.append(product_total['id'])
lines.append(product_total['product_name'])

for field in top_ingredients_no_en:
    line = '\t'.join([field['id'], str(field['percent_estimate'])])

    lines.append(line)

    row_f = '\t'.join(lines)

In [None]:
print(row_f)

### Question 4

For the fourth part of the assignment, your task is to read all the product data available in the file `openfoodfacts-products.json`. For each product in this file, use the `get_string_row` function to generate a row of data as specified in the previous question. Write each of these rows to an output file. 

#### Requirements:

1. Read the `openfoodfacts-products.json` file.
2. Parse the 100k first products and generate a row for each, using the `get_string_row` function.
3. Write these rows to an output file called `openfoodfacts-products.tsv`. You will need to also push this file to your assignment repository.

#### Task:

You can use any and all Python libraries you may need, but for simplicity, you may find that using pure Python and the `json` library already imported should suffice. An example snippet for reading each json object is provide below.



In [None]:
with open('openfoodfacts-products.json', 'r') as f:
    products = json.load(f)
    # complete code here


### Question 5. 

Aren't you curious to know which ingredients are the most common in these products? For the fifth part of the assignment, your task is to parse the table you've generated and count the occurrences of each ingredient.

#### Requirements:
* Read the output file generated in the previous question.
* Count the number of occurrences of each ingredient across all 100k products.
* Store the list 5 most prevalent ingredients in a variable called `most_prevalent_ingredients`. Not that names of the ingredients is sufficient. 

#### Task:
Again, you're free to use any Python libraries that can assist you in completing this task, but using just pure Python and the json library should suffice for this purpose. Ensure to run the cell with the assert to make sure that variable contains a list indeed and it have elements

In [None]:
# do analysis

most_prevalent_ingredients = ...

In [None]:
assert type(most_prevalent_ingredients) == list
assert len(most_prevalent_ingredients) == 5

## PART 2: Implementing a MapReduce Algorithm for Letter Counting

### Question 1

After introducing a basic MapReduce framework in the class, your task is to write a MapReduce application that converts input text to lower case and counts the occurrences of each letter of the alphabet within a list of strings. This problem is somewhat similar to the word count example, but instead of counting words, you will be counting individual letters.

#### Requirements:

1. Utilize the SimpleMapReduce class to create a new application for counting letter occurrences. The class code is provided below.
2. Implement the `flatMapper` function to map each letter to a key-value pair.
3. Implement the `reducer` as needed.
4. Use the `post_process` function for any additional operations, like sorting the letters by their occurrences.



    

In [1]:
from collections import defaultdict
from typing import List, Tuple, Iterable, Any

class SimpleMapReduce:
    """
    SimpleMapReduce is a minimalistic framework designed to implement 
    the MapReduce programming model. This class provides the base 
    functionalities for a MapReduce job, including mapping, shuffling,
    and reducing phases.
    
    Methods:
        getData(input_data: List[Any]) -> Iterable[Tuple[int, Any]]:
            Enumerates the input data, which can be of any type.
        
        flatMapper(k1: int, v1: Any) -> Iterable[Tuple[Any, Any]]:
            Takes a key-value pair and returns an iterable of key-value pairs.
            Must be implemented by subclasses.
        
        reducer(k2: Any, v2s: Iterable[Any]) -> Tuple[Any, Any]:
            Reduces a list of values that share the same key to a single 
            key-value pair. Must be implemented by subclasses.
            
        _flatMap(mass_k1v1: Iterable[Tuple[int, Any]]) -> List[Tuple[Any, Any]]:
            Helper function that applies the flatMapper function to the input data.
        
        _shuffle(mapped: List[Tuple[Any, Any]]) -> Iterable[Tuple[Any, List[Any]]]:
            Helper function that groups the mapped data by keys.
            
        _reduce(mass_k2v2s: Iterable[Tuple[Any, List[Any]]]) -> List[Tuple[Any, Any]]:
            Helper function that applies the reducer function to the shuffled data.
        
        _run(input_data: List[Any]) -> List[Tuple[Any, Any]]:
            Executes the MapReduce job in the order of mapping, shuffling, and reducing.
            
    To use this class, you must subclass it and implement the flatMapper and 
    reducer methods.
    """

    
    @staticmethod
    def getData(input_data: List[Any]) -> Iterable[Tuple[int, Any]]:
        return enumerate(input_data)
    
    @staticmethod
    def flatMapper(k1: int, v1: Any) -> Iterable[Tuple[Any, Any]]:
        raise NotImplementedError()
        
    @staticmethod
    def reducer(k2: Any, v2s: Iterable[Any]) -> Tuple[Any, Any]:
        raise NotImplementedError()
    
    @classmethod
    def _flatMap(cls, mass_k1v1: Iterable[Tuple[int, Any]]) -> List[Tuple[Any, Any]]:
        output = []
        for k1, v1 in mass_k1v1:
            for k2, v2 in cls.flatMapper(k1, v1):
                output.append((k2, v2))
        return output
    
    @classmethod
    def _shuffle(cls, mapped: List[Tuple[Any, Any]]) -> Iterable[Tuple[Any, List[Any]]]:
        d = defaultdict(list)
        for k2, v2 in mapped:
            d[k2].append(v2)
        return d.items()
    
    @classmethod
    def _reduce(cls, mass_k2v2s: Iterable[Tuple[Any, List[Any]]]) -> List[Tuple[Any, Any]]:
        output = []
        for k2, v2s in mass_k2v2s:
            output.append(cls.reducer(k2, v2s))
        return output
    
    @classmethod
    def _run(cls, input_data: List[Any]) -> List[Tuple[Any, Any]]:
        # Load data
        data = cls.getData(input_data)
        
        # Map
        mapped = cls._flatMap(data)
        
        # Shuffle
        shuffled = cls._shuffle(mapped)
        
        # Reduce
        reduced = cls._reduce(shuffled)
        
        return reduced

#### Task:

- Implement your code in the scaffold below. 

- The reducer function should return a dictionary with the counts of each letter after converitng the input to lower-case. Ensure that you test your code using asserts provided, which test that the type of the object retured is a dictionary, and the the counts matches what would be expected for a simple test case.


* Optional (not-graded): use the post_process method to sort the letters based on their occurrence counts. For example, if `_run`  output is 
`{'h': 2, 'e': 3, 'l': 2, 'o': 1, 't': 1, 'r': 1}`
`post_process` would return:
`[('e', 3), ('h', 2), ('l', 2), ('o', 1), ('r', 1), ('t': 1)]`

In [None]:
# Temp
t1 = ['hello']

t2 = t1.pop()
list(t2)

In [2]:
# Working correctly!

class LetterCountMapReduce(SimpleMapReduce):

    @staticmethod
    def flatMapper(k1: int, v1: str) -> Iterable[Tuple[str, int]]:
        words = v1.split()
        stp = words.pop()
        for c in stp:
            yield (c.lower(), 1)

    @staticmethod
    def reducer(k2: str, v2s: List[int]) -> Tuple[str, int]:
        return k2, sum(v2s)
    
    @classmethod
    def post_process():
        pass

In [3]:
input_data = ["Hello", "There"]
my_sample_data = LetterCountMapReduce.getData(input_data)

In [4]:
list(my_sample_data)

[(0, 'Hello'), (1, 'There')]

In [5]:
my_sample_data = LetterCountMapReduce.getData(input_data)
my_sample_data_mapped = LetterCountMapReduce._flatMap(my_sample_data)
print(my_sample_data_mapped)

[('h', 1), ('e', 1), ('l', 1), ('l', 1), ('o', 1), ('t', 1), ('h', 1), ('e', 1), ('r', 1), ('e', 1)]


In [6]:
my_sample_data_mapped_shuffled =  LetterCountMapReduce._shuffle(my_sample_data_mapped)
my_sample_data_mapped_shuffled

dict_items([('h', [1, 1]), ('e', [1, 1, 1]), ('l', [1, 1]), ('o', [1]), ('t', [1]), ('r', [1])])

In [7]:
LetterCountMapReduce._reduce(my_sample_data_mapped_shuffled)

[('h', 2), ('e', 3), ('l', 2), ('o', 1), ('t', 1), ('r', 1)]

In [8]:
input_data = ["Hello", "There"]
LetterCountMapReduce._run(input_data)

[('h', 2), ('e', 3), ('l', 2), ('o', 1), ('t', 1), ('r', 1)]

In [9]:
input_data = ["Hello", "There"]
correct_output = [('h', 2), ('e', 3), ('l', 2), ('o', 1), ('t', 1), ('r', 1)]
output = LetterCountMapReduce._run(input_data)
assert output == correct_output
assert type(output) == list