# Assignment group 2: Network and exploratory data analysis

## Module D _(40 pts)_ An ingredient-based recommender system
In this module we're going to build a recommender system using some recipes data and the Apriori algorithm. These data can be obtained from Kaggle:

- https://www.kaggle.com/kaggle/recipe-ingredients-dataset

and are packaged with the assignment in the following directory:

- `./data/train.json`

__D1.__ _(2 pts)_ To start, load the recipe data from `json` format and print the first 5 recipes.

In [66]:
## code here
import pandas as pd
import numpy as np
import json
json_data = open('./data/train.json')
data = json.load(json_data)
#print first 5 recipes
recipes = data[0:5]
print(recipes)


[{'id': 10259, 'cuisine': 'greek', 'ingredients': ['romaine lettuce', 'black olives', 'grape tomatoes', 'garlic', 'pepper', 'purple onion', 'seasoning', 'garbanzo beans', 'feta cheese crumbles']}, {'id': 25693, 'cuisine': 'southern_us', 'ingredients': ['plain flour', 'ground pepper', 'salt', 'tomatoes', 'ground black pepper', 'thyme', 'eggs', 'green tomatoes', 'yellow corn meal', 'milk', 'vegetable oil']}, {'id': 20130, 'cuisine': 'filipino', 'ingredients': ['eggs', 'pepper', 'salt', 'mayonaise', 'cooking oil', 'green chilies', 'grilled chicken breasts', 'garlic powder', 'yellow onion', 'soy sauce', 'butter', 'chicken livers']}, {'id': 22213, 'cuisine': 'indian', 'ingredients': ['water', 'vegetable oil', 'wheat', 'salt']}, {'id': 13162, 'cuisine': 'indian', 'ingredients': ['black pepper', 'shallots', 'cornflour', 'cayenne pepper', 'onions', 'garlic paste', 'milk', 'butter', 'salt', 'lemon juice', 'water', 'chili powder', 'passata', 'oil', 'ground cumin', 'boneless chicken skinless thig

__D2.__ _(5 pts)_ Next, `from collections import Counter` to write a function called `count_items(recipes)` that counts up the number of recipes that include each `ingredient`, storing each in the counter as a single-element tuple (for downstream convienience), i.e., incrementing like `counts[tuple([ingredient])] +=1`. 

When complete, exhibit this functions utility in application to the `recipes` loaded in __D1__ and print the number of 'candidates' in the output.

In [13]:
## code here
from collections import Counter
def count_items(recipes):
    counts = Counter()
    for recipe in recipes:
        ingredients = recipe['ingredients']
        for ingredient in ingredients:
            counts[tuple([ingredient])] += 1
    return counts

# test
count_items(recipes)

Counter({('romaine lettuce',): 1,
         ('black olives',): 1,
         ('grape tomatoes',): 1,
         ('garlic',): 1,
         ('pepper',): 2,
         ('purple onion',): 1,
         ('seasoning',): 1,
         ('garbanzo beans',): 1,
         ('feta cheese crumbles',): 1,
         ('plain flour',): 1,
         ('ground pepper',): 1,
         ('salt',): 4,
         ('tomatoes',): 1,
         ('ground black pepper',): 1,
         ('thyme',): 1,
         ('eggs',): 2,
         ('green tomatoes',): 1,
         ('yellow corn meal',): 1,
         ('milk',): 2,
         ('vegetable oil',): 2,
         ('mayonaise',): 1,
         ('cooking oil',): 1,
         ('green chilies',): 1,
         ('grilled chicken breasts',): 1,
         ('garlic powder',): 1,
         ('yellow onion',): 1,
         ('soy sauce',): 1,
         ('butter',): 2,
         ('chicken livers',): 1,
         ('water',): 2,
         ('wheat',): 1,
         ('black pepper',): 1,
         ('shallots',): 1,
         ('cor

__D3.__ _(5 pts)_ Now, write a function called `store_frequent(candidates, threshold = 25)`, which accepts a `Counter` of `candidates`, i.e., item or itemset counts, and stores only those with count above the determined `threshold` value in a separate counter called `frequent`, which is `return`ed at the end of the function. Apply this function to your output from __D1__ with the default `threshold` value of `25` to exhibit your function's utility, and then print the number of frequent items found.

In [80]:
## code here
def store_frequent(candidates, threshold=25):
    counts = Counter()
    frequent = Counter()
    for candidate in candidates:
        counts[candidate] += 1
        if counts[candidate] > threshold:
            frequent[candidate] += 1
    return frequent

#counting frequency
ingredient_list = []
for ingredient in data:
    ingredient = recipe['ingredients']
    ingredient_list.append(ingredient)
Ing=ingredient_list[:]
flattened = [val for sublist in Ing for val in sublist]
count=store_frequent(flattened)
print(count)

Counter({'black pepper': 39749, 'shallots': 39749, 'cornflour': 39749, 'cayenne pepper': 39749, 'onions': 39749, 'garlic paste': 39749, 'milk': 39749, 'butter': 39749, 'salt': 39749, 'lemon juice': 39749, 'water': 39749, 'chili powder': 39749, 'passata': 39749, 'oil': 39749, 'ground cumin': 39749, 'boneless chicken skinless thigh': 39749, 'garam masala': 39749, 'double cream': 39749, 'natural yogurt': 39749, 'bay leaf': 39749})


__D4.__ (10 pts) Now, write a function called `get_next(recipes, frequent, threshold = 25)` that accepts the `frequent` items output from the `store_frequent()` function. With these inputs, your function should:

1. create a new `Counter` called `next_candidates`
2. compute the `size` of the itemsets for `next_candidates` from a single key in `frequent`
3. `for` any `recipe` with _at least_ as many ingredients as `size`:
    1. loop over all itemsets of size `size` (see combinations note below)
    2. utilize the apriori principle and subsets of itemsets to count up potentially-frequent candidate itemsets in `next_candidates`
4. `return(next_candidates)` 

__Important__: once your code runs, apply this function to the output of __D3__, report the resulting number of `next_candidates` found, and run `store_frequent` on these to report the number of 2-itemsets that were frequent. Repeat this process to build the 3-itemsets and record in the markdown box any observations on run time for these successive applications. In the response box below reply to the following questions:

- Are we generating more candidates or frequent itemsets as we look at larger sizes? 
- Why would this process become more and more computationally expensive as the size get's larger?
    
Note: to complete this part it is _extremely strongly_ encouraged that you import the `combinations()` function from the `itertools` module. With this, you can execute `combinations(items, k)` to find all combinations of size `k` from a list of `items`.

_Response._

In [88]:
## code here
def get_next(recipes, frequent, threshold=25):
    next_candidates = Counter()
    for k, v in frequent:
        set_size = frequent.values()

__D5.__ (10 pts) Now that we have the pieces to run Apriori/collect frequent itemsets it's time to package the process together, collecting all frequent itemsets up to a particular `size`. To do this, write a function called `train(recipes, size = 4)`, which:

1. initializes two empty dictionaries, `candidates`, and `frequent`;
2. runs the `count_items` and `store_frequent` function, storing output in the `candidates`, and `frequent` dictionaries using the integer `1` as a key;
3. loops over sizes: 2, 3, .., `size` to compute and store the subsequent sizes candidates and frequent itemsets in the same structure as (2), but now utilizing the `get_next` function, instead of `count_items`; and
4. `return`s the `candidates` and `frequent` itemsets.

In [None]:
## code here

In [None]:
## code here

__D5.__ _(8 pts)_ Now that we have our `frequent` itemsets up to `size`, we can utilize them to recommend missing ingredients from ingredient 'baskets' of at most `size - 1`. To do this, write a function called `recommend(basket, frequent)` that does the following: 

1. initializes an empty `recommendations` list
2. loops over all frequent `itemset`s of `size 1 greater than the `basket`
    - if there's one item left from the `itemset` when the `basket` removed, append the remaining item to the `recommendations` list in a tuple, with the number of ocurrences of the itemset in the second position
4. `return` `recommendations`, but sorted from high to low by itemset ocurrence.

Once your code is complete, report the top 10 recommended items to buy for recipe flexibility in the following scenarios:

- `basket = tuple(['butter', 'flour'])`
- `basket = tuple(['soy sauce', 'green onions'])`
- `basket = tuple(['avocado', 'garlic', 'salt'])`

and in the response box below discuss the output and the types of recipes you think the recommender is pointing you to. Does this output seem appropriate? 

Note: your function should additionally respond appropriately if the user requests a recommendation for a basket of size at least as big as the `size` specified in the `train()` function, i.e., it should return an error message gracefully, alerting the user to not having trained on itemsets large enough.

_Response._

In [None]:
## code here

In [None]:
## code here

In [None]:
## code here

In [None]:
## code here