# Hungry with my fridge!
This kernel aims to solve the task described [here](https://www.kaggle.com/shuyangli94/food-com-recipes-and-user-interactions/tasks?taskId=164).

**Goals**

* Create an API that takes as input a list of ingredients, and returns as output the top 3 recipes that maximises the use of ingredients in provided in the list, and minimises additional ingredients needed and preparation time.
* The output is in the format of an array of arrays of [recipeId, recipeName, prepTimeInMinutes, numberOfFridgeItemUsed, numberOfAdditionalItemsNeeded].
* The score of the recipes in output is given by $score = numberOfIngredientsUsed^{\frac{60}{prepTimeInMinutes}} - numberOfAdditionalIngredientsNeeded^{\frac{prepTimeInMinutes}{15}}$ 
* The ouput array is sorted by the score of that array.

**Initial Commentary**

The task provided is an interesting one. In the context of the lockdown experienced by many across the world in 2020, such a tool seems even more apt. The idea behind the task is to find a recipe that uses as many of the ingredients available at hand, while minimising the number of ingredients needed to be purchased, and minimising the preparation time. On immediate inspection, it is clear that these three optimisation goals could be contradictory in nature. For example, a recipe that uses all the ingredients available may require many more ingredients to be purchased than a recipe that uses only a small amount of ingredients availble. Also, a recipe that uses all ingredients available and require no additional ingredients may then require an extraordinary amount of time to prepare. 

Based on the potentially contradictory optimisation requirements, it is then important to derive a metric that provides a required balance between these 3 requirements. The $score$ metric defined as part of the evaluation of the task attempts to do this. The $score$ metric increases when more ingredients are used but the increment is limited exponentially by $\frac{60}{prepTimeInMinutes}$. This means the the longer it takes to prepare the recipe, assuming the same number of ingredients used, the contribution to score by the number of ingredients used decreases exponentially. Conversely, the $score$ metric decreases when more additional ingredients are needed but the reduction is limited exponentially by $\frac{prepTimeInMinutes}{15}$. This means the the longer it takes to prepare the recipe, assuming the same number of additional ingredients needed, the reduction of the score by the number of ingredients needed used increases exponentially. 

While the balance offered by the $score$ metric seems reasonable, a quick review of the data raises potential issues. On inspection of the preperation time provided for each recipe, the existence of recipes with $0$min preperation time points to recipes that regardless of ingredients available, will lead to an undefined score. A quick inspection of these recipes suggests that this is a data anomally for recipes on food.com that do not have a time entry.

To handle this issue, the following 2 proposals are made:

1. Treat recipes with $0$min preperation time to have $0.5$min preperation time. 
2. Ignore recipes with $0$min preperation time.

The later proposal will be used and recipes with $0$min prep time as part of the calculation will be assign a score of $-maxint$ where $maxint$ is an arbitrarily large number.

Furthermore, the existance of preperation time that take days could result in scores sufficiently large in magnitude that the kernel will not be able to handle it. In this tool the decimal module is used to manage this issue. During evaluation, this module may be required. Interestingly there exist a single recipe with $2147483647$min preperation time representing roughly 4000 years. This results in a score with such an absurdly large magnitude that the decimal module is unable to handle it. For situations like this, the score is fixed at $-maxint$.



**Contents:**

1. Setup
2. Loading and cleaning the data
3. Creating the API to perform the search
4. Final Solution

## 1. Setup
The setup steps imports all necessary modules

In [None]:
# Importing libraries
import numpy as np
import pandas as pd
import os
import pickle
import ast # parses list in lit string to pythong list
from tqdm import tqdm # progress bar helpful in monitoring processes
import decimal

## 2. Loading and cleaning the data

### 2.1 Loading the data
A generic load all files function is used to parse all data provided in input as pd.DataFrames. The DataFrames are stored as values in a dictionary with the input file name as the corresponding key.

For this task, only 3 of the input DataFrames are required:
1. `ingr_map.pkl` provides a mapping of ingredients (i) to ingredient ids (i_id)
2. `RAW_recipes.csv` provides a mapping of recipes (r) to ingredients (i), time (t), and recipe ids (r_id)
3. `PP_recipes.csv` provides a mapping of recipe ids (r_id) to ingredient ids (i_id)

In [None]:
# Def Load Files func
def loadfiles(directory):
    files = {} # Initiate file dict
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            fullpath = os.path.join(dirname, filename)
            if filename.split(".")[-1] == "csv": # load csv file
                files[''.join(filename.split(".")[:-1])] = pd.read_csv(fullpath)
                print(f"Loaded file: {filename}")
            elif filename.split(".")[-1] == "pkl": # load pkl file
                with open(fullpath, 'rb') as f:
                    files[''.join(filename.split(".")[:-1])] =  pickle.load(f)
                    print(f"Loaded file: {filename}")
    return files

# Load files
files = loadfiles('/kaggle/input')

In [None]:
ingredients = files['ingr_map']
recipes = files['RAW_recipes']
r2i_map_raw = files['PP_recipes']

### 2.2 Cleaning the data
The data is cleansed and processed to create mappers (dict) to be used in the API to retrieve appropriate recipes. 

* `r2i_map`: Key = recipe id, Value = set of ingredient ids required for this recipe
* `i2r_map`: Key = ingredient id, Value = set of recipe ids that use this ingredient 
* `id2r_map`: Key = recipe id, Value = recipe
* `i2id_map`: Key = ingredient, Value = ingredient id

To save time, the generated mappers (dict) are saved as pkls, the script to generate the mappers have been commented out, and scripts to load the pkls left inline.

In [None]:
# def method to clean r2i_map_raw table
def generate_maps(r2i_map_raw):
    r2i_map = {} # key = recipe id, value = ingredient id set
    i2r_map = {} # key = ingredient id, value = recipe id set

    # parse and append individual rows
    for i in tqdm(range(len(r2i_map_raw.id))):
        recipe_id = r2i_map_raw.id[i]
        
        # retrieve ingredients
        ingredients = ingredients = ast.literal_eval(r2i_map_raw.query(f"id == '{recipe_id}'").ingredient_ids.values[0])

        # add r2i entry
        r2i_map[recipe_id] = set(ingredients)

        # add i2r entry
        for i in ingredients:
            if i in i2r_map.keys():
                i2r_map[i] = i2r_map[i].union({recipe_id})
            else:
                i2r_map[i] = {recipe_id}
    
    return r2i_map, i2r_map

r2i_map, i2r_map = generate_maps(r2i_map_raw)

i2id_map_raw_replaced = ingredients[['id','replaced']].drop_duplicates(subset='replaced', keep="first")
i2id_map_raw_raw_ingr = ingredients[['id','raw_ingr']].drop_duplicates(subset='raw_ingr', keep="first")
i2id_map_raw_processed = ingredients[['id','processed']].drop_duplicates(subset='processed', keep="first")

i2id_map = {**dict(zip(list(i2id_map_raw_replaced['replaced']), list(i2id_map_raw_replaced['id']))),
            **dict(zip(list(i2id_map_raw_raw_ingr['raw_ingr']), list(i2id_map_raw_raw_ingr['id']))),
            **dict(zip(list(i2id_map_raw_processed['processed']), list(i2id_map_raw_processed['id'])))}

id2r_map_raw = recipes[['name','id']].drop_duplicates(subset='id', keep="first")
id2r_map = dict(zip(list(id2r_map_raw['id']),list(id2r_map_raw['name'])))

r2min_map_raw = recipes[['minutes','id']].drop_duplicates(subset='id', keep="first")
r2min_map = dict(zip(list(r2min_map_raw['id']),list(r2min_map_raw['minutes'])))

with open('/kaggle/working/i2r_map.pkl', 'wb') as handle:
    pickle.dump(i2r_map, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('/kaggle/working/r2i_map.pkl', 'wb') as handle:
    pickle.dump(r2i_map, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('/kaggle/working/i2id_map.pkl', 'wb') as handle:
    pickle.dump(i2id_map, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open('/kaggle/working/id2r_map.pkl', 'wb') as handle:
    pickle.dump(id2r_map, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('/kaggle/working/r2min_map.pkl', 'wb') as handle:
    pickle.dump(r2min_map, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Loading previously generated mappers
with open('/kaggle/working/i2r_map.pkl', 'rb') as f:
    i2r_map =  pickle.load(f)

with open('/kaggle/working/r2i_map.pkl', 'rb') as f:
    r2i_map =  pickle.load(f)

with open('/kaggle/working/i2id_map.pkl', 'rb') as f:
    i2id_map =  pickle.load(f)

with open('/kaggle/working/id2r_map.pkl', 'rb') as f:
    id2r_map =  pickle.load(f)
    
with open('/kaggle/working/r2min_map.pkl', 'rb') as f:
    r2min_map =  pickle.load(f)

## 3. Creating the API to perform the search

In [None]:
def getRecipes(ingredient_list_id):
    output_data = {} # key = recipe id, value = {'i_req': set(),'i_avail': set(),'i_needed': set(), 'time_req':r2min_map[r]}
    
    for i in ingredient_list_id:
        recipes = i2r_map[i] # Retrieve recipes containing this ingredient
        for r in recipes:
            if r in output_data.keys():
                output_data[r]['i_avail'] = output_data[r]['i_avail'].union({i})
            else:
                output_data[r] = {'i_req': r2i_map[r],'i_avail': {i}, 'time_req':r2min_map[r]}
    
    for r in output_data.keys():
        output_data[r]['i_needed'] = output_data[r]['i_req'].difference(output_data[r]['i_avail'])
    
    return output_data

def parseIngredientList(ingredient_list_string):
    ingredient_list_id=[]
    for i in ingredient_list_string:
        ingredient_list_id.append(i2id_map[i])
    return ingredient_list_id

def score(recipe_data):
    try:
        if recipe_data['time_req']==0: 
            return -decimal.Decimal(2)**decimal.Decimal(1000)
        else:
            score = (decimal.Decimal((len(recipe_data['i_avail']))**decimal.Decimal(60.0/float(recipe_data['time_req']))) - (decimal.Decimal(len(recipe_data['i_needed']))**decimal.Decimal(float(recipe_data['time_req'])/15)))
        return score
    except:
        return -decimal.Decimal(2)**decimal.Decimal(1000)

def sortByScore(output_data):
    return sorted(list(output_data.keys()), key=lambda recipe: score(output_data[recipe]), reverse=True)

def maxScoreRecipeId(output_data):
    return sortByScore(output_data)[0]

def getRecipeData(r_id,output_data):
    recipe_data_list = []
    recipe_data_list.append(r_id) # Append recipeId to list
    recipe_data_list.append(id2r_map[r_id]) # Append recipeName to list
    recipe_data_list.append(output_data[r_id]['time_req']) # Append prepTimeInMinutes to list
    recipe_data_list.append(len(output_data[r_id]['i_avail'])) # Append numberOfFridgeItemUsed to list
    recipe_data_list.append(len(output_data[r_id]['i_needed'])) # Append numberOfAdditionalItemsNeeded to list
    return recipe_data_list

def hungryWithMyFridgeAPI(arrayOfArrayOfIngredients):
    output_array = []
    for ingredientsArray in arrayOfArrayOfIngredients:
        recipes = getRecipes(parseIngredientList(ingredientsArray))
        output_array.append(getRecipeData(maxScoreRecipeId(recipes),recipes))
    return output_array

def scoreOutputArray(output_array):
    scores = []
    for output in output_array:
        scores.append((decimal.Decimal(output[3])**decimal.Decimal(60.0/output[2])) - (decimal.Decimal(output[4])**decimal.Decimal(output[2]/15)))
    return np.mean(scores)

In [None]:
## Evaluation

input = [
    ['winter squash', 'mexican seasoning', 'mixed spice', 'honey', 'butter', 'olive oil', 'salt'],
    ['low sodium chicken broth', 'tomatoes', 'zucchini', 'potatoes', 'wax beans', 'green beans', 'carrots'],
    ['spinach',  'garlic powder', 'soft breadcrumbs', 'oregano', 'onion'] ]
output = hungryWithMyFridgeAPI(input)

for i in range(len(input)):
    print("For Available ingredients:")
    print(input[i])
    print("\nRecommended Recipe:")
    print(output[i])
    print(f"With Score: {scoreOutputArray([output[i]])}\n")

print(f"\nOverall Mean Score: {scoreOutputArray(output)}")