<h1 style = "text-align: center">Collaborative Filtering Data Model</h1>

<h3 style = "text-align: center">Food.com Recipe Recommender - SOEN 471 (Big Data Analytics)</h3>

## Objective:
The objective of this notebook is to create a recommender system data model that recommend recipes based on user preferences using collaborative filtering.

In [1]:
import os
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

# accessing directory
for dirname, _, filenames in os.walk('./clean_data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

./clean_data/interactions_TRAIN.csv
./clean_data/interactions_TEST.csv
./clean_data/recipes.csv


## Reading files:

In [2]:
# read
training = pd.read_csv("./clean_data/interactions_TRAIN.csv")
testing = pd.read_csv("./clean_data/interactions_TEST.csv")
recipes = pd.read_csv("./clean_data/recipes.csv")

# print shapes of data
print("Shape of training model: ", training.shape)
print("Shape of testing model: ", testing.shape)
print("Shape of recipes model: ", recipes.shape)

Shape of training model:  (1019129, 6)
Shape of testing model:  (113237, 6)
Shape of recipes model:  (231636, 19)


## Since the data is big, we will take a small random sample:

In [3]:
interactions_sample = training.sample(40000)
interactions_sample.head(1)

Unnamed: 0.1,Unnamed: 0,user_id,recipe_id,date,rating,review
29907,121294,281399,321212,2009-02-02,5,"Oh yeah, this is good. First one I made I trie..."


In [4]:
recipes_sample = recipes.sample(40000)
recipes_sample.head(1)

Unnamed: 0.1,Unnamed: 0,name,recipe_id,minutes,contributor_id,submitted,tags,n_steps,steps,description,ingredients,n_ingredients,Calories,Total_fat_PDV,Sugar_PDV,Sodium_PDV,Protein_PDV,Saturated_fat_PDV,Carbohydrates_PDV
8243,8244,apricot tea loaf,221701,60,315565,2007-04-09,"['60-minutes-or-less', 'time-to-make', 'course...",9,['soak apricots in hot tea for 1 hour till tea...,easy loaf to make and yummy to eat - from aust...,"['dried apricots', 'brewed tea', 'butter', 'ca...",6,369.0,27.0,127.0,5.0,8.0,54.0,16.0


## Join both samples based on 

In [5]:
data = pd.merge(interactions_sample, recipes_sample, right_on='recipe_id', left_on='recipe_id')
print("The shape of the joind training data sample: ", data.shape)
data.head(2)

The shape of the joind training data sample:  (6995, 24)


Unnamed: 0,Unnamed: 0_x,user_id,recipe_id,date,rating,review,Unnamed: 0_y,name,minutes,contributor_id,...,description,ingredients,n_ingredients,Calories,Total_fat_PDV,Sugar_PDV,Sodium_PDV,Protein_PDV,Saturated_fat_PDV,Carbohydrates_PDV
0,121294,281399,321212,2009-02-02,5,"Oh yeah, this is good. First one I made I trie...",34149,california iced tea,3,351811,...,being from california i had to add what us cal...,"['citrus-infused vodka', 'gin', 'rum', 'amaret...",8,102.6,0.0,0.0,0.0,0.0,0.0,0.0
1,138268,424680,276589,2009-08-06,5,I pretty much cut this recipe in half when mak...,50021,warm rice pudding with seasonal fruit,40,336058,...,this is a michael chiarello recipe that i foun...,"['milk', 'sugar', 'lemon peel', 'vanilla bean'...",9,196.1,9.0,33.0,2.0,12.0,16.0,9.0


 ## Summary of the descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset's distribution:

In [6]:
data.describe()

Unnamed: 0,Unnamed: 0_x,user_id,recipe_id,rating,Unnamed: 0_y,minutes,contributor_id,n_steps,n_ingredients,Calories,Total_fat_PDV,Sugar_PDV,Sodium_PDV,Protein_PDV,Saturated_fat_PDV,Carbohydrates_PDV
count,6995.0,6995.0,6995.0,6995.0,6995.0,6995.0,6995.0,6995.0,6995.0,6995.0,6995.0,6995.0,6995.0,6995.0,6995.0,6995.0
mean,115208.351251,144323500.0,160049.125518,4.407434,28611.701787,98.861472,1577750.0,9.679199,8.992852,457.143074,33.499071,84.762116,29.023874,36.446605,42.905361,15.124803
std,65526.991023,511037300.0,130305.887967,1.270011,17231.438396,626.320634,51431600.0,5.745908,3.720766,629.094779,49.747936,358.207366,67.061731,45.026394,74.49555,35.158052
min,19.0,1533.0,92.0,0.0,36.0,0.0,1530.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,59389.5,133174.0,51501.0,4.0,13543.0,22.0,37305.0,6.0,6.0,180.2,9.0,8.0,6.0,7.0,8.0,4.0
50%,113492.0,329638.0,117424.0,5.0,28818.0,40.0,89831.0,9.0,9.0,309.7,21.0,23.0,16.0,19.0,23.0,9.0
75%,173260.0,800090.0,243784.0,5.0,43485.0,70.0,242729.0,12.0,11.0,518.4,40.0,68.0,34.0,55.0,52.0,16.0
max,227317.0,2002352000.0,536345.0,5.0,58887.0,43200.0,2001404000.0,82.0,26.0,17554.0,805.0,18127.0,3651.0,897.0,1595.0,1511.0


## Create a pivot table from the data and replace any missing value by 0:

In [7]:
pivot_table = data.pivot_table(index='recipe_id', columns='user_id', values='rating')
pivot_table.fillna(0, inplace=True)

## Define a function that generates similar recipes based on recipe_id provided

In [8]:
def similar_recipes(recipe_id):
    recipe = pivot_table.loc[recipe_id].values.reshape(1,-1)
    cosine_similarities = cosine_similarity(pivot_table, recipe).flatten()
    related_recipe_indices = cosine_similarities.argsort()[:-6:-1]
    related_recipes = recipes.loc[related_recipe_indices][['name', 'ingredients']]
    return related_recipes.values.tolist()

## Define a function that returns a recommended recipe based on minutes and calories provided 

In [9]:
def recommend_recipe(minutes, calories):
    # find the recipes that have similar minutes and calories as the input
    similar_minutes = data[(data['minutes'] >= minutes-10) & (data['minutes'] <= minutes+10)]
    similar_calories = similar_minutes[(similar_minutes['Calories'] >= calories-100) & (similar_minutes['Calories'] <= calories+100)]
    recipe_ids = similar_calories['recipe_id'].unique().tolist()

    # recommend similar recipes for each recipe in the filtered data
    recommended_recipes = []
    for recipe_id in recipe_ids:
        recommended_recipes.extend(similar_recipes(recipe_id))

    # remove duplicates
    recommended_recipes = list(set([tuple(recipe) for recipe in recommended_recipes]))

    return recommended_recipes[:10] # return top 10 recommended recipes

## Usage Example

In [10]:
recommended_recipes = recommend_recipe(30, 500)
print("Recommended Recipes that needs 30 minutes to prepare and contains 500 calories:")

for i, recipe in enumerate(recommended_recipes):
    print(f"{i+1}. Recipe Name: {recipe[0]}")
    print("Ingredients:")
    ingredients = recipe[1].split(', ')
    for ingredient in ingredients:
        print("- ", ingredient)
    print()

Recommended Recipes that needs 30 minutes to prepare and contains 500 calories:
1. Recipe Name: alabammy delight
Ingredients:
-  ['southern comfort'
-  'amaretto'
-  'sloe gin'
-  'lemon juice']

2. Recipe Name: all bran banana bread
Ingredients:
-  ['all-bran cereal'
-  'whole wheat flour'
-  'baking powder'
-  'bicarbonate of soda'
-  '1% low-fat milk'
-  'caster sugar'
-  'egg'
-  'water'
-  'banana']

3. Recipe Name: alaskan cranberry dumplings
Ingredients:
-  ['cranberries'
-  'flour'
-  'salt'
-  'double-acting baking powder'
-  'butter'
-  'milk'
-  'sugar'
-  'cinnamon']

4. Recipe Name: almond joy cake with creamy coconut butter frosting
Ingredients:
-  ['coconut pudding mix'
-  'water'
-  'white cake mix'
-  'vegetable oil'
-  'eggs'
-  'coconut rum'
-  "devil's food cake mix"
-  'instant pistachio pudding mix'
-  'almond liqueur'
-  'butter'
-  'salt'
-  'vanilla extract'
-  'coconut extract'
-  "confectioners' sugar"
-  'coconut milk'
-  'flaked coconut']

5. Recipe Name: 5