<h1 style = "text-align: center">Collaborative Filtering Data Model</h1>

<h3 style = "text-align: center">Food.com Recipe Recommender - SOEN 471 (Big Data Analytics)</h3>

## Objective:
The objective of this notebook is to create a recommender system data model that recommend recipes based on user preferences using collaborative filtering.

In [145]:
import os
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn
from sklearn.metrics.pairwise import cosine_similarity
import warnings

warnings.filterwarnings("ignore") # ignore error of displot

# accessing directory
for dirname, _, filenames in os.walk('./clean_data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

./clean_data/interactions_TRAIN.csv
./clean_data/interactions_TEST.csv
./clean_data/recipes.csv


## Reading files:

In [146]:
# read
training = pd.read_csv("./clean_data/interactions_TRAIN.csv")
testing = pd.read_csv("./clean_data/interactions_TEST.csv")
recipes = pd.read_csv("./clean_data/recipes.csv")

# print shapes of data
print("Shape of training model: ", training.shape)
print("Shape of testing model: ", testing.shape)
print("Shape of recipes model: ", recipes.shape)

Shape of training model:  (1019129, 6)
Shape of testing model:  (113237, 6)
Shape of recipes model:  (231636, 19)


## Since the data is big, we will take a small random sample:

In [147]:
interactions_sample = training.sample(40000)
interactions_sample.head(1)

Unnamed: 0.1,Unnamed: 0,user_id,recipe_id,date,rating,review
836811,37715,226863,312378,2008-11-12,3,"I followed this recipe exactly, so I'm not sur..."


In [148]:
recipes_sample = recipes.sample(40000)
recipes_sample.head(1)

Unnamed: 0.1,Unnamed: 0,name,recipe_id,minutes,contributor_id,submitted,tags,n_steps,steps,description,ingredients,n_ingredients,Calories,Total_fat_PDV,Sugar_PDV,Sodium_PDV,Protein_PDV,Saturated_fat_PDV,Carbohydrates_PDV
171807,55160,red achiote mexican rice,67319,40,80138,2003-07-24,"['60-minutes-or-less', 'time-to-make', 'course...",13,"['rinse rice and drain well', 'set apart', 'pr...",a tomato-less mexican red rice. popular in the...,"['white rice', 'white onion', 'garlic', 'achio...",10,275.9,15.0,2.0,16.0,11.0,7.0,13.0


## Join both samples based on 

In [149]:
data = pd.merge(interactions_sample, recipes_sample, right_on='recipe_id', left_on='recipe_id')
print("The shape of the joind training data sample: ", data.shape)
data.head(2)

The shape of the joind training data sample:  (6902, 24)


Unnamed: 0,Unnamed: 0_x,user_id,recipe_id,date,rating,review,Unnamed: 0_y,name,minutes,contributor_id,...,description,ingredients,n_ingredients,Calories,Total_fat_PDV,Sugar_PDV,Sodium_PDV,Protein_PDV,Saturated_fat_PDV,Carbohydrates_PDV
0,119776,121690,206272,2007-03-19,5,I cooked the rice in my electric rice steamer ...,13638,denise s saffron vegetable fried rice,25,341142,...,an interesting meld of indian and chinese cook...,"['long grain rice', 'water', 'powdered saffron...",12,212.8,10.0,4.0,5.0,7.0,4.0,11.0
1,95472,858469,154636,2009-09-29,0,This was really good. If you are a banana brea...,50557,weight watcher 1 point banana bread flex points,70,7108,...,"moist and tasty! for those on ww, it's an eas...","['bananas', 'splenda sugar substitute', 'natur...",7,87.8,0.0,30.0,10.0,3.0,0.0,6.0


 ## Summary of the descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset's distribution:

In [150]:
data.describe()

Unnamed: 0,Unnamed: 0_x,user_id,recipe_id,rating,Unnamed: 0_y,minutes,contributor_id,n_steps,n_ingredients,Calories,Total_fat_PDV,Sugar_PDV,Sodium_PDV,Protein_PDV,Saturated_fat_PDV,Carbohydrates_PDV
count,6902.0,6902.0,6902.0,6902.0,6902.0,6902.0,6902.0,6902.0,6902.0,6902.0,6902.0,6902.0,6902.0,6902.0,6902.0,6902.0
mean,114025.756013,135360100.0,162404.523327,4.399305,28473.455665,126.593306,5033479.0,9.787163,9.15865,455.150377,33.693567,77.478557,31.110693,36.587946,42.027673,14.774703
std,66067.211579,496264900.0,131522.569956,1.278239,16805.58011,1772.47022,94273150.0,5.811689,3.728647,574.326555,50.566635,233.306537,65.488555,41.695716,65.829974,27.298191
min,174.0,1533.0,39.0,0.0,32.0,0.0,1530.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,56957.0,135400.2,53914.0,4.0,14114.75,25.0,37449.0,6.0,6.0,186.65,9.0,9.0,6.0,7.0,8.0,4.0
50%,112315.0,336058.0,123864.5,5.0,28815.0,40.0,89831.0,8.0,9.0,318.85,20.0,24.0,16.0,21.0,24.0,9.0
75%,172183.25,815671.5,249890.75,5.0,42641.75,70.0,250332.2,13.0,11.0,517.4,39.0,68.0,36.0,56.0,51.0,16.0
max,227289.0,2002372000.0,536524.0,5.0,58760.0,129615.0,2001488000.0,70.0,26.0,9038.5,900.0,5174.0,2361.0,872.0,1241.0,484.0


## Creates a pivot table from the data and replace any missing value by 0:

In [151]:
pivot_table = data.pivot_table(index='recipe_id', columns='user_id', values='rating')
pivot_table.fillna(0, inplace=True)

## Define a function that generates similar recipes based on recipe_id provided

In [152]:
def similar_recipes(recipe_id):
    recipe = pivot_table.loc[recipe_id].values.reshape(1,-1)
    cosine_similarities = cosine_similarity(pivot_table, recipe).flatten()
    related_recipe_indices = cosine_similarities.argsort()[:-6:-1]
    related_recipes = recipes.loc[related_recipe_indices][['name', 'ingredients']]
    return related_recipes.values.tolist()

## Define a function that returns a recommended recipe based on minutes and calories provided 

In [153]:
def recommend_recipe(minutes, calories):
    # find the recipes that have similar minutes and calories as the input
    similar_minutes = data[(data['minutes'] >= minutes-10) & (data['minutes'] <= minutes+10)]
    similar_calories = similar_minutes[(similar_minutes['Calories'] >= calories-100) & (similar_minutes['Calories'] <= calories+100)]
    recipe_ids = similar_calories['recipe_id'].unique().tolist()

    # recommend similar recipes for each recipe in the filtered data
    recommended_recipes = []
    for recipe_id in recipe_ids:
        recommended_recipes.extend(similar_recipes(recipe_id))

    # remove duplicates
    recommended_recipes = list(set([tuple(recipe) for recipe in recommended_recipes]))

    return recommended_recipes[:10] # return top 10 recommended recipes

## Usage Example

In [None]:
recommended_recipes = recommend_recipe(30, 500)
print("Recommended Recipes that needs 30 minutes to prepare and contains 500 calories:")

for i, recipe in enumerate(recommended_recipes):
    print(f"{i+1}. Recipe Name: {recipe[0]}")
    print("Ingredients:")
    ingredients = recipe[1].split(', ')
    for ingredient in ingredients:
        print("- ", ingredient)
    print()