## Similar Dish Recommendation

We will be using ML to recommend similar dishes, based on a dish. We have sourced the dataset from [Kaggle](https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions). The dataset contains more than 200 thousand recipes.

In [1]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/FoodRecommendation/RAW_recipes_og.csv')
df.head()

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,['place potatoes in a large pot of lightly sal...,"this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n...",11
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,['mix all ingredients& boil for 2 1 / 2 hours ...,my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar...",8


Now, we will start with preprocessing.

In [2]:
df.shape

(231637, 12)

Handling duplicates

In [3]:
df.nunique()

name              230185
id                231637
minutes              888
contributor_id     27926
submitted           5090
tags              209115
nutrition         229318
n_steps               94
steps             231074
description       222668
ingredients       230475
n_ingredients         41
dtype: int64

In [4]:
df=df.drop_duplicates(subset=['name'],keep='first')
df.shape

(230186, 12)

Handling null values

In [5]:
df.isna().sum()

name                 1
id                   0
minutes              0
contributor_id       0
submitted            0
tags                 0
nutrition            0
n_steps              0
steps                0
description       4937
ingredients          0
n_ingredients        0
dtype: int64

In [6]:
df= df.dropna() #Removing rows with NaN entry
df.shape

(225248, 12)

Now, we will take a sample of 30,000 rows. This will help us lower our workload.

In [7]:
df=df.sample(n=30000, random_state=50)
df=df.reset_index(drop=True)

Removing the columns that we won't require further

In [8]:
df=df.drop(['contributor_id','submitted','n_ingredients','n_steps','id'],axis=1)
df.columns

Index(['name', 'minutes', 'tags', 'nutrition', 'steps', 'description',
       'ingredients'],
      dtype='object')

Converting ingredients and tags columns from list representations to csv representation

In [9]:
import ast
df['ingredients'] = df['ingredients'].apply(ast.literal_eval)
df['ingredients'] = df['ingredients'].apply(lambda x: ', '.join(x))
df['tags'] = df['tags'].apply(ast.literal_eval)
df['tags'] = df['tags'].apply(lambda x: ', '.join(x))
df.head()

Unnamed: 0,name,minutes,tags,nutrition,steps,description,ingredients
0,coffee ice cream and cookie parfaits,10,"weeknight, 15-minutes-or-less, time-to-make, c...","[1.4, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]",['spoon 1 tablespoons cookie crumbs into each ...,how much easier can it get than this delicious...,"chocolate sandwich style cookies, coffee ice c..."
1,chicken vegetable soup,45,"60-minutes-or-less, time-to-make, course, main...","[325.4, 11.0, 14.0, 33.0, 56.0, 9.0, 13.0]","['in a saucepan , combine the first five ingre...",from a toh mag.,"chicken broth, frozen corn, celery rib, carrot..."
2,grilled halibut with pineapple lime salsa,36,"60-minutes-or-less, time-to-make, course, main...","[569.6, 16.0, 76.0, 18.0, 171.0, 7.0, 9.0]",['to make the salsa: mix together all the sals...,another excellent recipe from cooking light.,"vegetable oil, garlic cloves, halibut steaks, ..."
3,chicken hash,85,"time-to-make, course, main-ingredient, prepara...","[570.2, 42.0, 5.0, 21.0, 92.0, 78.0, 3.0]","['in a large saucepan , bring the wine and chi...","a lady i used to work with in watsonville, cal...","white wine, chicken broth, whole boneless skin..."
4,ultra creamy mashed potatoes,35,"60-minutes-or-less, time-to-make, preparation,...","[390.1, 16.0, 12.0, 56.0, 30.0, 28.0, 19.0]",['heat the broth and potatoes in a 3-quart sau...,"mashed potatoes are one of those ""staple"" reci...","swanson chicken broth, potatoes, light cream, ..."


We will now decompose nutrition columns into individual values. All the values are out of 100 except calories.

In [10]:
def decompose_list(row):
    try:
        values = eval(row)
        return pd.Series(values)
    except:
        return pd.Series([None]*7)

decomposed_df = df['nutrition'].apply(decompose_list)
decomposed_df.columns = ['calories', 'total fat', 'sugar', 'sodium', 'protein', 'saturated fat', 'carbohydrates']
df_final=pd.concat([df, decomposed_df], axis=1)

In [11]:
df_final = df_final.rename(columns={'total fat': 'total_fat', 'saturated fat': 'saturated_fat'})
df_final=df_final.drop(['nutrition'],axis=1)
df_final.head()

Unnamed: 0,name,minutes,tags,steps,description,ingredients,calories,total_fat,sugar,sodium,protein,saturated_fat,carbohydrates
0,coffee ice cream and cookie parfaits,10,"weeknight, 15-minutes-or-less, time-to-make, c...",['spoon 1 tablespoons cookie crumbs into each ...,how much easier can it get than this delicious...,"chocolate sandwich style cookies, coffee ice c...",1.4,0.0,0.0,0.0,0.0,0.0,0.0
1,chicken vegetable soup,45,"60-minutes-or-less, time-to-make, course, main...","['in a saucepan , combine the first five ingre...",from a toh mag.,"chicken broth, frozen corn, celery rib, carrot...",325.4,11.0,14.0,33.0,56.0,9.0,13.0
2,grilled halibut with pineapple lime salsa,36,"60-minutes-or-less, time-to-make, course, main...",['to make the salsa: mix together all the sals...,another excellent recipe from cooking light.,"vegetable oil, garlic cloves, halibut steaks, ...",569.6,16.0,76.0,18.0,171.0,7.0,9.0
3,chicken hash,85,"time-to-make, course, main-ingredient, prepara...","['in a large saucepan , bring the wine and chi...","a lady i used to work with in watsonville, cal...","white wine, chicken broth, whole boneless skin...",570.2,42.0,5.0,21.0,92.0,78.0,3.0
4,ultra creamy mashed potatoes,35,"60-minutes-or-less, time-to-make, preparation,...",['heat the broth and potatoes in a 3-quart sau...,"mashed potatoes are one of those ""staple"" reci...","swanson chicken broth, potatoes, light cream, ...",390.1,16.0,12.0,56.0,30.0,28.0,19.0


Converting textual columns to string for embeddings

In [12]:
df_final.dtypes

name              object
minutes            int64
tags              object
steps             object
description       object
ingredients       object
calories         float64
total_fat        float64
sugar            float64
sodium           float64
protein          float64
saturated_fat    float64
carbohydrates    float64
dtype: object

In [13]:
df_final['description'] = df_final['description'].astype(str)
df_final['name'] = df_final['name'].astype(str)
df_final['ingredients'] = df_final['ingredients'].astype(str)
df_final['tags'] = df_final['tags'].astype(str)

Saving the final dataset for retrieval at the backend

In [14]:
df_final.to_csv('/content/dishes.csv')

Finally, we cluster the data with using Nearest Neighbours to get similar dishes based on their features

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from scipy.sparse import hstack

# Vectorizing textual columns
vectorizer = TfidfVectorizer(stop_words='english')
description_vec = vectorizer.fit_transform(df_final['description'])
ingredients_vec = vectorizer.fit_transform(df_final['ingredients'])
tags_vec = vectorizer.fit_transform(df_final['tags'])
name_vec = vectorizer.fit_transform(df_final['name'])

# Scaling numerical columns
scaler = StandardScaler()
numerical_features = df_final[['calories', 'total_fat', 'sugar', 'sodium', 'protein', 'saturated_fat', 'carbohydrates']]
numerical_features_scaled = scaler.fit_transform(numerical_features)

# Assigning weights to each feature of the dish
combined_features = hstack([description_vec * 0.1, tags_vec * 0.1, name_vec * 0.15, ingredients_vec * 0.4, numerical_features_scaled * 0.35])

# Using Nearest Neighbors to cluster
neigh = NearestNeighbors(n_neighbors=10,metric='cosine')
neigh.fit(combined_features)

In [16]:
print(df_final.iloc[29999])
input_combined_features = hstack([description_vec[29999] * 0.1, tags_vec[29999] * 0.1, name_vec[29999] * 0.15, ingredients_vec[29999] * 0.4, numerical_features_scaled[29999].reshape(1, -1) * 0.35])
nearest_dishes_distances, nearest_dishes_indices = neigh.kneighbors(input_combined_features)
nearest_dishes_indices

name                                  perfect roast pork crackling
minutes                                                        255
tags             time-to-make, course, main-ingredient, cuisine...
steps            ['score pork skin , best to have your butcher ...
description      this is the only way to make true crackling on...
ingredients      pork legs, dried fennel seed, sea salt, black ...
calories                                                    1839.2
total_fat                                                    217.0
sugar                                                          0.0
sodium                                                        14.0
protein                                                      261.0
saturated_fat                                                245.0
carbohydrates                                                  0.0
Name: 29999, dtype: object


array([[29999, 28733, 10883, 20473,  6080, 28311, 20973,   151, 11813,
        11048]])

In [17]:
df_final.iloc[nearest_dishes_indices[0]][["name","ingredients"]]

Unnamed: 0,name,ingredients
29999,perfect roast pork crackling,"pork legs, dried fennel seed, sea salt, black ..."
28733,baked honey and garlic ribs,"pork ribs, honey, garlic cloves, ginger, tabas..."
10883,steph s pork riblets,"beef broth, soy sauce, brown sugar, pork rible..."
20473,tender crock pot roast beef,"roast, cream of mushroom soup, onion, sliced m..."
6080,grandma g s baked thanksgiving turkey,"turkey, italian salad dressing, butter, celery..."
28311,louisiana chicken and sausage gumbo the real s...,"whole chicken, sausage, water, butter, flour, ..."
20973,olive and lemon chicken,"black olives, extra virgin olive oil, butter, ..."
151,apple barbecued ribs,"spareribs, onion, garlic, vegetable oil, apple..."
11813,spareribs saucy,"spareribs, water, honey, soy sauce, cream sher..."
11048,chinese style crock pot spareribs,"soy sauce, orange marmalade, catsup, garlic cl..."


Saving the trained model as a pkl file to use in production

In [18]:
import pickle

# Saving the NearestNeighbors model to a file
with open('nearest_dishes_model.pkl', 'wb') as file:
    pickle.dump(neigh, file)

##Creating a model for recommendations based only on ingredients

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import hstack

vectorizer = TfidfVectorizer(stop_words='english')
ingredients_vec = vectorizer.fit_transform(df_final['ingredients'])

neigh2 = NearestNeighbors(n_neighbors=10, metric="cosine")
neigh2.fit(ingredients_vec)

In [22]:
input_ingredients='water, sugar'
input_vec=vectorizer.transform([input_ingredients])
nearest_dishes_distances, nearest_dishes_indices = neigh2.kneighbors(input_vec)
nearest_dishes_indices

array([[25265, 21328, 11131, 13744,   878, 13757, 24010, 25564,  1734,
        20990]])

In [23]:
print(df_final.iloc[nearest_dishes_indices[0]][["name","ingredients"]])

                                                    name  \
25265                          corn syrup   simple syrup   
21328                                      simple syrups   
11131                       dulce de tomate   tomato jam   
13744  sugar syrup for light or dark spirits   pete e...   
878              perfect lemonade  real lemons and sugar   
13757                                 doily sugar starch   
24010                           candied ginger and syrup   
25564        all purpose mild brine for poultry and pork   
1734                                   frozen jello pops   
20990                                clove lemonade base   

                                   ingredients  
25265                             sugar, water  
21328                             water, sugar  
11131                   tomatoes, sugar, water  
13744          water, white sugar, brown sugar  
878      sugar, water, lemon juice, cold water  
13757                  granulated sugar, wate

In [24]:
import pickle

with open('nearest_ingredients_model.pkl', 'wb') as file:
    pickle.dump(neigh2, file)