#MyFoodGenie - Food Recommendation System

##Sarah Kim

Motivation:
People from all around the world are getting more concerned in their health and way of life in today's modern environment. However, avoiding junk food and exercising alone are insufficient; we also need to eat a balanced diet. We can live a healthy life with a balanced diet based on our height, weight, and age.

## Content Based Filtering & Collaborative Filtering





Content-based Filtering: Will recommend food items based on their attributes (like C_Type, Veg_Non, Describe). If a user liked a particular type of food, I can recommend other foods with similar attributes.

Collaborative Filtering: Will consider the ratings given by users to recommend similar food items based on other users' preferences. For instance, if User A and User B both liked a particular food item, and User A liked another food item, then that item might be recommended to User B.


##Data Prep

In [214]:
import pandas as pd
import numpy as np

food_data = pd.read_csv('/dataset.csv')
ratings_data = pd.read_csv('/ratings.csv')

In [215]:
food_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Food_ID   400 non-null    int64 
 1   Name      400 non-null    object
 2   C_Type    400 non-null    object
 3   Veg_Non   400 non-null    object
 4   Describe  400 non-null    object
dtypes: int64(1), object(4)
memory usage: 15.8+ KB


###Food Data (dataset.csv):

Food_ID: Identifier for the food item.

Name: Name of the food item.

C_Type: Type or category of the food (e.g., Healthy Food, Snack, Dessert).

Veg_Non: Vegetarian or Non-Vegetarian classification.

Describe: Ingredients or a brief description of the food.

In [216]:
food_data.head()

Unnamed: 0,Food_ID,Name,C_Type,Veg_Non,Describe
0,1,summer squash salad,Healthy Food,veg,"white balsamic vinegar, lemon juice, lemon rin..."
1,2,chicken minced salad,Healthy Food,non-veg,"olive oil, chicken mince, garlic (minced), oni..."
2,3,sweet chilli almonds,Snack,veg,"almonds whole, egg white, curry leaves, salt, ..."
3,4,tricolour salad,Healthy Food,veg,"vinegar, honey/sugar, soy sauce, salt, garlic ..."
4,5,christmas cake,Dessert,veg,"christmas dry fruits (pre-soaked), orange zest..."


In [217]:
food_data.shape

(400, 5)

In [218]:
food_data.duplicated().sum() #Any duplicated data? - no

0

In [219]:
ratings_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 512 entries, 0 to 511
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   User_ID  511 non-null    float64
 1   Food_ID  511 non-null    float64
 2   Rating   511 non-null    float64
dtypes: float64(3)
memory usage: 12.1 KB


###Ratings Data (ratings.csv):

User_ID: Identifier for the user.

Food_ID: Identifier for the food item (links to the Food Data).

Rating: The rating given by the user for the food item.


In [220]:
ratings_data.head()

Unnamed: 0,User_ID,Food_ID,Rating
0,1.0,88.0,4.0
1,1.0,46.0,3.0
2,1.0,24.0,5.0
3,1.0,25.0,4.0
4,2.0,49.0,1.0


In [221]:
ratings_data.shape

(512, 3)

In [222]:
ratings_data.duplicated().sum() #Any duplicated data? - no

0

#Content-Based Filtering


##Recommendation Based on 'Describe'

###TfIdfVectorizer

In [223]:
# Compute TF-IDF vectors for each food item's combined attributes
tfidf_vectorizer = TfidfVectorizer(stop_words='english') #remove some unnecessary words

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
ENGLISH_STOP_WORDS #like these words


tfidf_matrix.shape #400 describe attribute contains 1175 words

(400, 1175)

In [224]:
food_data['Describe'].isnull().values.any() #See if there's any null words, if false, not any


False

In [225]:
# Compute the cosine similarity between food items

from sklearn.metrics.pairwise import linear_kernel

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

cosine_sim.shape

(400, 400)

The cosine similarity matrix has been computed successfully. It has a shape of
400×400, which means we have computed the similarity for 400 food items against each other

###Creating a function to get recommendations based on a given food item

This function will provide a list of food items that are most similar to the given item based on the computed cosine similarities.

In [226]:
# Get the index of the given food item - Just considering the Food names from the dataframe

index = pd.Series(food_data.index, index=food_data['Name']).drop_duplicates()
index #The resulting indices Series can be used to look up the row index of a particular Name value in df1 by using the Series' .loc accessor.

Name
summer squash salad                                          0
chicken minced salad                                         1
sweet chilli almonds                                         2
tricolour salad                                              3
christmas cake                                               4
                                                          ... 
Kimchi Toast                                               395
Tacos de Gobernador (Shrimp, Poblano, and Cheese Tacos)    396
Melted Broccoli Pasta With Capers and Anchovies            397
Lemon-Ginger Cake with Pistachios                          398
Rosemary Roasted Vegetables                                399
Length: 400, dtype: int64

In [227]:
index['chicken minced salad']

1

In [228]:
food_data.iloc[[1]]

Unnamed: 0,Food_ID,Name,C_Type,Veg_Non,Describe
1,2,chicken minced salad,Healthy Food,non-veg,"olive oil, chicken mince, garlic (minced), oni..."


In [229]:
# Get the top 10 lists when getting the Food Name
def get_recommendations(Name, cosine_sim=cosine_sim):

# Get the index through Food Name in whole data
  idx = index[Name]

# Get data as (idc,sim) in (cosine_sim) -  Get the pairwise similarity scores of all food items with the given item
  sim_scores = list(enumerate(cosine_sim[idx]))

# Get in reverse order based on sim score
  sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

# 10 food recommendations exclding oneself
  sim_scores = sim_scores[1:11]

# 10 index info in food
  food_index = [i[0] for i in sim_scores]

# Return the top 10 most similar food items
  return food_data['Name'].iloc[food_index]

###TEST

In [230]:
# Test the function with a sample food item
recommendations = get_recommendations("summer squash salad")
recommendations

163                                 green cucumber shots
69               shepherds salad (tamatar-kheera salaad)
220    amaranthus granola with lemon yogurt, berries ...
16               baked namakpara with roasted almond dip
143                            shrimp & cilantro ceviche
378         Grilled Chicken with Almond and Garlic Sauce
160                                     spanish fish fry
177                                  oats shallots pulao
131                        coffee marinated mutton chops
85                     roast turkey with cranberry sauce
Name: Name, dtype: object

The function has provided the top 10 recommended food items based on the given food item "summer squash salad". Here are the recommendations:

Green cucumber shots

Shepherds salad (Tamatar-Kheera Salaad)

Amaranthus granola with lemon yogurt, berries, and honey

Baked namakpara with roasted almond dip

Shrimp & cilantro ceviche

Grilled Chicken with Almond and Garlic Sauce

Spanish fish fry

Oats shallots pulao

Coffee marinated mutton chops

Roast turkey with cranberry sauce

##Recommendation Based on 'C_Type' and 'Veg_Non'

In [231]:
food_data.head(3)

Unnamed: 0,Food_ID,Name,C_Type,Veg_Non,Describe
0,1,summer squash salad,Healthy Food,veg,"white balsamic vinegar, lemon juice, lemon rin..."
1,2,chicken minced salad,Healthy Food,non-veg,"olive oil, chicken mince, garlic (minced), oni..."
2,3,sweet chilli almonds,Snack,veg,"almonds whole, egg white, curry leaves, salt, ..."


In [232]:
food_data.loc[0, 'C_Type']

'Healthy Food'

In [233]:
def data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(' ', '')) for i in x]
    else:
        if isinstance(x, str):
            return str.lower(x.replace(' ', ''))
        else:
            return ''

In [234]:
features = ['C_Type', 'Veg_Non']
for feature in features:
    food_data[feature] = food_data[feature].apply(data)

In [235]:
food_data[['Name', 'C_Type', 'Veg_Non']].head(3)

Unnamed: 0,Name,C_Type,Veg_Non
0,summer squash salad,healthyfood,veg
1,chicken minced salad,healthyfood,non-veg
2,sweet chilli almonds,snack,veg


In [236]:
def create_soup(x):
    return ''.join(x['C_Type']) + ' ' + ''.join(x['Veg_Non'])
food_data['soup'] = food_data.apply(create_soup, axis=1)
food_data['soup']

0          healthyfood veg
1      healthyfood non-veg
2                snack veg
3          healthyfood veg
4              dessert veg
              ...         
395             korean veg
396        mexican non-veg
397         french non-veg
398        dessert non-veg
399        healthyfood veg
Name: soup, Length: 400, dtype: object

In [237]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(food_data['soup'])
count_matrix

<400x17 sparse matrix of type '<class 'numpy.int64'>'
	with 962 stored elements in Compressed Sparse Row format>

In [238]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)
cosine_sim2

array([[1.        , 0.81649658, 0.5       , ..., 0.40824829, 0.40824829,
        1.        ],
       [0.81649658, 1.        , 0.40824829, ..., 0.66666667, 0.66666667,
        0.81649658],
       [0.5       , 0.40824829, 1.        , ..., 0.40824829, 0.40824829,
        0.5       ],
       ...,
       [0.40824829, 0.66666667, 0.40824829, ..., 1.        , 0.66666667,
        0.40824829],
       [0.40824829, 0.66666667, 0.40824829, ..., 0.66666667, 1.        ,
        0.40824829],
       [1.        , 0.81649658, 0.5       , ..., 0.40824829, 0.40824829,
        1.        ]])

In [239]:
index['summer squash salad']

0

In [240]:
food_data = food_data.reset_index()
index = pd.Series(food_data.index, index=food_data['Name'])
index

Name
summer squash salad                                          0
chicken minced salad                                         1
sweet chilli almonds                                         2
tricolour salad                                              3
christmas cake                                               4
                                                          ... 
Kimchi Toast                                               395
Tacos de Gobernador (Shrimp, Poblano, and Cheese Tacos)    396
Melted Broccoli Pasta With Capers and Anchovies            397
Lemon-Ginger Cake with Pistachios                          398
Rosemary Roasted Vegetables                                399
Length: 400, dtype: int64

### LET'S TEST HOW IT WORKS

In [241]:
get_recommendations('summer squash salad', cosine_sim2)

3                        tricolour salad
8                   cream of almond soup
9               broccoli and almond soup
10             coconut lime quinoa salad
12    watermelon and strawberry smoothie
13    peach, raspberry and nuts smoothie
26                  hawaiin papaya salad
27               vegetable som tam salad
33         mixed berry & banana smoothie
34                banana walnut smoothie
Name: Name, dtype: object

In [242]:
food_data.loc[0] #data of summer squash salad

index                                                       0
Food_ID                                                     1
Name                                      summer squash salad
C_Type                                            healthyfood
Veg_Non                                                   veg
Describe    white balsamic vinegar, lemon juice, lemon rin...
soup                                          healthyfood veg
Name: 0, dtype: object

In [243]:
food_data.loc[3] #data of tricolour salad

index                                                       3
Food_ID                                                     4
Name                                          tricolour salad
C_Type                                            healthyfood
Veg_Non                                                   veg
Describe    vinegar, honey/sugar, soy sauce, salt, garlic ...
soup                                          healthyfood veg
Name: 3, dtype: object

both summer squash salad and tricolour salad are healthyfood and veg

In [244]:
get_recommendations('christmas cake', cosine_sim2)

6                chocolate nero cookies
17                 grilled almond barfi
20                          apple rabdi
22                 dates and nuts ladoo
23           green lentil dessert fudge
24                   cashew nut cookies
31            almond and amaranth ladoo
50    christmas chocolate fudge cookies
65                   betel nut popsicle
68         banana and maple ice lollies
Name: Name, dtype: object

#Collaborative Filtering

##Data Prep

In [245]:
!pip install scikit-surprise



In [246]:
import surprise
surprise.__version__

'1.1.3'

In [247]:
import pandas as pd
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate, train_test_split

In [248]:
ratings_data.head()

Unnamed: 0,User_ID,Food_ID,Rating
0,1.0,88.0,4.0
1,1.0,46.0,3.0
2,1.0,24.0,5.0
3,1.0,25.0,4.0
4,2.0,49.0,1.0


In [249]:
#Checking the shape
ratings_data.shape

(512, 3)

In [250]:
# Checking for null values
ratings_data.isnull().sum()

User_ID    1
Food_ID    1
Rating     1
dtype: int64

In [251]:
ratings_data = ratings_data.dropna()

In [252]:
ratings_data.isnull().sum()

User_ID    0
Food_ID    0
Rating     0
dtype: int64

In [253]:
ratings_data['Rating'].min()

1.0

In [254]:
ratings_data['Rating'].max()

10.0

In [255]:
reader = Reader(rating_scale=(1, 10))

In [256]:
data = Dataset.load_from_df(ratings_data[['User_ID', 'Food_ID', 'Rating']], reader=reader)
data

<surprise.dataset.DatasetAutoFolds at 0x7de76306cc10>

## Cross Validate(RMSE, MAE)

###K-Fold

In [257]:
svd = SVD(random_state=0)

In [258]:
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    2.9561  2.9035  2.5945  2.9894  2.9562  2.8799  0.1454  
MAE (testset)     2.5117  2.5200  2.2200  2.5445  2.5536  2.4700  0.1259  
Fit time          0.01    0.01    0.01    0.01    0.01    0.01    0.00    
Test time         0.27    0.00    0.00    0.00    0.00    0.06    0.11    


{'test_rmse': array([2.9561382 , 2.90346085, 2.59446394, 2.9894466 , 2.95618132]),
 'test_mae': array([2.51170808, 2.5199686 , 2.21998911, 2.5445213 , 2.55357603]),
 'fit_time': (0.011940479278564453,
  0.010019779205322266,
  0.009433984756469727,
  0.009714841842651367,
  0.0101776123046875),
 'test_time': (0.2709922790527344,
  0.0009374618530273438,
  0.0008845329284667969,
  0.00128173828125,
  0.000997781753540039)}

cv = 5 (divide data into 5 datasests)

100 data

A:1-20

B:21-40

C:41-60

D:61-80

E:81-100

ABCD (train set) E (test set)

ABCE (train set) D (test set)

ABDE (train set) C (test set)

ACDE (train set) B (test set)

BCDE (train set) A (test set)

In [259]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7de76306d5d0>

In [260]:
ratings_data[ratings_data['User_ID'] == 1] #how userid = 1 rated the food

Unnamed: 0,User_ID,Food_ID,Rating
0,1.0,88.0,4.0
1,1.0,46.0,3.0
2,1.0,24.0,5.0
3,1.0,25.0,4.0


In [261]:
svd.predict(1, 12) # predict how would userid = 1 rate foodid = 12 -> est = 10 might rate 10

Prediction(uid=1, iid=12, r_ui=None, est=4.4544710848220825, details={'was_impossible': False})

In [262]:
svd.predict(1, 300, 3) # User_ID = 1 real rated score is 3 for Food_ID = 300, how would it be the predicted score? -> est = 10

Prediction(uid=1, iid=300, r_ui=3, est=5.31926947929299, details={'was_impossible': False})

apparently this isn't accurate, I think bc the data is small

In [263]:
ratings_data[ratings_data['User_ID'] == 100]

Unnamed: 0,User_ID,Food_ID,Rating
508,100.0,24.0,10.0
509,100.0,233.0,10.0
510,100.0,29.0,7.0


In [264]:
svd.predict(100, 300) # User_Id = 100, Food_Id = 300

Prediction(uid=100, iid=300, r_ui=None, est=6.826004026075378, details={'was_impossible': False})

userid 100 would rate 10 for foodid 300