Notes: Stage 1 is Candidate Generation. This step using TF-IDF and Cosine Similarity to Filter top 100 foods.

## **Import Library**

In [None]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from autocorrect import Speller
import numpy as np

In [None]:
# Text Preprocessing
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [None]:
df=pd.read_csv('./dataset/preprocessed_recipes.csv')

In [None]:
df.head(2)

Unnamed: 0,RecipeId,Calories,FatContent,SaturatedFatContent,CholesterolContent,SodiumContent,CarbohydrateContent,FiberContent,SugarContent,ProteinContent,NameClean,RecipeIngredientPartsClean,RecipeInstructionsClean,Combined
0,38,170.9,2.5,1.3,8.0,29.8,37.1,3.6,30.2,3.2,lowfat berry blue frozen dessert,blueberry granulated sugar vanilla yogurt lemo...,toss 2 ups berry sugar let stand 45 minute sti...,blueberry granulated sugar vanilla yogurt lemo...
1,41,536.1,24.0,3.8,0.0,1558.6,64.2,17.3,32.1,29.3,carinas tofuvegetable kebabs,extra firm tofu eggplant zuhini mushroom soy s...,drain tofu arefully squeezing ex water pat dry...,extra firm tofu eggplant zuhini mushroom soy s...


# **Stage 1: Candidate Generation**
## **Content-Based Filtering**

*Desc:*

Similar items, such as Reels about dogs, are close together in the embedding space. The candidate generator works as follows:
- given a user, the system looks for items that are close to them in the embedding space.
- The notion of “closeness” is defined by a similarity measure


*Vectorization:*

For dietary food recommender system, TF-IDF is the most suitable method because it effectively highlights important ingredients and instructions, capturing the essence of the recipes better than binary features or BoW.

*Similarity:*

For dietary food recommender system, Cosine Similarity is the most appropriate choice due to its effectiveness in handling high-dimensional sparse data, such as TF-IDF vectors. It focuses on the direction of the vectors rather than their magnitude, making it robust for text-based features.

## User Input

In [None]:
user_favorite_foods = ['fish', 'beef']

# Generate combinations for similarity calculations
user_favorites = user_favorite_foods + [' '.join(user_favorite_foods)]

In [None]:
user_favorites

['fish', 'beef', 'fish beef']

## Vectorization: TF-IDF & Cosine Similarity

TF-IDF: Converts text into numerical vectors by capturing term importance within a document and across the corpus.

In [None]:
df_combined=df["Combined"]
df_combined

0         blueberry granulated sugar vanilla yogurt lemo...
1         extra firm tofu eggplant zuhini mushroom soy s...
2         plain tomato juie abbage onion arrots elerycab...
3         sugar margarine egg flour salt buttermilk grah...
4         butter brown sugar granulated sugar vanilla ex...
                                ...                        
261440    selfrising flour shortening milk buttermilk sh...
261441    salted butter allpurpose flour iing sugar whit...
261442    hamburger onion elery water hestnut dried dill...
261443    allpurpose flour brown sugar butter ground inn...
261444    fresh ginger unsalted butter dark brown sugar ...
Name: Combined, Length: 261445, dtype: object

In [None]:
# Create a TfidfVectorizer object to transform the movie genres into a Tf-idf representation
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(df_combined)

In [None]:
# Transform user's favorite foods to TF-IDF vector
user_favorite_vector = tfidf.transform(user_favorites)

Cosine Similarity: Measures the similarity between two vectors, providing a metric to compare documents.

In [None]:
# Calculate the cosine similarity matrix between the food
cosine_similarities = cosine_similarity(user_favorite_vector, tfidf_matrix).flatten()

In [None]:
# Initialize a dictionary to store similarities
similarity_dict = {}

# Calculate similarities for each string
for favorite in user_favorites:
    # Transform the string to a TF-IDF vector
    favorite_vector = tfidf.transform([favorite])

    # Calculate cosine similarities
    similarities = cosine_similarity(favorite_vector, tfidf_matrix).flatten()

    # Store the similarities in the dictionary
    similarity_dict[favorite] = similarities

In [None]:
similarity_df = pd.DataFrame(similarity_dict, index=df_combined)
similarity_df.head(2)

Unnamed: 0_level_0,fish,beef,fish beef
Combined,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
blueberry granulated sugar vanilla yogurt lemon juielowfat berry blue frozen desserttoss 2 ups berry sugar let stand 45 minute stirring oasionally transfer berrysugar mixture food proessor add yogurt proess smooth strain fine sieve pour baking pan transfer ie ream maker proess aording manufaturers diretions freeze unovered edge solid entre soft transfer proessor blend smooth return pan freeze edge solid transfer proessor blend smooth fold remaining 2 ups blueberry pour plasti mold freeze overnight let soften slightly serve,0.0,0.0,0.0
extra firm tofu eggplant zuhini mushroom soy saue low sodium soy saue olive oil maple syrup honey red wine vinegar lemon juie garli love mustard powder blak peppercarinas tofuvegetable kebabsdrain tofu arefully squeezing ex water pat dry paper towel cut tofu oneinh square set aside cut eggplant lengthwise half ut eah half approximately three strip cut strip rosswise oneinh ubes slie zuhini halfinh thik slies cut red pepper half removing stem seed ut eah half oneinh square wipe mushroom lean moist paper towel remove stem thread tofu vegetable barbeue skewer alternating olor ombinations example first piee eggplant slie tofu zuhini red pepper baby orn mushroom continue way skewer full make marinade putting ingredient blender blend high speed one minute mixed alternatively put ingredient glass jar tightly lid shake well mixed lay kebab long shallow baking pan nonmetal tray making sure lie flat evenly pour marinade kebab turning one tofu vegetable oated refrigerate kebab three eight hour oasionally spooning marinade broil grill kebab 450 f 1520 minute grill vegetable browned suggestion meal served ooked brown rie amount easily doubled make four serving,0.0,0.0,0.0


## Filter High Similarity

In [None]:
top_n_high = 50
high_similarity_candidates = []

for favorite, similarities in similarity_dict.items():
    # Get top n indices
    top_n_indices = similarities.argsort()[-top_n_high:][::-1]

    # Select the top n candidate recipes
    candidate_recipes = df.iloc[top_n_indices].copy()
    candidate_recipes['cosine_similarity'] = similarities[top_n_indices]

    # Append the candidate DataFrame to the list
    high_similarity_candidates.append(candidate_recipes)

In [None]:
high_similarity_df = pd.concat(high_similarity_candidates).drop_duplicates().reset_index(drop=True)

## Filter Low Random Similarity

In [None]:
top_n_low = 5
low_similarity_candidates = []

for favorite, similarities in similarity_dict.items():
    # Get indices of foods with non-zero and low similarity
    non_zero_indices = np.where(similarities > 0)[0]
    low_similarity_indices = non_zero_indices[similarities[non_zero_indices].argsort()[:top_n_low]]

    # Select random foods from these low similarity candidates
    random_low_sim_candidates = df.iloc[low_similarity_indices].sample(top_n_low, random_state=42)
    random_low_sim_candidates['cosine_similarity'] = similarities[low_similarity_indices]

    # Append the low similarity DataFrame to the list
    low_similarity_candidates.append(random_low_sim_candidates)

In [None]:
low_similarity_df = pd.concat(low_similarity_candidates).drop_duplicates().reset_index(drop=True)

In [None]:
high_similarity_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 15 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   RecipeId                    150 non-null    int64  
 1   Calories                    150 non-null    float64
 2   FatContent                  150 non-null    float64
 3   SaturatedFatContent         150 non-null    float64
 4   CholesterolContent          150 non-null    float64
 5   SodiumContent               150 non-null    float64
 6   CarbohydrateContent         150 non-null    float64
 7   FiberContent                150 non-null    float64
 8   SugarContent                150 non-null    float64
 9   ProteinContent              150 non-null    float64
 10  NameClean                   150 non-null    object 
 11  RecipeIngredientPartsClean  150 non-null    object 
 12  RecipeInstructionsClean     150 non-null    object 
 13  Combined                    150 non

In [None]:
low_similarity_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 15 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   RecipeId                    15 non-null     int64  
 1   Calories                    15 non-null     float64
 2   FatContent                  15 non-null     float64
 3   SaturatedFatContent         15 non-null     float64
 4   CholesterolContent          15 non-null     float64
 5   SodiumContent               15 non-null     float64
 6   CarbohydrateContent         15 non-null     float64
 7   FiberContent                15 non-null     float64
 8   SugarContent                15 non-null     float64
 9   ProteinContent              15 non-null     float64
 10  NameClean                   15 non-null     object 
 11  RecipeIngredientPartsClean  15 non-null     object 
 12  RecipeInstructionsClean     15 non-null     object 
 13  Combined                    15 non-nu

## Final Candidate

In [None]:
final_candidates = pd.concat([high_similarity_df, low_similarity_df]).drop_duplicates().reset_index(drop=True)

In [None]:
final_candidates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Data columns (total 15 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   RecipeId                    165 non-null    int64  
 1   Calories                    165 non-null    float64
 2   FatContent                  165 non-null    float64
 3   SaturatedFatContent         165 non-null    float64
 4   CholesterolContent          165 non-null    float64
 5   SodiumContent               165 non-null    float64
 6   CarbohydrateContent         165 non-null    float64
 7   FiberContent                165 non-null    float64
 8   SugarContent                165 non-null    float64
 9   ProteinContent              165 non-null    float64
 10  NameClean                   165 non-null    object 
 11  RecipeIngredientPartsClean  165 non-null    object 
 12  RecipeInstructionsClean     165 non-null    object 
 13  Combined                    165 non

# **Export Final Candidate**

In [None]:
final_candidates.to_csv('./dataset/final_candidates.csv', index=False)