# Recipe Recommendation System

Updated 3/27/2023

## Overview

Major system flow

Pre-processed:
* Create Database
* Store 1,000 recipies (for initial dev)
* Create Main Tables
  * Recipe Table (primary data for all recipies)
  * Ingredients Table (table of unique ingredients and embeddings)


System Flow
* User recipie query
* Return top 5 matching recipe titles
* Select 1 recipe
* Choose ingredient to substitute 
* Perform embeding similarity search
* Return top 5 substitutions
* Return final recipe with substitutions made


# Package Instalation

Install needed libraries

In [1]:
!pip install git+https://github.com/neuml/txtai

# Clear output for this cell
from IPython.display import clear_output
clear_output()

Import needed libraries

In [2]:
# Google Drive Imports
from google.colab import drive
from google.colab import auth

# System Imports
import os
import re

# Database Imports
import sqlite3
import csv
import pandas as pd

# Semantic Search Imports
from txtai.embeddings import Embeddings

# Clear output for this cell
from IPython.display import clear_output
clear_output()

Mount Google Drive

In [3]:
drive.mount('/content/drive')

Mounted at /content/drive


Authenticate the current user to make the file access dynamic

In [4]:
auth.authenticate_user()

In [5]:
current_user = !gcloud config get-value account
current_user = current_user[0]
print(current_user)

coryroyce@gmail.com


Change file path based on team members name

In [6]:
# Dictionary of user names and file paths to the group coding folder
file_path_google_drive = {
    "coryroyce@gmail.com" : "/content/drive/MyDrive/Education/San Jose State University/Masters in AI/Classes/CMPE 295/sjsu-aiml-project/Code",
    "joecesena@gmail.com"  : "/content/drive/MyDrive/sjsu-aiml-project/Code",
}

Change working directory


In [7]:
# Display current working directory
print(os.getcwd())

/content


In [8]:
# Change Working directory to Group Project Code folder
os.chdir(file_path_google_drive[current_user])
print(os.getcwd())

/content/drive/.shortcut-targets-by-id/1IW04c63EDOGFk1MypaTrNbfIVpIP0aXk/sjsu-aiml-project/Code


Show the files in this folder for reference

In [9]:
!ls

archive
Combined_System_Back_End.ipynb
database
dataset
df_ingredient_subtitutions_ground_truth.csv
df_ingredient_subtitutions_w_scores_230328.pkl
df_ingredient_subtitutions_w_scores_230401.csv
df_ingredient_subtitutions_w_scores_230401.pkl
df_ingredient_subtitutions_w_scores_230419_all-distilroberta-v1.pkl
df_ingredient_subtitutions_w_scores_230419_all-MiniLM-L12-v2.pkl
df_ingredient_subtitutions_w_scores_230419_all-MiniLM-L6-v2.pkl
df_ingredient_subtitutions_w_scores_230419_all-mpnet-base-v2.pkl
df_ingredient_subtitutions_w_scores_230419_distiluse-base-multilingual-cased-v1.pkl
df_ingredient_subtitutions_w_scores_230419_distiluse-base-multilingual-cased-v2.pkl
df_ingredient_subtitutions_w_scores_230419_multi-qa-distilbert-cos-v1.pkl
df_ingredient_subtitutions_w_scores_230419_multi-qa-MiniLM-L6-cos-v1.pkl
df_ingredient_subtitutions_w_scores_230419_multi-qa-mpnet-base-dot-v1.pkl
df_ingredient_subtitutions_w_scores_230419_paraphrase-albert-small-v2.pkl
df_ingredient_subtitutions_w_score

# Database Setup

Create the initial database and needed tables

In [10]:
class DatabaseSetup:
  """
  Database class for creating and reading SQLite data
  """
  def __init__(self):
    self.db_name :str = "recipe_database.db"
    self.connection = sqlite3.connect(self.db_name, check_same_thread=False)
    self.cursor = self.connection.cursor()
    self.use_sample_db :bool = True


  def __del__(self):
    self.connection.close()


  def create_recipe_data_table(self):
    # Choose to load the sample data set or the full dataset
    if self.use_sample_db == True:
      file_name_recipe_csv = "recipe_data_1000.csv"
    else:
      file_name_recipe_csv = "recipe_data.csv" # This needs to be updated based on the final data we use
    
    # Read specific columns of csv file using Pandas
    df = pd.read_csv(file_name_recipe_csv, usecols = ["title","ingredients","directions", "NER"])

    # Create the table in SQLite database
    table_name = "recipes"
    query = f"CREATE TABLE IF NOT EXISTS {table_name} (title, ingredients_with_measurements, directions, ingredients)"
    df.to_sql(table_name, self.connection, if_exists="replace", index=True)
    self.connection.commit()

    return


  def read_data_as_df(self, table_name: str) :
    """Read in the table as a Pandas Dataframe"""
    df = pd.read_sql_query(f"SELECT * FROM {table_name}", self.connection)
    self.connection.close()
    return df

Example of how to connect to the database and create new data for the recipie dataset

In [11]:
# Instantiate the database instance/connection
db = DatabaseSetup()

In [12]:
# Create the recipe table if it doesn't exist
db.create_recipe_data_table()

In [13]:
# Read a table by name from the database as a pandas dataframe 
df_recipe_sample = db.read_data_as_df(table_name = "recipes")
df_recipe_sample.head()

Unnamed: 0,index,title,ingredients,directions,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...","[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....","[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...","[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...","[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...","[""peanut butter"", ""graham cracker crumbs"", ""bu..."


In [14]:
# Delete the data base instance if not being used
del db

# Semantic Search

Create a class for the word embedings method on ingredients

In [15]:
class SemanticSearch:
  """
  Manage all of the semantic search capabilities for ingredient and title word embeddings
  """
  def __init__(self, df_recipe: pd.DataFrame, use_sample_ingredient_index :bool = False):
    self.df_recipe :pd.DataFrame = df_recipe
    self.embeddings_ingredients = Embeddings({"path": "sentence-transformers/all-mpnet-base-v2"}) # "sentence-transformers/nli-mpnet-base-v2"
    self.embeddings_recipe_titles = Embeddings({"path": "sentence-transformers/all-mpnet-base-v2"})
    self.top_k_matches_ingredients :int = 5
    self.top_k_matches_recipe_titles :int = 5
    self.file_path_substitutions_ground_truth :str = file_path_google_drive[current_user] + "/df_ingredient_subtitutions_ground_truth.csv"
    self.df_ingredient_subtitutions_ground_truth : pd.DataFrame = None
    self.df_unique_ingredients :pd.DataFrame = pd.DataFrame()
    self.use_sample_ingredient_index :bool = use_sample_ingredient_index


  def run_prep_process(self):
    """Organize all of the main steps in the process so that the data can all be refreshed with one call"""
    # Upate the txtai index for all of the recipe titles
    print(f"Generating semantic index for recipe titles...")
    self.create_semantic_search_index_recipe_titles()

    ### Create ingredients based semantic search ###
    # Load the ground truth ingredient substitution dataframe
    print(f"Gathering list of unique ingredients...")
    self.create_df_ingredient_subtitutions_ground_truth()

    # Create unique list of possible ingredient substitutes
    self.generate_unique_ingredients_df()

    # Update the txtai index for all of the ingredients
    print(f"Generating semantic index for ingredients...")
    self.create_semantic_search_index_ingredients()

    print(f"Preperation Process Complete!!!")

    return


  def create_semantic_search_index_recipe_titles(self):
    """Use the recipe dataframe to create a semantic search index"""
    # Create a list of uniqie ingredients in the correct tuple format for txtai embedings to create an index
    list_of_recipe_titles = [(index, row["title"], None) for index, row in self.df_recipe.iterrows()]

    # Create and update the index for the embedding based in the unique ingredeints
    self.embeddings_recipe_titles.index(list_of_recipe_titles)

    return


  def query_semantic_index_recipe_titles(self, query :str) -> list:
    """Run a query through the semantic search index and return top k responses"""
    # Matches shape is [(index,score), ...] e.g [(0, 0.4172574281692505), (3, 0.3305395245552063)]
    matches :list = self.embeddings_recipe_titles.search(query, self.top_k_matches_recipe_titles)

    # Get just the incies from the matches
    matching_indicies = [tup[0] for tup in matches]

    # Get the ingredient associated with the each index
    top_k_matching_recipe_titles = self.df_recipe.iloc[matching_indicies]["title"].tolist()

    return top_k_matching_recipe_titles


  def create_df_ingredient_subtitutions_ground_truth(self):
    """Load in the ground truth data frame"""
    # Load in the ingredients ground truth from a csv
    df = pd.read_csv(self.file_path_substitutions_ground_truth)

    # Clean the ground truth dataframe
    # Split each string into a list
    df["Substitutes"] = df["Substitutes"].apply(lambda x: x.split(","))
    
    # Clean each ingredient in the list
    df["Substitutes"] = df["Substitutes"].apply(lambda x: [self.clean_ingredient_substitution(item) for item in x])

    # Update the dataframe
    self.df_ingredient_subtitutions_ground_truth = df.copy()

    return df


  @staticmethod
  def clean_ingredient_substitution(ingredient :str):
    """Clean the ingredient string in a consistent manner"""
    ingredient_clean = ingredient
    # Make the string lower case
    ingredient_clean = ingredient_clean.lower() #.replace("[/+/_/*]", " ", regex=True).replace("\s+", " ", regex=True).strip()
    # Remove special symbols
    ingredient_clean = re.sub(r"[/+/_/*]", " ", ingredient_clean)
    # remove extra spaces
    ingredient_clean = re.sub(r"\s+", " ", ingredient_clean).strip()

    return ingredient_clean
    

  def generate_unique_ingredients_df(self):
    "Create unique list of possible substitutes to choose from"
    df = self.df_ingredient_subtitutions_ground_truth.copy()

    # Get all of the unique values from the Substitutes column
    unique_values = df["Substitutes"].explode().unique()
    df_unique_ingredients = pd.DataFrame({"ingredient_substitutes": unique_values})
    
    # Remove leading and trailing white space
    df_unique_ingredients["ingredient_substitutes"] = df_unique_ingredients["ingredient_substitutes"].str.strip()

    # Drop the row with a whitespace character
    df_unique_ingredients = df_unique_ingredients[~df_unique_ingredients["ingredient_substitutes"].str.isspace()]
    
    # Drop rows with null values
    df_unique_ingredients.replace({'': None, ' ': None}, inplace=True)
    df_unique_ingredients = df_unique_ingredients.dropna()

    # Drop rows with duplicates
    df_unique_ingredients.drop_duplicates(inplace=True)

    # Sort the ingredient substitutes
    df_unique_ingredients = df_unique_ingredients.sort_values(by="ingredient_substitutes").reset_index(drop=True)

    # Check if using a smaller sample index for dev (full index takes ~5 min)
    if self.use_sample_ingredient_index:
      df_unique_ingredients = df_unique_ingredients.head(20)

    # Update the class instance of this dataframe
    self.df_unique_ingredients = df_unique_ingredients

    return df_unique_ingredients


  def load_new_embedding_model(self, model_name: str):
    """Replace the default embedding model with a new one"""
    self.embeddings_ingredients = Embeddings({"path": f"sentence-transformers/{model_name}"})
   
    return


  def create_semantic_search_index_ingredients(self):
    """Use the dataframe of unique ingredients to create a semantic search index"""
    # Create a list of uniqie ingredients in the correct tuple format for txtai embedings to create an index
    list_of_unique_ingredients = [(index, row["ingredient_substitutes"], None) for index, row in self.df_unique_ingredients.iterrows()]

    # Create and update the index for the embedding based in the unique ingredeints
    self.embeddings_ingredients.index(list_of_unique_ingredients)

    return


  def query_semantic_index_ingredients(self, query :str, top_k :int = None) -> list:
    """Run a query through the semantic search index and return top k responses"""
    # Check if a top_k value was provide if not use default specified in class instance
    if top_k == None:
      top_k = self.top_k_matches_ingredients

    # Matches shape is [(index,score), ...] e.g [(0, 0.4172574281692505), (3, 0.3305395245552063)]
    matches :list = self.embeddings_ingredients.search(query, top_k + 1)

    # Get just the incies from the matches
    matching_indicies = [tup[0] for tup in matches]

    # Get the ingredient associated with the each index
    top_k_matching_ingredients = self.df_unique_ingredients.iloc[matching_indicies]["ingredient_substitutes"].tolist()

    # If the input is an exact match then remove is since that is not a substitution
    if query in top_k_matching_ingredients:
      top_k_matching_ingredients.remove(query)

    # Make sure that only the first 5 matches are returned (may have 6 if there is not an exact match)
    top_k_matching_ingredients = top_k_matching_ingredients[:top_k]

    return top_k_matching_ingredients


  @staticmethod
  def save_semantic_search_index(embedding : Embeddings, embedding_name_to_save :str):
    """Save a pre-computed semantic search index to avoid re-generation"""
    # Save the embedding to the current directory
    embedding.save(f"embeddings_indexed/{embedding_name_to_save}.tar.gz")

    return

  @staticmethod
  def load_semantic_search_index(embedding : Embeddings, embedding_name_to_load :str):
    """Save a pre-computed semantic search index to avoid re-generation"""
    # Save the embedding to the current directory
    temp_embedding = embedding.load(f"embeddings_indexed/{embedding_name_to_load}.tar.gz")

    return temp_embedding


Example of how to apply semantic search and pre-process the data

In [16]:
# Instantiate the Semantic Seach class
ingredent_embedding = SemanticSearch(df_recipe = df_recipe_sample.head(10),
                                     use_sample_ingredient_index = False)

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [17]:
# Run the prep process which creates the semantic embeding index from scratch
ingredent_embedding.run_prep_process()

Generating semantic index for recipe titles...
Gathering list of unique ingredients...
Generating semantic index for ingredients...
Preperation Process Complete!!!


Explore loading and saving embeddings

In [18]:
# ingredent_embedding.save_semantic_search_index(embedding = ingredent_embedding.embeddings_ingredients,
#                                                embedding_name_to_save = f"ingredent_embedding")

In [19]:
# ingredent_embedding.load_semantic_search_index(embedding = ingredent_embedding.embeddings_ingredients,
#                                                embedding_name_to_load = f"ingredent_embedding")

In [20]:
# Run semantic seach on the ingredients
ingredent_embedding.query_semantic_index_ingredients(query = "pork")

['pork chop', 'ground pork', 'pork kidney', 'pork heart', 'pork liver']

In [21]:
# Run semantic seach on the ingredients
ingredent_embedding.query_semantic_index_ingredients(query = "sour cream")

['coconut cream', 'heavy cream', 'cream', 'cream of coconut', 'light cream']

In [22]:
# Run semantic seach on the recpie titles
ingredent_embedding.query_semantic_index_recipe_titles(query = "Cookies")

['No-Bake Nut Cookies',
 'Reeses Cups(Candy)  ',
 'Millionaire Pie',
 'Rhubarb Coffee Cake',
 'Creamy Corn']

In [23]:
# Run semantic seach on the recpie titles
ingredent_embedding.query_semantic_index_recipe_titles(query = "Chicken")

['Chicken Funny',
 "Jewell Ball'S Chicken",
 'Scalloped Corn',
 'Creamy Corn',
 "Nolan'S Pepper Steak"]

# Metrics for Ingredient Substitutions

Create metrics for how accurate the ingredient substitutions are for the semantic search

Note: Since this part will not be built into the app, all libraries and packages will be loaded separately here.

Import needed libraries

In [24]:
from sklearn.metrics import ndcg_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import label_binarize
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer

In [25]:
class MetricsIngredientSubstitutions:
  """
  Metrics class for comparing the accuacy of the ingredient substitutions through
  semantic search vs a ground truth list of subsitutions
  """
  def __init__(self, df_ingredient_subtitutions_ground_truth : pd.DataFrame = None, semantic_search_instance = None):
    self.df_ingredient_subtitutions_ground_truth : pd.DataFrame = df_ingredient_subtitutions_ground_truth
    self.semantic_search_instance = semantic_search_instance
    self.df_ingredient_subtitutions_w_semantic_substitutions : pd.DataFrame = None
    self.df_ingredient_subtitutions_w_scores : pd.DataFrame = None


  def run_prep_process(self):
    """Run all of the prep process"""
    # Generate semantic substitutions
    print(f"Generating semantic substitutions for every ingredient")
    self.generate_semanitic_substitutions_df()

    # Generate metrics for predicted substitutions
    print(f"Generating metrics to compare ground truth ingredients vs predicted substitutions")
    self.generate_substitution_metrics_df()

    # Run metrics on 

    return


  def generate_semanitic_substitutions_df(self):
    """Generate semantic substitutions for all ingredients in the ground truth dataframe"""
    # Make a copy of the dataframe
    # df = self.df_ingredient_subtitutions_ground_truth.head(20).copy()
    df = self.df_ingredient_subtitutions_ground_truth.copy()

    # Add a new column for semantic substitutions
    df["semantic_substitutions"] = ""

    # Itterate through each row and apply functions
    for index, row in df.iterrows():
      # Get the number of ground truth substitutes to match the length of
      cur_top_k = len(row["Substitutes"])
      # Get semantic predictions for each input ingredient
      df.at[index, "semantic_substitutions"] = self.generate_semantic_substitutions(query = row["Ingredient"], top_k = cur_top_k)

    # Update the df_ingredient_subtitutions_w_semantic_predictions
    self.df_ingredient_subtitutions_w_semantic_substitutions = df

    return df


  def generate_substitution_metrics_df(self):
    """Generate metrics (precision, recall, f1 and accuracy) for all semantic 
    substitutions against the ground truth dataframe"""

    # Make a copy of the dataframe
    df = self.df_ingredient_subtitutions_w_semantic_substitutions.copy()

    # Add a new column for metrics
    df["precision"] = ""
    df["recall"] = ""
    df["f1"] = ""
    df["accuracy"] = ""

    # Itterate through each row and apply functions
    for index, row in df.iterrows():
      # Calculate the metric scores between the ground truth and the predictions
      current_row_scores = self.calculate_metric_scores(ground_truth = row["Substitutes"], predictions = row["semantic_substitutions"])
      df.at[index, "precision"] = current_row_scores["precision"]
      df.at[index, "recall"] = current_row_scores["recall"]
      df.at[index, "f1"] = current_row_scores["f1"]
      df.at[index, "accuracy"] = current_row_scores["accuracy"]

    # Update the df_ingredient_subtitutions_w_scores
    self.df_ingredient_subtitutions_w_scores = df

    return df

  def generate_semantic_substitutions(self, query :str, top_k :int = 5):
    """Create a function that generates the same number of semantic matches that were provide"""
    matches = self.semantic_search_instance.query_semantic_index_ingredients(query = query, top_k = top_k)

    return matches


  @staticmethod
  def calculate_metric_scores(ground_truth :list, predictions :list):
    """Simple metric to see what percent of items match from ground truth to predictions"""
    # Convert lists to sets
    ground_truth_set = set(ground_truth)
    predictions_set = set(predictions)
    intersection_set = ground_truth_set.intersection(predictions_set)

    # Calcualte lengths of sets
    number_of_items_in_common = len(intersection_set)
    number_of_items_ground_truth = len(ground_truth_set)
    number_of_items_predictions = len(predictions_set)

    # Address base case
    if number_of_items_ground_truth == 0:
      scores = {
        "precision" : 1,
        "recall" : 1,
        "f1" : 1,
        "accuracy" : 1,
      }
      return scores

    # If there are no common tokens then all scores = 0
    if number_of_items_in_common == 0 or number_of_items_predictions == 0:
      scores = {
        "precision" : 0,
        "recall" : 0,
        "f1" : 0,
        "accuracy" : 0,
        }
      return scores

    # Calculate all metric formulas
    precision = number_of_items_in_common / number_of_items_predictions
    recall = number_of_items_in_common / number_of_items_ground_truth
    f1 = 2 * (precision * recall) / (precision + recall)
    accuracy = number_of_items_in_common / number_of_items_ground_truth

    # Put scores into a dictionary
    scores = {
      "precision" : precision,
      "recall" : recall,
      "f1" : f1,
      "accuracy" : accuracy,
      }

    return scores

In [26]:
metrics_ingredient_substituions = MetricsIngredientSubstitutions(
    df_ingredient_subtitutions_ground_truth = ingredent_embedding.create_df_ingredient_subtitutions_ground_truth(),
    semantic_search_instance = ingredent_embedding,
    )
metrics_ingredient_substituions.run_prep_process()

Generating semantic substitutions for every ingredient
Generating metrics to compare ground truth ingredients vs predicted substitutions


In [27]:
metrics_ingredient_substituions.df_ingredient_subtitutions_w_scores.head()

Unnamed: 0,Ingredient,Substitutes,semantic_substitutions,precision,recall,f1,accuracy
0,a1 sauce,"[barbecue sauce, ketchup]","[tomato sauce, adobo sauce]",0.0,0.0,0.0,0.0
1,adobo sauce,[tabasco sauce],[fish sauce],0.0,0.0,0.0,0.0
2,alphonso olives,"[kalamata olives, gaeta olives]","[atalanta olives, amphissa olives]",0.0,0.0,0.0,0.0
3,aluminum foil,"[plastic wrap, wax paper]","[plastic wrap, parchment paper]",0.5,0.5,0.5,0.5
4,amphissa olives,"[kalamata olives, gaeta olives]","[atalanta olives, cerignola olives]",0.0,0.0,0.0,0.0


In [28]:
# Save the DataFrame to a pickle file
metrics_ingredient_substituions.df_ingredient_subtitutions_w_scores.to_pickle('df_ingredient_subtitutions_w_scores_230401.pkl')
# metrics_ingredient_substituions.df_ingredient_subtitutions_w_scores.to_csv('df_ingredient_subtitutions_w_scores_230401.csv', index=False)

In [29]:
# Load the DataFrame from the pickle file
df = pd.read_pickle('df_ingredient_subtitutions_w_scores_230401.pkl')

In [30]:
df["accuracy"].mean()

0.16526665464165433

In [31]:
df.shape

(3480, 7)

# Create Experiments for Various Embedding Models

Use the above methods to create various embedding models.

In [None]:
models_to_test = ['all-mpnet-base-v2','multi-qa-mpnet-base-dot-v1','all-distilroberta-v1',
'all-MiniLM-L12-v2','multi-qa-distilbert-cos-v1','all-MiniLM-L6-v2','multi-qa-MiniLM-L6-cos-v1',
'paraphrase-multilingual-mpnet-base-v2','paraphrase-albert-small-v2',
'paraphrase-multilingual-MiniLM-L12-v2','paraphrase-MiniLM-L3-v2',
'distiluse-base-multilingual-cased-v1','distiluse-base-multilingual-cased-v2']

In [None]:
def test_individual_model(ingredient_embedding, model_name : str, model_scores: dict):
  # Update the ingredient embedding to work with the new model
  ingredient_embedding.load_new_embedding_model(model_name = model_name)

  # Rebuild the semantic index based on this new model
  ingredient_embedding.create_semantic_search_index_ingredients()

  # Apply the metrics class for the updated ingredient embedding
  metrics_ingredient_substituions = MetricsIngredientSubstitutions(
    df_ingredient_subtitutions_ground_truth = ingredient_embedding.create_df_ingredient_subtitutions_ground_truth(),
    semantic_search_instance = ingredient_embedding,
    )
  
  # Run the prep process to load everything up
  metrics_ingredient_substituions.run_prep_process()

  # Save the DataFrame to a pickle file
  metrics_ingredient_substituions.df_ingredient_subtitutions_w_scores.to_pickle(f'df_ingredient_subtitutions_w_scores_230419_{model_name}.pkl')

  # Print the overall accuracy score
  cur_model_score = metrics_ingredient_substituions.df_ingredient_subtitutions_w_scores["accuracy"].mean()
  model_scores[model_name] = cur_model_score
  print(f"Model Name: {model_name}")
  print(f"Model score: {cur_model_score}\n")

  return model_scores

In [None]:
def run_model_tests(models_to_test: str):
  # Instantiate the base class of the semantic search
  ingredient_embedding = SemanticSearch(df_recipe = df_recipe_sample.head(10),
                                     use_sample_ingredient_index = False)
  
  # Run the prep process which creates the semantic embeding index from scratch
  ingredient_embedding.run_prep_process()

  # Create an empty dictionary to hold model scores
  model_scores = {}

  # Itterate through each model and run the metrics
  print(f"Running tests for all models...")
  for model_name in models_to_test:
    # Run the test on an individual model
    model_scores = test_individual_model(ingredient_embedding = ingredient_embedding, model_name = model_name, model_scores = model_scores)
    print(model_name)


  return model_scores



In [None]:
model_scores = run_model_tests(models_to_test = models_to_test)

Generating semantic index for recipe titles...
Gathering list of unique ingredients...
Generating semantic index for ingredients...
Preperation Process Complete!!!
Generating semantic substitutions for every ingredient
Generating metrics to compare ground truth ingredients vs predicted substitutions
Model Name: all-mpnet-base-v2
Model score: 0.16526665464165433

all-mpnet-base-v2


Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Generating semantic substitutions for every ingredient
Generating metrics to compare ground truth ingredients vs predicted substitutions
Model Name: multi-qa-mpnet-base-dot-v1
Model score: 0.14938710794314208

multi-qa-mpnet-base-dot-v1


Downloading (…)lve/main/config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Generating semantic substitutions for every ingredient
Generating metrics to compare ground truth ingredients vs predicted substitutions
Model Name: all-distilroberta-v1
Model score: 0.14890224079879216

all-distilroberta-v1


Downloading (…)lve/main/config.json:   0%|          | 0.00/573 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Generating semantic substitutions for every ingredient
Generating metrics to compare ground truth ingredients vs predicted substitutions
Model Name: all-MiniLM-L12-v2
Model score: 0.14776167797719486

all-MiniLM-L12-v2


Downloading (…)lve/main/config.json:   0%|          | 0.00/523 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Generating semantic substitutions for every ingredient
Generating metrics to compare ground truth ingredients vs predicted substitutions
Model Name: multi-qa-distilbert-cos-v1
Model score: 0.14510263363711595

multi-qa-distilbert-cos-v1


Downloading (…)lve/main/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Generating semantic substitutions for every ingredient
Generating metrics to compare ground truth ingredients vs predicted substitutions
Model Name: all-MiniLM-L6-v2
Model score: 0.1466439674629326

all-MiniLM-L6-v2


Downloading (…)lve/main/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Generating semantic substitutions for every ingredient
Generating metrics to compare ground truth ingredients vs predicted substitutions
Model Name: multi-qa-MiniLM-L6-cos-v1
Model score: 0.1363354876285908

multi-qa-MiniLM-L6-cos-v1
Generating semantic substitutions for every ingredient
Generating metrics to compare ground truth ingredients vs predicted substitutions
Model Name: paraphrase-multilingual-mpnet-base-v2
Model score: 0.12927708944950309

paraphrase-multilingual-mpnet-base-v2


Downloading (…)lve/main/config.json:   0%|          | 0.00/827 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/46.7M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/245 [00:00<?, ?B/s]

Generating semantic substitutions for every ingredient
Generating metrics to compare ground truth ingredients vs predicted substitutions
Model Name: paraphrase-albert-small-v2
Model score: 0.1410527411604994

paraphrase-albert-small-v2


Downloading (…)lve/main/config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/471M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Generating semantic substitutions for every ingredient
Generating metrics to compare ground truth ingredients vs predicted substitutions
Model Name: paraphrase-multilingual-MiniLM-L12-v2
Model score: 0.11587659227745417

paraphrase-multilingual-MiniLM-L12-v2


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/69.6M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Generating semantic substitutions for every ingredient
Generating metrics to compare ground truth ingredients vs predicted substitutions
Model Name: paraphrase-MiniLM-L3-v2
Model score: 0.12852638979794123

paraphrase-MiniLM-L3-v2


Downloading (…)lve/main/config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/539M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Generating semantic substitutions for every ingredient
Generating metrics to compare ground truth ingredients vs predicted substitutions
Model Name: distiluse-base-multilingual-cased-v1
Model score: 0.10016115908357281

distiluse-base-multilingual-cased-v1


Downloading (…)lve/main/config.json:   0%|          | 0.00/610 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/539M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/531 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Generating semantic substitutions for every ingredient
Generating metrics to compare ground truth ingredients vs predicted substitutions
Model Name: distiluse-base-multilingual-cased-v2
Model score: 0.10036367368263917

distiluse-base-multilingual-cased-v2


In [None]:
model_scores

{'all-mpnet-base-v2': 0.16526665464165433,
 'multi-qa-mpnet-base-dot-v1': 0.14938710794314208,
 'all-distilroberta-v1': 0.14890224079879216,
 'all-MiniLM-L12-v2': 0.14776167797719486,
 'multi-qa-distilbert-cos-v1': 0.14510263363711595,
 'all-MiniLM-L6-v2': 0.1466439674629326,
 'multi-qa-MiniLM-L6-cos-v1': 0.1363354876285908,
 'paraphrase-multilingual-mpnet-base-v2': 0.12927708944950309,
 'paraphrase-albert-small-v2': 0.1410527411604994,
 'paraphrase-multilingual-MiniLM-L12-v2': 0.11587659227745417,
 'paraphrase-MiniLM-L3-v2': 0.12852638979794123,
 'distiluse-base-multilingual-cased-v1': 0.10016115908357281,
 'distiluse-base-multilingual-cased-v2': 0.10036367368263917}

Final Results reference

In [None]:
model_scores_summary = {'all-mpnet-base-v2': '16.53%',
 'multi-qa-mpnet-base-dot-v1': '14.94%',
 'all-distilroberta-v1': '14.89%',
 'all-MiniLM-L12-v2': '14.78%',
 'all-MiniLM-L6-v2': '14.66%',
 'multi-qa-distilbert-cos-v1': '14.51%',
 'paraphrase-albert-small-v2': '14.11%',
 'multi-qa-MiniLM-L6-cos-v1': '13.63%',
 'paraphrase-multilingual-mpnet-base-v2': '12.93%',
 'paraphrase-MiniLM-L3-v2': '12.85%',
 'paraphrase-multilingual-MiniLM-L12-v2': '11.59%',
 'distiluse-base-multilingual-cased-v2': '10.04%',
 'distiluse-base-multilingual-cased-v1': '10.02%'}
 
#  {'all-mpnet-base-v2': 0.16526665464165433,
#  'multi-qa-mpnet-base-dot-v1': 0.14938710794314208,
#  'all-distilroberta-v1': 0.14890224079879216,
#  'all-MiniLM-L12-v2': 0.14776167797719486,
#  'all-MiniLM-L6-v2': 0.1466439674629326,
#  'multi-qa-distilbert-cos-v1': 0.14510263363711595,
#  'paraphrase-albert-small-v2': 0.1410527411604994,
#  'multi-qa-MiniLM-L6-cos-v1': 0.1363354876285908,
#  'paraphrase-multilingual-mpnet-base-v2': 0.12927708944950309,
#  'paraphrase-MiniLM-L3-v2': 0.12852638979794123,
#  'paraphrase-multilingual-MiniLM-L12-v2': 0.11587659227745417,
#  'distiluse-base-multilingual-cased-v2': 0.10036367368263917,
#  'distiluse-base-multilingual-cased-v1': 0.10016115908357281}

In [None]:
# dict(sorted(model_scores_summary.items(), key=lambda item: item[1], reverse=True))
# for model, score in model_scores_summary.items():
#     model_scores_summary[model] = "{:.2%}".format(score)

Convert this into a table/csv

In [None]:
# Convert the dictionary to a pandas DataFrame
df_scores = pd.DataFrame(list(model_scores_summary.items()), columns=['Model', 'Score'])

# Convert the Score column to percentages with 2 decimal places
df_scores['Score'] = df_scores['Score'] #.apply(lambda x: '{:.2%}'.format(x))

# Print the DataFrame
print(df_scores)

# Export the DataFrame to a CSV file
df_scores.to_csv('model_scores_summary_230419.csv', index=False)

                                    Model   Score
0                       all-mpnet-base-v2  16.53%
1              multi-qa-mpnet-base-dot-v1  14.94%
2                    all-distilroberta-v1  14.89%
3                       all-MiniLM-L12-v2  14.78%
4              multi-qa-distilbert-cos-v1  14.51%
5                        all-MiniLM-L6-v2  14.66%
6               multi-qa-MiniLM-L6-cos-v1  13.63%
7   paraphrase-multilingual-mpnet-base-v2  12.93%
8              paraphrase-albert-small-v2  14.11%
9   paraphrase-multilingual-MiniLM-L12-v2  11.59%
10                paraphrase-MiniLM-L3-v2  12.85%
11   distiluse-base-multilingual-cased-v1  10.02%
12   distiluse-base-multilingual-cased-v2  10.04%


{'all-mpnet-base-v2': '16.53%',
 'multi-qa-mpnet-base-dot-v1': '14.94%',
 'all-distilroberta-v1': '14.89%',
 'all-MiniLM-L12-v2': '14.78%',
 'multi-qa-distilbert-cos-v1': '14.51%',
 'all-MiniLM-L6-v2': '14.66%',
 'multi-qa-MiniLM-L6-cos-v1': '13.63%',
 'paraphrase-multilingual-mpnet-base-v2': '12.93%',
 'paraphrase-albert-small-v2': '14.11%',
 'paraphrase-multilingual-MiniLM-L12-v2': '11.59%',
 'paraphrase-MiniLM-L3-v2': '12.85%',
 'distiluse-base-multilingual-cased-v1': '10.02%',
 'distiluse-base-multilingual-cased-v2': '10.04%'}

# Misc

Old code for how to create the semantic index based on the unique ingrdeintes from the 1M recipe dataset rather than the substitues list

In [None]:
# class SemanticSearch:
#   """
#   Manage all of the ingredient word embedding work
#   """
#   def __init__(self, df_recipe: pd.DataFrame, use_sample_ingredient_index :bool = False):
#     self.df_recipe :pd.DataFrame = df_recipe
#     self.embeddings_ingredients = Embeddings({"path": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"}) # "sentence-transformers/nli-mpnet-base-v2"
#     self.embeddings_recipe_titles = Embeddings({"path": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"})
#     self.top_k_matches_ingredients :int = 5
#     self.top_k_matches_recipe_titles :int = 5
#     self.file_path_substitutions_ground_truth :str = file_path_google_drive[current_user] + "/df_ingredient_subtitutions_ground_truth.csv"
#     self.df_ingredient_subtitutions_ground_truth : pd.DataFrame = None
#     self.df_unique_ingredients :pd.DataFrame = pd.DataFrame()


  ### This code was used for generating a list of ingredient substitutions based on the 1M dataset rather than substitutes ###
  # def generate_unique_ingredients_df(self):
  #   """Create the ingredients table of unique ingredients"""
  #   df = self.df_recipe.copy()

  #   # Convert values to lower case, replace underscores with spaces, and strip leading and trailing whitespace
  #   df["ner_clean"] = df["NER"].str.lower().str.replace("[/+/_/*]", " ", regex=True).str.replace("\s+", " ", regex=True).str.strip()

  #   # Get all of the unique values from the NER column
  #   unique_values = df["ner_clean"].apply(lambda x: eval(x)).explode().unique()
  #   df_unique_ingredients = pd.DataFrame({"ingredients": unique_values})

  #   # Drop the row with a whitespace character
  #   df_unique_ingredients = df_unique_ingredients[~df_unique_ingredients["ingredients"].str.isspace()]
    
  #   # Sort the ingredients
  #   df_unique_ingredients = df_unique_ingredients.sort_values(by="ingredients").reset_index(drop=True)

  #   # Update the class instance of this dataframe
  #   self.df_unique_ingredients = df_unique_ingredients

  #   return df_unique_ingredients


  # def create_semantic_search_index_ingredients(self):
  #   """Use the dataframe of unique ingredients to create a semantic search index"""
  #   # Create a list of uniqie ingredients in the correct tuple format for txtai embedings to create an index
  #   list_of_unique_ingredients = [(index, row["ingredients"], None) for index, row in self.df_unique_ingredients.iterrows()]

  #   # Create and update the index for the embedding based in the unique ingredeints
  #   self.embeddings_ingredients.index(list_of_unique_ingredients)

  #   return


  # def query_semantic_index_ingredients(self, query :str) -> list:
  #   """Run a query through the semantic search index and return top k responses"""
  #   # Matches shape is [(index,score), ...] e.g [(0, 0.4172574281692505), (3, 0.3305395245552063)]
  #   matches :list = self.embeddings_ingredients.search(query, self.top_k_matches_ingredients + 1)

  #   # Get just the incies from the matches
  #   matching_indicies = [tup[0] for tup in matches]

  #   # Get the ingredient associated with the each index
  #   top_k_matching_ingredients = self.df_unique_ingredients.iloc[matching_indicies]["ingredients"].tolist()

  #   # If the input is an exact match then remove is since that is not a substitution
  #   if query in top_k_matching_ingredients:
  #     top_k_matching_ingredients.remove(query)

  #   # Make sure that only the first 5 matches are returned (may have 6 if there is not an exact match)
  #   top_k_matching_ingredients = top_k_matching_ingredients[:self.top_k_matches_ingredients]

  #   return top_k_matching_ingredients



Old code from metrics

In [None]:

  ## @staticmethod
  # def calculate_ndcg(ground_truth :list, predictions :list):
  #     # Convert the lists of strings to binary arrays
  #     classes = list(set(ground_truth + predictions))
  #     mlb = MultiLabelBinarizer(classes=classes)
  #     # Fit and transfrome the data classes based on the ground truth
  #     ground_truth_binary = mlb.fit_transform([ground_truth])
  #     predictions_binary = mlb.transform([predictions])

  #     # Calculate the NDCG score between the two arrays
  #     ndcg = ndcg_score(ground_truth_binary, predictions_binary)
    
  #     return ndcg

  # def calculate_ndcg(self, ground_truth :list, predictions :list):
  #   # Check if the list are the exact same
  #   if ground_truth == predictions:
  #     return 1.0

  #   # Create an indexer to map terms to indices
  #   indexer = {term: i for i, term in enumerate(set(ground_truth + predictions))}

  #   # Convert the ground truth and prediction lists to vectors of term frequencies
  #   ground_truth_vector = [self.get_tf_vector(doc, indexer) for doc in ground_truth]
  #   predictions_vector = [self.get_tf_vector(doc, indexer) for doc in predictions]

  #   # Compute NDCG scores for each list
  #   # ground_truth_ndcg_score = ndcg_score(ground_truth_vector, ground_truth_vector)
  #   predictions_ndcg_score = ndcg_score(ground_truth_vector, predictions_vector)

  #   return predictions_ndcg_score


  ## @staticmethod
  # def get_tf_vector(doc, indexer):
  #   """Custom indexer that works on literal lists and exact matches"""
  #   # Initialize a vector of zeros with the same length as the indexer
  #   tf_vector = [0] * len(indexer)

  #   # Count the term frequencies in the document
  #   for term in doc.split():
  #       if term in indexer:
  #           tf_vector[indexer[term]] += 1

  #   # Normalize the vector by dividing by the sum of the term frequencies
  #   norm = sum(tf_vector)
  #   if norm > 0:
  #       tf_vector = [tf / norm for tf in tf_vector]

  #   return tf_vector