<img src="recipe_banner.png" alt="Banner" width="1100">

# Food.com Recipe Search Engine Project 

In this project, we will create a search engine for [food.com](food.com). The current search engine on the website works well for simple searches, but it does not work well with more complex and specific descriptions or dietary restrictions. The main goal is to create a search algorithm that can accomplish this better, by using TFIDF and a negative scoring methods. 

Dataset: [Food.com Recipes with Search Terms and Tags (Kaggle)](https://www.kaggle.com/datasets/shuyangli94/foodcom-recipes-with-search-terms-and-tags) 

Demo: https://www.youtube.com/watch?v=q2XHtwxdoKw 

Main Libraries: pandas, numpy, re, glob, stopwords, TfidfVectorizer, cosine_similarity, widgets, display, interact_manual

In [29]:
import glob
import re
import numpy as np
import pandas as pd
import ipywidgets as widgets
from IPython.display import display
from ipywidgets import interact_manual
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [30]:
pd.set_option('display.max_colwidth', None) # show the entire string value in the df 
pd.set_option('display.max_rows', None) # show the entire DataFrame

### Data Preparation 

#### Clean Tokens

First, create a column that contains all of the tokens that will be put into the TF IDF Vectorizer. We will use the name, ingredients, tags, and search_terms for our algorithm.

In [31]:
def simple_clean(input):
    """
    Do a simple clean on the input string.
    
    Args: 
        input (string): search query

    Returns:
        cleaned input (string)
    """
    input = input.lower()
    input = re.sub(r'[^a-zA-Z]', ' ', input)
    input = re.sub(r'\s+', ' ', input) # replace any multiple spaces with a single space 
    input = input.strip()
    return input

def replace_double_quotes_by_word_count(match):
    content = match.group(1) 
    count = len(content.split())
    
    # if 3 or less words within the double quotes return the words itself without any quotes 
    if count <= 3:
        return content 
    else: # if more than 3 words return it with single qutoes. this may be a sentence in double quotes 
        return f"'{content}'" 

def extract_words_in_quotes(all_terms, replace_dash=True, lowercase=True, replace_double_quotes=False, remove_apostrophes=False, make_unique=True):
    """
    Returns a list of tokens extracted from a string of tokens in quotations. 

    Example: ['water', 'cheese']['pasta']{'sugar-free'} (str) -> [water, cheese, pasta, sugar-free] (list)

    Args: 
        all_terms (string): string of tokens in single quotations ('')
        replace_dash (boolean, optional): if True replaces dashes with empty string 
        lowercase (boolean, optional): if True makes all characters lowercase 
        replace_double_quotes (boolean, optional): if True replace the double quotes with single quotes 
        remove_apostrophes (boolean, optional): if True remove any apostrophes used for possessive or contractions 
        make_unique (boolean, optional): if True only extract unique terms 

    Returns:
        list: list of tokens 
    """

    if replace_dash:
        all_terms = all_terms.replace('-', ' ')
    if lowercase:
        all_terms = all_terms.lower()
    if replace_double_quotes:
        all_terms = re.sub(r'(\d+)"', r'\1 inch', all_terms) # change double quotes (") that refer to inches to 'inch'
        all_terms = re.sub(r"(\d+)'(?=[FC])", r"\1", all_terms) # remove single quotes (') that refer to temperature (350'F)
        all_terms = re.sub(r'"([^"]+)"', replace_double_quotes_by_word_count, all_terms) # change double quotes (") that refer to "imitation"
    if remove_apostrophes:
        if '\'s' in all_terms:
            all_terms = re.sub(r"(\w+)'s\b", r"\1s", all_terms) # replace possessive 's
        if '\'t' in all_terms:
            all_terms = re.sub(r"(\w+)'t\b", r"\1t", all_terms)  # replace contraction n't
        if '\'re' in all_terms:
            all_terms = re.sub(r"(\w+)'re\b", r"\1re", all_terms)  # replace contraction 're
        if '\'m' in all_terms:
            all_terms = re.sub(r"(\w+)'m\b", r"\1m", all_terms)  # replace contraction 'm
        if '\'d' in all_terms:
            all_terms = re.sub(r"(\w+)'d\b", r"\1d", all_terms)  # replace contraction 'd
        if '\'ve' in all_terms:
            all_terms = re.sub(r"(\w+)'ve\b", r"\1ve", all_terms)  # replace contraction 've
        if '\'ll' in all_terms:
            all_terms = re.sub(r"(\w+)'ll\b", r"\1ll", all_terms)  # replace contraction 'll
    all_terms = re.findall(r"'(.*?)'", all_terms) # '(.*?)': This pattern matches anything inside single quotes.
    if make_unique:
        all_terms = list(set(all_terms))
    return all_terms

def preprocess_recipes(recipe):
    recipe['name_set'] = set(simple_clean(recipe['name']).split())
    recipe['steps'] = extract_words_in_quotes(recipe['steps'], replace_dash=False, lowercase=False, 
                            replace_double_quotes=True, remove_apostrophes=True, make_unique=False)

    return recipe


In [32]:
def get_all_recipes():
    """
    Get info for ~500k recipes from Food.com, including:
        id: identifier (double)
        name: name of the recipe (string)
        description: description of the recipe (string)
        ingredients: list of ingredients (string) 
        ingredients_raw_str: list of portions of ingredients (string) 
        serving size: serving size in grams (string)
        servings: number of servings (double)
        steps: list of steps to follow (string)
        tags: list of tags for the recipe (string)
        search_terms: set of search terms for the recipe (string)

    Returns:
        df (DataFrame): info for ~500k recipes 
    """
    files = glob.glob('/Users/averylee/Desktop/DS/recipes/recipes_w_search_terms_*.csv')
    recipes = pd.concat([pd.read_csv(file) for file in files])

    recipes = recipes.apply(lambda x: preprocess_recipes(x), axis=1)

    return recipes 

In [33]:
all_recipes = get_all_recipes()

Let's look at the first 5 examples. The columns `name`, `description`, `ingredients`, `tags`, and `search_terms` seem the most descriptive and relevant for our use case. 

In [34]:
all_recipes.head(5)

Unnamed: 0.1,Unnamed: 0,id,name,description,ingredients,steps,tags,search_terms,name_set
0,296982,514890,Jambalaya,My favorite jambalaya recipe.,"['vegetable oil', 'boneless chicken breasts', 'boneless chicken thighs', 'onion', 'green pepper', 'celery rib', 'garlic clove', 'parsley', 'kielbasa', 'cajun spices', 'dried thyme', 'cayenne pepper', 'bay leaf', 'converted rice', 'chicken stock', 'tomato sauce', 'raw shrimp', 'parsley']","[In a large heavy dutch oven over medium heat, heat oil and brown the chicken breast and thighs for 5 minutes., Add onion, bell pepper, celery,garlic, and parsley and cook for 5 mins longer., add the sausage, cajun spice, thyme, cayenne, bay leaf, and season to taste with salt and pepper. Cook for 1 minute., Stir in the rice, chicken stock, and tomato sauce and bring to a boil., reduce heat to medium low, cover and cook for 30 mins to 35 minutes Gently nestle the shrimp into the rice 5 mins before the jambalaya is finished., When ready to serve, fluff the rice with a fork. Garnish each serving with chopped parsley.]","['60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'preparation', 'main-dish', 'seafood', 'meat', 'pasta-rice-and-grains', 'from-scratch']",{'dinner'},{jambalaya}
1,296983,534634,Big Uncle Mike's M &amp; M Cookies,Delicious M & M cookies!,"['butter', 'eggs', 'vanilla extract', 'oil', 'dark brown sugar', 'flour', 'salt', 'baking soda', 'candy']","[blender in electric mixer butter, eggs, vanilla, oil., then add brown sugar, continue until fluffy., then add by hand flour, salt, baking soda, and m &amp; ms. mix w/ wooden spoon., spoon onto greased baking sheet 1-2 inches apart. bake at 350 degrees for 12-14 minutes rotate pan 1/2 way through.]","['60-minutes-or-less', 'time-to-make', 'course', 'preparation', 'for-large-groups', 'desserts', 'number-of-servings']","{'cookie', 'dessert'}","{cookies, amp, mike, uncle, m, big, s}"
2,296984,559,Ginger Fried Chicken,,"['whole chickens', 'fresh gingerroot', 'garlic', 'japanese soy sauce', 'salt', 'pepper', 'flour', 'eggs']","[Add ginger, garlic, soy sauce, salt, and pepper to chicken, mix well, and marinate overnight. Add eggs to chicken., Mix together well., Add flour to coat. Mix well., Fry until golden, about 5 minutes., Finish in oven, 350 degrees for 30-35 minutes.]","['60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'preparation', 'occasion', 'main-dish', 'poultry', 'chicken', 'dietary', 'meat', 'whole-chicken', 'number-of-servings']","{'dinner', 'chicken'}","{fried, chicken, ginger}"
3,296985,384613,The Realtor's Creamy Cheese Tortellini With Asparagus,YUMMY! This is so creamy and delicious. It's quick to the table and very satisfying.,"['reduced-sodium chicken broth', 'garlic', 'thyme', 'lemon pepper', 'cheese tortellini', 'cornstarch', 'heavy cream', 'asparagus', 'parmigiano-reggiano cheese']","[Boil broth with garlic, thyme and lemon pepper in a large heavy skillet until reduced to about 1 cup, about 6 minutes or so., Meanwhile, cook tortellini in a pasta pot of boiling salted water (1 1/2 tablespoons salt for 4 quarts water) according to package directions. Drain., Stir cornstarch into cream, then whisk into broth. Bring to a simmer, whisking, then continue to simmer 1 minute. Add asparagus and simmer until crisp-tender, about 2 minutes. Stir in cheese and tortellini and cook, gently stirring, until heated through.]","['weeknight', '60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'cuisine', 'preparation', 'occasion', 'north-american', 'main-dish', 'side-dishes', 'pasta', 'vegetables', 'easy', 'dinner-party', 'kid-friendly', 'dietary', 'one-dish-meal', 'comfort-food', 'pasta-rice-and-grains', 'ravioli-tortellini', 'asparagus', 'taste-mood', '3-steps-or-less']","{'side', 'dinner', 'pasta'}","{tortellini, cheese, creamy, with, realtor, the, asparagus, s}"
4,296986,444306,Bang Bang Shrimp,"Finally! I've been experimenting with different ""copycat"" recipes for Bonefish Grill's Bang Bang Shrimp for a couple of years. Finally, the Food Network folks have nailed it!! This recipe is in the April 2010 issue of Food Network Magazine. The only change I made was to add some Sriracha Hot Chili Sauce to add more zip. Serves 4 as an appetizer or 2 as a main dish.","['mayonnaise', 'asian chili sauce', 'honey', 'hot chili sauce', 'large shrimp', 'eggs', 'all-purpose flour', 'cornstarch', 'salt', 'pepper', 'scallion']","[Mix mayonnaise, chili sauce, sriracha and honey for sauce., Heat vegetable or peanut oil to 350 degrees., Whisk eggs in a pie plate., Whisk Flour, cornstarch, 1 t salt and 1 t pepper in another pie plate., Working in batches, dredge shrimp in flour mixture, shake off excess, dip in beaten eggs, then return to flour mixture. Fry shrimp in hot oil until golden. Transfer to paper towel lined plate with slotted spoon., Toss shrimp with prepared sauce. Top with sliced scallions.]","['30-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'cuisine', 'preparation', 'occasion', 'for-1-or-2', 'appetizers', 'seafood', 'asian', 'dinner-party', 'number-of-servings']","{'appetizer', 'dinner', 'shrimp'}","{bang, shrimp}"


Compile a list of tokens using these useful columns. 

In [35]:
def create_all_tokens_col(df):
    """
    Creates a new column that combines all the relevant recipe info that can be used as possible search tokens. 
    
    Args: 
        df (DataFrame): info for recipes 

    Returns:
        df (DataFrame): contains the new col that combines all the cols with possible search tokens. 
    """
    df = df.fillna(' ') # will remove whitespace and extraspaces later 
    df['all_tokens'] = '\'' + df['name'] + '\'' + ' ' + df['ingredients'].apply(str) + ' ' + df['tags'].apply(str) + ' ' + df['search_terms'].apply(str)
    
    return df 

In [36]:
cols = ['id', 'name', 'name_set', 'description', 'steps', 'ingredients', 'tags', 'search_terms']
all_recipes = all_recipes[cols]
all_recipes = create_all_tokens_col(all_recipes)
all_recipes.head(1)

Unnamed: 0,id,name,name_set,description,steps,ingredients,tags,search_terms,all_tokens
0,514890,Jambalaya,{jambalaya},My favorite jambalaya recipe.,"[In a large heavy dutch oven over medium heat, heat oil and brown the chicken breast and thighs for 5 minutes., Add onion, bell pepper, celery,garlic, and parsley and cook for 5 mins longer., add the sausage, cajun spice, thyme, cayenne, bay leaf, and season to taste with salt and pepper. Cook for 1 minute., Stir in the rice, chicken stock, and tomato sauce and bring to a boil., reduce heat to medium low, cover and cook for 30 mins to 35 minutes Gently nestle the shrimp into the rice 5 mins before the jambalaya is finished., When ready to serve, fluff the rice with a fork. Garnish each serving with chopped parsley.]","['vegetable oil', 'boneless chicken breasts', 'boneless chicken thighs', 'onion', 'green pepper', 'celery rib', 'garlic clove', 'parsley', 'kielbasa', 'cajun spices', 'dried thyme', 'cayenne pepper', 'bay leaf', 'converted rice', 'chicken stock', 'tomato sauce', 'raw shrimp', 'parsley']","['60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'preparation', 'main-dish', 'seafood', 'meat', 'pasta-rice-and-grains', 'from-scratch']",{'dinner'},"'Jambalaya' ['vegetable oil', 'boneless chicken breasts', 'boneless chicken thighs', 'onion', 'green pepper', 'celery rib', 'garlic clove', 'parsley', 'kielbasa', 'cajun spices', 'dried thyme', 'cayenne pepper', 'bay leaf', 'converted rice', 'chicken stock', 'tomato sauce', 'raw shrimp', 'parsley'] ['60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'preparation', 'main-dish', 'seafood', 'meat', 'pasta-rice-and-grains', 'from-scratch'] {'dinner'}"


The tokens are not clean yet when combined directly, but we can see that the terms are all in single quotes. Now, for each recipe, get a list of clean terms by extracting the terms from the quotes. This will be used later to classify terms as 'positive' or 'negative' for a recipe.

In [37]:
def remove_apostrophes(word):
    """
    Returns the input string with the apostrophes replaced as a space. 

    Args: 
        word (string)

    Returns:
        word (string)
    """
    word = word.replace('\'', ' ')
    return word 

def get_all_terms(df):
    """
    Gets all the main terms from name, ingredients, tags, and search terms. 

    Args: 
        df (DataFrame): must contain cols name, ingredients, tags, and search_terms

    Returns:
        df (DataFrame): contains new col that contains the combined tokens. 
    """
    df['name'] = df['name'].apply(remove_apostrophes)
    df['all_main_terms'] = '\'' + df['name'] + '\'' + df['ingredients'] + df['tags'] + df['search_terms']
    df['all_main_terms'] = df['all_main_terms'].apply(extract_words_in_quotes) # each resulting val is a list 

    return df 

In [38]:
all_recipes = get_all_terms(all_recipes)
all_recipes.head(1)

Unnamed: 0,id,name,name_set,description,steps,ingredients,tags,search_terms,all_tokens,all_main_terms
0,514890,Jambalaya,{jambalaya},My favorite jambalaya recipe.,"[In a large heavy dutch oven over medium heat, heat oil and brown the chicken breast and thighs for 5 minutes., Add onion, bell pepper, celery,garlic, and parsley and cook for 5 mins longer., add the sausage, cajun spice, thyme, cayenne, bay leaf, and season to taste with salt and pepper. Cook for 1 minute., Stir in the rice, chicken stock, and tomato sauce and bring to a boil., reduce heat to medium low, cover and cook for 30 mins to 35 minutes Gently nestle the shrimp into the rice 5 mins before the jambalaya is finished., When ready to serve, fluff the rice with a fork. Garnish each serving with chopped parsley.]","['vegetable oil', 'boneless chicken breasts', 'boneless chicken thighs', 'onion', 'green pepper', 'celery rib', 'garlic clove', 'parsley', 'kielbasa', 'cajun spices', 'dried thyme', 'cayenne pepper', 'bay leaf', 'converted rice', 'chicken stock', 'tomato sauce', 'raw shrimp', 'parsley']","['60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'preparation', 'main-dish', 'seafood', 'meat', 'pasta-rice-and-grains', 'from-scratch']",{'dinner'},"'Jambalaya' ['vegetable oil', 'boneless chicken breasts', 'boneless chicken thighs', 'onion', 'green pepper', 'celery rib', 'garlic clove', 'parsley', 'kielbasa', 'cajun spices', 'dried thyme', 'cayenne pepper', 'bay leaf', 'converted rice', 'chicken stock', 'tomato sauce', 'raw shrimp', 'parsley'] ['60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'preparation', 'main-dish', 'seafood', 'meat', 'pasta-rice-and-grains', 'from-scratch'] {'dinner'}","[60 minutes or less, boneless chicken thighs, vegetable oil, onion, meat, cajun spices, boneless chicken breasts, celery rib, pasta rice and grains, dinner, main ingredient, from scratch, tomato sauce, green pepper, preparation, kielbasa, garlic clove, chicken stock, raw shrimp, seafood, cayenne pepper, time to make, course, parsley, main dish, jambalaya, dried thyme, bay leaf, converted rice]"


#### Positive and Negative Tokens

Search queries can contain words that indicate ingredients that are unwanted. For example, 'sugar free bread' means the searcher wants bread that does not contain sugar. However, a standard TFIDF algorithm will not know that the searcher does not want sugar, and may recommend recipes that do contain sugar. 

So for each recipe, we need to classify which ingredients or terms are wanted (called positive), and which are unwanted (called negative). 

In [39]:
def get_pos_neg_patterns():
    """
    Return a dict of regex patterns, where the key indicates if it is of type low, free, or negation.
    The regex in the lists are approximately ordered from most to least likely to appear. 
    Note: if there is a new negation term added to the regex, it must be added to the key in the dict 

    Example of regex patterns: 
        positivies:
            positives after 'low': 
                'low sugar bread'           -> 'bread'
            positives after 'free':
                'sugar free bread'          -> 'bread'
            positives before 'free in' or 'free of': 
                'bread free in sugar'       -> 'bread'
            positives after negations:
                'no sugar bread'            -> 'bread'
            positives before negations:
                'bread with no sugar'       -> 'bread'
            positives after negation and added: 
                'no added sugar bread'      -> 'bread'
        negatives:
            negatives after 'low':
                'low sugar bread'           -> 'sugar'
            negatives before 'free':
                'sugar free bread'          -> 'sugar'
            negatives after 'free in' or 'free of':
                'bread free in sugar'       -> 'sugar'
            negatives after negations:
                'bread with no sugar'       -> 'sugar'
            negatives between negation and added: 
                'no added sugar bread'      -> 'sugar'
    Returns:
        dictionary: {
            type of regex (string) : {
                negative or positive (string) : list of regex (list)
            }
        }
    """
  
    # positives 
    pos_after_low_pattern = re.compile(r'\blow\s*(?:in\s*)?\s*\w+\s+(.+)\b')
    pos_after_free_pattern = re.compile(r'\bfree\b\s+(?!in|of)\s*(.+)')
    pos_before_freein_freeof_pattern = re.compile(r'(.+)\s+free\s+(in|of)\b')
    pos_after_negation_pattern = re.compile(r'\b(?:no added|no|with no|non|not|without|minimal|with minimal)\b\s+\w+\s+(\w.*)') 
    pos_before_negation_pattern = re.compile(r'(.+?)\s+(?=\b(?:no added|no|with no|non|not|without|minimal|with minimal)\b)') 
    pos_after_negation_added_pattern = re.compile(r'\b(?:no|with no|non|not|without|minimal|with minimal)\b\s+[\w\s]+?\s+added\s+(.+)') 

    # negatives 
    neg_after_low_pattern = re.compile(r'\blow\s*(?:in\s*)?\s*(\w+)')
    neg_before_free_pattern = re.compile(r'(.+)\s+free\b(?!\s+(in|of))') 
    neg_after_freein_freeof_pattern = re.compile(r'\b(?:free\s+in|free\s+of)\s*(.+)')
    neg_after_negation_pattern = re.compile(r'\b(?:no added|no|non|not|with no|without|minimal)\b(?:\s+added)?\s+(\w+)') 
    neg_between_negation_added_pattern = re.compile(r'\b(?:no|with no|non|not|without|minimal|with minimal)\b\s+([\w\s]+?)\s+added\b') 

    # in order of most common to least common 
    patterns = {
        'low': {
            'negative': [neg_after_low_pattern],
            'positive': [pos_after_low_pattern] 
        }, 
        'free': {
            'negative': [neg_before_free_pattern, neg_after_freein_freeof_pattern],
            'positive': [pos_after_free_pattern, pos_before_freein_freeof_pattern] 
        }, 
        'no_added no added with_no non not without minimal with_minimal': { # if there is a new negation term added to the regex, it must be added to this key 
            'negative': [neg_after_negation_pattern, neg_between_negation_added_pattern], 
            'positive': [pos_after_negation_pattern, pos_before_negation_pattern, pos_after_negation_added_pattern] 
        }
    }

    return patterns

def make_neg_multiword_into_singleword(term, multiword_neg_terms_list):
    """
    For a given term, replaces any negative multi-word term with its one-word version using underscores (_). 
    Returns the original string with any negative multi-word terms replaced, as well as True/False whether negative multi-word term existed or not.
    
    Args: 
        term (string): the entire term
        multiword_neg_terms_list (set): set of terms that are multiple words but should be considered as one term
    
    Returns:
        term (string): the original input term with the multiword term replaced with its one-word form using underscore (_)
        boolean: True if multiword term exists, False if not 

    Example: 
        term ('pizza without saturated fat'), multiword_neg_terms_list (['saturated fat']) -> 'pizza without saturated_fat', True 
        term ('pizza without fat'), multiword_neg_terms_list (['saturated fat']) -> 'pizza without fat', False 
    """
    for multiword in multiword_neg_terms_list:
        if multiword in term:
            term = term.replace(multiword, multiword.replace(' ', '_'))
            return term, True
    return term, False

def check_contains_neg_indicator(term, all_neg_indicators):
    """
    Returns if the term contains a negative indicator or not 
    
    Args: 
        term (string)
        all_neg_indicators (list): list of strings that indicate that negative exists in the term 
    
    Returns:
        boolean: True if there is a negative indicator, False if there is none 

    Example: 
        term ('no beef pizza'), all_neg_indicators (['no', 'no added', 'without']) -> True 
    """
    
    term = term.replace('no added', 'no_added')
    term = term.replace('with no', 'with_no')
    term = term.replace('with minimal', 'with_minimal')
    term_split = term.split()
    return any(word in term_split for word in all_neg_indicators)

def classify_pos_neg(all_terms, pos_neg_patterns, all_neg_indicators):
    """
    Classify all the positive and negative words for all the given terms. 
    If a token is not found to be positive or negative through the list of regex, it is marked as positive as that is the default. 
    
    Args: 
        all_terms (list): list of strings of all the terms
        pos_neg_patterns (dict): dict of all the positive and negative regex patterns 
        all_neg_indicators (list): list of strings that indicate that negative exists in the term 
    
    Returns:
        set of positive tokens (set)
        set of negative tokens (set)
    """

    pos_set, neg_set = set(), set()
    ignore_neg_words_list = set(['something']) # words to ignore even if determined to be 'negative' (for example, 'free of something')
    multiword_neg_terms_list = ['saturated fat', 'trans fat'] # terms that can be considered as one word for 'negative'

    for term in all_terms: 
        # check if any negative indicators in the term (ex: low, free, no, without, etc)
        # only do regex pattern check if there is a negative indicator, otherwise it is time costly as most terms are not negative 
        contains_negative_indicator = check_contains_neg_indicator(term, all_neg_indicators)

        if contains_negative_indicator:
            term, contains_multiword_neg_term_bool = make_neg_multiword_into_singleword(term, multiword_neg_terms_list)
            is_pattern_matched = False 

            for pattern_type, pattern_type_dict in pos_neg_patterns.items():
                neg_patterns_list = pattern_type_dict['negative']
                pos_patterns_list = pattern_type_dict['positive']

                for neg_pattern in neg_patterns_list:
                    match = neg_pattern.search(term)
                    if match:
                        is_pattern_matched = True 
                        neg_word = match.group(1)
                        if neg_word not in ignore_neg_words_list:
                            if contains_multiword_neg_term_bool: 
                                neg_word = neg_word.replace('_', ' ')
                            neg_words = neg_word.split()
                            neg_set.update(neg_words)
                            # only need to check positive pattern if its negative pattern matched 
                            for pos_pattern in pos_patterns_list:
                                match = pos_pattern.search(term)
                                if match:
                                    pos_word = match.group(1)
                                    if contains_multiword_neg_term_bool:
                                        pos_word = pos_word.replace('_', ' ')
                                    pos_words = pos_word.split()
                                    pos_set.update(pos_words)
                                    break 
                        break 
                
                if is_pattern_matched:
                    break 
                
            if not is_pattern_matched: # in case the term had negative indicator but did not contain negative word 
                pos_words = term.split()
                pos_set.update(pos_words)

        else: # no negative indicators 
            pos_words = term.split()
            pos_set.update(pos_words)

    return pos_set, neg_set

def get_pos_neg_terms(df):
    """
    Classify all the terms in all_terms into either positive or negative. 
    Will be used when matching positives and negatives in the search query to the recipes.

    Args: 
        df (DataFrame): must contain column 'all_main_terms' 

    Returns: 
        df (DataFrame): new cols positive_terms and negative_terms that indicate which words are positive and which are negative 
    """
    pos_neg_patterns = get_pos_neg_patterns()
    all_neg_indicators = ' '.join(pos_neg_patterns.keys()).split() # the keys split into a list of indicators 
    
    # (pos1, neg1), (pos2, neg2), (pos3, neg3) -> (pos1, pos2, pos3), (neg1, neg2, neg3)
    df['positive_terms'], df['negative_terms'] = zip(*df['all_main_terms'].apply(lambda x: classify_pos_neg(x, pos_neg_patterns, all_neg_indicators)))

    return df

In [40]:
all_recipes = get_pos_neg_terms(all_recipes)

The positive and negative tokens have been split, as shown below. 

In [41]:
all_recipes[['name', 'all_main_terms', 'positive_terms', 'negative_terms']].head(10)

Unnamed: 0,name,all_main_terms,positive_terms,negative_terms
0,Jambalaya,"[60 minutes or less, boneless chicken thighs, vegetable oil, onion, meat, cajun spices, boneless chicken breasts, celery rib, pasta rice and grains, dinner, main ingredient, from scratch, tomato sauce, green pepper, preparation, kielbasa, garlic clove, chicken stock, raw shrimp, seafood, cayenne pepper, time to make, course, parsley, main dish, jambalaya, dried thyme, bay leaf, converted rice]","{dish, bay, garlic, rib, from, onion, chicken, meat, to, green, pasta, converted, less, make, leaf, cayenne, minutes, time, 60, dinner, cajun, preparation, breasts, tomato, kielbasa, thighs, and, ingredient, sauce, seafood, course, grains, parsley, shrimp, vegetable, jambalaya, oil, pepper, clove, stock, dried, thyme, boneless, main, scratch, raw, spices, or, rice, celery}",{}
1,Big Uncle Mike s M &amp; M Cookies,"[60 minutes or less, flour, baking soda, big uncle mike s m &amp; m cookies, preparation, candy, butter, time to make, eggs, course, vanilla extract, number of servings, salt, cookie, desserts, for large groups, oil, dessert, dark brown sugar]","{groups, flour, &amp;, soda, uncle, to, baking, cookies, sugar, less, make, brown, mike, minutes, time, 60, dark, extract, m, big, for, preparation, candy, of, butter, eggs, course, servings, salt, s, cookie, desserts, large, oil, number, dessert, vanilla, or}",{}
2,Ginger Fried Chicken,"[garlic, 60 minutes or less, occasion, flour, poultry, chicken, meat, dietary, dinner, main ingredient, preparation, whole chickens, time to make, eggs, course, main dish, number of servings, salt, ginger fried chicken, fresh gingerroot, pepper, whole chicken, japanese soy sauce]","{dish, garlic, occasion, poultry, flour, fresh, chicken, meat, to, less, make, fried, minutes, time, dietary, 60, dinner, preparation, of, ingredient, sauce, eggs, course, servings, ginger, salt, soy, pepper, whole, number, gingerroot, japanese, main, chickens, or}",{}
3,The Realtor s Creamy Cheese Tortellini With Asparagus,"[dinner party, garlic, 60 minutes or less, occasion, 3 steps or less, side, cornstarch, comfort food, pasta, lemon pepper, the realtor s creamy cheese tortellini with asparagus, north american, parmigiano reggiano cheese, pasta rice and grains, dietary, dinner, main ingredient, preparation, side dishes, one dish meal, heavy cream, taste mood, cheese tortellini, cuisine, kid friendly, time to make, course, weeknight, reduced sodium chicken broth, main dish, ravioli tortellini, vegetables, thyme, easy, asparagus]","{dish, garlic, mood, occasion, ravioli, steps, side, chicken, cornstarch, north, to, pasta, the, lemon, tortellini, less, make, 3, cream, creamy, dishes, comfort, minutes, time, dietary, 60, dinner, broth, party, preparation, and, meal, cuisine, sodium, ingredient, american, food, course, parmigiano, grains, weeknight, reduced, s, taste, reggiano, kid, cheese, pepper, vegetables, thyme, easy, with, realtor, main, heavy, or, rice, one, friendly, asparagus}",{}
4,Bang Bang Shrimp,"[dinner party, occasion, appetizers, cornstarch, scallion, dinner, all purpose flour, main ingredient, preparation, large shrimp, cuisine, for 1 or 2, bang bang shrimp, time to make, seafood, eggs, course, number of servings, salt, shrimp, hot chili sauce, asian, pepper, asian chili sauce, 30 minutes or less, appetizer, mayonnaise, honey]","{occasion, flour, appetizers, purpose, cornstarch, to, scallion, hot, make, chili, less, minutes, time, dinner, party, 2, for, preparation, bang, 1, of, cuisine, ingredient, sauce, seafood, eggs, course, servings, all, salt, shrimp, large, asian, pepper, number, main, appetizer, or, 30, mayonnaise, honey}",{}
5,Swissair Tarts,"[60 minutes or less, flour, appetizers, light cream, cornstarch, milk, swiss, gruyere cheese, ground walnuts, dietary, main ingredient, from scratch, swissair tarts, preparation, eggs dairy, heavy cream, cuisine, european, butter, time to make, eggs, course, number of servings, salt, for large groups, cheese, vegetarian, egg yolks, appetizer]","{groups, flour, appetizers, from, dairy, cornstarch, gruyere, to, milk, swiss, ground, less, make, cream, minutes, tarts, dietary, 60, time, walnuts, swissair, preparation, for, appetizer, of, cuisine, european, yolks, ingredient, butter, eggs, course, servings, salt, cheese, large, egg, number, vegetarian, main, light, or, scratch, heavy}",{}
6,Dehydrator Au Gratin Potato Chips,"[5 ingredients or less, high calcium, potatoes, vegetables, preparation, easy, high in something, 3 steps or less, snack, snacks, sharp cheddar cheese, parmesan cheese, main ingredient, dietary, course, salt, dehydrator au gratin potato chips]","{dehydrator, potatoes, steps, sharp, snack, in, potato, less, 3, something, ingredients, dietary, cheddar, preparation, calcium, ingredient, gratin, snacks, chips, course, salt, au, cheese, vegetables, easy, high, main, parmesan, or, 5}",{}
7,Bread Pudding Apple Pie,"[fall, dinner party, occasion, comfort food, apples, 9 inch pie shell, rolled oats, bread pudding apple pie, applesauce, white sugar, pies and tarts, thanksgiving, dietary, oven, all purpose flour, main ingredient, dinner, pie, preparation, holiday event, sweet, pudding, taste mood, bread, pies, butter, time to make, eggs, weeknight, course, 4 hours or less, desserts, seasonal, vegetarian, dessert, ground cinnamon, equipment, brown sugar, non fat vanilla yogurt, fruit]","{fall, apple, mood, occasion, shell, flour, purpose, rolled, apples, to, oats, hours, ground, white, sugar, applesauce, 9, make, less, brown, comfort, thanksgiving, tarts, time, dietary, event, dinner, oven, cinnamon, pie, holiday, party, preparation, yogurt, pudding, sweet, bread, pies, and, ingredient, butter, 4, food, eggs, weeknight, course, all, taste, desserts, seasonal, vegetarian, dessert, main, inch, equipment, vanilla, or, fruit}",{fat}
8,Marshmallow Fridge Tart,"[marshmallow fridge tart, condensed milk, graham cracker crumb crust, south african, glace cherries, cream, puddings and mousses, preparation, cuisine, time to make, course, vanilla extract, desserts, african, marshmallows, 15 minutes or less, dessert, crushed pineapple, lemon juice]","{juice, tart, to, milk, crumb, lemon, less, make, condensed, crushed, cream, mousses, pineapple, minutes, time, extract, puddings, preparation, graham, fridge, and, cuisine, 15, course, marshmallow, crust, cherries, desserts, african, south, marshmallows, dessert, vanilla, or, glace, cracker}",{}
9,Valentine s Day Vanilla Fiesta,"[nuts, occasion, indian, valentine s day vanilla fiesta, dietary, main ingredient, vanilla ice cream, preparation, fruit, holiday event, almonds, valentines day, cuisine, kid friendly, time to make, toddler friendly, course, desserts, asian, dessert, 30 minutes or less, cake, custard, equipment, pistachios, cashews, refrigerator]","{refrigerator, nuts, occasion, indian, to, day, less, make, cream, toddler, minutes, time, dietary, event, holiday, preparation, almonds, ice, cuisine, ingredient, course, s, kid, desserts, asian, valentines, dessert, valentine, main, cake, fiesta, vanilla, custard, equipment, or, pistachios, 30, fruit, friendly, cashews}",{}


### Set up the TFIDF Vectorizer 

Now let's set up the TFIDF Vectorizer. The TFIDFVectorizer automatically cleans up any punctuation or stopwords, and requires a string to be taken as an input; we can use the unclean version of the string (`all_tokens` column) for this task.

The `feature_names` will provide a list of all the remaining tokens after the cleanup (removal of stopwords, punctuation, etc). This means words like 'and' or 'or' will not be included. 

In [42]:
def get_tfidf_vectorizer_and_matrix(df_all_tokens):
    """
    Get tfidf vectorizer and tfidf matrix given the tokens. 

    Args: 
        df_all_tokens (Series): contains all the tokens for all recipes

    Returns:
        list ([vectorizer, matrix]): return a list of tfidf vectorizer and tfidf matrix
    """
    custom_stopwords = stopwords.words('english') 

    vectorizer = TfidfVectorizer(stop_words=custom_stopwords, token_pattern=r'\b[a-zA-Z]{1,}\b') # by default removes only length of 1 digits, but want to remove all digits 

    matrix = vectorizer.fit_transform(df_all_tokens) # shape: number of recipes x number of unique terms in all of all_tokens

    return [vectorizer, matrix]

In [43]:
# create model 
tfidf_vectorizer, tfidf_matrix = get_tfidf_vectorizer_and_matrix(all_recipes['all_tokens'])

In [44]:
feature_names = tfidf_vectorizer.get_feature_names_out()

In [45]:
feature_names

array(['aaa', 'aab', 'aacute', ..., 'zwtiii', 'zydeco', 'zzzingers'],
      dtype=object)

### Get the Clean Tokens

Let's call these tokens remaining after the TFIDF cleanup `clean_tokens`. It is important to get the list of clean tokens when we later count how many of the search terms are also in the recipe. 

For example, if we do not use the clean list of tokens, it can give too much weight to stopwords like 'and' or 'or'. 

In [46]:
def get_clean_tokens(index, feature_names, matrix):
    """
    Get a set of the clean tokens for a certain recipe after tfidf.
    TFIDF transform may remove stopwords or other punctuation, so we want to get only the tokens that are remaining. 

    Args: 
        index (string): name of the recipe 
        feature_names (list): list of the feature names resulting from the tfidf
        matrix: tfidf matrix

    Returns:
        clean_tokens (set): set of all the clean tokens after tfidf
    """
    clean_indices = matrix[index].tocoo().col # indices where input_tfidf_vector is non null, aka the word (column name) exists in the input 
    clean_tokens = feature_names[clean_indices] # array format 

    return set(clean_tokens)

def get_clean_tokens_string(recipe_tokens):
    """
    Gets all the clean tokens after tfidf as a string, separated by a space ' '

    Args: 
        recipe_tokens (set): set of tokens for a recipe 

    Returns:
        string: string of the clean tokens separated by a space ' '
    """
    # input recipe_tokens is an array of strings, return string format 
    return ' '.join([token for token in recipe_tokens]) # array to string 

In [47]:
def get_clean_vectorizer_tokens(row, feature_names, matrix):
    """
    Get the clean tokens for a recipe after tfidf transformation. 

    Args: 
        row: row of a df, must contain column 'name'
        feature_names (list): list of the resulting feature names after tfidf 
        matrix: tfidf matrix 

    Returns:
        clean_tokens (set): set of clean tokens resulting from tfidf 
    """
    index = row.name
    clean_tokens = get_clean_tokens(index, feature_names, matrix)

    return clean_tokens

In [48]:
# get the clean tokens for each recipe 
all_recipes['clean_tokens'] = all_recipes.apply(lambda x: get_clean_vectorizer_tokens(x, feature_names, tfidf_matrix), axis=1) # axis=1 will make row.name=index 
all_recipes['clean_tokens_str'] = all_recipes['clean_tokens'].apply(lambda x: get_clean_tokens_string(x))
all_recipes.head(1)

Unnamed: 0,id,name,name_set,description,steps,ingredients,tags,search_terms,all_tokens,all_main_terms,positive_terms,negative_terms,clean_tokens,clean_tokens_str
0,514890,Jambalaya,{jambalaya},My favorite jambalaya recipe.,"[In a large heavy dutch oven over medium heat, heat oil and brown the chicken breast and thighs for 5 minutes., Add onion, bell pepper, celery,garlic, and parsley and cook for 5 mins longer., add the sausage, cajun spice, thyme, cayenne, bay leaf, and season to taste with salt and pepper. Cook for 1 minute., Stir in the rice, chicken stock, and tomato sauce and bring to a boil., reduce heat to medium low, cover and cook for 30 mins to 35 minutes Gently nestle the shrimp into the rice 5 mins before the jambalaya is finished., When ready to serve, fluff the rice with a fork. Garnish each serving with chopped parsley.]","['vegetable oil', 'boneless chicken breasts', 'boneless chicken thighs', 'onion', 'green pepper', 'celery rib', 'garlic clove', 'parsley', 'kielbasa', 'cajun spices', 'dried thyme', 'cayenne pepper', 'bay leaf', 'converted rice', 'chicken stock', 'tomato sauce', 'raw shrimp', 'parsley']","['60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'preparation', 'main-dish', 'seafood', 'meat', 'pasta-rice-and-grains', 'from-scratch']",{'dinner'},"'Jambalaya' ['vegetable oil', 'boneless chicken breasts', 'boneless chicken thighs', 'onion', 'green pepper', 'celery rib', 'garlic clove', 'parsley', 'kielbasa', 'cajun spices', 'dried thyme', 'cayenne pepper', 'bay leaf', 'converted rice', 'chicken stock', 'tomato sauce', 'raw shrimp', 'parsley'] ['60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'preparation', 'main-dish', 'seafood', 'meat', 'pasta-rice-and-grains', 'from-scratch'] {'dinner'}","[60 minutes or less, boneless chicken thighs, vegetable oil, onion, meat, cajun spices, boneless chicken breasts, celery rib, pasta rice and grains, dinner, main ingredient, from scratch, tomato sauce, green pepper, preparation, kielbasa, garlic clove, chicken stock, raw shrimp, seafood, cayenne pepper, time to make, course, parsley, main dish, jambalaya, dried thyme, bay leaf, converted rice]","{dish, bay, garlic, rib, from, onion, chicken, meat, to, green, pasta, converted, less, make, leaf, cayenne, minutes, time, 60, dinner, cajun, preparation, breasts, tomato, kielbasa, thighs, and, ingredient, sauce, seafood, course, grains, parsley, shrimp, vegetable, jambalaya, oil, pepper, clove, stock, dried, thyme, boneless, main, scratch, raw, spices, or, rice, celery}",{},"{dish, bay, garlic, rib, onion, chicken, meat, green, pasta, converted, less, make, leaf, cayenne, minutes, time, dinner, cajun, preparation, breasts, kielbasa, tomato, thighs, ingredient, sauce, seafood, course, grains, parsley, shrimp, jambalaya, vegetable, oil, pepper, clove, thyme, dried, stock, boneless, main, scratch, raw, spices, rice, celery}",dish bay garlic rib onion chicken meat green pasta converted less make leaf cayenne minutes time dinner cajun preparation breasts kielbasa tomato thighs ingredient sauce seafood course grains parsley shrimp jambalaya vegetable oil pepper clove thyme dried stock boneless main scratch raw spices rice celery


Make ingredients into a list to display cleanly. Now, only keep the cols needed in the recommender system or final output.

In [49]:
def get_ingredients_list(df):
    df['ingredients'] = df['ingredients'].apply(lambda x: extract_words_in_quotes(x, replace_dash=False, lowercase=False, remove_apostrophes=False, make_unique=False))
    
    return df

In [50]:
all_recipes = get_ingredients_list(all_recipes)
final_cols = ['name', 'name_set', 'clean_tokens', 'positive_terms', 'description', 'ingredients', 'steps']
all_recipes = all_recipes[final_cols]

In [51]:
all_recipes.head(1)

Unnamed: 0,name,name_set,clean_tokens,positive_terms,description,ingredients,steps
0,Jambalaya,{jambalaya},"{dish, bay, garlic, rib, onion, chicken, meat, green, pasta, converted, less, make, leaf, cayenne, minutes, time, dinner, cajun, preparation, breasts, kielbasa, tomato, thighs, ingredient, sauce, seafood, course, grains, parsley, shrimp, jambalaya, vegetable, oil, pepper, clove, thyme, dried, stock, boneless, main, scratch, raw, spices, rice, celery}","{dish, bay, garlic, rib, from, onion, chicken, meat, to, green, pasta, converted, less, make, leaf, cayenne, minutes, time, 60, dinner, cajun, preparation, breasts, tomato, kielbasa, thighs, and, ingredient, sauce, seafood, course, grains, parsley, shrimp, vegetable, jambalaya, oil, pepper, clove, stock, dried, thyme, boneless, main, scratch, raw, spices, or, rice, celery}",My favorite jambalaya recipe.,"[vegetable oil, boneless chicken breasts, boneless chicken thighs, onion, green pepper, celery rib, garlic clove, parsley, kielbasa, cajun spices, dried thyme, cayenne pepper, bay leaf, converted rice, chicken stock, tomato sauce, raw shrimp, parsley]","[In a large heavy dutch oven over medium heat, heat oil and brown the chicken breast and thighs for 5 minutes., Add onion, bell pepper, celery,garlic, and parsley and cook for 5 mins longer., add the sausage, cajun spice, thyme, cayenne, bay leaf, and season to taste with salt and pepper. Cook for 1 minute., Stir in the rice, chicken stock, and tomato sauce and bring to a boil., reduce heat to medium low, cover and cook for 30 mins to 35 minutes Gently nestle the shrimp into the rice 5 mins before the jambalaya is finished., When ready to serve, fluff the rice with a fork. Garnish each serving with chopped parsley.]"


### Search Algorithm

We have now set up the data we need for the algorithm given an input search query. The algorithm is as follows: 

Multiply the below scores for each recipe, then rank the final score from highest to lowest. 
- cosine similarity based on TF-IDF
- number of matching terms in all tokens in the recipe + 1 
- number of matching terms in the name of the recipe + 1
- negative multiplier (-1 if recipe includes something classified negatively in the search input, 1 otherwise)

Note: + 1 to counts for smoothing. Don't want any 0 counts to cause entire score to become 0.

Here is the function to get the common terms between a recipe and the input search query. 

In [52]:
def get_input_token_count(row, input):
    """
    Get the number of common terms between row (recipe) and the input query

    Args: 
        row (set): set of tokens in the recipe 
        input (set): set of tokens in the input query 

    Returns:
        int: number of common terms 
    """
    return len(row & input)

Here we classify the positive and negative terms from the search input query.

In [53]:
def get_pos_neg_terms_with_string(input):
    """
    Get all the positive and negative terms given a string input. 

    Args: 
        input (string): search query 

    Returns: 
        input_df (DataFrame): contains positive and negative terms for the input 
    """
    input_df = pd.DataFrame({'all_main_terms': [[input]]})
    input_df = get_pos_neg_terms(input_df)
    return input_df

This searches for any negative terms from the input query's positive terms list, to make sure none of the recommended recipes contain ingredients that was marked to not be wanted from the input search query. 

In [54]:
def get_neg_multiplier(recipe_pos_terms, input_neg_set):
    """
    If there is a negative term in the input that is a positive term in the recipe, 
    the negative multiplier is -1, otherwise it is 1

    Args: 
        recipe_pos_terms (set): set of all the positive terms in a recipe 
        input_neg_set (set): set of all the negative terms in a search input query 

    Returns: 
        int: -1 if there is an intersection between input negatives and recipe positives 
    """
    return -1 if input_neg_set & recipe_pos_terms else 1

def get_negative_scores(df, input_df): 
    """
    Get a list of the negative multiplier for each recipe in the df. 

    Args: 
        df (DataFrame): contains all the recipes and its positive terms 
        input_df (DataFrame): contains negative terms for the input search query 

    Returns: 
        list: list of negative multiplier (1 or -1) for each recipe 
    """
    input_neg_set = input_df.iloc[0]['negative_terms']

    negative_multipliers = df['positive_terms'].apply(lambda x: get_neg_multiplier(x, input_neg_set)).to_numpy() 

    return negative_multipliers

This is the final algorithm. 

In [58]:
def get_recipe_recommendations(df, input, tfidf_vectorizer, tfidf_matrix, top_n=10):
    """
    Get the top_n recipes to recommend based on the search input query. 

    Algorithm: multiply the below to get scores for each recipe, then rank from highest to lowest 
        * cosine similarity based on tfidf
        * number of matching terms in all tokens in the recipe + 1 (+ 1 for smoothing)
        * number of matching terms in the name of the recipe + 1 (+ 1 for smoothing)
        * negative multiplier (-1 if recipe includes something classified negatively in the search input, 1 otherwise)

    Args: 
        df (DataFrame): contains all the recipes and its information 
        input (string): search query 
        tfidf_vectorizer (TFIDFVectorizer)
        tfidf_matrix

    Returns: 
        recommended_recipes (DataFrame): the top top_n most recommended recipes based on the search input 
    """
    ### clean 
    # clean the input with simple cleaner 
    input = simple_clean(input)
    
    ### cosine similarity 
    # TF-IDF transform on the search query 
    # automatically does the extra token cleaning defined in tfidf_vectorizer
    input_tfidf_vector = tfidf_vectorizer.transform([input]) 

    # get cosine similarity with every recipe 
    cosine_similarity_scores = cosine_similarity(input_tfidf_vector, tfidf_matrix)[0]

    ### negatives 
    # take negatives into account 
    # find the negative word in the search query if there is one 
    # for example 'no sugar' -> 'sugar' is the negative word, 'sugar free' -> 'sugar' is the negative word 
    input_df = get_pos_neg_terms_with_string(input)

    print('Try exploring these delicious recipes.')

    # check for negatives - if a negative term is in the recipe rec, multiply -1 to the score 
    # get a vector of 1s and -1s to multiply with the scores_vector, then multiply it 
    negative_scores_vector = get_negative_scores(df, input_df)

    ### term matching between input and recipes 
    # get how many terms in the input match with the recipes 
    feature_names = tfidf_vectorizer.get_feature_names_out()
    clean_input = get_clean_tokens(0, feature_names, input_tfidf_vector)
    input_term_counts = df['clean_tokens'].apply(lambda x: get_input_token_count(x, clean_input)).to_numpy()
    input_term_counts += 1 # make it a multiplier to the cosine similarity 

    # get how many terms in the input match with the recipe name specifically 
    # want to put heavier weight on these 
    clean_input_pos_set = input_df['positive_terms'][0]
    input_term_counts_in_name = df['name_set'].apply(lambda x: get_input_token_count(x, clean_input_pos_set)).to_numpy()
    input_term_counts_in_name += 1 # make it a multiplier to the cosine similarity 

    ### score calculation 
    scores_vector = cosine_similarity_scores * input_term_counts * input_term_counts_in_name * negative_scores_vector

    ### ranking: get the point where the scores plateau. then get the top min(top_n, number of recipes before plateau) recipes to return 
    scores_vector_desc = sorted(scores_vector, reverse=True) # sort the scores in desc order 
    gradient = np.gradient(scores_vector_desc) # get the gradient of the decrease 

    threshold = 0.0001 # manually defined 
    plateau_index = np.where(np.abs(gradient) < threshold)[0][0] # get the index where the scores start to plateau 
    ranked_indices = np.argsort(scores_vector)[::-1][0:min(plateau_index, top_n)] # get the final recipes 
    recommended_recipes = df.iloc[ranked_indices]

    return recommended_recipes

### Search Engine 

Now, let's test out the search engine with a simple Python widget. Searching will provide the top 10 best recommendations from food.com recipes. 

In [59]:
def click_search_button(input):
    if not input:
        display('Search for a recipe')
    else: 
        print('Good choice!')

        top_n = 10
        recommendations = get_recipe_recommendations(all_recipes, input, tfidf_vectorizer, tfidf_matrix, top_n=top_n) 

        for _, row in recommendations.iterrows():
            display(widgets.HTML(f"""
                <h3>{row['name']}</h3> 
                <p>{row['description']}</p>
                <p><b>Ingredients:</b> {', '.join(row['ingredients'])}</p>
                <p><b>Steps</b></p>
                <ol>
                    {''.join([f"<li>{step}</li>" for step in row['steps']])}
                </ol>
                <hr>
            """))

user_input = widgets.Text(
    value='',
    placeholder='Type here',
    description='I want to make:',
    style={'description_width': 'initial'}
)

buffer_space = widgets.HTML(value="<div style='height: 150px;'></div>") 

title = widgets.HTML(value="<h2 style='font-weight: bold; text-align: left; font-size: 36px;'>Food.com</h2>") 

display(buffer_space, title)
# Link the button to the function
search_button = widgets.interact_manual(click_search_button, input=user_input)
search_button.widget.children[1].description = 'Search'
search_button.widget.children[1].style.button_color = '#ADD8E6'

HTML(value="<div style='height: 150px;'></div>")

HTML(value="<h2 style='font-weight: bold; text-align: left; font-size: 36px;'>Food.com</h2>")

interactive(children=(Text(value='', continuous_update=False, description='I want to make:', placeholder='Type…