# Ingredient Measurement Insertion

[Biggest source to remind me how Bag of words work](https://www.analyticsvidhya.com/blog/2021/08/a-friendly-guide-to-nlp-bag-of-words-with-python-example/)

The gerneral strategy is to find the "root" word in the for each ingredient phrase taking a hand-waving naive bayes approach. Out of the words in each ingredient phrase, which appears the most in the instructions? Assume this is the word with the highest MAP, and select it as the root ingredient. From there, assume that the measurement is made up of all text in the ingredient phrase before the root word. Example: "1 cup of flour." Flour might appear twice in the instructions and would be chosen as a root word. The measurement would then be "1 cup of". I make tuples of (measurement, ingredient) like 
("1 cup of", "flour"). Then, I find whereever the root ingredient appears in the instrctions and place the measurment before it. 


In [296]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


In [297]:
data = pd.read_csv('scraped_recipe_data.csv')
data=data.drop(columns='Unnamed: 0') #problems reading in the data from the csv

In [348]:
ingre = data.ingredients[0]
instruc = data.instructions[0]

In [349]:
instruc

'First, make Chana Masala as follows:Heat avocado oil in a pan. To the pan, add bay leaves, dry chili pepper, cinnamon sticks and whole cumin seeds.Once cumin seeds start to sizzle, add diced onion and cook until slightly brown and translucent.Lower the heat and add the crushed or fresh tomatoes, ginger and garlic and mixwell.Add salt, turmeric, paprika, red chili powder, chana masala mix and water. Mix well.Let simmer for about 5 minutes with lid on.Add cooked chickpeas, mix well and let cook for 7-10 minutes.Then, add the taco seasoning to the chana masala you just made. Mash with masher and add water as needed for a loose but not runny consistency. Heat in a saucepan. Stir occasionally to prevent sticking.Serve mixture on taco shells with lettuce, tomatoes, avocado, optional cheese and salsa.'

In [350]:
ingre

"['Chana Masala Ingredients:', '4 cups chickpeas (800 g) canned or pre-cooked', '1 medium onion finely diced', '2 roma tomatoes diced or crushed tomatoes', '2 tablespoons ginger grated', '3 cloves garlic', '1 cup water (240 mL) (adjust as needed)', '2 tablespoons avocado oil', '2 whole bay leaves', '2 whole dry chili peppers', 'cinnamon stick piece', '½ teaspoon cumin seeds', '1 teaspoon salt', '½ teaspoon turmeric', '2 teaspoons paprika', 'red chili powder (optional to add spice)', '2 teaspoons chana masala powder', 'Taco Ingredients:', '2 tablespoons taco seasoning', '1 cup water (240 mL) (adjust as needed for consistency)']"

In [302]:
def edit_recipe(ingre, instruc):
    
    
    #cleaned up the instructions so that no punctuation or double spaces messes things up
    temp_instruc = instruc.replace(',', '').replace(':', ' ').replace('.', ' ').replace('  ', ' ').lower()
    
    #I created a Term Frequency vectorizer and removed stop words and allowed bigrams
    tf = TfidfVectorizer(stop_words = 'english', ngram_range=(1,2))
    #for this instruction, find all of the words and bigrams that aren't stop words
    tf.fit([temp_instruc])
    #transform this instruciton to TF
    temp_tf = tf.transform([temp_instruc])
    #compile in a dataframe
    tf_table=pd.DataFrame(temp_tf.toarray(), columns=tf.get_feature_names())

    #the ingredients read in as a string from the .csv, so I restructure it as a list again
    new_ingre = ingre[2:-2].replace(" '",'').replace("'",'').split(',')
    #doing bag of words on the ingredients too
    bag_of_words_ingre = CountVectorizer(stop_words = 'english', ngram_range=(1,2))

    #Looping through each line of the ingredients, finding the actual ingredient 
    #rather than it's measurement by finding max tf from instructions
    # basically, I am looking at the instructions to see which words in the ingredients are important food items. The rest are measuments
    root_ingredients = []
    for i in range(len(new_ingre)):
        bag_of_words_ingre.fit([new_ingre[i]])
        names = bag_of_words_ingre.get_feature_names()
        tf_vals = []
        for word in new_ingre[i].split():
            try:
                tf_vals.append((tf_table[word], word))
            except:
                pass
        try:
            root_ingredient = tf_vals[np.argmax(tf_vals[:][0][0])][1]
        except:
            root_ingredient = None
        root_ingredients.append(root_ingredient)
        
    #cleaning this list by removing duplicates (both! Two measurements for one ingredient is too hard to figure out!) and Nones
    cleaned_ingre = [i for i in root_ingredients if root_ingredients.count(i)<2 and i !=None]
        
    #this will create tuples of (measrument, ingredient)
    temp=[]
    for i in new_ingre:
         for j in cleaned_ingre:
            if ' '+j in i: #added the space becuase red is a substring of ingredient
                temp.append((i.split(str(j))[0], j))
                
    #finally, update the instructions with added measurements!
    for tup in temp:
        if ' '+tup[1] in instruc:
            instruc = instruc.replace(' '+tup[1], ' '+tup[0]+tup[1])
            
    return(instruc)

In [303]:
ind = 0


updated_instructions = []
for i in range(len(data)):
    ingre = data.ingredients[i]
    instruc = data.instructions[i]
    try:
        new_instruc = edit_recipe(ingre,instruc)
        instructions = []
        for idx, i in enumerate(new_instruc.split('.')):
            instructions.append(str(idx+1)+': '+ i )
        if len(instructions[-1])>5:
            updated_instructions.append(instructions)
        else:
            updated_instructions.append(instructions[:-1])
    except:
        updated_instructions.append(None)


In [351]:
updated_instructions[0:2]

[['1: First, make Chana Masala as follows:Heat 2 tablespoons avocado oil in a pan',
  '2:  To the pan, add 2 whole bay leaves, 2 whole dry chili pepper, cinnamon sticks and whole ½ teaspoon cumin seeds',
  '3: Once ½ teaspoon cumin seeds start to sizzle, add diced 1 medium onion and cook until slightly brown and translucent',
  '4: Lower the heat and add the crushed or fresh 2 roma tomatoes, 2 tablespoons ginger and 3 cloves garlic and mixwell',
  '5: Add 1 teaspoon salt, ½ teaspoon turmeric, 2 teaspoons paprika, red chili powder, 2 teaspoons chana masala mix and water',
  '6:  Mix well',
  '7: Let simmer for about 5 minutes with lid on',
  '8: Add cooked 4 cups chickpeas, mix well and let cook for 7-10 minutes',
  '9: Then, add the 2 tablespoons taco seasoning to the 2 teaspoons chana masala you just made',
  '10:  Mash with masher and add water as needed for a loose but not runny consistency',
  '11:  Heat in a saucepan',
  '12:  Stir occasionally to prevent sticking',
  '13: Serve m

In [328]:
# There are 9 recipes that fail. I will move on for the sake of time, but this is something to look into in the future
temp = []
for i in updated_instructions:
    if i!=None:
        temp.append(updated_instructions.index(i))

In [336]:
temp

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,
 185,

In [318]:
updated_data = pd.DataFrame({'recipe_link':data.recipe_link, 'ingredients':data.ingredients, 'instructions':updated_instructions})

In [345]:
updated_data = updated_data.dropna()

In [346]:
updated_data.to_csv('updated_recipe_data.csv')