

---


# Set-up


---



In [1]:
from bs4 import BeautifulSoup #scraping
import requests #for HTTP requests
import re #for regular expression
import time 

import pandas as pd
import numpy as np
import spacy


import random

ModuleNotFoundError: No module named 'spacy'



---


# Web Scraping Recipes 

---



The recipes are scraped from [allrecipes.com](https://www.allrecipes.com). The website has a lot of recipes from various cuisines and seems to be well-maintained and moderated in terms of the recipes. Thus, I chose to work with this website. 

However, on exploring the website further, I noticed that the website has two different webpages versions for recipes. This could be a remenant of a previous template which have not been migrated to the newer template. Thus, I build a function to scrape these two different versions of the recipe webpage to scrape ingredients list. 

### 1. Defining function to scrape the website "allrecipes.com"

---



In [0]:
def getRecipeLink(url, totalPage):
    recipes_all = []
    for page in range(1, totalPage):
        r = requests.get(base_url + str(page)) #returns a variable which contains the html doc unstructured
        html = BeautifulSoup(r.content, "html.parser")
        k = 0 #to count recipes per page
        for i in html.findAll('article', {"class": "fixed-recipe-card"}):
            link = i.find("a", attrs={'href': re.compile("^https://www.allrecipes.com/recipe")})
            name = i.find('span', {"class":"fixed-recipe-card__title-link"}).text
            recipes_all.append({"name" : name, "url" : link['href']})
            k+=1
        print("Extracted {0} recipes".format(k))
    return recipes_all

In [0]:
def getIngredients(version, ingredientSoup):
    ingrdList = []
    if version == 0:
        attrs = {'class':'ingredients-item'}
    elif version == 1:
        attrs = {'class':"checkList__line"}
    for ul in ingredientSoup:
        for li in ul.findAll('li', attrs):
            ing = li.find("span").text.strip() #basic stripping to eliminate whitespaces
            if ing != "Add all ingredients to list": #sometimes this is read in too
                ingrdList.append(ing)
    return(ingrdList)

def getEachRecipe(recipesList):
    for ix in range(len(recipesList)):
        url = recipesList[ix]['url']
        r = requests.get(url) #returns a variable which contains the html doc unstructured
        currRecipe = BeautifulSoup(r.content, "html.parser")
        #there are two versions of the website, the old and new have different html structure
        #they require different parsing
        ver = 0
        if currRecipe.find_all('ul', {'class':'ingredients-section'}):
            ingredients  = currRecipe.find_all('ul', {'class':'ingredients-section'}) 
        else:
            ingredients = currRecipe.findAll('ul', {'id':re.compile('^lst_ingredients')})
            ver = 1
        currIngredient = getIngredients(ver,ingredients)
        recipesList[ix]['ingredients'] = currIngredient
        print("\r", "Adding {0} ingredients to {1} recipe".format(len(currIngredient), ix + 1), end=" ")

### 2. Scraping more than 100 recipes' name, link, and ingredients


---


In [0]:
base_url = "https://www.allrecipes.com/recipes/?page="

In [35]:
#getting the recipe link
recipes = getRecipeLink(base_url,7)
print("{0} recipes are scraped and stored.".format(len(recipes)))

Extracted 29 recipes
Extracted 20 recipes
Extracted 20 recipes
Extracted 20 recipes
Extracted 20 recipes
Extracted 20 recipes
129 recipes are scraped and stored.


In [430]:
#getting ingredients
getEachRecipe(recipes)

 Adding 12 ingredients to 129 recipe 

In [431]:
recipes[:2]

[{'ingredients': ['6 ounces ground beef',
   '4 drops Worcestershire sauce, or to taste',
   '3 pinches garlic powder',
   'freshly ground black pepper to taste',
   'kosher salt to taste',
   '1 pinch cayenne pepper',
   '1 slice Cheddar cheese',
   '1  hamburger bun, split',
   '1 slice tomato',
   '1 leaf lettuce'],
  'name': "Chef John's Juicy Lucy",
  'url': 'https://www.allrecipes.com/recipe/267875/chef-johns-juicy-lucy/'},
 {'ingredients': ['1\u2009½ pounds lean ground beef',
   '½  onion, finely chopped',
   '½ cup shredded Colby Jack or Cheddar cheese',
   '1 teaspoon soy sauce',
   '1 teaspoon Worcestershire sauce',
   '1  egg',
   '1 (1 ounce) envelope dry onion soup mix',
   '1 clove garlic, minced',
   '1 tablespoon garlic powder',
   '1 teaspoon dried parsley',
   '1 teaspoon dried basil',
   '1 teaspoon dried oregano',
   '½ teaspoon crushed dried rosemary',
   'salt and pepper to taste'],
  'name': 'Best Hamburger Ever',
  'url': 'https://www.allrecipes.com/recipe/72657



---
---

On investigating the 129 scraped recipes, it was observed that there are several formatting pecularities unique to this case. These pecularities aren't all related to ingredient names and are stated as below:

*   Measurements are represented as ½ (called vulgar fractions)
*   Several non-alphanumeric characters such as 
    *   copyright and trademark symbols used to identify ingredients
    *   comma as used in "½  onion, finely chopped"
    *   brackets as used in "1 (1 ounce) envelope dry onion soup mix"
    *   hyphens (-)  as used in "all-purpose flour"


---
---


In [0]:
#conver to dataframe
base_df = pd.DataFrame(recipes)
ingredient = base_df.apply(lambda x: pd.Series(x['ingredients']), axis=1).stack().reset_index(level=1, drop=True)
ingredient.name = 'ingredient'
recipes_df= base_df.drop('ingredients', axis=1).join(ingredient)
recipes_df['ingredient'] = pd.Series(recipes_df['ingredient'], dtype=object)
recipes_df.reset_index(inplace = True,drop='index')
recipes_df = recipes_df[['url', 'name', 'ingredient']]

In [436]:
recipes_df.head()

Unnamed: 0,url,name,ingredient
0,https://www.allrecipes.com/recipe/267875/chef-...,Chef John's Juicy Lucy,6 ounces ground beef
1,https://www.allrecipes.com/recipe/267875/chef-...,Chef John's Juicy Lucy,"4 drops Worcestershire sauce, or to taste"
2,https://www.allrecipes.com/recipe/267875/chef-...,Chef John's Juicy Lucy,3 pinches garlic powder
3,https://www.allrecipes.com/recipe/267875/chef-...,Chef John's Juicy Lucy,freshly ground black pepper to taste
4,https://www.allrecipes.com/recipe/267875/chef-...,Chef John's Juicy Lucy,kosher salt to taste


In [0]:
recipes_df.to_csv("rawData.csv")

# Cleaning Data



---


---
The cleaning will take place in two phases:
1. **Primary Cleaning:** The objective of the first phase is to ensure that the data is readable and accessible on all platforms by fixing encoding errors and eliminating symbols which aren't translated well across platforms. This cleaning will not get rid of any punctuations, stopwords etc.

2. **Problem-specific cleaning** :The objective of the second phase of the cleaning is to prepare the data for our calculations and is centered on the problem set requirement. 

I believe it is a good practice to separate the both, as if requirements change in the future you can always proceed with the result of the first cleaning phase to perform another analysis altogether.






---



---


### 1. Primary Cleaning 


---



In [0]:
#primary cleaning
#convert vulgar fractions 
import unicodedata
unicodedata.numeric(u'⅕')
unicodedata.name(u'⅕')

#convert vulgar fractions
for ix, row in recipes_df.iterrows():
    for char in row['ingredient']:
        if unicodedata.name(char).startswith('VULGAR FRACTION'):  
            normalized = unicodedata.normalize('NFKC', char)
            recipes_df.iloc[ix, 2] = recipes_df.iloc[ix, 2].replace(char, normalized)

In [440]:
#sanity check for vulgar fractions removal
recipes_df.iloc[10:15, :]

Unnamed: 0,url,name,ingredient
10,https://www.allrecipes.com/recipe/72657/best-h...,Best Hamburger Ever,1 1⁄2 pounds lean ground beef
11,https://www.allrecipes.com/recipe/72657/best-h...,Best Hamburger Ever,"1⁄2 onion, finely chopped"
12,https://www.allrecipes.com/recipe/72657/best-h...,Best Hamburger Ever,1⁄2 cup shredded Colby Jack or Cheddar cheese
13,https://www.allrecipes.com/recipe/72657/best-h...,Best Hamburger Ever,1 teaspoon soy sauce
14,https://www.allrecipes.com/recipe/72657/best-h...,Best Hamburger Ever,1 teaspoon Worcestershire sauce


### 2. Problem-specific Cleaning

---


The **objective is to extract the ingredient name** from sentences which contain additional information such as measurement, unit of measurement, ingredient state-specific information (chopped, minced, frozen etc). 








#### 2a. Data Exploration
---

In order to eliminate the additional information, the position of the additional information w.r.t the ingredient name is helpful. It is indicative of the position of the ingredient and almost follows a pattern though not strictly. 

A few patterns and their example are:


1.  Pattern: quantity measurement ingredient
>      Example: 1 teaspoon soy sauce

2.  Pattern: quantity ingredient
>     Example: 2 eggs

3.   Pattern: quantity quantity ingredient, ingredient-specific information
>     Example: 1⁄2 onion, finely chopped


A few other patterns can be observed here

In [441]:
recipes_df.ingredient

0                                  6 ounces ground beef
1             4 drops Worcestershire sauce, or to taste
2                               3 pinches garlic powder
3                  freshly ground black pepper to taste
4                                  kosher salt to taste
                             ...                       
1200        2 large green bell peppers, roughly chopped
1201    1 (14.5 ounce) can stewed tomatoes, with liquid
1202                            3 tablespoons soy sauce
1203                             1 teaspoon white sugar
1204                                    1 teaspoon salt
Name: ingredient, Length: 1205, dtype: object

Let's see if there are any overlaps in this cleaned data.

In [442]:
recipes_df.ingredient.value_counts()

1⁄2 teaspoon salt                           19
1 teaspoon salt                             17
1 teaspoon vanilla extract                  16
2  eggs                                     14
1 teaspoon baking soda                      13
                                            ..
1 teaspoon chicken bouillon granules         1
4 eggs, beaten                               1
1/4 teaspoon hot pepper sauce (optional)     1
1/8 teaspoon ground black pepper             1
1 pinch ground black pepper to taste         1
Name: ingredient, Length: 806, dtype: int64

#### 2b. Ingredient Extraction Methodology via Named Entity Recognition

---

With this problem-specific intuition and data knowledge, there are a few approaches that we can use. 

**Approach 1:**

Since there is a dependency among the components of the sentence and we know that the ingredient name will be a noun, we can use this information along with custom regex expression to eliminate measurement units to extract the ingredient name.

This has been explored in ***the appendix section***. 

**Approach 2:**

***Named Entity Recognition (NER)*** can be used to extract ingredient name from the unstructured text. [Named Entity Recognition](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj5o8r1ttHpAhWMA3IKHaegBZgQFjABegQIBBAB&url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FNamed-entity_recognition&usg=AOvVaw0gBRQzjbYCnfBi_ocUVXYa) is one of the first tasks of information extraction that seeks to locate and classify named entity mentioned in unstructed text into pre-defined categories such as people, organizations etc. 

**To create a custom named entity recongition model**, we need training data which is annonated in the format specified by spacy documentation. To do this, I collected a subset of data from [NY Times public dataset](https://raw.githubusercontent.com/nytimes/ingredient-phrase-tagger/master/nyt-ingredients-snapshot-2015.csv) of ingredients  to ensure the training data is representative of the data we have scraped. 




##### *Extract NYTimes Dataset*


In [443]:
#extract the dataset
NY_train_data = pd.read_csv("https://raw.githubusercontent.com/nytimes/ingredient-phrase-tagger/master/nyt-ingredients-snapshot-2015.csv")
print("There are {0} rows in the NY Times dataset".format(NY_train_data.shape[0]))

There are 179207 rows in the NY Times dataset


In [298]:
NY_train_data.head(6)

Unnamed: 0,index,input,name,qty,range_end,unit,comment
0,0,1 1/4 cups cooked and pureed fresh butternut s...,butternut squash,1.25,0.0,cup,"cooked and pureed fresh, or 1 10-ounce package..."
1,1,1 cup peeled and cooked fresh chestnuts (about...,chestnuts,1.0,0.0,cup,"peeled and cooked fresh (about 20), or 1 cup c..."
2,2,"1 medium-size onion, peeled and chopped",onion,1.0,0.0,,"medium-size, peeled and chopped"
3,3,"2 stalks celery, chopped coarse",celery,2.0,0.0,stalk,chopped coarse
4,4,1 1/2 tablespoons vegetable oil,vegetable oil,1.5,0.0,tablespoon,
5,5,,water,0.5,0.0,cup,


We observe from the snippet of NY Times dataset that we need only input and name to create our training dataset. Thus, lets proceed with these two. The data for training custom NER has to be of the form



>     TRAIN_DATA = [('3 tablespoons chopped fresh sage',
                    {'entities': [(28, 32, 'INGREDIENT')]}),
                    ('1/4 cup brown sugar',
                    {'entities': [(8, 19, 'INGREDIENT')]}),
                    ('1 1/2 cups heavy cream',
                    {'entities': [(11, 22, 'INGREDIENT')]}),
                    ('1 1/4 cups whole milk',
                    {'entities': [(11, 21, 'INGREDIENT')]})

Here, the number are the starting and the ending position of the ingredient name in the sentence.

However, the data extracted from the NY Times dataset is not in the format that we need to train the NER model. Additionally, there are 179206 rows in the dataset of which I will utilize a small subset of around 600 rows as Spacy's NER is extremely powerful and learns very quickly. Thus, I wrote the script to convert it into the specified format.



##### *Clean & Transform NYTimes Dataset*

For cleaning,

1.   The columns - input, name - are converted to string 
2.   Words to be removed from the name column are identified as sometimes the name column contains descriptive information
3.   Words are removed from name column if there is additional information present
4.   Parantheses and text between parantheses is eliminated for name column as it contains additional information
5.   Drop rows which contain null values for input and name column. This form of imputation is acceptable here as we are utilizing a very small subset of the data and there aren't many null values in the original dataset.  



In [0]:
#convert input and name to string
NY_train_data['input'] = NY_train_data['input'].astype(str)
NY_train_data['name'] = NY_train_data['name'].astype(str)

In [0]:
#clean the name column with these words
remove_words = ['ground','to','taste', 'and', 'or', 'powder','white','red','green','yellow', 'can', 'seed', 'into', 'cut', 'grated',\
                'leaf','package','finely','divided','a','piece','optional','inch','needed','more','drained','for','flake','juice','dry','breast',\
                'extract','yellow','thinly','boneless','skinless','cubed','bell','bunch','cube','slice','pod','beaten','seeded','broth','uncooked',\
                'root','plain','baking','heavy','halved','crumbled','sweet','with','hot','confectioner','room','temperature','trimmed',\
                'all-purpose','sauce','crumb','deveined','bulk','seasoning','jar','food','sundried','italianstyle','if','bag','mix','in',\
                'each','roll','instant','double','such','extra-virgin','frying','thawed','whipping','stock','rinsed','mild','sprig','brown',\
                'freshly','toasted','link','boiling','cooked','basmati','unsalted','container','split','cooking','thin','lengthwise','warm',\
                'softened','thick','quartered','juiced','pitted','chunk','melted','cold','coloring','puree','cored','stewed',\
                'floret','coarsely','the','clarified','blanched','zested','sweetened','powdered','longgrain','garnish','indian','dressing',\
                'soup','at','active','french','lean','chip','sour','condensed','long','smoked','ripe','skinned','fillet','from','stem','flaked',\
                'removed','zest','stalk','unsweetened','baby','cover','crust', 'extra', 'prepared', 'blend', 'of', 'ring','plus','firmly', 'packed',\
                'lightly','level','even','rounded','heaping','heaped','sifted','bushel','peck','stick','chopped','sliced','halves', 'shredded',\
                'slivered','sliced','whole','paste','whole',' fresh', 'peeled', 'diced','mashed','dried','frozen','fresh','peeled','candied',\
                'no', 'pulp','crystallized','canned','crushed','minced','julienned','clove','head', 'small','large','medium', 'good', 'quality', \
                'freshly']


In [0]:
#drop null 
NY_train_data = NY_train_data.dropna(axis = 0, subset = ['input', 'name'])

In [0]:
def cleanNYData(df, col, size):
  #clean extra words and brackets
    cleaned_col = []
    for _, row in df.iloc[:size].iterrows():
        #remove text within parantheses along with the parantheses
        row[col] = re.sub("[\(\[].*?[\)\]]", "", row[col])
        row[col] = row[col].replace("-", "")
        curr_row =  row[col].split()
        if len(curr_row) > 1:
            resultwords  = [word for word in curr_row if word.lower() not in remove_words]
            row[col] = ' '.join(resultwords)
        if row[col] == '':
            cleaned_col.append(" ")
        else:
            cleaned_col.append(row[col])
    df.iloc[:size][col] = cleaned_col
    return(df.iloc[:size])

In [472]:
clean_ny_data = cleanNYData(NY_train_data, 'name', 600)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


In [473]:
clean_ny_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 600 entries, 0 to 601
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   index      600 non-null    int64  
 1   input      600 non-null    object 
 2   name       600 non-null    object 
 3   qty        600 non-null    float64
 4   range_end  600 non-null    float64
 5   unit       434 non-null    object 
 6   comment    367 non-null    object 
dtypes: float64(2), int64(1), object(4)
memory usage: 37.5+ KB


For transforming, 

The subset of the cleaned NY Times dataset is transformed into the required form for training the custom NER model

In [0]:
def generateEntity(line, ingredient_list, entity):
    curr_dict = {}
    if len(ingredient_list) == 1:
        ingd_regex = re.compile(ingredient_list[0])
        entity_match = ingd_regex.search(line)
        curr_dict['entities'] = [(entity_match.start(), entity_match.end(), entity)]
        return(curr_dict['entities'])
    else:
        for i in range(len(ingredient_list)):
            ingd_regex = re.compile(ingredient_list[i])
            entity_match = ingd_regex.search(line)
            if i == 0:
                curr_dict['entities'] = [(entity_match.start(), entity_match.end(), entity)]
            else:
                curr_dict['entities'].append((entity_match.start(), entity_match.end(), entity))
    return(curr_dict['entities'])


def generateTrainingData(df, inputCol, ingredientCol, entity):
    TRAIN_DATA = []
    subset = df[[inputCol, ingredientCol]]
    for ix in range(len(df)):
        line = subset.iloc[ix, 0]
        ingd_name = subset.iloc[ix, 1]
        ent_dict = {}
        ingd_list = ingd_name.split()
        flag = 0
        #for each token
        for ingredient in ingd_list:
            if line == 'nan' or ingredient == 'nan':
                flag = 1
                continue
            if ingredient not in line:
                flag = 1
                continue
        if flag == 0:
            ent_dict['entities'] = generateEntity(line, ingd_list, entity)
            TRAIN_DATA.append((line, ent_dict))
            print("\r", "Adding", (line, ent_dict), "to row {0}".format(ix + 1), end = " ")
        else:
            print("\r","Skipping {} row".format(ix + 1), end = " ")
    return(TRAIN_DATA)    

In [487]:
TRAIN_DATA = generateTrainingData(clean_ny_data, "input", 'name', 'INGREDIENT')

 Adding ('2 tablespoons freshly squeezed lemon juice, more to taste', {'entities': [(31, 36, 'INGREDIENT')]}) to row 600 

In [488]:
print('Our training set has {} observations'.format(len(TRAIN_DATA)))

Our training set has 572 observations


##### *Train Custom Named Entity Recognition Model*

In [0]:
def train_spacy(data,iterations):
    TRAIN_DATA = data
    nlp = spacy.load('en_core_web_sm')
    ner = nlp.get_pipe("ner")
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            ner.add_label(ent[2])
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Statring iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.2,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            print(losses)
    return nlp

In [491]:
ner_model = train_spacy(TRAIN_DATA, 25)

Statring iteration 0
{'ner': 1597.9703202758897}
Statring iteration 1
{'ner': 1412.4199845095231}
Statring iteration 2
{'ner': 1339.9057926157514}
Statring iteration 3
{'ner': 1368.157786536946}
Statring iteration 4
{'ner': 1264.5722895093263}
Statring iteration 5
{'ner': 1317.0224620388951}
Statring iteration 6
{'ner': 1345.752825314061}
Statring iteration 7
{'ner': 1315.3997490144498}
Statring iteration 8
{'ner': 1330.389740800146}
Statring iteration 9
{'ner': 1212.2507211363586}
Statring iteration 10
{'ner': 1271.488753699605}
Statring iteration 11
{'ner': 1204.4500160514683}
Statring iteration 12
{'ner': 1275.7939297056428}
Statring iteration 13
{'ner': 1232.0827359020443}
Statring iteration 14
{'ner': 1128.2279360688538}
Statring iteration 15
{'ner': 1304.1340880967266}
Statring iteration 16
{'ner': 1183.6373150339896}
Statring iteration 17
{'ner': 1292.8330503383104}
Statring iteration 18
{'ner': 1249.023666995607}
Statring iteration 19
{'ner': 1294.0348647360474}
Statring iterat

In [0]:
#save to model
ner_model.to_disk('ner_model')

The model took approximately ~15 - 30 minutes to train

##### *Generate Ingredients for Scraped Data*

In [0]:
extracted = []
for _,row in recipes_df.iterrows():
    nlp_ing = ner_model(row['ingredient'])
    ans = {}
    for ent in nlp_ing.ents:
        ans[ent.label_] = ent.text.lower()#once the ingredients are found convert to lower case
    extracted.append(ans)


#generate list
clean_ingredient = []
for ix in range(len(recipes_df.ingredient)):
    curr = extracted[ix]
    if curr == {}:
        ingd = " "
    else:
        ingd = curr['INGREDIENT']
    clean_ingredient.append(ingd)

To gain a comparative understanding of how well the model is performing, I have shown the before and after columns for 10 rows.

In [343]:
for el in zip(recipes_df.ingredients[:10], clean_ingredient[:10]):
    print(el)

('6 ounces ground beef', 'beef')
('4 drops Worcestershire sauce, or to taste', 'worcestershire')
('3 pinches garlic powder', 'garlic')
('freshly ground black pepper to taste', 'pepper')
('kosher salt to taste', 'salt')
('1 pinch cayenne pepper', 'pepper')
('1 slice Cheddar cheese', 'cheese')
('1  hamburger bun, split', 'hamburger')
('1 slice tomato', 'tomato')
('1 leaf lettuce', 'lettuce')


The final clean version is stored in file - cleanData.csv and Dataframe - clean_recipes

In [0]:
#convert to a new dataframe
clean_recipe = recipes_df[['url', 'name']]
clean_recipe['ingredient'] = clean_ingredient

In [497]:
clean_recipe.head()

Unnamed: 0,url,name,ingredient
0,https://www.allrecipes.com/recipe/267875/chef-...,Chef John's Juicy Lucy,beef
1,https://www.allrecipes.com/recipe/267875/chef-...,Chef John's Juicy Lucy,worcestershire
2,https://www.allrecipes.com/recipe/267875/chef-...,Chef John's Juicy Lucy,garlic
3,https://www.allrecipes.com/recipe/267875/chef-...,Chef John's Juicy Lucy,pepper
4,https://www.allrecipes.com/recipe/267875/chef-...,Chef John's Juicy Lucy,salt


In [0]:
clean_recipe.to_csv('cleanData.csv')


---

# Analysis

---



The objective of the analyis is to generate the count and proportion of each ingredient to fid the 10 most common ingredient.

---
### 1. Count Calculation


---



In [0]:
count_df = pd.DataFrame(clean_recipe.ingredient.value_counts().rename_axis('ingredient').reset_index(name='count'))

In [501]:
print("There are {} unique ingredients".format(count_df.shape[0]))

There are 167 unique ingredients


**The top 10 most common ingredients are as follows:**

In [502]:
count_df.head(10)

Unnamed: 0,ingredient,count
0,sugar,94
1,salt,76
2,butter,66
3,flour,62
4,pepper,56
5,cheese,42
6,onion,41
7,garlic,37
8,eggs,37
9,vanilla,36


---
### 2. Proportion Calculation
---

Let us find if one ingredient appears more than once in a recipe. This is important as if they don't appear more than once than the count divided by the number of recipe will give us the proportion. 

However, if an ingredient occurs more than once the count is not reflective of the number of recipes it occurs in alone and includes multiple occurence within a recipe. 

In [0]:
count_recipe_ingredient = clean_recipe.groupby(['name', 'ingredient']).count()

In [504]:
count_recipe_ingredient

Unnamed: 0_level_0,Unnamed: 1_level_0,url
name,ingredient,Unnamed: 2_level_1
Air Fryer Lemon Pepper Shrimp,garlic,1
Air Fryer Lemon Pepper Shrimp,lemon,2
Air Fryer Lemon Pepper Shrimp,oil,1
Air Fryer Lemon Pepper Shrimp,paprika,1
Air Fryer Lemon Pepper Shrimp,pepper,1
...,...,...
Zesty Slow Cooker Chicken Barbecue,barbeque,1
Zesty Slow Cooker Chicken Barbecue,chicken,1
Zesty Slow Cooker Chicken Barbecue,salad,1
Zesty Slow Cooker Chicken Barbecue,sugar,1


In [506]:
count_recipe_ingredient.url.value_counts()

1    1017
2      79
3       7
5       1
4       1
Name: url, dtype: int64

Since there are a few recipes which contain the same ingredient multiple number of times. This could be because there could be variation of the ingredient such as chopped, diced onions etc. 

Thus, I first group the recipes by name and find the set of ingredient associated with each and then count each ingredients occurence to eventually calculate the proportion.

In [0]:
recipe_ingredient = clean_recipe.groupby('name')['ingredient'].apply(set)

In [0]:
ingd_count = {}
for el in count_df.ingredient:
    for r in recipe_ingredient.index:
        if el in recipe_ingredient[r]:
            if el not in ingd_count:
                ingd_count[el] = 1
            else:
                ingd_count[el] += 1

In [0]:
prop_df = pd.DataFrame(ingd_count.items(), columns = ['ingredient', 'proportion'])

In [0]:
prop_df['proportion'] = prop_df['proportion'].div(len(recipes_df))

In [515]:
prop_df.sort_values( by = 'proportion', ascending = False)

Unnamed: 0,ingredient,proportion
1,salt,0.060581
0,sugar,0.054772
3,flour,0.048963
2,butter,0.047303
4,pepper,0.036515
...,...,...
127,fillets,0.000830
129,raspberries,0.000830
130,chuck,0.000830
131,spinach,0.000830


In [520]:
#join with count_df
results_df = pd.merge(count_df, prop_df, on = 'ingredient')
results_df.head(10)

Unnamed: 0,ingredient,count,proportion
0,sugar,94,0.054772
1,salt,76,0.060581
2,butter,66,0.047303
3,flour,62,0.048963
4,pepper,56,0.036515
5,cheese,42,0.025726
6,onion,41,0.032365
7,garlic,37,0.027386
8,eggs,37,0.030705
9,vanilla,36,0.025726


---
---

The top 10 ingredients are filled with condiments and diary mainly. The only vegetables here are onion and garlic since there are used in almost all sauces, gravies etc. There is flour too and the presence of this along with eggs and butter suggest substantial baking recipes among the scraped dataset.

---
---



In [0]:
#save to file only top 10
results_df.iloc[:10, :].to_csv("results.csv")
#save all results
results_df.to_csv("resultsAll.csv")

---

# Appendix

---

### 1. Approach 1 

Approach 1 is explored here. This was used to understand if a simpler method will produce high accuracy results.

Here, to find the ingredient, 



```
PSEUDOCODE
> for each token do the following  
        1.   If on checking the token dependencies, the dependencies    
             of the token for sentences' subject or root is true then 
             move to step 2
        2.   If the token is a noun, then move to step 3.
        3.   Scan the token for childrens which are either             
             modifications or compounds and not measurements and return  
             the identified token as ingredient name
```














In [0]:
#load the existing small model from spacy
base_model = spacy.load('en_core_web_sm')

In [96]:
measurements = re.compile(r'(bowl|bulb|cube|clove|cup|drop|ounce|oz|pinch|pound|teaspoon|tablespoon)s?')
extracted = []

for ix, row in recipes_df.iterrows():
    print('\r', "Extracting ingredient for row", ix, end='')
    tokens = base_model(row['ingredients'])
    extract = ''
    for token in tokens:
        if (token.dep_ in ['nsubj', 'ROOT']) and (token.pos_ in ['NOUN', 'PROPN']) and (not measurements.match(token.text)):
        #explore children
            for child in token.children:
                if (not measurements.match(child.text)) and (child.dep_ in ['amod', 'compound']):
                    extract += child.text + ' '
            extract += token.text + ' '
    extracted.append(extract)

 Extracting ingredient for row 1204

In [97]:
#convert to dataframe to view 
dup_recipe = recipes_df[['name', 'link', 'ingredients']]
dup_recipe['cleaned ingredients'] = extracted

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [99]:
dup_recipe.head(10)

Unnamed: 0,name,link,ingredients,cleaned ingredients
0,Chef John's Juicy Lucy,https://www.allrecipes.com/recipe/267875/chef-...,6 ounces ground beef,ground beef
1,Chef John's Juicy Lucy,https://www.allrecipes.com/recipe/267875/chef-...,"4 drops Worcestershire sauce, or to taste",
2,Chef John's Juicy Lucy,https://www.allrecipes.com/recipe/267875/chef-...,3 pinches garlic powder,garlic powder
3,Chef John's Juicy Lucy,https://www.allrecipes.com/recipe/267875/chef-...,freshly ground black pepper to taste,black pepper
4,Chef John's Juicy Lucy,https://www.allrecipes.com/recipe/267875/chef-...,kosher salt to taste,kosher salt
5,Chef John's Juicy Lucy,https://www.allrecipes.com/recipe/267875/chef-...,1 pinch cayenne pepper,cayenne pepper
6,Chef John's Juicy Lucy,https://www.allrecipes.com/recipe/267875/chef-...,1 slice Cheddar cheese,slice Cheddar cheese
7,Chef John's Juicy Lucy,https://www.allrecipes.com/recipe/267875/chef-...,"1 hamburger bun, split",hamburger bun split
8,Chef John's Juicy Lucy,https://www.allrecipes.com/recipe/267875/chef-...,1 slice tomato,slice tomato
9,Chef John's Juicy Lucy,https://www.allrecipes.com/recipe/267875/chef-...,1 leaf lettuce,leaf lettuce


In [0]:
dup_recipe.to_csv("secondaryModelResults.csv") 

It is noticed that this method though not as good as the trained model, does a fairly good job and is a good starting point for venturing into the space of text mining.