In [155]:
import numpy as np
import pandas as pd
import json
import requests
import re
import bs4
from bs4 import BeautifulSoup as bs

# Web Scraping from Allrecipes

In order to compile a dataset of recipes for this project, we will first need to use web scraping and build a JSON object.  For this, we will send requests to www.allrecipes.com and parse their recipe cards.

In [156]:
# set how many pages to scrape recipes from
first_page = 1
last_page = 1

In [157]:
# create empty json file to store recipe data
data = []
with open('recipes.json','w') as out_file:
    json.dump(data, out_file, indent=4)

In [158]:
def save_to_json(title, picture, ingredients, method):
    with open('recipes.json') as in_file:
        data = json.load(in_file)
    
    # if this recipe title already exists in the data, do not add it again
    already_exists = False
    for recipe in data:
        if recipe['title'] == title:
            already_exists = True
    
    # if a new recipe, append to the json object and dump back to the file
    if not already_exists:
        new_recipe = {}
        new_recipe['title'] = title
        new_recipe['picture'] = picture
        new_recipe['ingredients'] = ingredients
        new_recipe['method'] = method

        data.append(new_recipe)
    
        with open('recipes.json', 'w') as out_file:
            json.dump(data, out_file, indent=4)

Note that allrecipes has two different HTML layouts for their recipe pages, a regular layout and a layout which supports shopping for ingredients directly from the recipe page.  These two layouts have the information we need in different locations, so we need to differentiate them.  If the title element we initially search for is set to 'None', we have to instead look for the elements where they would be in the second (shopper) layout.

In [159]:
for page in range(first_page, last_page + 1):
    # request the main allrecipes page which lists the top recipes
    source = requests.get("https://www.allrecipes.com?page=" + str(page))
    print("PARSING PAGE {}".format(page))
    doc = bs(source.text,'html.parser')
    
    # find each recipe linked on the main page, and open their links one by one
    recipe_cards = doc.select('a.fixed-recipe-card__title-link')
    for card in recipe_cards:
        recipe_page_source = requests.get(card['href'])        
        recipe_main = bs(recipe_page_source.text,'html.parser')
        
        # search for the title, picture, ingredients, and method elements
        title = recipe_main.select_one('.recipe-summary__h1')
        if title is not None:
             #for ordinary formatting layout (1)
            layout = 1
            picture = recipe_main.select_one('.rec-photo')
            ingredients = recipe_main.select('.recipe-ingred_txt')
            method = recipe_main.select('.recipe-directions__list--item')
        # if the title is 'None', then the page must be in the second layout
        else: 
            # for shopper formatting layout (2)
            layout = 2
            title = recipe_main.select_one('h1.headline.heading-content')
            picture = recipe_main.select_one('.inner-container > img')
            ingredients = recipe_main.select('span.ingredients-item-name')
            method = recipe_main.select('div.paragraph > p')
        
        # compile a list of ingredients for the current recipe
        ingredients_list = []
        for ingredient in ingredients:
            if ingredient.text != 'Add all ingredients to list' and ingredient.text != '':
                ingredients_list.append(ingredient.text.strip())
            
        # compile a list of method instructions for the current recipe
        method_list = []
        for instruction in method:
            method_list.append(instruction.text.strip())

        # save all data for the current recipe to the json file
        save_to_json(title.text, picture.attrs['src'], ingredients_list, method_list)
        print("Saved: {} (layout {})".format(title.text,layout))
        
print("FINISHED PARSING")

PARSING PAGE 1
Saved: Curry Salmon with Mango (layout 1)
Saved: Cake Mix Cinnamon Rolls (layout 2)
Saved: Slow Cooker Creamy Chicken Taco Soup (layout 1)
Saved: Simple Tomato Soup (layout 1)
Saved: Two-Ingredient Pizza Dough (layout 1)
Saved: Best Chocolate Chip Cookies (layout 2)
Saved: Janet's Rich Banana Bread (layout 2)
Saved: Creamed Eggs on Toast (layout 1)
Saved: World's Best Lasagna (layout 2)
Saved: Good Old Fashioned Pancakes (layout 1)
Saved: To-Die-For Chicken Pot Pie (layout 2)
Saved: Pantry Chicken Casserole (layout 1)
Saved: Slow Cooker Barbecue Chicken Breast (layout 1)
Saved: Island Kielbasa in a Slow Cooker (layout 1)
Saved: Best Brownies (layout 2)
Saved: Banana Banana Bread (layout 2)
Saved: Shrimp and Sugar Snap Peas (layout 1)
Saved: Easy Korean Ground Beef Bowl (layout 2)
Saved: Dill Pickle Soup (layout 1)
Saved: Fluffy Pancakes (layout 1)
FINISHED PARSING


# Data Analysis

Now we have a JSON object which contains recipe information scraped from the web.  We can use this data and perform some analyses on it.

In [160]:
recipes = pd.read_json(r'recipes.json')
recipes.head()

Unnamed: 0,ingredients,method,picture,title
0,"[1 (1 pound) fillet salmon fillet, 1/4 cup avo...",[Preheat oven to 400 degrees F (200 degrees C)...,https://images.media-allrecipes.com/userphotos...,Curry Salmon with Mango
1,"[3 (.25 ounce) packages active dry yeast, 2 ½ ...","[In a small bowl, dissolve yeast in warm water...",https://imagesvc.meredithcorp.io/v3/mm/image?u...,Cake Mix Cinnamon Rolls
2,"[1 serving nonstick cooking spray, 1 cup diced...",[Spray a slow cooker with cooking spray. Add o...,https://images.media-allrecipes.com/userphotos...,Slow Cooker Creamy Chicken Taco Soup
3,"[1 tablespoon unsalted butter or margarine, 1 ...",[Heat butter and olive oil in a large saucepan...,https://images.media-allrecipes.com/userphotos...,Simple Tomato Soup
4,"[1 1/2 cups self-rising flour, plus more for k...",[Mix flour and Greek yogurt together in a bowl...,https://images.media-allrecipes.com/userphotos...,Two-Ingredient Pizza Dough


In [190]:
all_ingredients_list = []

for row in recipes['ingredients']:
    for ing in row:
        all_ingredients_list.append(ing)

In [191]:
ingredients = pd.DataFrame(all_ingredients_list, columns=['ingredients'])
ingredients['edited'] = ingredients['ingredients']
ingredients.head()

Unnamed: 0,ingredients,edited
0,1 (1 pound) fillet salmon fillet,1 (1 pound) fillet salmon fillet
1,1/4 cup avocado oil,1/4 cup avocado oil
2,1 teaspoon curry powder,1 teaspoon curry powder
3,salt to taste,salt to taste
4,"1 mango - peeled, seeded, and diced","1 mango - peeled, seeded, and diced"


Now we attempt to clean the ingredients list to get rid of the 'noise' and have a raw list of ingredient names without units, numbers or descriptors.

In [192]:
# lists of common words we want to remove
units = ['gallon','quart','pint','cup','teaspoon','tablespoon','ounce','pound','can','pinch','serving','slice','package','bottle']
descriptors = ['small','medium','large']

# remove common measuring and descriptive words
for word in units + descriptors:
    plural = word+"s"
    ingredients['edited'] = ingredients['edited'].str.replace(' '+plural+' ', ' ')
    ingredients['edited'] = ingredients['edited'].str.replace(' '+word+' ','')

# remove parantheicals
ingredients['edited'] = ingredients['edited'].str.replace(r'\([^()]*\)','')
# remove text after commas and hyphens
ingredients['edited'] = ingredients['edited'].str.partition(',')
ingredients['edited'] = ingredients['edited'].str.partition(',')
# remove non-alphabetical characters
ingredients['edited'] = ingredients['edited'].str.replace('[^a-zA-Z]', ' ')

# edit down extra spaces caused by adjacent removals
ingredients['edited'] = ingredients['edited'].str.strip()

ingredients.head()

Unnamed: 0,ingredients,edited
0,1 (1 pound) fillet salmon fillet,fillet salmon fillet
1,1/4 cup avocado oil,avocado oil
2,1 teaspoon curry powder,curry powder
3,salt to taste,salt to taste
4,"1 mango - peeled, seeded, and diced",mango peeled


From here we can investigate which ingredients are the most common.  First, we see which edited rows appear most commonly with a call to .value_counts()

In [194]:
ingredients['edited'].value_counts().head(20)

butter                      12
salt                        11
all purpose flour            9
white sugar                  8
baking soda                  4
cloves garlic                4
ground black pepper          4
milk                         4
egg                          4
eggs                         4
brown sugar                  4
vanilla extract              4
chicken broth                3
water                        3
baking powder                3
cooking spray                3
olive oil                    2
lean ground beef             2
unsweetened cocoa powder     2
onion                        2
Name: edited, dtype: int64

This shows us the most common row values.  However, this data has a few flaws.  Primarily, the issue is that we need the entire row to match to be counted together.  For instance, 'sugar' and 'white sugar' are counted seperately in this analysis.

Another approach might be to store the list of all ingredient 'words' in its own dataframe, and perform a value_counts on it to see what the most common non-unit, non-descriptive words are in our recipe ingredients.  This loses some specificity (we lose the distinction between 'white sugar' and 'brown sugar'), but is helpful in other contexts.

In [195]:
ingredient_words_list = []
for row in ingredients['edited']:
    for word in row.split():
        ingredient_words_list.append(word)

ingredient_words = pd.DataFrame(ingredient_words_list,columns=['words'])

In [196]:
ingredient_words['words'].value_counts().head(20)

sugar      17
butter     15
salt       14
white      14
pepper     13
flour      10
all         9
purpose     9
powder      9
ground      9
chicken     8
onion       8
chopped     8
brown       7
baking      7
garlic      7
green       7
diced       6
water       6
oil         6
Name: words, dtype: int64