# NY Times Food Best Recipes

This notebook describes the steps I took to visualize the data on the New York Times cooking website. My intention was to analyze which chefs are the most prolific and which recipes are the most popular on the website.

In [80]:
import pandas as pd
from bs4 import BeautifulSoup
import re
import requests
from six.moves import urllib
from time import *
from random import randint
from IPython.core.display import clear_output
import plotly.express as px

# Lists to store data in
recipe_name = []
recipe_author = []
recipe_rating = []
recipe_review_count = []

# List of recipe links
recipe_links = []

# Number of pages to parse through
pages_url = [str(i) for i in range(1, 10)]

# Preparing the monitoring of the loop
start_time = time()
page_requests = 0

# For every page in the search interval
for page in pages_url:
    response = requests.get('https://cooking.nytimes.com/search?q=&page=' + page)
    html = response.text
    soup = BeautifulSoup(html)

    # Pause the loop
    #sleep(randint(1,2))
    
    for link in soup.find_all('a', attrs={'href': re.compile('^/recipes')}):
        recipe_links.append('https://cooking.nytimes.com' + link.get('href'))
    
    #print (recipe_links)

# Remove duplicates recipe links
recipe_links = list(dict.fromkeys(recipe_links))

for recipe in recipe_links:
    response_recipe = requests.get(recipe)
    html_recipe = response_recipe.text
    soup_recipe = BeautifulSoup(html_recipe, 'html.parser')
    
    # Pause the loop
    # sleep(randint(1,2))
    
    # Adds title of recipe
    try:
        recipe_name.append(soup_recipe.title.string)
    except:
        recipe_name.append('none')
    
    # Adds author of recipe
    try: 
        recipe_author.append(soup_recipe.find('span', {'class': 'byline-name', 'itemprop': 'author'}).text)
    except:
        recipe_author.append('None')

    # Adds average rating and number of reviews of recipe
    
    pattern = '\=\s(\d+)' # Pattern for 
    value = re.findall(pattern, str(soup_recipe.find(text=re.compile("bootstrap.recipe.avg_rating"))))
    try:
        recipe_rating.append(value[0])
    except:
        recipe_rating.append('0')
    try:
        recipe_review_count.append(value[1])
    except:
        recipe_review_count.append('0')
    
recipe_information = pd.DataFrame({
    'Recipe Name': recipe_name,
    'Recipe Author': recipe_author,
    'Recipe Rating': recipe_rating,
    'Recipe Review Count': recipe_review_count
})

In [82]:
recipe_information.to_csv('recipe_information.csv')

In [1]:
recipe_information= pd.read_csv('recipe_information.csv')

NameError: name 'pd' is not defined

The following represents a snapshot of the DataFrame

In [84]:
print (recipe_information.head())
recipe_information.info()

   Unnamed: 0                                        Recipe Name  \
0           0  Mushroom-Farro Soup With Parmesan Broth Recipe...   
1           1  Beans and Garlic Toast in Broth Recipe - NYT C...   
2           2           Easiest Lentil Soup Recipe - NYT Cooking   
3           3                Parmesan Broth Recipe - NYT Cooking   
4           4  Potato Gratin With Swiss Chard and Sumac Onion...   

      Recipe Author  Recipe Rating  Recipe Review Count  
0     Julia Sherman              4                  101  
1         Tejal Rao              4                  482  
2     Melissa Clark              4                  888  
3     Julia Sherman              4                   63  
4  Yotam Ottolenghi              4                   31  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 432 entries, 0 to 431
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Unnamed: 0           432 non-null   

Currently, the data in the rating and review count columns are objects, rather than integers. In addition, we want to remove the "- New York Times" from the recipe titles to clean up display.

I then check to make sure that the changes have been made and are successful.

In [85]:
# Clean up recipe name
recipe_information['Recipe Name'] = recipe_information['Recipe Name'].str.extract('(.*)\sRecipe')
print (recipe_information.head())

# Convert rating and review count columns to integers
recipe_information['Recipe Rating'] = pd.to_numeric(recipe_information['Recipe Rating'])
recipe_information['Recipe Review Count'] = pd.to_numeric(recipe_information['Recipe Review Count'])

print (recipe_information.head())
recipe_information.info()

   Unnamed: 0                                      Recipe Name  \
0           0          Mushroom-Farro Soup With Parmesan Broth   
1           1                  Beans and Garlic Toast in Broth   
2           2                              Easiest Lentil Soup   
3           3                                   Parmesan Broth   
4           4  Potato Gratin With Swiss Chard and Sumac Onions   

      Recipe Author  Recipe Rating  Recipe Review Count  
0     Julia Sherman              4                  101  
1         Tejal Rao              4                  482  
2     Melissa Clark              4                  888  
3     Julia Sherman              4                   63  
4  Yotam Ottolenghi              4                   31  
   Unnamed: 0                                      Recipe Name  \
0           0          Mushroom-Farro Soup With Parmesan Broth   
1           1                  Beans and Garlic Toast in Broth   
2           2                              Easiest Lentil

In order to create visualizations of the data, we are going to create a dataframe that includes the count of recipes by each author.

In [86]:
recipe_information_count = recipe_information.groupby('Recipe Author').size().reset_index(name='Counts')
print (recipe_information_count.head())

     Recipe Author  Counts
0     Alexa Weibel      32
1       Ali Slagle      38
2     Alison Roman      26
3  Angela Dimayuga      10
4     Becky Hughes       1


We are going to use plotly express to create a pie chart representing the proportion of recipes made by each author.

In [87]:
recipe_information_count.loc[recipe_information_count['Counts'] < 2, 'Recipe Author'] = 'Other Authors' # Represent only large authors
fig = px.pie(recipe_information_count, values='Counts', names='Recipe Author', title='Percentage of NYT Recipes Written by Each Author')
fig.show()

Great work, Tejal! I love your Indian food recipes. Now I want to see what the most popular recipes on the site are. I'm going to sort first by rating (i.e., 5-star recipes first) and then by the number of reviews.

In [88]:
sorted_recipe_information = recipe_information.sort_values(['Recipe Rating', 'Recipe Review Count'], ascending = False)

display (sorted_recipe_information.head(50))

Unnamed: 0.1,Unnamed: 0,Recipe Name,Recipe Author,Recipe Rating,Recipe Review Count
102,102,Caramelized Shallot Pasta,Alison Roman,5,3766
162,162,Spicy White Bean Stew With Broccoli Rabe,Alison Roman,5,3168
284,284,Thai-Inspired Chicken Meatball Soup,Ali Slagle,5,2598
151,151,Red Curry Lentils With Sweet Potatoes and Spinach,Lidey Heuck,5,2499
210,210,Via Carota’s Insalata Verde,Samin Nosrat,5,2375
414,414,Coconut Curry Chickpeas With Pumpkin and Lime,Melissa Clark,5,1483
112,112,Lemony Shrimp and Bean Stew,Sue Li,5,1311
128,128,Cheesy Baked Pasta With Sausage and Ricotta,Melissa Clark,5,1290
375,375,Coconut Milk Chicken Adobo,Angela Dimayuga,5,1175
159,159,Indian Butter Chickpeas,Melissa Clark,5,952


Great job, Alison! I can confirm that your Pork Noodle Soup is quite delicious!

Now let's take a look at which authors create the biggest percentage of 5-star recipes.

In [89]:
recipe_information_five_star = recipe_information.loc[recipe_information['Recipe Rating'] == 5].iloc[:, 1::]
display (recipe_information_five_star)

Unnamed: 0,Recipe Name,Recipe Author,Recipe Rating,Recipe Review Count
5,Braised Fennel With White Bean Purée,Julia Sherman,5,5
17,Pork Noodle Soup With Ginger and Toasted Garlic,Alison Roman,5,856
34,Vegan Turkish Kebabs With Sumac Onions and Gar...,J. Kenji López-Alt,5,35
35,Steamed Clams With Garlic-Parsley Butter and L...,David Tanis,5,9
49,Chocolate-Chip Banana Bread,Erin Jeanne McDowell,5,365
50,"Chicken and Rice Soup With Celery, Parsley and...",Ali Slagle,5,533
58,Gluten-Free Chocolate Chip Cookies,Erin Jeanne McDowell,5,412
62,Chicken Potpie With Cornbread Biscuits,Lidey Heuck,5,45
72,Sherry Margarita,Rebekah Peppler,5,10
99,Jalapeño Poppers,Alexa Weibel,5,234


In [90]:
recipe_information_five_star_count = recipe_information_five_star.groupby('Recipe Author').size().reset_index(name='Counts')
# recipe_information_five_star_count.loc[recipe_information_five_star_count['Counts'] < 2, 'Recipe Author'] = 'Other Authors' # Represent only large authors
fig2 = px.pie(recipe_information_five_star_count, values='Counts', names='Recipe Author', title='Percentage of NYT Recipes Written by Each Author')
fig2.show()