# NLP Recipe Web Scraping
## Created by Ryan Pindale on March 20th, 2022
### Purpose: gather text from recipe blogs for a supervised NLP project

In [2]:
import requests
from bs4 import BeautifulSoup

# The first recipe blog website is [Dinner at the Zoo](https://www.dinneratthezoo.com/). I scraped 2024 recipe links and saved the links in a csv. Text extraction from these links will take place on a seperate date once project strategy is ironed out.

In [12]:
headers = {'user-agent': 'Pindale UVA MSDS Project (mwp8zy@virginia.edu)'}
url='https://www.dinneratthezoo.com/'
r = requests.get(url, headers=headers)
r

<Response [200]>

In [13]:
soup = BeautifulSoup(r.content, 'html.parser')

In [None]:
#These are the categories for recipes listed on teh website
cats = ['appetizers', 'asian-food', 'baking', 'breakfast', 'brunch', 
        'desserts', 'dinner', 'drinks', 'gluten-free', 'grilling', 
        'instant-pot', 'light-healthy', 'one-pot-meals', 'pasta', 
        'salads', 'side-dishes', 'slow-cooker', 'snakcs', 'soup']

cat_url = ['/category/'+ end +'/' for end in cats]
#cat_url

For each category, I extracted the links to all recipes on each page. Different categories have different number of pages. I could get this number of pages in a sloppy way by taking the 41st from the back element of a list of the text from 'a' tags. This worked for every category except snacks. So, I cheated and added a try/except rule instead of figuring out what the problem was for snacks.

In [103]:
recipes=[]
for i in range(0,len(cat_url)):
    headers = {'user-agent': 'Pindale UVA MSDS Project (mwp8zy@virginia.edu)'}
    url='https://www.dinneratthezoo.com/'
    r = requests.get(url+cat_url[i], headers=headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    tags = soup.find_all('a')
    try:
        number_of_pages=int(tags[-41].get_text()) #This is the number of pages per category
    except:
        number_of_pages=6 #it broke only on "snacks" which ahs 6 pages
    page_num = ['page/'+str(i)+'/' for i in range(2,number_of_pages+1)]
    tags = soup.find_all(class_='more-link')
    for tag in tags:
        recipes.append(tag.get('href'))
    for page_urls in page_num:
        r = requests.get(url+cat_url[i]+page_urls, headers=headers)
        soup = BeautifulSoup(r.content, 'html.parser')
        tags = soup.find_all(class_='more-link')
        for tag in tags:
            recipes.append(tag.get('href'))
        
        

In [106]:
len(recipes)

2024

In [107]:
recipes[0]

'https://www.dinneratthezoo.com/potato-skins-recipe/'

In [112]:
# importing pandas as pd  
import pandas as pd  

df = pd.DataFrame({'recipe_links':recipes}) 

# saving the dataframe 
df.to_csv('dinner_at_the_zoo_recipe_links.csv') 

Here is some early text extraction. I will discuss with my team for better methods before continuing.

In [108]:
headers = {'user-agent': 'Pindale UVA MSDS Project (mwp8zy@virginia.edu)'}
url='https://www.dinneratthezoo.com/'
r = requests.get(recipes[0], headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')

In [109]:
try_this = soup.find_all('p')

In [110]:
for tags in try_this:
    print(tags.get_text())

Dinner at the Zoo
Home » Appetizers » Potato Skins Recipe
Published: March 15, 2022 Last Modified: March 15, 2022 By Sara 18 Comments 
This recipe for potato skins is Russet potato halves topped with cheese and bacon, then baked to crispy perfection and finished with sour cream and green onions. A classic appetizer that’s easy to make and always a crowd pleaser!
I love to serve a variety of hot appetizers when I’m entertaining, including baked buffalo wings, jalapeno poppers and these loaded potato skins.
Baked potato skins are popular for good reason – who can resist the combination of bacon, cheese and sour cream? Not me! These homemade potato snacks are SO much better than anything you’d buy in the freezer section, and they’re easy to make too!
For this recipe you will need potatoes, and olive oil, salt and pepper to season the potatoes. Next up are the toppings which include shredded cheese, bacon, sour cream and green onions. Put everything together and you’ll have the ultimate ap

# The next website will be [Spend with Pennies](https://www.spendwithpennies.com/)

In [115]:
#Establishing a connection
headers = {'user-agent': 'Pindale UVA MSDS Project (mwp8zy@virginia.edu)'}
url='https://www.spendwithpennies.com/'
r = requests.get(url, headers=headers)
r
soup = BeautifulSoup(r.content, 'html.parser')

There seams to be a path where all the recipes live.

In [116]:
easy_url = 'https://www.spendwithpennies.com/category/recipes/'
r = requests.get(url, headers=headers)
r
soup = BeautifulSoup(r.content, 'html.parser')

In [117]:
tags = soup.find_all(class_='entry-image-link')
for tag in tags:
    print(tag.get('href'))

https://www.spendwithpennies.com/air-fryer-potato-and-sausage/
https://www.spendwithpennies.com/parmesan-baked-acorn-squash/
https://www.spendwithpennies.com/homemade-potato-bread/
https://www.spendwithpennies.com/frito-pie/
https://www.spendwithpennies.com/air-fryer-potato-and-sausage/
https://www.spendwithpennies.com/air-fryer-spaghetti-squash/
https://www.spendwithpennies.com/air-fryer-roasted-garlic/
https://www.spendwithpennies.com/air-fryer-stuffed-chicken-breasts/
https://www.spendwithpennies.com/weight-loss-vegetable-soup-recipe/
https://www.spendwithpennies.com/slow-cooker-minestrone-soup/
https://www.spendwithpennies.com/quick-cabbage-soup/
https://www.spendwithpennies.com/overnight-oats/
https://www.spendwithpennies.com/air-fryer-potato-and-sausage/
https://www.spendwithpennies.com/frito-pie/
https://www.spendwithpennies.com/beef-guinness-stew-recipe/
https://www.spendwithpennies.com/cottage-pie/
https://www.spendwithpennies.com/parmesan-baked-acorn-squash/
https://www.spend

In [118]:
spend_with_pennies_recipes=[]
headers = {'user-agent': 'Pindale UVA MSDS Project (mwp8zy@virginia.edu)'}
easy_url = 'https://www.spendwithpennies.com/category/recipes/'
r = requests.get(easy_url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
tags = soup.find_all(class_='entry-image-link')
number_of_pages=148 
page_num = ['page/'+str(i)+'/' for i in range(2,number_of_pages+1)]
for tag in tags:
    spend_with_pennies_recipes.append(tag.get('href'))
for page_urls in page_num:
    r = requests.get(easy_url+page_urls, headers=headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    tags = soup.find_all(class_='entry-image-link')
    for tag in tags:
        spend_with_pennies_recipes.append(tag.get('href'))

I tried to check for duplicates by comparing the length of the lsit to the length of the set. There is a difference. The difference in the recipes list was more concerning. I am ignoring it for now. We will deal with it when we get to the text extraction.

In [121]:
print(len(spend_with_pennies_recipes))
print(len(set(spend_with_pennies_recipes)))

2364
2312


In [123]:
df = pd.DataFrame({'recipe_links':spend_with_pennies_recipes})

# saving the dataframe 
df.to_csv('spend_with_pennies_recipe_links.csv') 

# The next website will be [Cafe Delites](https://cafedelites.com/)

This one is also really easy. There is a path that lists all of the recipes in 94 pages.

In [125]:
cafe_delites_recipes=[]
headers = {'user-agent': 'Pindale UVA MSDS Project (mwp8zy@virginia.edu)'}
easy_url = 'https://cafedelites.com/recipes/'
r = requests.get(easy_url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
tags = soup.find_all(class_='entry-image-link')
number_of_pages=94
page_num = ['page/'+str(i)+'/' for i in range(2,number_of_pages+1)]
for tag in tags:
    cafe_delites_recipes.append(tag.get('href'))
for page_urls in page_num:
    r = requests.get(easy_url+page_urls, headers=headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    tags = soup.find_all(class_='entry-image-link')
    for tag in tags:
        cafe_delites_recipes.append(tag.get('href'))

In [127]:
print(len(cafe_delites_recipes))
print(len(set(cafe_delites_recipes)))

839
839


In [128]:
df = pd.DataFrame({'recipe_links':cafe_delites_recipes})

# saving the dataframe 
df.to_csv('cafe_delites_recipe_links.csv') 

# The next one will be [Damn Delicious](https://damndelicious.net/)

In [162]:
headers = {'user-agent': 'Pindale UVA MSDS Project (mwp8zy@virginia.edu)'}
url = 'https://damndelicious.net/recipe-index/'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')

In [166]:
#Getting the categories!
cat_links = []
trying = soup.select('div[class*="archive-post"]')
for EachPart in trying:
    cat_links.append(EachPart.findNext().get('href'))
    
cat_links.remove(cat_links[6]) #this one is just her cookbooks

In [184]:
headers = {'user-agent': 'Pindale UVA MSDS Project (mwp8zy@virginia.edu)'}
url = cat_links[4]
r = requests.get(url+'/page/2/', headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')

In [185]:
trying = soup.select('div[class*="archive-post"]')
for EachPart in trying:
    print(EachPart.findNext().get('href'))

https://damndelicious.net/2020/01/14/meal-prep-chicken-3-ways/
https://damndelicious.net/2020/01/10/quick-chicken-taquitos/
https://damndelicious.net/2019/12/10/quick-chicken-ramen-noodle-stir-fry/
https://damndelicious.net/2019/10/21/honey-buffalo-wings-with-homemade-ranch/
https://damndelicious.net/2019/10/05/chicken-harvest-salad/
https://damndelicious.net/2019/10/01/cilantro-lime-chicken-wings/
https://damndelicious.net/2019/09/24/skillet-lemon-dill-chicken-thighs/
https://damndelicious.net/2019/09/09/grilled-honey-mustard-chicken-tenders/
https://damndelicious.net/2019/08/28/greek-chicken-kabobs/
https://damndelicious.net/2019/08/14/grilled-greek-chicken-salad/
https://damndelicious.net/2019/08/10/instant-pot-lemon-chicken-thighs/
https://damndelicious.net/2019/08/06/easy-chicken-tacos/
https://damndelicious.net/2019/08/02/honey-garlic-asian-chicken-kabobs/
https://damndelicious.net/2019/07/28/rosemary-chicken-and-peach-salad/
https://damndelicious.net/2019/07/24/sheet-pan-chicken

In [186]:
cat_links

['https://damndelicious.net/category/appetizer/',
 'https://damndelicious.net/category/asian-inspired/',
 'https://damndelicious.net/category/bread/',
 'https://damndelicious.net/category/breakfast/',
 'https://damndelicious.net/category/chicken-recipes/',
 'https://damndelicious.net/category/christmas/',
 'https://damndelicious.net/category/dessert/',
 'https://damndelicious.net/category/dog-food/',
 'https://damndelicious.net/category/drink/',
 'https://damndelicious.net/category/entree/',
 'https://damndelicious.net/category/fall/',
 'https://damndelicious.net/category/freezer-friendly/',
 'https://damndelicious.net/category/game-day/',
 'https://damndelicious.net/category/healthy/',
 'https://damndelicious.net/category/instant-pot-recipes/',
 'https://damndelicious.net/category/meal-prep/',
 'https://damndelicious.net/category/one-pot/',
 'https://damndelicious.net/category/pasta/',
 'https://damndelicious.net/category/entree/quick-easy/',
 'https://damndelicious.net/category/salad

In [190]:
damn_delicious_recipes=[]
headers = {'user-agent': 'Pindale UVA MSDS Project (mwp8zy@virginia.edu)'}
for i in range(len(cat_links)):
    r = requests.get(cat_links[i], headers=headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    pn_search = soup.find_all(class_='page-numbers') #trying to fins all of the page number info
    try:
        number_of_pages = int(pn_search[-2].get_text()) #turns out the number of pages lives here
    except:
        number_of_pages = 1 #sometimes there is only one page
    page_num = ['page/'+str(i)+'/' for i in range(2,number_of_pages+1)]
    tags = soup.select('div[class*="archive-post"]')
    for tag in tags:
        damn_delicious_recipes.append(tag.findNext().get('href'))
    for page_urls in page_num:
        r = requests.get(cat_links[i]+page_urls, headers=headers)
        soup = BeautifulSoup(r.content, 'html.parser')
        tags = soup.select('div[class*="archive-post"]')
        for tag in tags:
            damn_delicious_recipes.append(tag.findNext().get('href'))

In [195]:
print(len(damn_delicious_recipes))
print(len(set(damn_delicious_recipes))) #That is very concerning for repeats

3235
1357


In [194]:
df = pd.DataFrame({'recipe_links':damn_delicious_recipes})

# saving the dataframe 
df.to_csv('damn_delicious_recipe_links.csv') 

# Next up is [Gimma some Oven](https://www.gimmesomeoven.com/)

In [196]:
gimme_some_oven_recipes=[]
headers = {'user-agent': 'Pindale UVA MSDS Project (mwp8zy@virginia.edu)'}
easy_url = 'https://www.gimmesomeoven.com/all-recipes/'
r = requests.get(easy_url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
tags = soup.find_all(class_='teaser-post-sm')
number_of_pages=149
page_num = ['?fwp_paged='+str(i) for i in range(2,number_of_pages+1)]
for tag in tags:
    gimme_some_oven_recipes.append(tag.findNext().get('href'))
for page_urls in page_num:
    r = requests.get(easy_url+page_urls, headers=headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    tags = soup.find_all(class_='teaser-post-sm')
    for tag in tags:
        gimme_some_oven_recipes.append(tag.findNext().get('href'))

In [198]:
print(len(gimme_some_oven_recipes))
print(len(set(gimme_some_oven_recipes)))

1782
1782


In [199]:
df = pd.DataFrame({'recipe_links':gimme_some_oven_recipes})

# saving the dataframe 
df.to_csv('gimme_some_oven_recipe_links.csv') 

# Next up is [Cooking Classy](https://www.cookingclassy.com/)

In [206]:
headers = {'user-agent': 'Pindale UVA MSDS Project (mwp8zy@virginia.edu)'}
url = 'https://www.cookingclassy.com/recipes/'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')


#Getting the categories!
cat_links = []
trying = soup.find_all(class_='li-a')
for EachPart in trying:
    cat_links.append(EachPart.findNext().get('href'))
    
#cat_links.remove(cat_links[6]) #this one is just her cookbooks

cat_links

['https://www.cookingclassy.com/recipes/appetizer/',
 'https://www.cookingclassy.com/recipes/asian/',
 'https://www.cookingclassy.com/recipes/meat/',
 'https://www.cookingclassy.com/recipes/bread/',
 'https://www.cookingclassy.com/recipes/breakfast/',
 'https://www.cookingclassy.com/recipes/bars/',
 'https://www.cookingclassy.com/recipes/cake/',
 'https://www.cookingclassy.com/recipes/holidays/christmas/',
 'https://www.cookingclassy.com/recipes/cookies/',
 'https://www.cookingclassy.com/recipes/dessert/',
 'https://www.cookingclassy.com/recipes/drinks/',
 'https://www.cookingclassy.com/recipes/fall-faves/',
 'https://www.cookingclassy.com/recipes/holidays/halloween/',
 'https://www.cookingclassy.com/recipes/healthy/',
 'https://www.cookingclassy.com/recipes/holidays/',
 'https://www.cookingclassy.com/recipes/ice-cream/',
 'https://www.cookingclassy.com/recipes/instant-pot/',
 'https://www.cookingclassy.com/recipes/main-dish/',
 'https://www.cookingclassy.com/recipes/mexican/',
 'https

# [Sally's Baking Recipes](https://sallysbakingaddiction.com/)

In [202]:
sallys_baking_recipes=[]
headers = {'user-agent': 'Pindale UVA MSDS Project (mwp8zy@virginia.edu)'}
easy_url = 'https://sallysbakingaddiction.com/recipe-index/'
r = requests.get(easy_url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
tags = soup.find_all(class_='recipe-image')
number_of_pages=59
page_num = ['?fwp_paged='+str(i) for i in range(2,number_of_pages+1)]
for tag in tags:
    sallys_baking_recipes.append(tag.findNext().get('href'))
for page_urls in page_num:
    r = requests.get(easy_url+page_urls, headers=headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    tags = soup.find_all(class_='recipe-image')
    for tag in tags:
        sallys_baking_recipes.append(tag.findNext().get('href'))

In [204]:
print(len(sallys_baking_recipes))
print(len(set(sallys_baking_recipes)))

1167
1167


In [205]:
df = pd.DataFrame({'recipe_links':sallys_baking_recipes})

# saving the dataframe 
df.to_csv('sallys_baking_recipe_links.csv') 

# [Natahsa's Kitchen](https://natashaskitchen.com/)

# [Two Peas and Their Pod](https://www.twopeasandtheirpod.com/)

# [Smitten Kitchen](https://smittenkitchen.com/)