## Set up

We'll mainly be using two packages for scraping data off the internet:

1. [requests](https://2.python-requests.org//en/master/)
2. [beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

So, let's install them if you haven't.

In [9]:
!pip install --user requests bs4



Import the necessary packages...

In [10]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

Now, let's say we want to scrape https://www.seriouseats.com/recipes/topics/ingredient/meats-and-poultry/chicken?page=1

The landing page has a link of recipes all related to chicken. 

Before we start, a quick refresher on the URL's components:

1. Scheme: identifies the protocol to be used (HTTP or HTTPS)
2. Host: identifies the owner that holds the resource (www.foobar.com)
3. Port: a number that follows the host (mainly used for identifying specific processes under the same TCP -- usually omitted)
4. Path: identifies the specific resource in the host (/library/goodies/blah.html)
5. Query: key-value pairs that the resource can make use of (book=science). Also known as a payload.

The components of the URL can be primarily delimited like so:

**scheme://host:port/path?query**

Thus, **https** is our scheme, **www.seriouseats.com** is our host, **recipes/topics/ingredient/meats-and-poultry/chicken** is our path, **page=1** is our query.

In [19]:
# declare our url components
scheme = "https"
host = "www.seriouseats.com"
path = "recipes/topics/ingredient/meats-and-poultry/chicken"

Now that we have our designated page to scrape, we will be using requests to get the necessary resource.

How request works is pretty simple:

1. Requests sends an HTTP request (GET/POST etc.)
2. Servers sends back a response (plaintext, json, etc.)
3. Do what you may with the response

For example, we send out a request like this.

In [21]:
# declare our url template
url = "{scheme}://{host}/{path}"
# create the payload (query)
payload = {"page": 1}
# send the request
r = requests.get(url.format(scheme=scheme, host=host, path=path), params=payload)

That's it! Now, we can see what the response is in plaintext.

In [22]:
r.text

'<!DOCTYPE html>\n<html lang="en">\n  <head>\n    \n    \n\n\n\n\n\n  \n\n  <meta charset="utf-8">\n  <meta http-equiv="X-UA-Compatible" content="IE=edge">\n  <!-- generated by chapp // cameo // proxy-2-wp // support@seriouseats.com -->\n  <title>Chicken Recipes | Serious Eats</title>\n  <meta name="viewport" content="width=device-width, initial-scale=1.0">\n  <link href="https://www.seriouseats.com/recipes/topics/ingredient/meats-and-poultry/chicken" rel="canonical" />\n  <link href="https://www.seriouseats.com/recipes/topics/ingredient/meats-and-poultry/chicken?page=2" rel="next" />\n\n\n  <meta name="twitter:card" content="summary_large_image">\n  <meta name="twitter:site" content="@seriouseats">\n  <meta property="og:title" content="Chicken Recipes" />\n  <meta name="description" content="These recipes prove how many directions you can take a simple piece of chicken. Take full advantage of its tender meat with hot soups, cold salads, and easy oven-baked recipes. " />\n  <meta prope

## Scraping the recipe links

Notice our response is an html file. In fact, we can see this response already in our browser! 

If you're on Chrome, you can press Ctrl+Shift+I to go into Developers Tools (Inspect mode) and you will see the exact html response in the Elements tab.

Now, this is where BeautifulSoup comes into play. The module pretty much makes it SUPER simple to parse through the entire html document.

From this page, what we want to scrape the links that will lead to actual chicken recipes.

For e.g. https://www.seriouseats.com/recipes/2011/12/serious-eats-halal-cart-style-chicken-and-rice-white-sauce-recipe.html

After we get the links, we will then again scrape the recipe information and other relevant information.

Notice if you hover over the a certain element while you're in Inspect mode, in our case this element:

```<section class="block block-primary block-no-nav block-has-kicker" id="recipes">```

You'll see that this element pretty much acts like a container for all the recipe links. And it even has an "id" handy for us to identify it as "recipes"!

Thus, if we print out this information like so:

In [23]:
# initialize BeauitfulSoup
soup = BeautifulSoup(r.text, "html.parser")
# get the element with a certain id (find will get you the first element found)
recipes = soup.find(id="recipes")
print(recipes)

<section class="block block-primary block-no-nav block-has-kicker" id="recipes">
<header class="block__header">
<h2 class="block__title">Recipes</h2>
</header>
<div class="ui-toggle">
<span class="ui-toggle-prompt">Sort:</span>
<div class="ui-toggle-options self-clear">
<div class="ui-toggle-option active" data-toggle-option="popularity">
<span class="ui-toggle-option-text">Most Popular</span>
</div>
<div class="ui-toggle-option" data-toggle-option="latest">
<span class="ui-toggle-option-text">latest</span>
</div>
</div>
</div>
<div class="block__wrapper">
<div class="module" data-postid="40012">
<div class="module__wrapper">
<a class="module__image-container module__link" href="https://www.seriouseats.com/recipes/2011/12/serious-eats-halal-cart-style-chicken-and-rice-white-sauce-recipe.html" tabindex="-1">
<img class="module__image" data-ofi-src="https://www.seriouseats.com/recipes/images/2011/12/20111205-ctb-halal-chicken-rice-primary.jpg" data-pin-nopin="true" data-src="https://www.

If you look closely, you'll notice **all the links** are inside a certain html element. For e.g:

```<a class="module__image-container module__link" href="https://www.seriouseats.com/recipes/2011/12/serious-eats-halal-cart-style-chicken-and-rice-white-sauce-recipe.html" tabindex="-1">```

This element is an A-tag, with the class types "module__image--container" and "module_link". The information we want is the "href" inside said A-tag.

We can scrape the information like so:

In [24]:
for r in recipes.find_all("a", {"class": "module__image-container module__link"}):
    print(r.get("href"))

https://www.seriouseats.com/recipes/2011/12/serious-eats-halal-cart-style-chicken-and-rice-white-sauce-recipe.html
https://www.seriouseats.com/recipes/2016/01/slow-cooker-thai-chicken-meatball-recipe.html
https://www.seriouseats.com/recipes/2015/07/the-food-lab-southern-fried-chicken-recipe.html
https://www.seriouseats.com/recipes/2011/12/gavthi-indian-village-chicken-curry-recipe-indian.html
https://www.seriouseats.com/recipes/2013/05/classic-chicken-adobo-adobo-road-cookbook.html
https://www.seriouseats.com/recipes/2016/08/oyakodon-japanese-chicken-and-egg-rice-bowl-recipe.html
https://www.seriouseats.com/recipes/2010/05/butterflied-roasted-chicken-with-quick-jus-recipe.html
https://www.seriouseats.com/recipes/2017/03/vietnamese-style-baked-chicken-recipe.html
https://www.seriouseats.com/recipes/2015/04/pressure-cooker-fast-and-easy-chicken-chile-verde-recipe.html
https://www.seriouseats.com/recipes/2014/10/traditional-french-cassoulet-recipe.html
https://www.seriouseats.com/recipes/

Yup :)

That easy. And now let's put in a function.

In [26]:
def get_recipe_links(scheme, host, path, page_num):
    """
    Gets recipe links from the main index page with a pagination parameter
    
    Args:
        scheme: the schema for the url
        host: the host for the url
        path: the path for the url
        page_num: the pagination number to scrape links from
    
    Returns:
        result: list of recipe links scrape from a given page
    
    """
    
    # declare our url template
    url = "{scheme}://{host}/{path}"
    # create the payload (query)
    payload = {"page": page_num}
    
    try:
        # send the request
        r = requests.get(url.format(scheme=scheme, host=host, path=path), params=payload)
    except requests.exceptions.RequestException as e:  # handle error gracefully if request fails
        print(e)
        sys.exit(1)

    # parse for relevant information
    soup = BeautifulSoup(r.text, "html.parser")
    recipes = soup.find(id="recipes")
    return [
        r.get("href") for r in recipes.find_all(
            "a", {"class": "module__image-container module__link"})
    ]

Test run:

In [30]:
# declare our url components
scheme = "https"
host = "www.seriouseats.com"
path = "recipes/topics/ingredient/meats-and-poultry/chicken"
# test call for page 1
get_recipe_links(scheme, host, path, 1)

['https://www.seriouseats.com/recipes/2011/12/serious-eats-halal-cart-style-chicken-and-rice-white-sauce-recipe.html',
 'https://www.seriouseats.com/recipes/2016/01/slow-cooker-thai-chicken-meatball-recipe.html',
 'https://www.seriouseats.com/recipes/2015/07/the-food-lab-southern-fried-chicken-recipe.html',
 'https://www.seriouseats.com/recipes/2011/12/gavthi-indian-village-chicken-curry-recipe-indian.html',
 'https://www.seriouseats.com/recipes/2013/05/classic-chicken-adobo-adobo-road-cookbook.html',
 'https://www.seriouseats.com/recipes/2016/08/oyakodon-japanese-chicken-and-egg-rice-bowl-recipe.html',
 'https://www.seriouseats.com/recipes/2010/05/butterflied-roasted-chicken-with-quick-jus-recipe.html',
 'https://www.seriouseats.com/recipes/2017/03/vietnamese-style-baked-chicken-recipe.html',
 'https://www.seriouseats.com/recipes/2015/04/pressure-cooker-fast-and-easy-chicken-chile-verde-recipe.html',
 'https://www.seriouseats.com/recipes/2014/10/traditional-french-cassoulet-recipe.htm

## Scraping the Actual Recipes

Now that we have the links, the next step is to scrape the **actual** recipes.

Our scraping process will be very much similar like before:

1. Send out the request
2. Use the response text to initialize BeautifulSoup
3. Figure out which elements are relevant
4. Get the relevant information

Exercise: Try to figure this out on your own :P

In [35]:
def get_recipe(recipe_link):
    """
    Gets recipe title, ingredients, and directions
    
    Args:
        recipe_link: the link of the recipe to scrape from
    
    Returns (title, ing_cleaned, dir_cleaned):
        title: the title of the recipe
        ing_cleaned: the ingredients of the recipe separated by a " || " token
        dir_cleaned: the directions of the recipe
    """
    ing_result = []
    dir_result = []
    try:
        r = requests.get(recipe_link)
    except requests.exceptions.RequestException as e: # handle error gracefully if requets fails
        print(e)
        sys.exit(1)
    
    # parse for relevant information
    soup = BeautifulSoup(r.text, "html.parser")
    title = soup.find("div", {"class": "content-main"}).get("data-title")

    ingredients = soup.find_all("li", {"class": "ingredient"})
    for i in ingredients:
        ing_result.extend(i.contents)

    directions = soup.find_all("div", {"class": "recipe-procedure-text"})
    for d in directions:
        dir_result.extend(d.contents)
    
    ing_cleaned = BeautifulSoup(" || ".join([str(r) for r in ing_result]), "html.parser").text.strip()
    dir_cleaned = BeautifulSoup(" ".join([str(r) for r in dir_result]), "html.parser").text.strip()
    return (title, ing_cleaned, dir_cleaned)

In [33]:
def create_recipe_dataframe(links):
    """
    Creates a recipe dataframe from the links provided
    
    Args:
        links: a list of recipe links
    Returns:
        result: a dataframe with 4 columns (titles, ingredients, directions, links)
    
    """
    titles = []
    ingredients = []
    directions = []
    for l in links:
        title, ings, dirs = get_recipe(l)
        titles.append(title)
        ingredients.append(ings)
        directions.append(dirs)
        print("Finished scraping {}".format(l))
    return pd.DataFrame({"TITLE": titles, "INGREDIENTS": ingredients, "DIRECTIONS": directions, "LINKS": links})

Test run:

In [36]:
links = get_recipe_links(scheme, host, path, 1)
data = create_recipe_dataframe(links)

Finished scraping https://www.seriouseats.com/recipes/2011/12/serious-eats-halal-cart-style-chicken-and-rice-white-sauce-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2016/01/slow-cooker-thai-chicken-meatball-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2015/07/the-food-lab-southern-fried-chicken-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2011/12/gavthi-indian-village-chicken-curry-recipe-indian.html
Finished scraping https://www.seriouseats.com/recipes/2013/05/classic-chicken-adobo-adobo-road-cookbook.html
Finished scraping https://www.seriouseats.com/recipes/2016/08/oyakodon-japanese-chicken-and-egg-rice-bowl-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2010/05/butterflied-roasted-chicken-with-quick-jus-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2017/03/vietnamese-style-baked-chicken-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2015/04/pressure-cooker-fast-

Boom. Magic.

![](https://media.giphy.com/media/12NUbkX6p4xOO4/giphy.gif)

In [37]:
data

Unnamed: 0,TITLE,INGREDIENTS,DIRECTIONS,LINKS
0,Serious Eats' Halal Cart-Style Chicken and Ric...,For the chicken: || 2 tablespoons fresh lemon ...,"For the chicken: Combine the lemon juice, oreg...",https://www.seriouseats.com/recipes/2011/12/se...
1,Slow-Cooker Sticky Thai Meatballs Recipe,"4 stalks lemongrass, roughly chopped || 3 medi...","In the bowl of a food processor, combine lemon...",https://www.seriouseats.com/recipes/2016/01/sl...
2,The Best Buttermilk-Brined Southern Fried Chic...,2 tablespoons paprika || 2 tablespoons freshly...,"Combine the paprika, black pepper, garlic powd...",https://www.seriouseats.com/recipes/2015/07/th...
3,Indian Village (Gavthi) Chicken Curry Recipe,For the Chicken Marinade || : || 2 tablespoons...,To Marinate the Chicken: Mix ginger-garlic pas...,https://www.seriouseats.com/recipes/2011/12/ga...
4,Easy Filipino Chicken Adobo Recipe,1/4 cup (65 ml) soy sauce || 1/2 cup (125 ml) ...,"Place the soy sauce, vinegar, garlic, black pe...",https://www.seriouseats.com/recipes/2013/05/cl...
5,Oyakodon (Japanese Chicken and Egg Rice Bowl) ...,"1 cup (240ml) || homemade || dashi, or the e...","Combine dashi, soy sauce, sake, and sugar in a...",https://www.seriouseats.com/recipes/2016/08/oy...
6,Spatchcocked (Butterflied) Roast Chicken Recipe,"1 large chicken, about 4 to 5 pounds (1.8 to 2...",Place oven rack in upper-middle position and p...,https://www.seriouseats.com/recipes/2010/05/bu...
7,Vietnamese-Style Baked Chicken Recipe,"2 pounds (900g) bone-in, skin-on chicken thigh...",Place thighs in a large zipper-lock bag. In a ...,https://www.seriouseats.com/recipes/2017/03/vi...
8,Green Chili Chicken Pressure Cooker Recipe,3 pounds bone-in skin-on chicken thighs and dr...,"Combine chicken, tomatillos, poblano peppers, ...",https://www.seriouseats.com/recipes/2015/04/pr...
9,Traditional French Cassoulet Recipe,1 pound dried cannellini beans || Kosher salt ...,"In a large bowl, cover beans with 3 quarts wat...",https://www.seriouseats.com/recipes/2014/10/tr...


Now, let's try scraping all pages of links from page 1 to page 9...

In [38]:
# declare our url components
scheme = "https"
host = "www.seriouseats.com"
path = "recipes/topics/ingredient/meats-and-poultry/chicken"

links = []
for i in range(1, 10):
    links.extend(get_recipe_links(scheme, host, path, i))

data = create_recipe_dataframe(links)

Finished scraping https://www.seriouseats.com/recipes/2011/12/serious-eats-halal-cart-style-chicken-and-rice-white-sauce-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2016/01/slow-cooker-thai-chicken-meatball-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2015/07/the-food-lab-southern-fried-chicken-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2011/12/gavthi-indian-village-chicken-curry-recipe-indian.html
Finished scraping https://www.seriouseats.com/recipes/2013/05/classic-chicken-adobo-adobo-road-cookbook.html
Finished scraping https://www.seriouseats.com/recipes/2016/08/oyakodon-japanese-chicken-and-egg-rice-bowl-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2010/05/butterflied-roasted-chicken-with-quick-jus-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2017/03/vietnamese-style-baked-chicken-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2015/04/pressure-cooker-fast-

Finished scraping https://www.seriouseats.com/recipes/2014/04/chicken-tinga-tacos-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2014/03/kimchi-chicken-cabbage-stir-fry.html
Finished scraping https://www.seriouseats.com/recipes/2017/12/classic-chicken-soup.html
Finished scraping https://www.seriouseats.com/recipes/2012/10/sweet-soy-sauce-korean-fried-chicken-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2010/04/dinner-tonight-parisian-chicken-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2015/03/arroz-caldo-chicken-rice-soup-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2015/03/sticky-rice-wrapped-lotus-leaf-lo-mai-gai-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2012/05/colombian-chicken-stew-with-potatoes-tomato-onion-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2015/06/best-grilled-chicken-greek-style-lemon-garlic-olive-oil.html
Finished scraping https://www.serio

Finished scraping https://www.seriouseats.com/recipes/2019/04/samgyetang-korean-rice-stuffed-chicken-soup.html
Finished scraping https://www.seriouseats.com/recipes/2013/08/one-pot-wonders-saucy-ratatouille-with-chicken.html
Finished scraping https://www.seriouseats.com/recipes/2012/08/real-deal-tortilla-soup-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2014/07/smoky-pulled-barbecue-chicken-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2015/03/roasted-chicken-curry-soubise-onion-sauce-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2012/08/thomas-kellers-chicken-breasts-with-tarragon.html
Finished scraping https://www.seriouseats.com/recipes/2014/10/slow-cooker-chicken-tortilla-soup-all-the-fixings-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2008/05/beer-can-cola-can-dr-pepper-grilled-chicken-recipe.html
Finished scraping https://www.seriouseats.com/recipes/2012/03/thai-deep-fried-chicken-recipe.html
Finis

In [40]:
data.tail()

Unnamed: 0,TITLE,INGREDIENTS,DIRECTIONS,LINKS
211,Tamales With Red Chili and Chicken Filling Recipe,"6 medium guajillo chiles, stemmed and seeded |...",Combine all the dried chiles in a large Dutch ...,https://www.seriouseats.com/recipes/2015/05/ta...
212,"Fried Chicken, Honey Butter, and Biscuit Sandw...",For the Biscuits || : || 3 cups (about 15 ounc...,For the Biscuits: Adjust oven rack to middle p...,https://www.seriouseats.com/recipes/2012/06/fr...
213,Broiled Tandoori-Style Chicken With Almonds an...,8 ounces (225g) yogurt || 2 ounces (60ml) fres...,"In a mixing bowl, stir together yogurt, 1 ounc...",https://www.seriouseats.com/recipes/2018/10/br...
214,Buffalo Wings Recipe | Grilling,"3 pounds chicken wings (18 wings), cut up || 1...","Mix together the cayenne, black pepper, and sa...",https://www.seriouseats.com/recipes/2009/07/gr...
215,Crispy Caramel Chicken Skewers Recipe,For the Marinated Chicken: || 2 1/2 pounds bon...,For the Marinated Chicken: In a large zipper-l...,https://www.seriouseats.com/recipes/2015/07/cr...


In [41]:
# save dataframe to a local csv file
data.to_csv("chicken.csv", index=False)

![](https://media.giphy.com/media/lD76yTC5zxZPG/giphy.gif)