# Scraping websites with BeautifulSoup

This is a basic walkthrough of harvesting data from websites that have infromation spread across multiple pages and inside links.

We'll be scraping the database of food recall warnings from the [Canadian Food Inspection Agency](http://www.inspection.gc.ca/about-the-cfia/newsroom/food-recall-warnings/complete-listing/eng/1351519587174/1351519588221).

First, the set-up. This tutorial uses Python 2.7 and three modules that need to be installed:

* [requests](http://docs.python-requests.org/en/latest/) for HTTP
* [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) for parsing HTML
* [unicodecsv](https://github.com/jdunck/python-unicodecsv) for saving the data to a CSV without worrying about Unicode errors

Install them in the command line with with pip:

    $ pip install requests, beautifulsoup4, unicodecsv

## Study the site structure

First, we need to spend time on the site to understand how the website is built. What's the URL structure? What's the general DOM structure?

By poking around a bit and selecting a year in the top menu, we see that the site lists all warnings for that year. Good, less pagination work for us.

That year is also specified in the '`ay=`' parameter in the URL. And if you change that parameter to another year, all the warnins for that year are loaded.

<img src="img/url.png">

This will help us cycle through the different years.

So let's set up our first task, which is cycling through the years.

In [3]:
import requests
from bs4 import BeautifulSoup

# Concatenate the URL up to 'ay=' with the year we want, then the tail end
url_head = "http://www.inspection.gc.ca/about-the-cfia/newsroom/food-recall-warnings/complete-listing/eng/1351519587174/1351519588221?ay="
url_tail = "&fr=0&fc=0&fd=0&ft=1"

def cycle_years(start_year, end_year):

    for year in range(start_year, end_year + 1):
        r = requests.get(url_head + str(year) + url_tail)
        # Load the raw HTML into a BeautifulSoup object
          

'\ndef cycle_years(start_year, end_year):\n\n    for year in range(start_year, end_year + 1):\n        r = requests.get(url_head + str(year) + url_tail)\n        # Load the raw HTML into a BeautifulSoup object\n        soup = BeautifulSoup(r.content, "lxml")\n'

Let's leave this for now and come back to it later. 

We'll first look at how BeautifulSoup works at parsing a DOM. Let's load just the index page for 2015 recalls and explore it.

In [4]:
r = requests.get("http://www.inspection.gc.ca/about-the-cfia/newsroom/food-recall-warnings/complete-listing/eng/1351519587174/1351519588221?ay=2015&fr=0&fc=0&fd=0&ft=1")
soup = BeautifulSoup(r.content, "lxml")

In [8]:
# The variable soup is a BeautifulSoup object containing the entire HTML document.

# Use prettify() to pretty-print the document
print soup.prettify()[:1000]

<!DOCTYPE html>
<!--[if lt IE 9]><html class="no-js lt-ie9" lang="en" dir="ltr"><![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" dir="ltr" lang="en">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Complete listing of all recalls and allergy alerts  - About the Canadian Food Inspection Agency - Canadian Food Inspection Agency
  </title>
  <meta content="Complete listing of all recalls and allergy alerts" name="dcterms.title"/>
  <meta content="Canadian Food Inspection Agency" name="description"/>
  <meta content="Canadian Food Inspection Agency" name="dcterms.description"/>
  <meta content="Canadian Food Inspection Agency" name="keywords"/>
  <meta content="inspection" name="dcterms.subject" title="gccore"/>
  <meta content="Government of Canada,Canadian Food Inspection Agency" name="dcterms.creator"/>
  <meta content="eng" name="dcterms.language" title="ISO639-2"/>
  <meta content="2012-10-

In [31]:
# It's a big mess, but we can start navigating and searching the DOM for the elements we want.
# You can access some nodes directly:

soup.head

<head>\n<meta charset="unicode-escape"/>\n<meta content="width=device-width, initial-scale=1" name="viewport"/>\n<title>Complete listing of all recalls and allergy alerts  - About the Canadian Food Inspection Agency - Canadian Food Inspection Agency</title>\n<meta content="Complete listing of all recalls and allergy alerts" name="dcterms.title"/>\n<meta content="Canadian Food Inspection Agency" name="description"/>\n<meta content="Canadian Food Inspection Agency" name="dcterms.description"/>\n<meta content="Canadian Food Inspection Agency" name="keywords"/>\n<meta content="inspection" name="dcterms.subject" title="gccore"/>\n<meta content="Government of Canada,Canadian Food Inspection Agency" name="dcterms.creator"/>\n<meta content="eng" name="dcterms.language" title="ISO639-2"/>\n<meta content="2012-10-29 10:06:27" name="dcterms.issued" title="W3CDTF"/>\n<meta content="2013-09-24" name="dcterms.modified" title="W3CDTF"/>\n<!--[if gte IE 9 | !IE ]><!-->\n<link href="/DAM/PresentationSe

In [32]:
# Another example:

soup.title

<title>Complete listing of all recalls and allergy alerts  - About the Canadian Food Inspection Agency - Canadian Food Inspection Agency</title>

In [55]:
# We can keep going deeper, and really target the node we want. 

soup.head.link

<link href="/DAM/PresentationService/WET/4.0.19/theme-gcwu-fegc/assets/favicon.ico" rel="icon" type="image/x-icon"/>

In [58]:
# There are several link nodes, but BeautifulSoup returns the first one it finds.
# What if we want the link to the CSS stylesheet?

soup.head.find("link", {"rel" : "stylesheet"})

<link href="/DAM/PresentationService/WET/4.0.19/theme-gcwu-fegc/css/theme.min.css" rel="stylesheet"/>

In [9]:
# Let's say we want to get all the hyperlinks into a list:

soup.find_all("a")[:20]

[<a class="wb-sl" href="#wb-cont">Skip to main content</a>,
 <a class="wb-sl" href="#wb-info">Skip to "About this site"</a>,
 <a class="wb-sl" href="#wb-sec">Skip to section menu</a>,
 <a href="http://www.canada.ca/en/index.html" rel="external">Canada.ca</a>,
 <a href="http://www.canada.ca/en/services/index.html" rel="external">Services</a>,
 <a href="http://www.canada.ca/en/government/dept/index.html" rel="external">Departments</a>,
 <a href="/au-sujet-de-l-acia/salle-de-nouvelles/avis-de-rappel-d-aliments/liste-complete/fra/1351519587174/1351519588221?ay=2015&amp;fr=0&amp;fc=0&amp;fd=0&amp;ft=1" lang="fr">Fran\xe7ais</a>,
 <a aria-controls="mb-pnl" class="overlay-lnk btn btn-sm btn-default" href="#mb-pnl" role="button" title="Search and menus"><span class="glyphicon glyphicon-search"><span class="glyphicon glyphicon-th-list"><span class="wb-inv">Search and menus</span></span></span></a>,
 <a href="/eng/1297964599443/1297965645317">Canadian Food Inspection Agency</a>,
 <a class="item"

Nice. But we don't want ALL the links, just the ones that go to the deails of the recalls. By inspecting the page with Chrome's developer tools, we see that those links are inside a table with classes "table table-striped table-hover" and then inside the `<tbody>` node.

<img src="img/link_table.png">

So let's get those.

In [21]:
links_table = soup.find("table", class_ = "table table-striped table-hover")

links_table.prettify()[:1000]



In [28]:
# links_table is also a BeautifulSoup object with all the same methods.
# We can inspect its classes, for instance

links_table["class"]

['table', 'table-striped', 'table-hover']

In [29]:
# We can see how many children nodes it has

len(links_table.contents)

7

In [13]:
# Now that we isolated the table with the links we want, let's get just the links.
links = links_table.find_all("a")
links[:10]



Great, we can start accessing each recall page. We just want the `href` attribute of the link, so we'll use BeautifulSoup to isolate it and pass it to requests.

In [None]:
for link in links:
    href = link.get("href")
    r_details = requests.get("http://www.inspection.gc.ca/" + href)
    soup_details = BeautifulSoup(r_details.content)

OK, we got all the links to all recall details for a given year.

Now we need to study the pages for each recall to see what data we want to extract from it.

For this exercise, all I want are the details on the top of the page: date, reason for recall, hazard class, the company name, and where the product was distributed.

Again, inspecting that element shows it's in a `<dl>` element with a class of "dl-horizontal". The details we want are in child `<dd>` nodes.

<img src="img/details.png">



Again, let's load a single recall page to see how it works.

In [12]:
r = requests.get("http://www.inspection.gc.ca/about-the-cfia/newsroom/food-recall-warnings/complete-listing/2014-03-31/eng/1396319875632/1396319886479")
soup = BeautifulSoup(r.content)

# Right away we can isolate the dl and find all the dd's
dl = soup.find("dl", class_="dl-horizontal")
details = dl.find_all("dd")
details

[<dd class="mrgn-bttm-sm">March 31, 2014</dd>,
 <dd class="mrgn-bttm-sm">Allergen - Milk</dd>,
 <dd class="mrgn-bttm-sm">Class 1</dd>,
 <dd class="mrgn-bttm-sm">Altra Foods <abbr title="Incorporated">Inc.</abbr>, Candy &amp; Chocolate Creations, Vancouver Judaica Group</dd>,
 <dd class="mrgn-bttm-sm">Possibly National</dd>,
 <dd class="mrgn-bttm-sm">Consumer</dd>,
 <dd class="mrgn-bttm-sm">8747, 8755, 8760</dd>]

Beautiful. Let's get the text content of each up to the line we care about (distribution). Look how easy it is:

In [14]:
for item in details[:5]:
    print item.text

# Again, item is a BeautifulSoup object with all the methods available

March 31, 2014
Allergen - Milk
Class 1
Altra Foods Inc., Candy & Chocolate Creations, Vancouver Judaica Group
Possibly National


Now we can save each item to a variable and store it in the database of choice. I prefer CSVs. So putting everything together:

In [49]:
# I use unicodecsv instead of csv because it handles unicode 
# without having to worry about encoding and decoding.
# It's especially useful in Quebec where words have diacritics like é

import requests
from bs4 import BeautifulSoup
import unicodecsv
from datetime import datetime
import time

# Concatenate the URL up to 'ay=' with the year we want, then the tail end
url_head = "http://www.inspection.gc.ca/about-the-cfia/newsroom/food-recall-warnings/complete-listing/eng/1351519587174/1351519588221?ay="
url_tail = "&fr=0&fc=0&fd=0&ft=1"

def cycle_years(start_year, end_year):
    for year in range(start_year, end_year + 1):
        print "Getting data for year {}".format(str(year))
        r = requests.get(url_head + str(year) + url_tail)
        # Load the raw HTML into a BeautifulSoup object
        soup = BeautifulSoup(r.content, "lxml")
        links_table = soup.find("table", class_ = "table table-striped table-hover")
        links = links_table.find_all("a")
        for link in links:
            href = link.get("href")
            scrape_details(href)
            time.sleep(1)

def scrape_details(href):
    r = requests.get("http://www.inspection.gc.ca/" + href)
    soup = BeautifulSoup(r.content)
    dl = soup.find("dl", class_="dl-horizontal")
    details = dl.find_all("dd")
    # Convert text date into a Python datetime object
    date = datetime.strptime(details[0].text, "%B %d, %Y")
    reason = details[1].text
    recall_class = details[2].text
    company = details[3].text
    distribution = details[4].text
    print "   Recall date: {}".format(date.strftime("%d/%m/%Y"))
    writer.writerow([date, reason, recall_class, company, distribution])
    

f = open("recall_data.csv", "w")
writer = unicodecsv.writer(f)
# Write the column headers in the CSV
writer.writerow(["date", "reason", "class", "company", "distribution"])
cycle_years(2013, 2016)
f.close()

Getting data for year 2013
   Recall date: 18/12/2013
   Recall date: 19/12/2013
   Recall date: 10/05/2013
   Recall date: 16/08/2013
   Recall date: 16/05/2013
   Recall date: 29/05/2013
   Recall date: 23/05/2013
   Recall date: 22/05/2013
   Recall date: 21/05/2013
   Recall date: 10/12/2013
   Recall date: 09/12/2013
   Recall date: 06/12/2013
   Recall date: 06/12/2013
   Recall date: 18/11/2013
   Recall date: 15/11/2013
   Recall date: 14/11/2013


KeyboardInterrupt: 