# Scraping websites with BeautifulSoup

This is a basic walkthrough of harvesting data from websites that have infromation spread across multiple pages and inside links.

We'll be scraping the database of food recall warnings from the [Canadian Food Inspection Agency](http://www.inspection.gc.ca/about-the-cfia/newsroom/food-recall-warnings/complete-listing/eng/1351519587174/1351519588221).

First, the set-up. This tutorial uses Python 2.7 and three modules that need to be installed:

* [requests](http://docs.python-requests.org/en/latest/) for HTTP
* [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) for parsing HTML


## Mind your ethics

Scraping websites carry a few ethical questions: does the site's terms of use forbid scraping? Is it copyright infringement to republish the data? Are you punishing the servers with too many requests?

web scraping is in somewhat of a gray moral area. on one side you are harvesting data that is not private and is available to you through the webpage, but you are creating trafic on the websites server, without compensating the owner by complying with the sites buisiness model (allowing them to present ads?)
so please keep these points in mind and don't be surprised if the site owner will try to block your scraper :)

## Study the site structure

First, we need to spend time on the site to understand how the website is built. What's the URL structure? What's the general hirarchical structure of the webpage?

By poking around a bit and selecting a year in the top menu, we see that the site lists all warnings for that year. Good, less pagination work for us.

That year is also specified in the '`ay=`' parameter in the URL. And if you change that parameter to another year, all the warnins for that year are loaded.

<img src="img/url.png">

This will help us cycle through the different years.

So let's set up our first task, which is cycling through the years.

In [1]:
import requests
from bs4 import BeautifulSoup

# Concatenate the URL up to 'ay=' with the year we want, then the tail end
url_head = "http://www.inspection.gc.ca/about-the-cfia/newsroom/food-recall-warnings/complete-listing/eng/1351519587174/1351519588221?ay="
url_tail = "&fr=0&fc=0&fd=0&ft=1"

def cycle_years(start_year, end_year):

    for year in range(start_year, end_year + 1):
        r = requests.get(url_head + str(year) + url_tail)          

Let's leave this for now and come back to it later. 

We'll first look at how BeautifulSoup works at parsing a DOM. Let's load just the index page for 2015 recalls and explore it.

In [2]:
r = requests.get("http://www.inspection.gc.ca/about-the-cfia/newsroom/food-recall-warnings/complete-listing/eng/1351519587174/1351519588221?ay=2015&fr=0&fc=0&fd=0&ft=1")
soup = BeautifulSoup(r.content, "lxml")

In [5]:
# The variable soup is a BeautifulSoup object containing the entire HTML document.

# Use prettify() to pretty-print the document
print(soup.prettify()[:1000])

<!DOCTYPE html>
<!--[if lt IE 9]><html class="no-js lt-ie9" lang="en" dir="ltr"><![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" dir="ltr" lang="en">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Complete listing of all recalls and allergy alerts  - Canadian Food Inspection Agency
  </title>
  <meta content="Complete listing of all recalls and allergy alerts" name="dcterms.title"/>
  <meta content="Canadian Food Inspection Agency" name="description"/>
  <meta content="Canadian Food Inspection Agency" name="dcterms.description"/>
  <meta content="Canadian Food Inspection Agency" name="keywords"/>
  <meta content="inspection" name="dcterms.subject" title="gccore"/>
  <meta content="Government of Canada,Canadian Food Inspection Agency" name="dcterms.creator"/>
  <meta content="eng" name="dcterms.language" title="ISO639-2"/>
  <meta content="2012-10-29 10:06:27" name="dcterms.issued" title="W3

In [32]:
# we can access some elements directly (for example page title)

soup.title

<title>Complete listing of all recalls and allergy alerts  - About the Canadian Food Inspection Agency - Canadian Food Inspection Agency</title>

In [55]:
# We can keep going deeper, and really target the node we want. 

soup.head.link

<link href="/DAM/PresentationService/WET/4.0.19/theme-gcwu-fegc/assets/favicon.ico" rel="icon" type="image/x-icon"/>

In [7]:
# Let's say we want to get all the hyperlinks into a list:

soup.find_all("a")[:20]

[<a class="wb-sl" href="#wb-cont">Skip to main content</a>,
 <a class="wb-sl" href="#wb-info">Skip to "About this site"</a>,
 <a class="wb-sl" href="#wb-sec">Skip to section menu</a>,
 <a href="https://www.canada.ca/en.html" rel="external">Canada.ca</a>,
 <a href="https://www.canada.ca/en/services.html" rel="external">Services</a>,
 <a href="https://www.canada.ca/en/government/dept.html" rel="external">Departments</a>,
 <a href="/au-sujet-de-l-acia/salle-de-nouvelles/avis-de-rappel-d-aliments/liste-complete/fra/1351519587174/1351519588221?ay=2015&amp;fr=0&amp;fc=0&amp;fd=0&amp;ft=1" lang="fr">Français</a>,
 <a aria-controls="mb-pnl" class="overlay-lnk btn btn-sm btn-default" href="#mb-pnl" role="button" title="Search and menus"><span class="glyphicon glyphicon-search"><span class="glyphicon glyphicon-th-list"><span class="wb-inv">Search and menus</span></span></span></a>,
 <a href="/eng/1297964599443/1297965645317">Canadian Food Inspection Agency</a>,
 <a class="item" href="#aboutthecf

Nice. But we don't want ALL the links, just the ones that go to the deails of the recalls. By inspecting the page with Chrome's developer tools, we see that those links are inside a table with classes "table table-striped table-hover" and then inside the `<tbody>` node.

<img src="img/link_table.png">

So let's get those.

In [30]:
links_table = soup.find("table", class_ = "table table-striped table-hover")

print(links_table.prettify()[:500])

<table class="table table-striped table-hover">
 <caption class="text-left mrgn-bttm-sm">
 </caption>
 <thead>
  <tr>
   <th>
    Posted
   </th>
   <th>
    Recall
   </th>
   <th>
    Class
   </th>
   <th>
    Distribution
   </th>
  </tr>
 </thead>
 <tbody>
  <!-- WCMS:RECALL-LIST-ITEM BEGIN -->
  <tr>
   <td>
    2015-12-30
   </td>
   <td>


In [9]:
# links_table is also a BeautifulSoup object with all the same methods.
# We can inspect its classes, for instance

links_table["class"]

['table', 'table-striped', 'table-hover']

In [10]:
# We can see how many children nodes it has

len(links_table.contents)

7

In [11]:
# Now that we isolated the table with the links we want, let's get just the links.
links = links_table.find_all("a")
links[:10]



Great, we can start accessing each recall page. We just want the `href` attribute of the link, so we'll use BeautifulSoup to isolate it and pass it to requests.

In [13]:
for link in links:
    href = link.get("href")
    r_details = requests.get("http://www.inspection.gc.ca/" + href)
    soup_details = BeautifulSoup(r_details.content, "lxml")

OK, we got all the links to all recall details for a given year.

Now we need to study the pages for each recall to see what data we want to extract from it.

For this exercise, all I want are the details on the top of the page: date, reason for recall, hazard class, the company name, and where the product was distributed.

Again, inspecting that element shows it's in a `<dl>` element with a class of "dl-horizontal". The details we want are in child `<dd>` nodes.

<img src="img/details.png">



Again, let's load a single recall page to see how it works.

In [12]:
r = requests.get("http://www.inspection.gc.ca/about-the-cfia/newsroom/food-recall-warnings/complete-listing/2014-03-31/eng/1396319875632/1396319886479")
soup = BeautifulSoup(r.content, "lxml")

# Right away we can isolate the dl and find all the dd's
dl = soup.find("dl", class_="dl-horizontal")
details = dl.find_all("dd")
details

[<dd class="mrgn-bttm-sm">March 31, 2014</dd>,
 <dd class="mrgn-bttm-sm">Allergen - Milk</dd>,
 <dd class="mrgn-bttm-sm">Class 1</dd>,
 <dd class="mrgn-bttm-sm">Altra Foods <abbr title="Incorporated">Inc.</abbr>, Candy &amp; Chocolate Creations, Vancouver Judaica Group</dd>,
 <dd class="mrgn-bttm-sm">Possibly National</dd>,
 <dd class="mrgn-bttm-sm">Consumer</dd>,
 <dd class="mrgn-bttm-sm">8747, 8755, 8760</dd>]

Beautiful. Let's get the text content of each up to the line we care about (distribution). Look how easy it is:

In [14]:
for item in details[:5]:
    print(item.text)

# Again, item is a BeautifulSoup object with all the methods available

March 31, 2014
Allergen - Milk
Class 1
Altra Foods Inc., Candy & Chocolate Creations, Vancouver Judaica Group
Possibly National


Now we can save each item to a variable and store it in the database of choice. I prefer CSVs. So putting everything together:

In [24]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time
import pandas as pd

# Concatenate the URL up to 'ay=' with the year we want, then the tail end
url_head = "http://www.inspection.gc.ca/about-the-cfia/newsroom/food-recall-warnings/complete-listing/eng/1351519587174/1351519588221?ay="
url_tail = "&fr=0&fc=0&fd=0&ft=1"

def cycle_years(start_year, end_year, doc):
    for year in range(start_year, end_year + 1):
        print("Getting data for year {}".format(str(year)))
        r = requests.get(url_head + str(year) + url_tail)
        # Load the raw HTML into a BeautifulSoup object
        soup = BeautifulSoup(r.content, "lxml")
        links_table = soup.find("table", class_ = "table table-striped table-hover")
        links = links_table.find_all("a")
        for link in links:
            href = link.get("href")
            scrape_details(href, doc)
            time.sleep(1)

def scrape_details(href, doc):
    r = requests.get("http://www.inspection.gc.ca/" + href)
    soup = BeautifulSoup(r.content, "lxml")
    dl = soup.find("dl", class_="dl-horizontal")
    details = dl.find_all("dd")
    # Convert text date into a Python datetime object
    date = datetime.strptime(details[0].text, "%B %d, %Y")
    doc['date'] += [date]
    doc['reason'].append(details[1].text)
    doc['recall_class'].append(details[2].text)
    doc['company'].append(details[3].text)
    doc['distribution'].append(details[4].text)
    print("   Recall date: {}".format(date.strftime("%d/%m/%Y")))
    

# initializing empty doc to collect data
doc = {
    'date':[],
    'reason':[],
    'recall_class':[],
    'company':[],
    'distribution':[]
}

cycle_years(2018, 2019,doc)



Getting data for year 2018
   Recall date: 31/12/2018
   Recall date: 24/12/2018
   Recall date: 20/12/2018
   Recall date: 15/12/2018
   Recall date: 14/12/2018
   Recall date: 10/12/2018
   Recall date: 05/12/2018
   Recall date: 04/12/2018
   Recall date: 29/11/2018
   Recall date: 28/11/2018
   Recall date: 21/11/2018
   Recall date: 21/11/2018
   Recall date: 16/11/2018
   Recall date: 16/11/2018
   Recall date: 14/11/2018
   Recall date: 09/11/2018
   Recall date: 08/11/2018
   Recall date: 02/11/2018
   Recall date: 31/10/2018
   Recall date: 26/10/2018
   Recall date: 24/10/2018
   Recall date: 22/10/2018
   Recall date: 22/10/2018
   Recall date: 19/10/2018
   Recall date: 18/10/2018
   Recall date: 17/10/2018
   Recall date: 15/10/2018
   Recall date: 12/10/2018
   Recall date: 11/10/2018
   Recall date: 09/10/2018
   Recall date: 04/10/2018
   Recall date: 03/10/2018
   Recall date: 02/10/2018
   Recall date: 28/09/2018
   Recall date: 28/09/2018
   Recall date: 27/09/2018
 

KeyboardInterrupt: 

Unnamed: 0,company,date,distribution,reason,recall_class
0,Cape Breton Fudge Co.,2018-12-31,Nova Scotia,Allergen - Tree Nut,Class 1
1,"Buy Low Foods Ltd., Loblaw Companies Limited, ...",2018-12-24,National,Microbiological - Listeria,Class 1
2,Bulk Barn Foods Limited,2018-12-20,National,Allergen - Milk,Class 1
3,"Courchesne Larose Ltée, Fruits et Légumes Gaét...",2018-12-15,"New Brunswick, Newfoundland and Labrador, Nova...",Microbiological - E. coli O157:H7,Class 1
4,"Buy Low Foods Ltd., Loblaw Companies Limited, ...",2018-12-14,National,Microbiological - Listeria,Class 1


let's save the scraped data to a pandas data frame

In [None]:
df = pd.DataFrame(doc)

df.head()

we successfuly scraped data from the web! now we can get to the fun part!