## X02. Web Scraping with Python

Web Scraping (also called Web Harvesting and Web Data Extraction) is the process of extracting data from a website and transforming this into a format that can be analysed, saved and visualised. 

There are some things you should bear in mind before performing any kind of web scraping activity...

* Check the site's terms and conditions to see if there's anything that prohibits you from scraping it
* Be careful not to overload the website's server via dodgy while loops etc...
* Web pages format change often - be prepared to reconfigure your scraper!
* Some basic knowledge of HTML is a big help!


## Beautiful Soup

We'll be using two libraries to perform web scraping - the first is Requests which we've already met. The second is called <a href = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/">Beautiful Soup</a> which is a library specifically for extracting data from HTML and XML files.

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

* Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn't take much code to write an application

* Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't detect one. Then you just have to specify the original encoding.

* Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility. 

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text." 

Note that a simple definition of <i>parsing</i> is the process of analysing a string of symbols (usually code!).

Beautiful Soup is not installed as part of Anaconda so we'll need to install it manually as follows:

    conda install bs4

## Understanding HTML

To best utilise web scraping, it's good to have a basic understanding of html pages and how they work.

A couple of excellent resources for this are:

* <a href = "http://www.w3schools.com/html/">W3 Schools HTML</a><br/>
* <a href = "http://www.w3schools.com/xml/">W3 Schools XML</a><br/>
* <a href = "https://www.codecademy.com/learn/web">CodeAcademy</a><br/>

Html is a 'markup' language which means it's for the processing and display of text. It does this via 'tags' which are contained in chevrons. There is a simple example of some html below:

When displayed on a web page this looks like:

<div class  = "container">
<h1> Heading</h1><br/>
<div class = "header5"><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed imperdiet sem velit, eget egestas erat imperdiet vitae. Vivamus diam nibh, malesuada non pellentesque vitae, ullamcorper eget quam. Nam sit amet sodales sapien. In iaculis viverra tortor eget semper. Donec accumsan consequat aliquam. Nulla facilisi. Duis non tellus condimentum, varius elit nec, tempor sapien.</p> 

<p>Sed facilisis aliquam tincidunt. Nulla magna nunc, rutrum id libero in, venenatis fermentum lacus. Suspendisse potenti. Morbi varius pharetra sapien, id fermentum nisl convallis non. Nulla quis sagittis sapien, quis posuere lorem. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</p></div>
</div>
<ul>
<li>Lorem</li>
<li>ipsum</li>
<li>dolor</li>
<li>sit</li>
<li>amet</li>
</ul>

You'll see how the tags help give the text structure and formatting. However we can also use them to extract specific data from a webpage using Beautiful Soup.

## Understanding Tags

You only really need to understand maybe <a href = "http://www.99lime.com/_bak/topics/you-only-need-10-tags/">10 tags</a> to get started with web scraping.

<img src = "img/tags.png">

For now we'll only be using he div, h1-h6 and ul tags in our example.

## Classes

We'll also be meeting <a href = "http://www.w3schools.com/tags/att_global_class.asp">Classes</a> as part of web scraping. Classes are assigned in order to style various elements of the html document a certain way. For example you might want to style a section heading or list a certain way to set it apart from other items.

Since classes set these items apart, we can also use them to identify specific elements in the document which we'll see more of below. Assigning a class is simple and looks like this:

# IDs

IDs are similar to classes except that whilst classes can be applied to multiple elements in a web page and ID will only be applied to a single element and not be repeated. Assigning an ID is also simple and looks like this:

## Web Scraping

We'll start by importing the libraries we'll need:

In [30]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

We'll see if we can build a scraper that extracts the recipe title and ingredients from the BBC recipes website.

We're going to use a recipe page for cookies to get started:

In [7]:
url = 'http://www.bbc.co.uk/food/recipes/peanut_butter_cookies_02578'

In [9]:
r = requests.get(url)                     # Making the request to the page
r.status_code                             # Checking that the request has been successful

200

Now that we know our request has been successful we can create a beautiful soup object and start to see how we can extract data from the web page.

In [46]:
soup = BeautifulSoup(r.content,'lxml')           # Creating the beautiful soup object                  
print(soup.prettify()[0:1000])                   # Using the prettify method to make the output readable + slicing with a specified number of characters

<!DOCTYPE html>
<html class="no-touch" lang="en">
 <head>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="100004154058350" property="fb:admins"/>
  <!--[if (gt IE 8) | (IEMobile)]><!-->
  <link href="http://static.bbci.co.uk/frameworks/barlesque/3.18.3/orb/4/style/orb.min.css" rel="stylesheet"/>
  <!--<![endif]-->
  <!--[if (lt IE 9) & (!IEMobile)]>
<link rel="stylesheet" href="http://static.bbci.co.uk/frameworks/barlesque/3.18.3/orb/4/style/orb-ie.min.css">
<![endif]-->
  <!--orb.ws.require.lib-->
  <script type="text/javascript">
   /*<![CDATA[*/ if (typeof window.define !== 'function' || typeof window.require !== 'function') { document.write('<script class="js-require-lib" src="http://static.bbci.co.uk/frameworks/requirejs/lib.js"><'+'/script>'); } /*]]>*/
  </script>
  <script type="text/javascript">
   bbcRequireMap = {"jquery-1":"http://static.bbci.co.uk/frameworks/jquery/0.4.1/sha

This is a LOT of text! Fortunately most of it is irrelevent to us and we can take some simple steps to seperate what's relevent from what's not. For example the header section of the document doesn't contain any content. Similary there will be script tags which contain Javascript which is not of interest to us.

The most simple way of returning the document content this is to create a new object just for the body of the document. This makes it easier to explore the content of the document.

In [51]:
body = soup.find('body')                       # Creating a body object
print(body.prettify())

<body class="nojs">
 <!--<![endif]-->
 <!-- BBCDOTCOM bodyFirst -->
 <div class="bbccom_display_none" id="bbccom_interstitial_ad">
 </div>
 <div class="bbccom_display_none" id="bbccom_interstitial">
  <script type="text/javascript">
   /*<![CDATA[*/ (function() { if (window.bbcdotcom && bbcdotcom.config.isActive('ads')) { googletag.cmd.push(function() { googletag.display('bbccom_interstitial'); }); } }()); /*]]>*/
  </script>
 </div>
 <div class="bbccom_display_none" id="bbccom_wallpaper_ad">
 </div>
 <div class="bbccom_display_none" id="bbccom_wallpaper">
  <script type="text/javascript">
   /*<![CDATA[*/ (function() { var wallpaper; if (window.bbcdotcom && bbcdotcom.config.isActive('ads')) { if (bbcdotcom.config.isAsync()) { googletag.cmd.push(function() { googletag.display('bbccom_wallpaper'); }); } else { googletag.display("wallpaper"); } wallpaper = bbcdotcom.adverts.adRegister.getAd('wallpaper'); } }()); /*]]>*/
  </script>
 </div>
 <script type="text/javascript">
  /*<![CDATA[*/

Now we're ready to From there we can now search the body using Ctrl+F for specific text as follows: 

#### Peanut butter cookies with banana ice cream

We can see that it appears 3 times. One is inside a script tag, one is related to a button and one is as the heading of the recipe. It's this one we'll extract as the title as follows:

In [55]:
title = soup.find("h1",class_="content-title__text").text     # Finds the H1 tag with the desinated class and extracts the text from it
print(title)

Peanut butter cookies with banana ice cream 


This has successfully given us the title of the recipe! Now we'll need to exctract the ingredients list. If we Ctrl+F and search for ingredient we can see that there's a div and class as follows:

In [56]:
ingredients = soup.find('div',class_="recipe-ingredients")
ingredients

<div class="recipe-ingredients">
<div class="recipe-ingredients-wrapper">
<h2 class="recipe-ingredients__heading">Ingredients</h2>
<h3 class="recipe-ingredients__sub-heading">For the banana ice cream</h3>
<ul class="recipe-ingredients__list">
<li class="recipe-ingredients__list-item" itemprop="ingredients"> 4 ripe <a class="recipe-ingredients__link" href="/food/banana">bananas</a> </li>
<li class="recipe-ingredients__list-item" itemprop="ingredients"> 100ml/3½fl oz natural Greek <a class="recipe-ingredients__link" href="/food/yoghurt">yoghurt</a></li>
<li class="recipe-ingredients__list-item" itemprop="ingredients"> splash <a class="recipe-ingredients__link" href="/food/milk">milk</a></li>
<li class="recipe-ingredients__list-item" itemprop="ingredients"> 2 tbsp <a class="recipe-ingredients__link" href="/food/honey">honey</a></li>
</ul>
<h3 class="recipe-ingredients__sub-heading">For the peanut butter cookies</h3>
<ul class="recipe-ingredients__list">
<li class="recipe-ingredients__list

We can see that the ingredients are all in a list due to the li tags and have the class as follows:

We'll use this as the source of our ingredients list but as there's more than one ingredient we'll need to use the find_all method:

In [63]:
ingredients_soup = soup.find_all('li',class_="recipe-ingredients__list-item")
ingredients_soup

[<li class="recipe-ingredients__list-item" itemprop="ingredients"> 4 ripe <a class="recipe-ingredients__link" href="/food/banana">bananas</a> </li>,
 <li class="recipe-ingredients__list-item" itemprop="ingredients"> 100ml/3½fl oz natural Greek <a class="recipe-ingredients__link" href="/food/yoghurt">yoghurt</a></li>,
 <li class="recipe-ingredients__list-item" itemprop="ingredients"> splash <a class="recipe-ingredients__link" href="/food/milk">milk</a></li>,
 <li class="recipe-ingredients__list-item" itemprop="ingredients"> 2 tbsp <a class="recipe-ingredients__link" href="/food/honey">honey</a></li>,
 <li class="recipe-ingredients__list-item" itemprop="ingredients"> 75g/2¾oz <a class="recipe-ingredients__link" href="/food/margarine">margarine</a></li>,
 <li class="recipe-ingredients__list-item" itemprop="ingredients"> 100g/3½oz golden <a class="recipe-ingredients__link" href="/food/caster_sugar">caster sugar</a></li>,
 <li class="recipe-ingredients__list-item" itemprop="ingredients"> 1 

This returns us a list which we can interate through to extract the text from:

In [66]:
ingredients_list = []

for item in ingredients_soup:
    out = item.text[1:]
    ingredients_list.append(out)



In [67]:
ingredients_list

['4 ripe bananas ',
 '100ml/3½fl oz natural Greek yoghurt',
 'splash milk',
 '2 tbsp honey',
 '75g/2¾oz margarine',
 '100g/3½oz golden caster sugar',
 '1 large free-range egg',
 '100g/3½oz plain flour',
 '1 tbsp golden syrup',
 '100g/3½oz crunchy peanut butter',
 '1 tsp bicarbonate of soda',
 '50g/1¾oz salted peanuts']

Lets store our data as a dictionary...

In [69]:
recipe = {title:ingredients_list}
recipe

{'Peanut butter cookies with banana ice cream ': ['4 ripe bananas ',
  '100ml/3½fl oz natural Greek yoghurt',
  'splash milk',
  '2 tbsp honey',
  '75g/2¾oz margarine',
  '100g/3½oz golden caster sugar',
  '1 large free-range egg',
  '100g/3½oz plain flour',
  '1 tbsp golden syrup',
  '100g/3½oz crunchy peanut butter',
  '1 tsp bicarbonate of soda',
  '50g/1¾oz salted peanuts']}

And finally wrap everything up into a function:

In [84]:
recipes = {}                                                                          # Blank recipes dictionary to which to append our data

def bbc_rec(url):
    # Title and ingredients scraper from the bbc website!
    r = requests.get(url)                                                             # Making the request to the page
    r.status_code                                                                     # Checking that the request has been successful
    if r.status_code == 200:                                                          # Only execute if the request is successful
        soup = BeautifulSoup(r.content,'lxml')                                        # Creating the beautiful soup object
        title = soup.find("h1",class_="content-title__text").text                     # Extracting the title of the recipe
        ingredients_soup = soup.find_all('li',class_="recipe-ingredients__list-item") # Extracting the ingredients of the recipe
        ingredients_list = []                                                         # Creating a blank ingredients list to which to append text
        for item in ingredients_soup:                                                 # Loop to extract the relevant text
            out = item.text[1:]                                                       # Extracing the text and removing the preceding space
            ingredients_list.append(out)                                              # Appending the blank item to a recipe list  
        recipes[title] = ingredients_list                                             # Appending the title and ingredients to the recipes dictionary
    else:
        print('Error code %s' % r.status_code)                                        # In case of error print the error message + code
        
bbc_rec(url='http://www.bbc.co.uk/food/recipes/peanut_butter_cookies_02578')          
bbc_rec(url='http://www.bbc.co.uk/food/recipes/buckwheat_triple_32460')
bbc_rec(url='http://www.bbc.co.uk/food/recipes/cake_lollipops_06507')
recipes

{'Cake pops': ['100g/3½oz dark chocolate',
  '125g/4½oz fruit cake ',
  '125g/4½oz Madeira cake ',
  '2 tbsp desiccated coconut',
  '2 tbsp chopped hazelnuts',
  '300g/10½oz white chocolate',
  'few drops food colouring',
  'multi-coloured sugar ball sprinkles'],
 'Peanut butter cookies with banana ice cream ': ['4 ripe bananas ',
  '100ml/3½fl oz natural Greek yoghurt',
  'splash milk',
  '2 tbsp honey',
  '75g/2¾oz margarine',
  '100g/3½oz golden caster sugar',
  '1 large free-range egg',
  '100g/3½oz plain flour',
  '1 tbsp golden syrup',
  '100g/3½oz crunchy peanut butter',
  '1 tsp bicarbonate of soda',
  '50g/1¾oz salted peanuts'],
 'Triple chocolate buckwheat cookies': ['150g/5½oz dark chocolate chips',
  '125g/4½oz dark chocolate (minimum 70% cocoa solids)',
  '125g/4½oz buckwheat flour',
  '25g/1oz cocoa powder, sieved',
  '½ tsp bicarbonate of soda',
  '½ tsp fine sea salt',
  '60g/2¼oz soft unsalted butter',
  '125g/4½oz soft dark brown sugar',
  '1 tsp vanilla paste or extr

## Further Reading

<a href = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/">Beautiful Soup Documentation</a><br/>
<a href = "http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/">Beautiful Soup Tutorial</a><br/>
<a href = "https://www.youtube.com/watch?v=3xQTJi2tqgk">Youtube Tutorial on Beautiful Soup + Requests</a>
<a href = "http://www.w3schools.com/html/">W3 Schools HTML Introduction</a><br/>
<a href = "http://www.w3schools.com/xml/">W3 Schools XML Introduction</a><br/>
<a href = "https://www.codecademy.com/learn/web">CodeAcademy HTML Course</a><br/>


