# Web Scraping

Web Scraping is a general term for techniques involving automating the gathering of data from a website.

**Guidelines/Cautions :**<br>
Before we begin, here are some important rules to follow and understand :

• Always be respectful and try to get premission to scrape, do not bombard a website with scraping requests, otherwise your IP address may be blocked!<br>• Be aware that websites change often, meaning your code could go from working to totally broken from one day to the next.<br>• Pretty much every web scraping project of interest is a unique and custom job, so try your best to generalize the skills learned here.

OK, let's get started with the basics!

#### Web Scraping in Python

There are a few libraries you will need, you can go to your command line and install them with `conda` install (if you are using anaconda distribution), or `pip` install for other python distributions.

    conda install requests
    conda install lxml
    conda install bs4

With `pip install`, for example :

    pip install requests
    pip install lxml
    pip install bs4

Now let's see what we can do with these libraries.

### Example Task 0 - Grabbing the title of a page

Let's start very simple, we will grab the title of a page. Remember that this is the HTML block with the title tag. For this task we will use www.example.com which is a website specifically made to serve as an example domain. Let's go through the main steps :

In [1]:
import requests

In [25]:
# Step 1: Use the requests library to grab the page

# Note, this may fail if you have a firewall blocking Python/Jupyter

# Note, sometimes you need to run this twice if it fails the first time
res = requests.get("http://www.example.com")

This object is a `requests.models.Response` object and it actually contains the information from the website, for example :

In [26]:
type(res)

requests.models.Response

In [27]:
res.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

****
Now we can use `BeautifulSoup` to analyze the extracted page. Technically we could use our own custom script to loook for items in the string of `res.text` but the BeautifulSoup library already has lots of built-in tools and methods to grab information from a string of this nature (basically an HTML file). Using BeautifulSoup we can create a "soup" object that contains all the "ingredients" of the webpage.

`lxml` is the engine used to decode/parse a webpage's text content to generate a _soup_ object.

In [28]:
import bs4

In [29]:
soup = bs4.BeautifulSoup(res.text, "lxml")

In [30]:
soup

<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples

Now let's use the `.select()` method to grab elements. We are looking for the **'title'** tag, so we will pass in 'title'.

In [10]:
soup.select('title')

[<title>Example Domain</title>]

In [12]:
type(soup.select('title'))

bs4.element.ResultSet

> Notice what is returned here, its actually a list containing all the title elements (along with their tags). You can use indexing or even looping to grab the elements from the list. Since this object it still a specialized tag, we can use method calls to grab just the text.

In [13]:
title_tag = soup.select('title')

In [14]:
title_tag[0]

[<title>Example Domain</title>]

In [15]:
type(title_tag[0])     # the type displayed here is a special beautifulsoup object.

bs4.element.Tag

In [19]:
title_tag[0].getText()

'Example Domain'

In [31]:
soup.select('p')

[<p>This domain is for use in illustrative examples in documents. You may use this
     domain in literature without prior coordination or asking for permission.</p>,
 <p><a href="https://www.iana.org/domains/example">More information...</a></p>]

### Example Task 1 - Grabbing all elements of a class

Let's try to grab all the section headings of the Wikipedia Article on Late Shri A.P.J. Abdul Kalam from this URL : https://en.wikipedia.org/wiki/A._P._J._Abdul_Kalam

In [66]:
# First get the request
res = requests.get('https://en.wikipedia.org/wiki/A._P._J._Abdul_Kalam')

In [67]:
# Create a soup from request
soup = bs4.BeautifulSoup(res.text,"lxml")

Now its time to figure out what we are actually looking for. Inspect the element on the page to see that the section headers. Because this is a class and not a straight tag, we need to adhere to some syntax for CSS. In this case :

| Syntax to pass to the `.select()` method | Match Results |
| :-------: | :--------: |
| soup.select('div') | All elements with the \<div\> tag |
| soup.select('#some_id') | All the HTML elements with the CSS class named notice |
| soup.select('div span') | Any elements named \<span\> that are within an element named \<div\> |
| soup.select('div > span') | Any elements named \<span\> that are **directly** within an element named \<div\>, with no other element in between. |

In [68]:
# note, depending on your IP Address this class may be called something different

soup.select(".toctext")

[<span class="toctext">Early life and education</span>,
 <span class="toctext">Career as a scientist</span>,
 <span class="toctext">Presidency</span>,
 <span class="toctext">Post-presidency</span>,
 <span class="toctext">Death</span>,
 <span class="toctext">Reactions</span>,
 <span class="toctext">Memorial</span>,
 <span class="toctext">Personal life</span>,
 <span class="toctext">Religious and spiritual views</span>,
 <span class="toctext">Islam</span>,
 <span class="toctext">Syncretism</span>,
 <span class="toctext">Pramukh Swami as Guru</span>,
 <span class="toctext">Writings</span>,
 <span class="toctext">Awards and honours</span>,
 <span class="toctext">Island</span>,
 <span class="toctext">Road</span>,
 <span class="toctext">Plant species</span>,
 <span class="toctext">Other awards and honours</span>,
 <span class="toctext">Legacy</span>,
 <span class="toctext">Books, documentaries and popular culture</span>,
 <span class="toctext">See also</span>,
 <span class="toctext">Referenc

In [65]:
for item in soup.select('.toctext'):
    print(item.text)

Early life and education
Career as a scientist
Presidency
Post-presidency
Death
Reactions
Memorial
Personal life
Religious and spiritual views
Islam
Syncretism
Pramukh Swami as Guru
Writings
Awards and honours
Island
Road
Plant species
Other awards and honours
Legacy
Books, documentaries and popular culture
See also
References
External links


### Example task 3 - Getting an Image from a Website

Let's attempt to grab the image of the Quantum Computing from this wikipedia article: https://en.wikipedia.org/wiki/Quantum_computing

In [47]:
res = requests.get("https://en.wikipedia.org/wiki/Quantum_computing")

In [48]:
soup = bs4.BeautifulSoup(res.text, 'lxml')

In [49]:
imageInfo = soup.select('.thumbimage')
imageInfo

[<img alt="" class="thumbimage" data-file-height="3584" data-file-width="5166" decoding="async" height="153" src="//upload.wikimedia.org/wikipedia/commons/thumb/6/60/IBM_Q_system_%28Fraunhofer_2%29.jpg/220px-IBM_Q_system_%28Fraunhofer_2%29.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/6/60/IBM_Q_system_%28Fraunhofer_2%29.jpg/330px-IBM_Q_system_%28Fraunhofer_2%29.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/6/60/IBM_Q_system_%28Fraunhofer_2%29.jpg/440px-IBM_Q_system_%28Fraunhofer_2%29.jpg 2x" width="220"/>,
 <img alt="" class="thumbimage" data-file-height="185" data-file-width="163" decoding="async" height="250" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/f4/Bloch_Sphere.svg/220px-Bloch_Sphere.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/f4/Bloch_Sphere.svg/330px-Bloch_Sphere.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/f4/Bloch_Sphere.svg/440px-Bloch_Sphere.svg.png 2x" width="220"/>,
 <img alt="" class="thumbima

In [50]:
# this webpage contains 3 images
len(imageInfo)

3

In [53]:
quantumCompImage = imageInfo[0]
type(quantumCompImage)

bs4.element.Tag

You can make dictionary like calls for parts of the Tag, in this case, we are interested in the `src`, or "source" of the image, which should be its own .jpg or .png link :

In [54]:
quantumCompImage['src']

'//upload.wikimedia.org/wikipedia/commons/thumb/6/60/IBM_Q_system_%28Fraunhofer_2%29.jpg/220px-IBM_Q_system_%28Fraunhofer_2%29.jpg'

We can actually display it with a markdown cell with the following :

    <img src='https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/IBM_Q_system_%28Fraunhofer_2%29.jpg/220px-IBM_Q_system_%28Fraunhofer_2%29.jpg'>
    
<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/IBM_Q_system_%28Fraunhofer_2%29.jpg/220px-IBM_Q_system_%28Fraunhofer_2%29.jpg'>

Now that you have the actual src link, you can grab the image with requests and get along with the .content attribute.<br>Note how we had to add `https://` before the link, if you don't do this, requests will complain.

In [56]:
imageLink = requests.get('https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/IBM_Q_system_%28Fraunhofer_2%29.jpg/220px-IBM_Q_system_%28Fraunhofer_2%29.jpg')

In [57]:
imageLink.content

b'\xff\xd8\xff\xe2\x02@ICC_PROFILE\x00\x01\x01\x00\x00\x020ADBE\x02\x10\x00\x00mntrRGB XYZ \x07\xcf\x00\x06\x00\x03\x00\x00\x00\x00\x00\x00acspAPPL\x00\x00\x00\x00none\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf6\xd6\x00\x01\x00\x00\x00\x00\xd3-ADBE\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\ncprt\x00\x00\x00\xfc\x00\x00\x002desc\x00\x00\x010\x00\x00\x00kwtpt\x00\x00\x01\x9c\x00\x00\x00\x14bkpt\x00\x00\x01\xb0\x00\x00\x00\x14rTRC\x00\x00\x01\xc4\x00\x00\x00\x0egTRC\x00\x00\x01\xd4\x00\x00\x00\x0ebTRC\x00\x00\x01\xe4\x00\x00\x00\x0erXYZ\x00\x00\x01\xf4\x00\x00\x00\x14gXYZ\x00\x00\x02\x08\x00\x00\x00\x14bXYZ\x00\x00\x02\x1c\x00\x00\x00\x14text\x00\x00\x00\x00Copyright 1999 Adobe Systems Incorporated\x00\x00\x00desc\x00\x00\x00\x00\x00\x00\x00\x11Adobe RGB (1998)\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x

Above is the raw content of the actual image i.e. it's a binary file and can be interpreted by a computer's machine language.

**Let's write this to a file,<br>Note the `'wb'` call to denote a binary writing of the file.**<br>
`wb` = `write binary`

In [58]:
f = open('quantum-computer-image.jpg', 'wb')

In [59]:
f.write(imageLink.content)

6816

In [60]:
f.close()

Now we can display this file right here in the notebook as markdown using :

    <img src="quantum-computer-image.jpg">
    
Just write the above line in a new markdown cell and it will display the image we just downloaded!

<img src="quantum-computer-image.jpg">

### Example Project - Working with Multiple Pages and Items

Let's show a more realistic example of scraping a full site. The website: http://books.toscrape.com/index.html (a part of, https://toscrape.com/) is specifically designed for people to scrape it. Let's try to get the title of every book that has a 2 star rating and at the end just have a Python list with all their titles.

We will do the following:

• Figure out the URL structure to go through every page.<br>
• Scrap every page in the catalogue.<br>
• Figure out what tag/class represents the Star rating.<br>
• Filter by that star rating using an `if` statement.<br>
• Store the results to a list.<br>

We can see that the URL structure is the following:

    http://books.toscrape.com/catalogue/page-1.html

In [112]:
baseUrl = 'http://books.toscrape.com/catalogue/page-{}.html'

We can then fill in the page number with `.format()`

In [113]:
# Obtain request

res = requests.get(baseUrl.format('1'))
res

<Response [200]>

Now, let's grab the products (books) from the get request result :

In [114]:
# Turn into soup

soup = bs4.BeautifulSoup(res.text, 'lxml')

In [115]:
soup.select(".product_pod")

# this is gonna be a large part inside the body element.

[<article class="product_pod">
 <div class="image_container">
 <a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>,
 <article class="product_pod">
 <div class="image_container">
 <a href="tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thumbnail" src="../media/cach

Now we can see that each book has the **product_pod** class. We can select any tag with this class, and then further reduce it by its rating.

In [116]:
products = soup.select(".product_pod")

In [117]:
example = products[0]

In [118]:
type(example)

bs4.element.Tag

In [119]:
example.attrs

{'class': ['product_pod']}

Now by inspecting the site we can see that the class we want is `class='star-rating Two'` , if you click on this in your browser, you'll notice it displays the space as a `.` , so that means we want to search for `".star-rating.Two"`

In [126]:
list(example.children)

['\n',
 <div class="image_container">
 <a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>,
 '\n',
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 '\n',
 <h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>,
 '\n',
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>,
 '\n']

In [129]:
example.select('.star-rating.Three')

[<p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>]

But we are looking for **2 stars**, so it looks like we can just check to see if something was returned.

In [123]:
example.select('.star-rating.Two')

# hence, there are no elements with ".star-rating Two" class

[]

Alternatively, we can just quickly check the text string to see if "star-rating Two" is in it, i.e. check to see if `"star-rating Two" in example` is true or false.<br>Either approach is fine.

Now let's see how we can get the title if we have a 2-star match :

In [124]:
example.select('a')

[<a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>,
 <a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>]

In [125]:
example.select('a')[1]

<a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>

In [88]:
example.select('a')[1]['title']

'A Light in the Attic'

Okay, let's give it a shot by combining all the ideas we've talked about! (this should take about 20-60 seconds to complete running. Be aware a firwall may prevent this script from running. Also if you are getting a no response error, maybe try adding a sleep step with `time.sleep(1)`.

In [90]:
two_star_titles = []

for n in range(1, 51):

    scrapeUrl = baseUrl.format(n)
    res = requests.get(scrapeUrl)
    
    soup = bs4.BeautifulSoup(res.text,"lxml")
    books = soup.select(".product_pod")
    
    for book in books:
        if len(book.select('.star-rating.Two')) != 0:
            two_star_titles.append(book.select('a')[1]['title'])

In [91]:
two_star_titles

['Starving Hearts (Triangular Trade Trilogy, #1)',
 'Libertarianism for Beginners',
 "It's Only the Himalayas",
 'How Music Works',
 'Maude (1883-1993):She Grew Up with the country',
 "You can't bury them all: Poems",
 'Reasons to Stay Alive',
 'Without Borders (Wanderlove #1)',
 'Soul Reader',
 'Security',
 'Saga, Volume 5 (Saga (Collected Editions) #5)',
 'Reskilling America: Learning to Labor in the Twenty-First Century',
 'Political Suicide: Missteps, Peccadilloes, Bad Calls, Backroom Hijinx, Sordid Pasts, Rotten Breaks, and Just Plain Dumb Mistakes in the Annals of American Politics',
 'Obsidian (Lux #1)',
 'My Paris Kitchen: Recipes and Stories',
 'Masks and Shadows',
 'Lumberjanes, Vol. 2: Friendship to the Max (Lumberjanes #5-8)',
 'Lumberjanes Vol. 3: A Terrible Plan (Lumberjanes #9-12)',
 'Judo: Seven Steps to Black Belt (an Introductory Guide for Beginners)',
 'I Hate Fairyland, Vol. 1: Madly Ever After (I Hate Fairyland (Compilations) #1-5)',
 'Giant Days, Vol. 2 (Giant Day

****