In [5]:
!pip install -Uq requests bs4
from bs4 import BeautifulSoup
import requests

/bin/bash: pip: command not found


# Web Scraping

## What is web scraping?

**Web scraping is two sequential steps**
1. fetching a webpage HTML
2. extracting data from the HTML

## Am I allowed to scrape?

*I'm not a lawyer, and don't play one on the internet*

Web scraping involves extracting data from HTML
- this HTML (& data) is publically available through an HTTP request
- you only get back what they send you

`robots.txt` 
- way for websites to tell crawlers & webscrapers what is allowed or not
- for example - https://www.theguardian.com/robots.txt

You should be polite
- tell the website who you are (`user-agent`)
- don't spam requests - consider adding a `time.sleep` in between requests
- spamming the server is not polite
- if they offer an API, use that instead

If you ever use data from web scraping commercially

- check for copyright
- i.e. couldn't scrape videos from YouTube & repost 

## Fetching HTML

Web scraping is two steps
1. fetching a webpage HTML
2. extracting data from the HTML

We will be scraping the Wikipedia page for [Yann LeCun](https://en.wikipedia.org/wiki/Yann_LeCun) - one of the three recipients of the 2018 Turing award for work in Deep Learning - the other two being [Geoffery Hinton](https://en.wikipedia.org/wiki/Geoffrey_Hinton) and [Yoshua Bengio](https://en.wikipedia.org/wiki/Yoshua_Bengio).

Let's do an HTTP request to the Wikipedia URL:

In [6]:
response = requests.get('https://en.wikipedia.org/wiki/Yann_LeCun')

We can look at the HTML content we get back using `.text`.

This is the same HTML that your browser uses to render a page:

In [7]:
response.text[:250]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Yann LeCun - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],'

## What is the web?

We think of the web as a collection of pages

- in fact, the web is a collection of users (usually web browsers, can also be servers) and servers

A better mental for the web is a conversation between users & servers

## What is a server?

It's just a computer running a program

- i.e. Flask, which is a Python program

Servers can also run & be accessed locally

- this is how we use Jupyter Lab :)

## HTTP - what happens when you visit a website?

This is the kind of conversation that happens when you access a page on the internet:

*CLIENT* - request to https://www.reddit.com

*SERVER* - I'm the server hosting reddit.com - what page would you like?

*CLIENT* - Please give me https://www.reddit.com/r/MachineLearning/

*SERVER* - Sure -> sends text files

*CLIENT* - Thanks! -> renders text files in browser

This kind of conversation is had every time you access a webpage

- **it's also the same thing that happens when we do `requests.get`!

*Further reading*
[Interactive Data Visualization for the Web - Scott Murray](oreilly.com/library/view/interactive-data-visualization/9781449340223/) - in particular Chapter 3

## What text files are common on the internet?

What do you expect to get back when you send a request
1. HTML (`.html`)
2. CSS (`.css`)
3. Javascript (`.js`)

### HTML

HTML is a markup language used to format text.  

The fundamental primitive is an element.  Elements can have different tags, such as:
- `<p>` paragraph
- `<h1>` heading
- `<a>` link
- `<img>` image
- `<script>` Javascript

These elements can be nested to create complex structure (particularly parent - child, or inheritance relationships).

Take a look at `example.html` to see a full HTML document.  You can also use HTML within notebooks (like this one).

HTML elements have optional attributes
- property `<a property="value">`
- class `<a class="myClass">`
- id `<a id="myID>`

Properties are usually stuff like color, and change how the HTML renders
- classes & ID's are used to identify

### CSS

Used to style HTML
- you don't need to know this for web scraping

### Javascript

Dynamic, weakly untyped language

- executes in the browser
- do fancy stuff like calling API's, dynamically rendering HTML, responding to user input

While you don't need to know Javascript for web scraping, it is useful to look out for JSON strings
- these can hold useful infomation
- example - always check for a `<script type="application/ld+json">`

## HTML 101


Tags can have **attributes** - for example the `<a>` usually has an attribute of `href` that holds the link:

`<a href="https://adgefficiency.com/">My personal blog</a>`

This is rendered as:

<a href="https://adgefficiency.com/">My personal blog</a>

A common attribute for HTML elements to have is a **class** - this is used to specify the styling of the object to a CSS class.

## Parsing HTML

We need some way to parse this HTML text - to do this we will use **Beautiful Soup**:

We can use Beautiful Soup to parse the HTML for specific tags.  First we create an instance of the `BeautifulSoup` class, taking the HTML text we got using `requests`:

In [8]:
response = requests.get('https://en.wikipedia.org/wiki/Yann_LeCun')
soup = BeautifulSoup(response.text)

# ul is an "unordered list" 
len(soup.findAll('ul'))

54

The **title** tag is a special tag required in all HTML documents:

In [9]:
soup.title

<title>Yann LeCun - Wikipedia</title>

We can use Beautiful Soup to find all the `p` tags:

In [10]:
p = soup.find_all('p')

p[-1]

<p>In March 2019, LeCun won the Turing award, sharing it with <a href="/wiki/Yoshua_Bengio" title="Yoshua Bengio">Yoshua Bengio</a> and <a href="/wiki/Geoffrey_Hinton" title="Geoffrey Hinton">Geoffrey Hinton</a>.<sup class="reference" id="cite_ref-36"><a href="#cite_note-36">[36]</a></sup> In September 2019, he received the Golden Plate Award of the <a href="/wiki/Academy_of_Achievement" title="Academy of Achievement">American Academy of Achievement</a>.<sup class="reference" id="cite_ref-37"><a href="#cite_note-37">[37]</a></sup>
</p>

Or to find all the links (`a`) in a page:

In [11]:
p = soup.find_all('a')
p[-1]

<a href="https://www.mediawiki.org/"><img alt="Powered by MediaWiki" height="31" loading="lazy" src="/static/images/footer/poweredby_mediawiki_88x31.png" srcset="/static/images/footer/poweredby_mediawiki_132x47.png 1.5x, /static/images/footer/poweredby_mediawiki_176x62.png 2x" width="88"/></a>

## Developer tools

One useful tool in web development are the **Developer Tools** included in modern browsers:

The **Inspect elements** tool allows us to find the HTML block for the biography table:


Let's find the table:

In [14]:
table = soup.find('table', attrs={'class': 'infobox biography vcard'})

type(table)

bs4.element.Tag

## Tables in HTML

`tr` = row

`th` = header cell

`td` = data cell

Let's take a look at the third row (**Born**):

In [15]:
rows = [r for r in table.find_all('tr')]
row = rows[2]
row

<tr><th class="infobox-label" scope="row">Born</th><td class="infobox-data"><span style="display:none"> (<span class="bday">1960-07-08</span>) </span>July 8, 1960<span class="noprint ForceAgeToShow"> (age 61)</span><br/><div class="birthplace" style="display:inline"><a href="/wiki/Soisy-sous-Montmorency" title="Soisy-sous-Montmorency">Soisy-sous-Montmorency</a>, <a href="/wiki/France" title="France">France</a></div></td></tr>

The header:

In [16]:
row.find('th')

<th class="infobox-label" scope="row">Born</th>

The data:

In [17]:
row.find('td')

<td class="infobox-data"><span style="display:none"> (<span class="bday">1960-07-08</span>) </span>July 8, 1960<span class="noprint ForceAgeToShow"> (age 61)</span><br/><div class="birthplace" style="display:inline"><a href="/wiki/Soisy-sous-Montmorency" title="Soisy-sous-Montmorency">Soisy-sous-Montmorency</a>, <a href="/wiki/France" title="France">France</a></div></td>

We can also get the text from these HTML elements:

In [18]:
row.find('td').text

' (1960-07-08) July 8, 1960 (age\xa061)Soisy-sous-Montmorency, France'

We can store this data in a dictionary:

In [19]:
data = None
if data:
    print('ih')

In [20]:
data = {}

data[row.find('th').text] = row.find('td').text
data

{'Born': ' (1960-07-08) July 8, 1960 (age\xa061)Soisy-sous-Montmorency, France'}

## Exercise - clean the biography table

Let's iterate over the rows in the biography table and store each row in a list of dictionaries:

In [21]:
#from answers import store_biography_table
#store_biography_table(rows)

## Finding links

Another common task when parsing HTML is to look for links - in HTML links have an `a` tag.  

Let's find all the links in the **References** section - which is a `div` element:

In [22]:
table = soup.find('div', 'mw-references-wrap mw-references-columns')

In [23]:
links = []
for link in table.find_all('a'):
    links.append(link)

links = [link for link in table.find_all('a')]
li = links[1]
li

<a class="external text" href="https://www.legifrance.gouv.fr/jo_pdf.do?id=JORFTEXT000039726325" rel="nofollow">"Version électronique authentifiée publiée au JO n° 0001 du 01/01/2020 | Legifrance"</a>

In [24]:
li['href']

'https://www.legifrance.gouv.fr/jo_pdf.do?id=JORFTEXT000039726325'

In [25]:
li.text

'"Version électronique authentifiée publiée au JO n° 0001 du 01/01/2020 | Legifrance"'

## Downloading images

Now we are familiar with Beautiful Soup, we know we can find all the images in a page eaisly:

In [26]:
soup.find_all('img')

[<img alt="Yann LeCun - 2018 (cropped).jpg" data-file-height="1490" data-file-width="1297" decoding="async" height="253" src="//upload.wikimedia.org/wikipedia/commons/thumb/2/22/Yann_LeCun_-_2018_%28cropped%29.jpg/220px-Yann_LeCun_-_2018_%28cropped%29.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/2/22/Yann_LeCun_-_2018_%28cropped%29.jpg/330px-Yann_LeCun_-_2018_%28cropped%29.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/2/22/Yann_LeCun_-_2018_%28cropped%29.jpg/440px-Yann_LeCun_-_2018_%28cropped%29.jpg 2x" width="220"/>,
 <img alt="" class="thumbimage" data-file-height="2448" data-file-width="3264" decoding="async" height="225" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/84/Yann_LeCun_at_the_University_of_Minnesota.jpg/300px-Yann_LeCun_at_the_University_of_Minnesota.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/84/Yann_LeCun_at_the_University_of_Minnesota.jpg/450px-Yann_LeCun_at_the_University_of_Minnesota.jpg 1.5x, //upload.wikimedia.or

Let's download the first one - note that we use the `src` attribute, and have to append `'https:'` onto the url:

In [27]:
img = soup.find_all('img')[0]
url = 'https:' + img['src']
url

'https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Yann_LeCun_-_2018_%28cropped%29.jpg/220px-Yann_LeCun_-_2018_%28cropped%29.jpg'

Now we can use `requests` again to get the bytes for this image:

In [28]:
res = requests.get(url)

In [29]:
res.content[:50]

b'\xff\xd8\xff\xe1\x00\x92Exif\x00\x00MM\x00*\x00\x00\x00\x08\x00\x06\x01\x1a\x00\x05\x00\x00\x00\x01\x00\x00\x00V\x01\x1b\x00\x05\x00\x00\x00\x01\x00\x00\x00^\x01(\x00\x03'

## Exercise: Web Scraping Poems

Scrape the poems from the poet Edgar Allen Poe from this website: https://poestories.com/poetry.php

Put them in the `./text_files` directory, and run a report on the poems afterwards.