# üåê Web Scraping with Python 

## Why is this useful?
There are many reasons why you would want to scrape data. Some examples are:
- Scrape pages of newspapers to get information around important historical events (e.g. elections, major reforms, armed conflicts)
- Compare prices of different products by scraping the pages
- Find the cheapest flight tickets for your dream holidays!

 Web scraping is about downloading structured data from the web, selecting some of that data, and passing along what you selected to another process.

## A little more of theory

We first need to understand some basics on how the web works and how we can access the data in the web.

### HTTP
HTTP or Hypertext Transfer Protocol is an application protocol used for communication between distributed and multi-layered systems on the web. The foundation of the web as we know today (the world wide web) uses HTTP as the main data communication protocol.

HTTP functions as a request-response protocol, or in an active fashion, meaning that one end issues a "request" and the other end receives the request and responds with a "response". This is a generic and more of a high-level explanation, but just have in mind that the response can be pretty much anything that is parseable (a JSON document, XML, HTML, an integer number, a URL...you name it).

The classic example of how this work is when you browse the web. Usually speaking, there is a server (or multiple) hosting the website you are accessing that will be in charge of receiving the requests and responding with the website pages (serving them to you). In this case your browser will be known as the "client" and the server, well, it's known as the "server" 

![](./assets/http.png)

### URLs
![](./assets/urls.png)


### The components of a webpage

When we visit a web page,our web browser makes a request to a web server. This is called a `GET` request, as we are gettign the files from the server. The server then sends back the files that tell out browser how to render the page for us. The files fall into a few types:
- HTML: contain the main content of the page
- CSS : the styling (makes the page look snazzy)
- JS: Javascript files that add interactivity to the pages
- Images

Once these files are received the web browser renders them and display them to us. 


### Verbs / Methods
These define the action that should be performed on the the host:

- HTTP GET: Requests a representation of the specified resource (Document, HTML Page, Picture, JSON, XML...). Using the GET method should only retrieve data and should have no other effect.


- HTTP POST: Requests that the server accept the data enclosed in it's request body as a new resource to be persisted on it's end. The data POSTed might be for instance a new user that registered in your website, a new message on an instant messaging app, a comment on a thread etc. Usually speaking, it's something that the server will store on it's end for later consumption, processing or usage.


- HTTP HEAD: The HEAD request is identical to the GET request, but instead of receiving the full payload of the response, it receives only meta-information about the server (also known as the response headers). This is useful for understanding what is running on the server (or how it reacts to different requests), without having to tranport the entire content of a standard response.


- HTTP PUT: Similar to the POST request, but this one suplies an URI (identifier) that should be used by the server to persist the object transported by the PUT request. The catch here is that if an object with the same URI already exists on the server side, it should be overwritten by the one received (this operation is also known as UPSERT or Merge operation. If the record does not exist, it will be inserted, otherwise it will be updated).


- HTTP DELETE: Requests the deletion of the specified resource


- HTTP OPTIONS: Requests the HTTP methods and actions supported by the server for one specific URL


- HTTP TRACE: Bounces the issued request to the server and back again. This is useful for understanding whether any intermediate servers made any changes to the request you issued, before it reached the target.


### HTTP status codes

The Status Codes are the way the server can tell the client what happened with the request it issued. Have you ever tried to access a site and saw the classic "404 - Not Found" screen ? Well, it turns out that "404" is the Status Code that represents the Not Found status.

Each status is represented by it's own integer number and falls into one out of five different categories of status:

- `1XX` - Informational (E.g: 100 - Continue)

- `2XX` - Success (E.g: 200 - OK ; 201 - Created ; 204 - No Content)

- `3XX` - Redirection (E.g: 301 - Moved Permanently)

- `4XX` - Client Error (E.g: 400 - Bad Request ; 401 - Unauthorized ; 404 - Not Found)

- `5XX` - Server Error (E.g: 500 - Internal Server Error ; 501 - Not Implemented)

For a full list of status codes you can try [this link](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes?oldformat=true) or if you are a cat lover you can try this [visual representation of status codes as cats](https://http.cat)

### HTML

HyperText Markup Language (HTML) is a language that web pages are created in. HTML isn't a programming language, like Python ‚Äî instead, it's a markup language that tells a browser how to layout content. HTML allows you to do similar things to what you do in a word processor like Microsoft Word ‚Äî make text bold, create paragraphs, and so on. Because HTML isn't a programming language, it isn't nearly as complex as Python.

Let's take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements called tags. The most basic tag is the <html> tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:
    
```html
<html>
</html>
```

Save this as `simple_webpage.html` if you were to open this file you would not see anything. Let's add more content:

```html
<html>
    <head>
    </head>
    <body>
        <p>
            Here's a paragraph of text!
        </p>
        <p>
            Here's a second paragraph of text!
        </p>
    </body>
</html>
```

Tags have commonly used named that depend on their position in relation to the tags:

- `child` ‚Äî a child is a tag inside another tag. So the two p tags above are both children of the body tag.
- `parent` ‚Äî a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.
- `sibiling` ‚Äî a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they're both inside `html`. Both `p` tags are siblings, since they're both inside body.


There are many tags that add some functionalities and behaviours to the webpages. For a full list of them [visit this link](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)

Let's add another tag:


```html
<html>
    <head>
    </head>
    <body>
        <p class="bold-paragraph">
            Here's a paragraph of text!
        </p>
        <p class="bold-paragraph extra-large">
            Here's a second paragraph of text!
            <a href="https://www.python.org">The Python website!!</a>
        </p>
    </body>
</html>
```

In the above example the tag `<a>` adds a link to the site and tells the browser to render a link to another webpage. We also added a `class` to our paragraph which gives our elements special properties, classes are optional.

## Getting started with the scraping

We are going to use the Python library [requests](http://docs.python-requests.org/en/master/) to collect data from the web. For this we are going to start with a basic web page:

In [1]:
import requests

url = 'https://goo.gl/FwemWV'

page = requests.get(url)

The object `page` is now a response object. This will contain information about our request such as the status, encoding, the content, and much more.

Let's check the status code using the `status_code` attribute and get the `text` from the response.

In [2]:
page.status_code

200

In [3]:
page.text

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n\n<html lang="en-US" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">\n<head>\n  <meta http-equiv="content-type" content="text/html; charset=us-ascii" />\n\n  <title>Turtle Soup</title>\n</head>\n\n<body>\n  <h1>Turtle Soup</h1>\n\n  <p class="verse" id="first">Beautiful Soup, so rich and green,<br />\n  Waiting in a hot tureen!<br />\n  Who for such dainties would not stoop?<br />\n  Soup of the evening, beautiful Soup!<br />\n  Soup of the evening, beautiful Soup!<br /></p>\n\n  <p class="chorus" id="second">Beau--ootiful Soo--oop!<br />\n  Beau--ootiful Soo--oop!<br />\n  Soo--oop of the e--e--evening,<br />\n  Beautiful, beautiful Soup!<br /></p>\n\n  <p class="verse" id="third">Beautiful Soup! Who cares for fish,<br />\n  Game or any other dish?<br />\n  Who would not give all else for two<br />\n  Pennyworth only of Beautiful Soup?<br />\n  Pennyworth only of

Here you can see all the conent, including the HTML tags. However, there is not much spacing and this makes the content very difficult to read ü§®.
 We can use then a library called [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse the content of our web page in a nicer way.
 
 ### Parsing using Beautiful Soup
 
 The Beautiful Soup library creates a parse tree from parsed HTML and XML documents (including documents with non-closed tags or [tag soup](https://en.wikipedia.org/wiki/Tag_soup) and other malformed markup). This functionality will make the web page text more readable than what we saw coming from the Requests module.
 
Let's start by importing the library:

In [4]:
from bs4 import BeautifulSoup

Next, we‚Äôll run the `page.text` document through the module to give us a BeautifulSoup object ‚Äî that is, a parse tree from this parsed page that we‚Äôll get from running Python‚Äôs built-in `html.parser` over the HTML. The constructed object represents the `mockturtle.html `document as a nested data structure. This is assigned to the variable soup.

We will then use the method `prettify` to turn the BeautifulSoup parse tree into a nicely formatted Unicode string. Doing this will render each tag in a separate line ü§©

In [5]:
soup = BeautifulSoup(page.text, 'html.parser')

print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en-US" xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <title>
   Turtle Soup
  </title>
 </head>
 <body>
  <h1>
   Turtle Soup
  </h1>
  <p class="verse" id="first">
   Beautiful Soup, so rich and green,
   <br/>
   Waiting in a hot tureen!
   <br/>
   Who for such dainties would not stoop?
   <br/>
   Soup of the evening, beautiful Soup!
   <br/>
   Soup of the evening, beautiful Soup!
   <br/>
  </p>
  <p class="chorus" id="second">
   Beau--ootiful Soo--oop!
   <br/>
   Beau--ootiful Soo--oop!
   <br/>
   Soo--oop of the e--e--evening,
   <br/>
   Beautiful, beautiful Soup!
   <br/>
  </p>
  <p class="verse" id="third">
   Beautiful Soup! Who cares for fish,
   <br/>
   Game or any other dish?
   <br/>
   Who would not give all else for two
   <br/>
   Pennyworth only of 

Asa all the tags are nested, we can move through the structure one level at a time. We cna first select all the elements at the top level of the page using the `children` property of `soup`. Note that `children` returns a list generator, so we need to call the function `list` on it:

In [6]:
list(soup.children)

['html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"',
 '\n',
 <html lang="en-US" xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
 <head>
 <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
 <title>Turtle Soup</title>
 </head>
 <body>
 <h1>Turtle Soup</h1>
 <p class="verse" id="first">Beautiful Soup, so rich and green,<br/>
   Waiting in a hot tureen!<br/>
   Who for such dainties would not stoop?<br/>
   Soup of the evening, beautiful Soup!<br/>
   Soup of the evening, beautiful Soup!<br/></p>
 <p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beautiful Soup!<br/></p>
 <p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>
 <p class="choru

### üïµ Finding tag instances

We can extract a single tag  form a page using the `find_all` method. For example if we want to get all the `<p>` tags it would return all the instances within that tag including the line breaks `<br>`

In [7]:
soup.find_all('p')

[<p class="verse" id="first">Beautiful Soup, so rich and green,<br/>
   Waiting in a hot tureen!<br/>
   Who for such dainties would not stoop?<br/>
   Soup of the evening, beautiful Soup!<br/>
   Soup of the evening, beautiful Soup!<br/></p>,
 <p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beautiful Soup!<br/></p>,
 <p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>,
 <p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beauti--FUL SOUP!<br/></p>]

Notice that the returned data is surrounded by `[ ]` indicating that this is a Python list. So as with any other list we can call a particular item within it and use the method `get_item` to extract the text from inside the tag.

In [8]:
soup_p = soup.find_all('p')

soup_p[2].get_text()

'Beautiful Soup! Who cares for fish,\n  Game or any other dish?\n  Who would not give all else for two\n  Pennyworth only of Beautiful Soup?\n  Pennyworth only of beautiful Soup?'

<div class=warn>
Again, the `\n` line breaks are returned
</div>


### Finding tags by class and ID

Do you remember we already learned what a class and ID is? HTML elements that refer to CSS selectors like class and ID can be helpful to look at when working with web data using Beautiful Soup. We can target specific classes and IDs by using the `find_all()` method and passing the class and ID strings as arguments.

First, let‚Äôs find all of the instances of the class `chorus`. In Beautiful Soup we will assign the string for the class to the keyword argument `class_`:

In [9]:
soup.find_all(class_='chorus')

[<p class="chorus" id="second">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beautiful Soup!<br/></p>,
 <p class="chorus" id="fourth">Beau--ootiful Soo--oop!<br/>
   Beau--ootiful Soo--oop!<br/>
   Soo--oop of the e--e--evening,<br/>
   Beautiful, beauti--FUL SOUP!<br/></p>]

Similarly we can use `find_all` to target specific `ID`s

In [10]:
soup.find_all(id='third')

[<p class="verse" id="third">Beautiful Soup! Who cares for fish,<br/>
   Game or any other dish?<br/>
   Who would not give all else for two<br/>
   Pennyworth only of Beautiful Soup?<br/>
   Pennyworth only of beautiful Soup?<br/></p>]

##¬†üìë Illustrative example

On July 21, 2017, the New York Times updated an opinion article called [Trump's Lies](https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html), detailing every public lie the President has told since taking office. Because this is a newspaper, the information was (of course) published as a block of text. This is a great format for human consumption, but it can't easily be understood by a computer.

We will use Python to analyse the data published then! This is how the text looks like:


![](assets/article_1.png)


When converting this into a dataset, you can think of each lie as a "record" with four fields:

1. The date of the lie.
2. The lie itself (as a quotation).
3. The writer's brief explanation of why it was a lie.
4. The URL of an article that substantiates the claim that it was a lie.


Importantly, those fields have different formatting, which is consistent throughout the article: the date is bold red text, the lie is "regular" text, the explanation is gray italics text, and the URL is linked from the gray italics text.

Let's get started

In [11]:
url = 'https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html'

data =  requests.get(url)

data.status_code

200

In [12]:
data.text



Now we need to parse the data using beautiful soup. If you inspect the website source you will see something like this

![](assets/source_3.png)

You might have noticed that each record has the following format:

`<span class="short-desc"><strong> DATE </strong> LIE <span class="short-truth"><a href="URL"> EXPLANATION </a></span></span>`

There's an outer `<span>` tag, and then nested within it is a `<strong>` tag plus another `<span>` tag, which itself contains an `<a>` tag. All of these tags affect the formatting of the text. And because the New York Times wants each record to appear in a consistent way in your web browser, we know that each record will be tagged in a consistent way in the HTML. This is the pattern that allows us to build our dataset!

In [13]:
soup = BeautifulSoup(data.text, 'html.parser')
records = soup.find_all('span', attrs={'class':'short-desc'})

print(f'{len(records)} found')

180 found


To get a better idea of how the records look let's get the first three entries:

In [14]:
records[:3]

[<span class="short-desc"><strong>Jan. 21¬†</strong>‚ÄúI wasn't a fan of Iraq. I didn't want to go into Iraq.‚Äù <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 21¬†</strong>‚ÄúA reporter for Time magazine ‚Äî and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.‚Äù <span class="short-truth"><a href="http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/" target="_blank">(Trump was on the cover 11 times and Nixon appeared 55 times.)</a></span></span>,
 <span class="short-desc"><strong>Jan. 23¬†</strong>‚ÄúBetween 3 million and 5 million illegal votes caused me to lose the popular vote.‚Äù <span class="short-truth"><a href="https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrat

### Analysing one record

Web scraping is often an iterative process, in which you experiment with your code until it works exactly as you desire. To simplify the experimentation, we'll start by only working with the first record in the `records` object, and then later on we'll modify our code to use a loop. 

‚≠êÔ∏è The first thing we are going to extract is the date, which is enclosed in the `<strong>` tag. Since we are only interested in the date (and not the tags) we extract the text in the tags:

In [15]:
first_result = records[0]
print(first_result)

<span class="short-desc"><strong>Jan. 21¬†</strong>‚ÄúI wasn't a fan of Iraq. I didn't want to go into Iraq.‚Äù <span class="short-truth"><a href="https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the" target="_blank">(He was for an invasion before he was against it.)</a></span></span>


In [16]:
first_result.find('strong').text

'Jan. 21\xa0'

ü§î but what is that `\xa0`??? it is a escape sequence that represents the &nbsp; character ("). Since we  don't need this we can get rid of this by slicing the end of the string.

In [17]:
first_result.find('strong').text[0:-1] + ', 2017'  

'Jan. 21, 2017'

‚≠êÔ∏è The next thing is to extract 'the lie'. The problem is that there are no surrounding tags so we can use the `children` attribute that we used before. And then create a slice of the list to extract the content.

In [18]:
list(first_result.children)[1]

"‚ÄúI wasn't a fan of Iraq. I didn't want to go into Iraq.‚Äù "

Almost there... now we need to get rid of the silly quotation marks at the bginning and the end of the text.

In [19]:
list(first_result.children)[1][1:-2]

"I wasn't a fan of Iraq. I didn't want to go into Iraq."

‚≠êÔ∏è Next we are going to extract the explanation. By now and based on the things we have done you might have figured we have two options to do this.

> #### Exercise 3.1
> a) Can you extract the explanation using a tag?
>
> b) Try using the children attirbute to extract the content


‚≠êÔ∏è The last link to extract is the link to the source. Beautiful Soup treats tag attributes and their values like key-value pairs in a dictionary: you put the attribute name in brackets (like a dictionary key), and you get back the attribute's value

In [20]:
first_result.find('a')['href']  

'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'

### Building a full data set

Now that we've figured out how to extract the four components of `first_result`, we can create a loop to repeat this process on all 180 results. We are going to store this in a list called `all_records`

In [21]:
all_records = []

for record in records:
    date = record.find('strong').text[0:-1]+ ', 2017'
    lie = list(record.children)[1][1:-2]
    explanation = record.find('a').text[1:-1]
    url = record.find('a')['href']
    all_records.append((date, lie, explanation, url))
    

In [22]:
len(all_records)

180

In [23]:
all_records

[('Jan. 21, 2017',
  "I wasn't a fan of Iraq. I didn't want to go into Iraq.",
  'He was for an invasion before he was against it.',
  'https://www.buzzfeed.com/andrewkaczynski/in-2002-donald-trump-said-he-supported-invading-iraq-on-the'),
 ('Jan. 21, 2017',
  'A reporter for Time magazine ‚Äî and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.',
  'Trump was on the cover 11 times and Nixon appeared 55 times.',
  'http://nation.time.com/2013/11/06/10-things-you-didnt-know-about-time/'),
 ('Jan. 23, 2017',
  'Between 3 million and 5 million illegal votes caused me to lose the popular vote.',
  "There's no evidence of illegal voting.",
  'https://www.nytimes.com/2017/01/23/us/politics/donald-trump-congress-democrats.html'),
 ('Jan. 25, 2017',
  'Now, the audience was the biggest ever. But this crowd was massive. Look how far back it goes. This crowd was massive.',
  "Official aerial photos show Obama's 2009 inauguration was 

### Dumping into a pandas dataframe

it is helpful to have this data as a DataFrame, especiall if you intend to do some sort of statistical analysis or data visualization. 


In [24]:
import pandas as pd

df = pd.DataFrame(all_records, columns=['date', 'lie', 'explanation', 'url'])  


In [25]:
df.head()

Unnamed: 0,date,lie,explanation,url
0,"Jan. 21, 2017",I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,"Jan. 21, 2017",A reporter for Time magazine ‚Äî and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,"Jan. 23, 2017",Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,"Jan. 25, 2017","Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,"Jan. 25, 2017",Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


Also, the date is in a not very conventional format (it does not conform with the ISO datetime format; YYYY-MM-DD). so we are going to convert this to a datetime data type using pandas.

In [26]:
df['date'] = pd.to_datetime(df['date'])

The code above converts the "date" column to datetime format, and then overwrites the existing "date" column. (Notice that we did not have to tell pandas that the column was originally in "MONTH DAY, YEAR" format - pandas just figured it out!)

In [27]:
df.head()

Unnamed: 0,date,lie,explanation,url
0,2017-01-21,I wasn't a fan of Iraq. I didn't want to go in...,He was for an invasion before he was against it.,https://www.buzzfeed.com/andrewkaczynski/in-20...
1,2017-01-21,A reporter for Time magazine ‚Äî and I have been...,Trump was on the cover 11 times and Nixon appe...,http://nation.time.com/2013/11/06/10-things-yo...
2,2017-01-23,Between 3 million and 5 million illegal votes ...,There's no evidence of illegal voting.,https://www.nytimes.com/2017/01/23/us/politics...
3,2017-01-25,"Now, the audience was the biggest ever. But th...",Official aerial photos show Obama's 2009 inaug...,https://www.nytimes.com/2017/01/21/us/politics...
4,2017-01-25,Take a look at the Pew reports (which show vot...,The report never mentioned voter fraud.,https://www.nytimes.com/2017/01/24/us/politics...


<div class=warn>
Do not modify below: this adds the style to the notebook
</div>

In [28]:
from IPython.core.display import HTML


def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()