# BLU03 - Learning Notebook - Part 3 of 3 - Web scraping


## 1. Introduction

In the context of data wrangling, we've already talked about three data sources: files, databases and public APIs.
Now it's time to delve into the Web!

As we all know, there is a huge amount of data in the Web. Whenever we search something on Google, it shows us thousands of web pages full of answers.

However, there is a problem here: in most of the cases, the web pages show us the data in a beautiful but unstructured way. This makes sense, since the purpose of a web page is to be read by a human and not to have its content analysed by some computer program.

So we are left with the boring task of copying and pasting the data we want into csv files or excel tables, possibly thousands of times, before feeding it to some data model...

But worry no more!

<img src="media/web_scraping_to_the_rescue.png" width=350/>

## 2. What is web scraping

[Web scraping](https://en.wikipedia.org/wiki/Web_scraping) is the name given to the process of extracting data from web pages in an automated way.
There are many [techniques](https://en.wikipedia.org/wiki/Web_scraping#Techniques) that can be used to do web scraping and the one we're going to explore here is HTML parsing.

A web page is an HTML document, so HTML parsing means to split the contents of a web page into several small pieces and select the parts we find interesting. This technique is usefull when we want to extract data from many web pages that share a common template.

## 3. Understanding the HTML code of a web page

Before jumping to the part where we actually do web scraping, let's first understand the structure and code of a web page.

Usually, a web page has 3 different types of code:
* **HTML**: used to display the content of the web page
* **CSS**: used to apply styles to the web page, it's what makes the page pretty
* **JavaScript**: this is what makes the page dynamic, like triggering an action when a button is clicked.

We'll focus now on the HTML part, since it's the one that concerns what we want, which is data.

In file **../web_pages/nationalmuseum.html** you can see an example of an HTML document that represents a web page. Let's see the code.

In [1]:
# use ! type for Windows (use full path)
# ! cat web_pages/nationalmuseum.html
# C:\Users\caixi\OneDrive\Documents\GitHub\batch3-workspace\S02 - Data Wrangling\BLU03 - Data Sources\web_pages
! type web_pages\nationalmuseum.html

<!DOCTYPE html>
<html>
  <body>
    <h1>Webpage about the Nationamuseum</h1>
    <h3>It's in Sweden.</h3>
    <p>For more informations:</p>
    <br>
    <p>Check wikipedia!</p>
  </body>
</html>


And this is how the page looks in a browser.

![title](media/nationalmuseum_page_2.png)

As you can see above, an HTML page is a collection of HTML elements, where an element has the form:
```<tagname> content </tagname>```.

HTML elements can be nested with other HTML elements, meaning that the content between the start and end tags can be a set of elements.

An HTML element can also have no content. In that case, it's simply a tagname, like this:
```<tagname>```.

Let's go through the elements in this page:
- the ```<!DOCTYPE html>``` says that this document is an HTML document
- the ```<html>``` element is the root element of an HTML page
- the ```<body>``` element has the page content
- the ```<h1>``` element is a large heading
- the ```<h3>``` element is a smaller heading
- the ```<p>``` element is a paragraph
- the ```<br>``` element is a line break, which is an example of an element without content

## 4. How to web scrape

Now let's go to the fun part!

Going back to our movies database, you can see that there are some characers for which we're missing the character_name.
You can try to query the database to find which are these characters, but in the meanwhile, we gathered them in file **../data/missing_character_names.csv**.

In [2]:
import pandas as pd
import requests
# Import some helper functions to print shorter outputs
import utils

from bs4 import BeautifulSoup

In [3]:
missing_character_names = pd.read_csv('data/missing_character_names.csv')
missing_character_names.head()

Unnamed: 0,id,movie_id,imdb_id,actor_id,name,character_name
0,1073,718,tt0116405,82957,Dan Aykroyd,
1,1218,17579,tt0120240,105261,Bonnie Hunt,
2,1219,17579,tt0120240,79974,N'Bushe Wright,
3,1220,17579,tt0120240,55658,Michael Rapaport,
4,1221,17579,tt0120240,57737,Denis Leary,


Can you think of a good way to get this missing data? [IMDb](https://www.imdb.com) seems a very good candidate!

The first thing to do is to open the web page that has the content we're interested in. The URLs of movie pages in IMDb follow a standard: they all start with https://www.imdb.com/title/ followed by the IMDB movie id.

For instance, for the first movie with a missing character name, we can get the correspondent page using the URL https://www.imdb.com/title/tt0116405/. This is the page.

<img src="media/imdb_movie_page.png"/>

Now, let's head to the cast section of the page, since this is what we'll be scraping.

<img src="media/imdb_cast.png"/>

In order to get the page's content, we'll use a GET request.

Then, we can get the content from the response, which will be... a bunch of incomprehensible HTML.

In [4]:
response = requests.get('https://www.imdb.com/title/tt0116405/')
# Printing short output, if you want to see everything, delete the friendly_print function call
utils.friendly_print_string(response.content)

b'\n\n\n\n\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///title/tt0116405?src=mdot">\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n      ue'


And here is where **Beautiful Soup** can help us. Beautiful soup is a package for parsing HTML documents, you can check its documentation [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

First, we need to create an instance of the BeautifulSoup class, passing it the HTML document to parse.

In [5]:
soup = BeautifulSoup(response.content, 'html.parser')

By calling the **prettify** method, we can see the HTML elements of the document in a pretty and indented way.

In [6]:
# Printing short output, if you want to see everything, delete the friendly_print function call
utils.friendly_print_string(soup.prettify())

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="app-id=342792525, app-argument=imdb:///title/tt0116405?src=mdot" name="apple-itunes-app"/>
  <script type="text/javascript">
   var IMDbTimer={starttime: new Date().getTime(),pt:'java'};
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
  </s


By calling the **children** property of the soup, we can parse it into smaller elements. The idea is that an HTML element will correspond to a beautiful soup element.
Let's see an example.

This soup has 5 elements:
* 3 NavigableString elements, with a single \n
* a Doctype element, with the value 'html'
* a Tag element, with tag html

We're particularly interested in the Tag element, which is where the HTML content is.

In [16]:
# inspecting the elements in the soup
soup_children = list(soup.children)

# uncomment the next line to print the output, we didn't do it here because it's too long :/
# soup_children

In [19]:
# inspecting the types of the elements in the soup
[type(item) for item in soup_children]

[bs4.element.NavigableString,
 bs4.element.Doctype,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString]

To get the html tag element from the soup, we can just call it by its name.

In [9]:
type(soup.html)

# uncomment the next line to print the output, we didn't do it here because it's too long :/
#soup.html

bs4.element.Tag

In order to get to an actual content, we can navigate through the tags until we reach a tag that has a value as content (instead of other elements).

In [23]:
soup.html.head.title

<title>Crime com Castigo (1996) - IMDb</title>

Finally, by calling method **get_text**, we can get the content of the element as a string (here it's in Portuguese).

In [11]:
soup.html.head.title.get_text()

'Crime com Castigo (1996) - IMDb'

By now, you must be thinking that this is somehow a complicated process, as it requires manually inspecting the HTML document and navigating through thousands of tags in order to find the interesting content in the middle of a big mess. And you're right :)

You'll now see a very easy way to access the interesting content directly.

First, you need to open the developer tools of your browser in the page to scrape.
In Google Chrome, you just have to right-click the page and select the "Inspect" option. 
For other browsers, google "How to open developer tools in *browser name*".

![title](media/dt_open.png)

The developer tools will open at the bottom of the window. Now, when you hover something on the page with your mouse, you'll see the correspondent HTML element highlighted in the developer tools window.

![title](media/dt_actordiv.png)

Here we can see that all the actor and character names are inside an element with tag **div** and class **article**. The classes inside HTML elements are related to the CSS styles of the web page and not with the content. However, they are useful to identify the elements that we're trying to parse.

We can inspect even further and notice that one of this element's children has tag **table** and class **cast_list** - we're getting closer! 

![title](media/dt_castlist.png)

Now we can call the soup's **find_all** method to find this table (and make sure there's only one in this page).

In [12]:
cast_list = soup.find_all('table', class_="cast_list")
print("Number of elements found: ", len(cast_list))

cast_table = cast_list[0]
cast_table

Number of elements found:  1


<table class="cast_list">
<tr><td class="castlist_label" colspan="4">Cast overview, first billed only:</td></tr>
<tr class="odd">
<td class="primary_photo">
<a href="/name/nm0000101/"><img alt="Dan Aykroyd" class="loadlate hidden" height="44" loadlate="https://m.media-amazon.com/images/M/MV5BMTI2MDA3NTg0NF5BMl5BanBnXkFtZTYwMzM5ODgz._V1_UX32_CR0,0,32,44_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" title="Dan Aykroyd" width="32"/></a> </td>
<td>
<a href="/name/nm0000101/"> Dan Aykroyd
</a> </td>
<td class="ellipsis">
              ...
          </td>
<td class="character">
<a href="/title/tt0116405/characters/nm0000101">Jack Lambert</a>
</td>
</tr>
<tr class="even">
<td class="primary_photo">
<a href="/name/nm0005499/"><img alt="Lily Tomlin" class="loadlate hidden" height="44" loadlate="https://m.media-amazon.com/images/M/MV5BMTYwNjc5NzA2NV5BMl5BanBnXkFtZTcwNjk1MTk3Mw@@._V1_UX32_CR0,0,32,44_AL_.jpg" src="https://m.med

Cool! Now we still need to clean up a bit more. Inside the table, we have several elements, but the ones that interest us are the ones with the **td** tag. By inspection, we can see that the actor names are inside the **td** elements that *don't* have a class!

![title](media/dt_select_actor.png)

We can select these elements in the following way:

In [13]:
actor_tags = cast_table.find_all('td', class_=False)

Finally, by calling the get_text method in each td element, we get the actor names.

In [25]:
actor_names = [actor.get_text().strip() for actor in actor_tags]
actor_names[:3]

['Dan Aykroyd', 'Lily Tomlin', 'Jack Lemmon']

Let's see another example, where we get the character names.

We need to open the developer tools again and inspect the element that contains the character name.

![title](media/dt_select_character.png)

This time it's simpler. We need to find td tags with character class, and get their content.

In [32]:
character_names = [c.get_text() for c in cast_table.find_all('td', class_='character')]

# this is to strip the \n and blank spaces in the strings
character_names = [' '.join(c.split()) for c in character_names]
character_names[:3]

['Jack Lambert', 'Inga Mueller', 'Max Mueller / Karl Luger']

Finally we have found Dan Aykroyd's character in movie tt0116405, which is Jack Lambert!

And the best part is that it will only take some minutes to get all the other missing character names. You're invited to do that as an exercise :)

## 5. Optional

### 5.1 Scraping and the Law

[This](https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/) is an interesting article about the subject, bottom line being: when scraping web pages, don't use a very high request rate, so that the owners of the website don't get angry.

### 5.2 Scraping and JavaScript

Sometimes, when scraping web pages, you'll need to navigate from one page to the other, click buttons, or take other actions that enter the JavaScript domain. In such cases, Beautiful Soup is not enough to fill your needs. If you find yourself there, take a look at [Selenium](https://www.seleniumhq.org/).

### 5.3 Scraping tools

If you're really into the world of scraping, you can give [parsehub](https://www.parsehub.com/) or [portia](https://github.com/scrapinghub/portia) a try! For scraping using Python, [scrapy](https://scrapy.org/) is also a good choice.