# Web Scraping 1: BeautifulSoup

## Docs and installation

[BeautifulSoup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

To install `BeautifulSoup` and the `requests` modules, run the following in your terminal:
```bash
pip install beautifulsoup4 requests
```

If you have errors, you many need to install the _parsers_:
```bash
pip install lxml html5lib
```

## What is it?

BeautifulSoup is an HTML parser, which means that it takes HTML (which is basically a plain text file) and interprets the structure of that file to help you navigate it easily.

It **doesn't** actually get the pages from the web. To do that, we usually use the requests library.

## HTML

HTML is the basic langauge used to create a webpage. It is a hierarchical list of elemnt with properties. A typical format for an element is
```html
<tag-name attr1="value of attr1" attr2="value of attr2" .... attrN="value of attrN">
    Inner text of the tag
</tag-name>
```

We see several things here:
* `tag-name`: This tells us what sort of "thing" we are represting on a page. Common examples of tags:
  * `h1`, `h2`, ...., `h6`: headers
  * `a`: Anchors (i.e. links)
  * `p`: Paragraphs
  * `ol`: ordered lists
  * `ul`: unordered lists
  * `li`: list items
  * `div`: Division (or section) of a page
  * `img`: An image
  
  (Almost) every tag has an beginning (e.g. `<tag>`) and an end (`</tag>`)
* `attributes`: Special properties we want this tag to have. There are four really common examples:
  * `href`: Hyperlink reference. If you click on this element, where do you go?
  * `class`: Style information about an element. Many elements can have the same class
  * `id`: Unique identifier for this element
  * `style`: Extra styling information we want applied to just this element (doing this is bad practice, you should be using CSS instead)
  
For example, here is a heading tag
```html
<h3 style="color:red;" id='top_heading'>This is a heading</h3>
```
This element has:
* a tag `h3`
* two attributes, `style` and `heading`
* inner text of `This is a heading`
Markdown will actually render HTML as well. In the cell below, we create a Markdown cell and copy the HTML:

#### An example
(Copied HTML from above)
<h3 style="color:red;" id='top_heading'>This is a heading</h3>
(End HTML)

#### Self-closing
A few tags don't have inner HMTL and are "self-closing". That is, there is no `</tag>`.

The most well-known one is the image tag:
```html
<img ....... />
```

For example, you can probably guess what this will do:
```html
<img width="300px" src="https://imgs.xkcd.com/comics/boyfriend.png" />
```

Let's put it in this cell and find out:
<img width="300px" src="https://imgs.xkcd.com/comics/boyfriend.png" />

## Example 1: Starting to scrape

Let's start scraping on some dummy HTML. By that, I mean we are just going to create some HTML as a string in this notebook

In [None]:
my_html = """
<html>
<head>
</head>
<body>
<Div style="border: 1px solid">
There isn't much in this file, except a list of to-do items

<ul>
  <li>Feed the cat</li>
  <li>Wash the dished</li>
  <li>Make coffee</li>
  <li>Go to the store</li>
  <li>Write BeautifulSoup lecture</li>
</ul>
</div>
</body>
</html>
"""

In [None]:
# Let's see our "webpage" (i.e. the webpage our string would make)
from IPython.core.display import display, HTML
display(HTML(my_html))  # make sure Jupyter knows to display it as HTML

Let's say we wanted to get all five "todo" items into a list in Python so we could analyze them. This is a task for BeautifulSoup!

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(my_html, 'html5lib')

### Getting a list item: soup.find

If we just look at soup, we might be unimpressed:

In [None]:
# This doesn't look so impressive
soup

In [None]:
# But we can use soup.find to get a specific element:
soup.find('li')

In [None]:
# We can also just grab the inner text
soup.find('li').text

### Getting all the list items: soup.find_all

In [None]:
soup.find_all('li')

In [None]:
# Hey, that is pretty neat! What sort of things are in the list? Strings?
type(soup.find_all('li')[0])

In [None]:
# Let's make a list with just the to-do text, (no <li> ...</li>)
todos = []
for element in soup.find_all('li'):
    todos.append(element.text)

print(todos)

In [None]:
# We did it! But we can use list comprehensions to be tidier:
todos = [element.text for element in soup.find_all('li')]
todos

## Example 2: A more complicated example

Have a look at the webpage in `first_webpage/page.html`. Can we grab all links from the articles (i.e. the link about Starbucks and the link about Bitcoin). We need a two step process:
1. Get the HTML (i.e. read the file)
2. Parse it using soup

In [None]:
with open('first_webpage/page.html') as website:
    html2 = website.read()
soup = BeautifulSoup(html2, 'html5lib')

display(HTML(html2))

In [None]:
# Looking for links, so let's look for 'a' tags
soup.find_all('a')

Uh-oh! It seems there are other links on the page as well in the sidebar. We want a way of just getting the article links. The last link is in the disclaimer as well.

Looking at the source in detail, we can see that all the articles are in a `div` with the class `article`. Let's start with that

In [None]:
# Grab all the div with class "article" from the page
soup.find_all('div', class_='article')

Each eleemnt of this list is _also_ a soup object. So our plan will be
- Grab the "article" divs
- For each article div, grab all the links

In [None]:
for article in soup.find_all('div', class_='article'):
    for link in article.find_all('a'):
        
        print(link)

We did it! 

We can do better. Let's get the text, and what it points to.
e.g.
> unicorn frappanccino --> http://starbucks.com/drinks/unicorn.html

In [None]:
for article in soup.find_all('div', class_='article'):
    for link in article.find_all('a'):
        print(f'{link.text:20s} --> {link.get("href")}')

### Example 3: Getting a list of all the presidents

Now we are going to get some information from the web, instead of just getting a local file.

The requests library is the standard way of grabbing information. The two most common types of requests are 
* `GET`: Doesn't change information, simply requests new info. When you put a URL into a browser, you are making a "get" request
* `POST`: Sends info to the website (e.g. when writing an email in Gmail)

A get request has the form:
`requests.get(url, params = {})`
where `params` are arguments placed at the end of the URL e.g. 
```
requests.get("http://google.com/search", {'search': 'metis'})
```
returns the same website as `http://google.com/search?search=metis`

The object returned is typically called response, and has the following properties:
* `response.text`: The HTML (if any) that is returned.
* `response.json`: The JSON returned (if any). This is typically used by APIs
* `response.status_code`: a code that tells you if your request was successful, or the type of error that occured. A 200 means "success" (and 2XX is generally successful). A 404 means "not found". There are a lot of status errors to help you debug what went wrong.

In [None]:
import requests

response = requests.get('https://www.britannica.com/place/United-States/Presidents-of-the-United-States')

print(response.status_code)

In [None]:
#Let's look at the page
display(HTML(response.text))

In [None]:
soup = BeautifulSoup(response.text, 'html5lib')

table = soup.find('table')

president_info = []
for president_row in table.find_all('tr')[3:]:
    # ignore column 0 with the picture
    desired_info = [e.text for e in president_row.find_all('td')][1:]
    president_info.append(desired_info)
president_info

In [None]:
import pandas as pd
pd.DataFrame(president_info, columns=['Name', 'Birthplace', 'Party', 'Term'])

### Example 4: Gettting info from the box office

We are going to use boxofficemojo in order to get information from the web.



In [None]:
# if needed: pip install requests

url = 'http://boxofficemojo.com/movies/?id=biglebowski.htm'

response = requests.get(url)

For information on HTTP status codes, see:

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [None]:
response.status_code

In [None]:
print(response.text)

In [None]:
page = response.text

In [None]:
# if needed: pip install beautifulsoup4
## pip install --upgrade bs4

from bs4 import BeautifulSoup

soup = BeautifulSoup(page,"lxml")

In [None]:
print(soup)

In [None]:
print(soup.prettify())

## `soup.find()`

`soup.find()` and its partner, `find_all`, are the most common functions we will use from this package. 

The syntax is
```
# Finds the FIRST tagname with attr1 equal to value1 AND attr2 equal to value2
soup.find('tagname', attr1='value1', attr2='value2', ...)
```

Sometimes attributes have names that are not legal Python (e.g. `data-value="23"`). We can use dictionary notation instead:
```
soup.find('tagname', {'attr1': 'value1', 'attr2':'value2', ....})
```
For the keyword `class`, it is so common that there is a special keyword for it, namely `class_`.

Let's try out some common variations of `soup.find()`

In [None]:
# soup.find() returns the first matched tag it finds.
# It searches the entire tree.

# Search for a type of tag by using the tag as a string
# (like 'body','div','p','a') as an argument.

print(soup.find('a'))

In [None]:
# Equivalently:
print(soup.a)

In [None]:
# Prettier:
print(soup.a.prettify())

In [None]:
# soup.find_all() returns a list of all matches

for link in soup.find_all('a'): 
    print(link)

In [None]:
# retrieve the url from an anchor tag
soup.find('a')['href']

In [None]:
# You can match on an attribute like an id or class.
# Take a look at what the 'mp_box_content' classes
# look like on the webpage, with Inspect Element.

for element in soup.find_all(class_='mp_box_content'):
   print(element, '\n')

In [None]:
# We can find all the columns in the first mp_box_content table
# by "chaining" `find` and `find_all`.

print(soup.find(class_='mp_box_content').find_all('td'))

In [None]:
# To extract just the value of interest:

soup.find(class_='mp_box_content').find_all('td')[1].text

Be careful with non-printing characters!

In [None]:
# find with an "id". (ID is unique.)

print(soup.find(id='hp_footer'))

##Consistency
Web scraping is made simple by the consistent format of information among like pages of a website. 

###Items to scrape for each movie:
* movie title
* total domestic gross
* release date
* runtime
* rating


In [None]:
# Movie Title

print(soup.find('title'))

In [None]:
title_string = soup.find('title').text
print(title_string)

In [None]:
print(title_string.split('('))

In [None]:
title = title_string.split('(')[0].strip()
print(title)

In [None]:
# Domestic Total Gross

## text does an exact match search!
print(soup.find(text="Domestic Total Gross"))

In [None]:
# You could find a perfect match:

print(soup.find(text="Domestic Total Gross: "))

In [None]:
# You could also use [regular expressions](https://xkcd.com/208/).

import re
domestic_total_regex = re.compile('Domestic Total')
soup.find(text=domestic_total_regex)

In [None]:
dtg_string = soup.find(text=re.compile('Domestic Total'))
print(dtg_string)

In [None]:
print(dtg_string.findNextSibling())

In [None]:
dtg = dtg_string.findNextSibling().text
dtg = dtg.replace('$','').replace(',','')
domestic_total_gross = int(dtg)
print(domestic_total_gross)

###We can actually do several of these using the text matching method, so let's make a function for that

In [None]:
def get_movie_value(soup, field_name):
    '''Grab a value from boxofficemojo HTML
    
    Takes a string attribute of a movie on the page and
    returns the string in the next sibling object
    (the value for that attribute)
    or None if nothing is found.
    '''
    obj = soup.find(text=re.compile(field_name))
    if not obj: 
        return None
    # this works for most of the values
    next_sibling = obj.findNextSibling()
    if next_sibling:
        return next_sibling.text 
    else:
        return None

In [None]:
# domestic total gross
dtg = get_movie_value(soup,'Domestic Total')
print(dtg)

In [None]:
# runtime
runtime = get_movie_value(soup,'Runtime')
print(runtime)

In [None]:
# rating
rating = get_movie_value(soup,'MPAA Rating')
print(rating)

In [None]:
rating = get_movie_value(soup,'Release Date')
print(rating)

### We need a few helper methods to parse the strings we've gotten

In [None]:
import dateutil.parser

def to_date(datestring):
    date = dateutil.parser.parse(datestring)
    return date

def money_to_int(moneystring):
    moneystring = moneystring.replace('$', '').replace(',', '')
    return int(moneystring)

def runtime_to_minutes(runtimestring):
    runtime = runtimestring.split()
    try:
        minutes = int(runtime[0])*60 + int(runtime[2])
        return minutes
    except:
        return None

In [None]:
# Let's get these again and format them all in one swoop

from pprint import pprint

raw_release_date = get_movie_value(soup,'Release Date')
release_date = to_date(raw_release_date)

raw_domestic_total_gross = get_movie_value(soup,'Domestic Total')
domestic_total_gross = money_to_int(raw_domestic_total_gross)

raw_runtime = get_movie_value(soup,'Runtime')
runtime = runtime_to_minutes(raw_runtime)

headers = ['movie title', 'domestic total gross',
           'release date', 'runtime (mins)', 'rating']

movie_data = []
movie_dict = dict(zip(headers, [title,
                                domestic_total_gross,
                                release_date,
                                runtime,
                                rating]))
movie_data.append(movie_dict)

pprint(movie_data)

### What about scraping tables? 

In [None]:
url = 'http://www.boxofficemojo.com/genres/chart/?id=foreign.htm'

response=requests.get(url)
page=response.text

soup=BeautifulSoup(page,"lxml")


In [None]:
tables=soup.find_all("table")
rows=[row for row in tables[3].find_all('tr')]

# Just want to look at 1st 20 rows for now
rows=rows[1:20]

movies={}
for row in rows:
    items=row.find_all('td')
    title=items[1].find('a')['href']
    movies[title]=[i.text for i in items[1:]]
    

list(movies.items())[1]

## Parting thoughts (optional - self review rather than class)

BeautifulSoup is very poewrful. There are a few other methods and hints that are useful.

One of the biggest hints is to use strings to make "fake documents" to test your scraping. 

One common problem you will encounter that we haven't discussed is how to get a neighboring element. We caan use the "test document" method to practice extracting it. It is pretty common in tables to see things like
```html
...
<tr>
    <th class="num_sales">Number of Sales</th>
    <td>405</td>
</tr>
...
```
We can easily find "num_sales" using the class, but we really want the (generic) `td` tag containing the data (405). 

Some terminology:
* The 'th' and 'td' tags are contained inside the 'tr' tag. The 'th' and 'td' tags are referred to as "children" of the 'tr' tag (and the 'tr' tag is the parent of the 'td' and 'th' tags).
* Continuing the family tree analogy, 'th' and 'td' are called siblings
Looking at the BeautifulSoup documentation, we see there are methods `next_sibling` and `previous_sibling`. Let's write a test document and see if we can get the data we want

In [None]:
from bs4 import BeautifulSoup

# We will add another row to amke sure we are not just grabbing the first row in the table
test_html = '''
<table>
  <tr>
    <th class="product_name">Product name</th>
    <td>Peanut butter</td>
  </tr>
  <tr>
    <th class="num_sales">Number of Sales</th>
    <td>405</td>
  </tr>
'''

soup = BeautifulSoup(test_html, 'html5lib')

easy_to_find = soup.find('th', class_='num_sales')
easy_to_find

In [None]:
easy_to_find.next_sibling

In [None]:
# weird. Let's try the next, next sibling
easy_to_find.next_sibling.next_sibling

In [None]:
# found it!
easy_to_find.next_sibling.next_sibling.text