In [1]:
from bs4 import BeautifulSoup
import requests

## Scrape the Web

So far we've used BeautifulSoup to parse our own HTML strings and files.  Now let's scrape Box Office Mojo.

First let's take a look at some source code for _The Big Lebowski_.

- Navigate to https://www.boxofficemojo.com/title/tt0118715/ in your browser, preferably Chrome
- Right click and select "Inspect"
- Alternatively, you can "View Page Source"

To retrieve the HTML for this webpage, we will use `requests`.

### `requests`

The `requests` library allows us to grab information from the web.  There are two common types of requests:
- `get` -- simply requests information, akin to putting a url in your browser
- `post` -- sends information to the website, for example, writing an email

We will be using `get` to retrieve a page's HTML.

In [1]:
url = 'https://www.zillow.com/homes/?searchQueryState={%22pagination%22:{%22currentPage%22:2},%22mapBounds%22:{%22west%22:-122.59102821350098,%22east%22:-122.40254402160645,%22south%22:37.71146108878938,%22north%22:37.84822253175519},%22regionSelection%22:[{%22regionId%22:20330,%22regionType%22:6}],%22isMapVisible%22:true,%22filterState%22:{%22sortSelection%22:{%22value%22:%22globalrelevanceex%22}},%22isListVisible%22:true,%22mapZoom%22:12} 

response = requests.get(url)

SyntaxError: EOL while scanning string literal (<ipython-input-1-0794d688ab3c>, line 1)

The response we got back is an object that gives us access to:
- `response.text` -- the returned HTML (if any)
- `response.json` -- the returned JSON (if any), typical for APIs
- `response.status_code` -- a [code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) to tell you if your request was successful or if an error occurred, 2XX indicates success while 404 means not found

In [None]:
response.status_code  #200 = success!

In [None]:
response.text[:1000]  #First 1000 characters of the HTML

In [None]:
page = response.text

### `BeautifulSoup` Basics

Now that we have the HTML, let's learn its structure by parsing with BeautifulSoup.

In [None]:
soup = BeautifulSoup(page, "lxml")

In [None]:
print(soup)

The `prettify` method turns the soup into a nicely formatted Unicode string with one tag on each line for readability.

In [None]:
print(soup.prettify())

**QUESTIONS**

> Select the first link on the page.

> Now select the LAST link on the page.  Can you get the text and the URL associated with this link?

Remember `find` gets only one match, but `find_all` retrieves all matches in a list.

In [None]:
for link in soup.find_all('a')[:5]:
    print(link, '\n')

And you can match only those with a specific `id` or `class` if you'd like.  Here are all the elements labeled with the "mojo-navigtaion-tab" class.

In [None]:
for element in soup.find_all(class_='mojo-navigation-tab'):
    print(element.prettify())

It's important to remember `find` and `find_all` return BeautifulSoup elements. You can continue searching these elements, thus chaining commands together.

Basic earnings information can be found in the `div` with the "mojo-performance-summary-table" class.  Let's extract the domestic gross from this element.

<br>
<img src="images/biglebow_table.png" alt="Big Lebowski Table" style="width: 500px;"/>

In [None]:
print(soup.find(class_='mojo-performance-summary-table').prettify())

In [None]:
soup.find(class_='mojo-performance-summary-table').find_all('span', class_='money')

Text needs to be extracted from one element at a time.  To get the domestic gross:

In [None]:
soup.find(class_='mojo-performance-summary-table').find_all('span', class_='money')[0].text

You can also find using an `id`; remember id should be unique to just one element.

In [None]:
print(soup.find(id='tabs').prettify())

### Web Scraping Pipeline

Now that we have the basics, let's practice web scraping.  **The main goal of web scraping is to extract data by taking advantage of a site's consistent format.**  That is, the code you write for one page on a website can hopefully be used on multiple pages to gather more information automatically.

Let's create code to get the following information for the movies on Box Office Mojo:
- Movie title
- Domestic gross
- Runtime
- MPAA rating
- Release date

#### Movie Title

In [None]:
soup.find('title')

In [None]:
title_string = soup.find('title').text

title_string

In [None]:
title_string.split('-')

In [None]:
title = title_string.split('-')[0].strip()

title

#### Domestic Gross: 

As we saw previously, the domestic gross can be found in a `span` within the "mojo-performance-summary-table" `div`.

In [None]:
dtg = soup.find(class_='mojo-performance-summary-table').find_all('span', class_='money')[0].text
dtg

The remainder of the information lives in this neighboring `div`.

<img src="images/biglebow_info.png" alt="Big Lebowski Information" style="width: 500px;"/>

#### Runtime: `.findNext()`

Sometimes you can find the information you are looking for by using text matching.  But note this must be an exact match!

In [None]:
soup.find(text='Run')  #does not match

In [None]:
soup.find(text='Running Time')  

Alternatively, we could use [regular expressions](https://docs.python.org/3/library/re.html).

In [None]:
import re
runtime_regex = re.compile('Run')
soup.find(text=runtime_regex)

In [None]:
rt_string = soup.find(text=re.compile('Run'))
print(rt_string)

In [None]:
type(rt_string)

The string we found is still a Beautiful Soup element. This means we can use it to navigate to the next element in the HTML, which is a `span` containing the actual runtime.

In [None]:
rt_string.findNext()

The `.findNext()` method can be incredibly useful when the information you want to find doesn't have a obvious tag, class, id, etc.

Let's clean this value up into usable data.

In [None]:
rt = rt_string.findNext().text
rt = rt.split()
minutes = int(rt[0])*60 + int(rt[2])
print(minutes)

#### MPAA Rating, Release Date

_**STEP 1:** Create function to grab values_ 

The text matching method can also help us get runtime, rating, and release date, so let's make a reuable function.

In [None]:
def get_movie_value(soup, field_name):
    
    '''Grab a value from Box Office Mojo HTML
    
    Takes a string attribute of a movie on the page and returns the string in
    the next sibling object (the value for that attribute) or None if nothing is found.
    '''
    
    obj = soup.find(text=re.compile(field_name))
    
    if not obj: 
        return None
    
    # this works for most of the values
    next_element = obj.findNext()
    
    if next_element:
        return next_element.text 
    else:
        return None

In [None]:
# runtime
runtime = get_movie_value(soup,'Run')
print(runtime)

In [None]:
# rating
rating = get_movie_value(soup,'MPAA')
print(rating)

In [None]:
release_date = get_movie_value(soup,'Release Date')
print(release_date)

In [None]:
release_date = release_date.split('\n')[0]  #Select the only the date
print(release_date)

_**STEP 2:** Create helper functions to parse strings into appropriate data types_

The returned values all need a bit of formatting before we can work with this data.  Here are a few helper functions.

In [None]:
import dateutil.parser

def money_to_int(moneystring):
    moneystring = moneystring.replace('$', '').replace(',', '')
    return int(moneystring)

def runtime_to_minutes(runtimestring):
    runtime = runtimestring.split()
    try:
        minutes = int(runtime[0])*60 + int(runtime[2])
        return minutes
    except:
        return None

def to_date(datestring):
    date = dateutil.parser.parse(datestring)
    return date

_**STEP 3:** Apply these conversions_

Let's get these values again and format them all in one swoop. (Note: Rating is already correct as a string.)

In [None]:
raw_domestic_total_gross = dtg
domestic_total_gross = money_to_int(raw_domestic_total_gross)

raw_runtime = get_movie_value(soup,'Running')
runtime = runtime_to_minutes(raw_runtime)

raw_release_date = get_movie_value(soup,'Release Date').split('\n')[0]
release_date = to_date(raw_release_date)

#### Put Results in Dictionary

Now that we have results for all five quantities, we can store them in a dictionary.

In [None]:
headers = ['movie title', 'domestic total gross',
           'runtime (mins)', 'rating', 'release date']

movie_data = []
movie_dict = dict(zip(headers, [title,
                                domestic_total_gross,
                                runtime,
                                rating, 
                                release_date]))

movie_data.append(movie_dict)
movie_data

**QUESTION**

> Why might we want to store these data in a dictionary?  Why did we put the dictionary in a list?

### Scraping Tables

Let's take a look at the [top G-rated movies](https://www.boxofficemojo.com/chart/mpaa_title_lifetime_gross/?by_mpaa=G) of Box Office Mojo.  How could we pull all the data from this main page?

First request the HTML and parse it with Beautiful Soup.

In [None]:
url = 'https://www.boxofficemojo.com/chart/mpaa_title_lifetime_gross/?by_mpaa=G'

response = requests.get(url)
page = response.text

soup = BeautifulSoup(page,"lxml")

Now find the main table; its the only `table` on the page.

In [None]:
table = soup.find('table')
table

In [None]:
rows = [row for row in table.find_all('tr')]  # tr tag is for rows

Each row contains the information we want but requires more parsing.

In [None]:
rows[1]

Remember: you can chain methods together to look for information!

In [None]:
rows[1].find_all('td')[0].find('a')['href']

Now grab data for the first 5 movies with a loop.

In [None]:
movies = {}

for row in rows[1:6]:
    items = row.find_all('td')
    link = items[0].find('a')
    title, url = link.text, link['href']
    movies[title] = [url] + [i.text for i in items]
    
movies

## Recap

- Beautiful Soup is a powerful HTML parser
- You can locate one element with `.find()` or all matching elements with `.find_all()`
- To select specific elements, you can filter by tags like `class` or `id` 
- You can also find items using text matching and `.findNext()`, `.findNextSibling()`, `.findChild()`, etc.

### Limitations
Beautiful Soup has its limitations though.  For example, we can't use Beautiful Soup if a page:
- Requires us to input a password
- Reveals information we want only when we interact with it
- Generates dynamically (with JavaScript) rather than statically serving HTML

For these situations we need a different tool, like **Selenium** -- coming soon!