## Web Scraping 101: BeautifulSoup

[BeautifulSoup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Scraping data from the internet.

Web scraping is simple due to the **consistent format** of information among web pages.

## HTML Refresher

### Overview
* HTML is the basic language used to create a web page. 
* It tells the web browser what text/media to display, where to display it, and how to display it (style)
* HTML is very **structured/hirarchical**. 
* Every page is made up of discrete "elements."

### Tags

* Elements are labeled with "tags."

* For example:

    ```html
    <p>You are beginning to learn HTML.</p>
    ```

### Attributes

* A start tag also often contains "attributes" with info about the element.

* Attributes usually have a name and value.

* Example:

```html
<p class="my_red_sentences">You are beginning to learn HTML.</p>
```

### Structure

A full HTML document has a structure more like this:

```html
<html> 
  <head> </head>
  <body>
     <p class="red">You are beginning to learn HTML.</p>
     <h1> This is a header </h1>
     <a href="www.google.com"> Some link </a>
  </body>
</html>
```

### Explore in Browser

* Let's explore some live HTML!
* Go to http://boxofficemojo.com/movies/?id=biglebowski.htm in your browser, preferably Chrome.
* Click Inspect Element, also click on View Page Source.

## HTML to BeautifulSoup

### Request data for The Big Lebowski

Scrape some information about [The Big Lebowski](http://boxofficemojo.com/movies/?id=biglebowski.htm).

In [1]:
from __future__ import print_function, division

In [2]:
# if needed: pip install requests or conda install requests
import requests

requests.__path__

['C:\\Users\\xianj\\Anaconda3\\lib\\site-packages\\requests']

In [3]:
url = 'http://boxofficemojo.com/movies/?id=biglebowski.htm'

response = requests.get(url)

### Check the Status

For information on HTTP status codes, see:

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [7]:
response.status_code # status code = 200 => OK

200

### Look at the Text

In [9]:
# use a .text method
print(response.text) # This is literally raw text. It doesn't know the format.

# This is a very ugly format. not usable yet.
# So what we need to do is to PROCESS the html.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-type" content="text/html;charset=iso-8859-1">
<title>The Big Lebowski (1998) - Box Office Mojo</title>

<style type="text/css">
table.chart-wide { width: 100%; }
</style>
<META name="keywords" content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, showtimes, tickets, show times, theaters, playing, weekend box office results, weekly box office, weekly box office, release summary, similar movies, box office mojo">
<META name="description" content="The Big Lebowski summary of box office results, charts and release information and related links.">

<link

### Soupify the Text

In [11]:
# We take out the text and save it into a variable called page
page = response.text

lxml is a library for processing XML and HTML in Python. We are **parsing the data** from txt to lxml.

In [13]:
# if needed: pip install beautifulsoup4 lxml or conda install beautifulsoup4 lxml
from bs4 import BeautifulSoup

soup = BeautifulSoup(page, "lxml") 
# We are using BeautifulSoup to parse the text. Now it knows teh structure and more things
# Than before

In [14]:
print(soup)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
<title>The Big Lebowski (1998) - Box Office Mojo</title>
<style type="text/css">
table.chart-wide { width: 100%; }
</style>
<meta content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, showtimes, tickets, show times, theaters, playing, weekend box office results, weekly box office, weekly box office, release summary, similar movies, box office mojo" name="keywords"/>
<meta content="The Big Lebowski summary of box office results, charts and release information and related links." name="description"/>
<link cha

### Prettify the Soup

A webpage can be thought of as a tree of elements, there is the 'body', which would contain a few 'divs' and each of those 'divs' can in turn contain 'divs' and other elements. A Soup object contains this tree. The **prettify() method will turn a Beautiful Soup tree into a nicely formatted Unicode string**, with each HTML/XML tag on its own line.

In [15]:
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
  <title>
   The Big Lebowski (1998) - Box Office Mojo
  </title>
  <style type="text/css">
   table.chart-wide { width: 100%; }
  </style>
  <meta content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, showtimes, tickets, show times, theaters, playing, weekend box office results, weekly box office, weekly box office, release summary, similar movies, box office mojo" name="keywords"/>
  <meta content="The Big Lebowski summary of box office results, charts and release information and related links." name="d

## Beautiful Soup - Find & Find_All

### `soup.find()`

* `soup.find()` is the **most common function** we will use from this package.  
* Let's try out some common variations of `soup.find()`

* `soup.find()` returns the first matched tag it finds.
* It searches the entire tree.

* Search for a type of tag by using the tag as a string argument ('body','div','p','a')

In [17]:
soup.find('a') # "a" tag is for hyperlink

# The FIRST 'a' text you can find from html

<a href="/daily/chart/">Daily Box Office (Sun.)</a>

In [18]:
# Equivalently: (another way to find)
soup.a

<a href="/daily/chart/">Daily Box Office (Sun.)</a>

In [20]:
# Prettier:
print(soup.a.prettify())

# If you prettify it, you can also see there is some hierarchy in this as well

<a href="/daily/chart/">
 Daily Box Office (Sun.)
</a>



**Here's how you can find the next one.**

In [21]:
soup.find('a').findNextSibling()  # This is how you find the next 'a' in the list

<a href="/weekend/chart/">Weekend Box Office (Oct. 11–13)</a>

### `soup.find_all()` 

`soup.find_all()` returns a list of all matches  (You usually would want to use this)

In [29]:
print(soup.find_all('a')[:10]) # soup.find_all('element') returns you a list

[<a href="/daily/chart/">Daily Box Office (Sun.)</a>, <a href="/weekend/chart/">Weekend Box Office (Oct. 11–13)</a>, <a href="/movies/?id=joker2019.htm">#1 Movie: 'Joker (2019)'</a>, <a href="http://www.imdb.com/showtimes/?ref_=mojo">Showtimes</a>, <a href="//bs.serving-sys.com/Serving/adServer.bs?cn=brd&amp;pli=1073982656&amp;Page=&amp;Pos=668343950" target="_blank">
<img border="0" height="90" src="//bs.serving-sys.com/Serving/adServer.bs?c=8&amp;cn=display&amp;pli=1073982656&amp;Page=&amp;Pos=668343950" width="728"/>
</a>, <a href="/"><img alt="Box Office Mojo" height="56" src="/img/misc/bom_logo1.png" width="245"/></a>, <a href="http://pro.imdb.com/signup/index.html?rf=mojo_nb_hm&amp;ref_=mojo_nb_hm" target="_blank">
<img alt="Get industry info at IMDbPro" height="20" src="/images/IMDbPro.png"/>
</a>, <a href="http://twitter.com/boxofficemojo" target="_blank">
<img alt="Follow us on Twitter" height="18" src="/images/glyphicons-social-32-twitter@2x.png"/>
</a>, <a href="http://faceb

In [22]:
len(soup.find_all('a'))

100

In [23]:
for link in soup.find_all('a'): 
    print(link)

<a href="/daily/chart/">Daily Box Office (Sun.)</a>
<a href="/weekend/chart/">Weekend Box Office (Oct. 11–13)</a>
<a href="/movies/?id=joker2019.htm">#1 Movie: 'Joker (2019)'</a>
<a href="http://www.imdb.com/showtimes/?ref_=mojo">Showtimes</a>
<a href="//bs.serving-sys.com/Serving/adServer.bs?cn=brd&amp;pli=1073982656&amp;Page=&amp;Pos=668343950" target="_blank">
<img border="0" height="90" src="//bs.serving-sys.com/Serving/adServer.bs?c=8&amp;cn=display&amp;pli=1073982656&amp;Page=&amp;Pos=668343950" width="728"/>
</a>
<a href="/"><img alt="Box Office Mojo" height="56" src="/img/misc/bom_logo1.png" width="245"/></a>
<a href="http://pro.imdb.com/signup/index.html?rf=mojo_nb_hm&amp;ref_=mojo_nb_hm" target="_blank">
<img alt="Get industry info at IMDbPro" height="20" src="/images/IMDbPro.png"/>
</a>
<a href="http://twitter.com/boxofficemojo" target="_blank">
<img alt="Follow us on Twitter" height="18" src="/images/glyphicons-social-32-twitter@2x.png"/>
</a>
<a href="http://facebook.com/b

In [24]:
# Just writing a list comprehension

[link for link in soup.find_all('a') if 'joelcoen' in str(link)]   

[<a href="/people/chart/?view=Director&amp;id=joelcoen.htm">Joel Coen</a>,
 <a href="/people/chart/?view=Writer&amp;id=joelcoen.htm">Joel Coen</a>]

## Beautiful Soup - More on Find

### `href` Example

In [30]:
# retrieve the url from an anchor tag
soup.find('a')['href'] # This code retrieves the link

'/daily/chart/'

### `id` and `class` examples

* An attribute like id or class can be matched
* Example: 'mp_box_content' classes

In [32]:
# Finding a specific 'id' in <div = 'something'> </div>
soup.find_all(id='top_links')

[<div id="top_links">
 <div style="float: left"><a href="/daily/chart/">Daily Box Office (Sun.)</a> | <a href="/weekend/chart/">Weekend Box Office (Oct. 11–13)</a> | <a href="/movies/?id=joker2019.htm">#1 Movie: 'Joker (2019)'</a> | <a href="http://www.imdb.com/showtimes/?ref_=mojo">Showtimes</a></div>
 <div style="float: right">Updated 10/13/2019 8:31 A.M. Pacific Time</div>
 <div style="clear:both; height: 0px"></div>
 </div>]

In [33]:
# Finding a specific class
# Take note it's class_  (with an underscore)

# This is how you extract everything from a class
for element in soup.find_all(class_='mp_box_content'):
    print(element, '\n')

<div class="mp_box_content">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td width="40%"><b>Domestic:</b></td>
<td align="right" width="35%"> <b>$18,034,458</b></td>
<td align="right" width="25%">   <b>38.6%</b></td>
</tr>
<tr>
<td width="40%">+ Foreign:</td>
<td align="right" width="35%"> $28,690,764</td>
<td align="right" width="25%">   61.4%</td>
</tr>
<tr>
<td colspan="3" width="100%"><hr/></td>
</tr>
<tr>
<td width="40%">= <b>Worldwide:</b></td>
<td align="right" width="35%"> <b>$46,725,222</b></td>
<td width="25%"> </td>
</tr>
</table>
</div> 

<div class="mp_box_content">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td align="center"><a href="/weekend/chart/?yr=1998&amp;wknd=10&amp;p=.htm">Opening Weekend:</a></td><td> $5,533,844</td></tr>
<tr>
<td align="center" colspan="2"><font size="2">(#6 rank, 1,207 theaters, $4,585 average)</font></td></tr>
<tr>
<td align="right">% of Total Gross:</td><td> 31.7%</td></tr>
<tr><td align="right" colspan="2"><font fac

You can also chain commands together

## Beautiful Soup - Chaining Finds

All the fields in mp_box_content can be found by "chaining" a few `find_all` functions.

In [34]:
# 'td' is for a cell in an HTML table
chain = [x.find_all('td') for x in soup.find_all(class_='mp_box_content')]

# In this whole mp_box_content, I want to find all the td

In [35]:
# for the first mp_box_content find all td's
chain[0]    # Now we are closer to getting what we want

[<td width="40%"><b>Domestic:</b></td>,
 <td align="right" width="35%"> <b>$18,034,458</b></td>,
 <td align="right" width="25%">   <b>38.6%</b></td>,
 <td width="40%">+ Foreign:</td>,
 <td align="right" width="35%"> $28,690,764</td>,
 <td align="right" width="25%">   61.4%</td>,
 <td colspan="3" width="100%"><hr/></td>,
 <td width="40%">= <b>Worldwide:</b></td>,
 <td align="right" width="35%"> <b>$46,725,222</b></td>,
 <td width="25%"> </td>]

**To extract just the value of interest:**

In [36]:
# Find the domestic gross. The '\xa0' represents a space in unicode

soup.find(class_='mp_box_content').find_all('td')[1].text

'\xa0$18,034,458'

In [38]:
# There are 2 td's the second one has the $17,451,873 and we remove the space character
soup.find(class_='mp_box_content').find_all('td')[1].text[1:]  

# '\xa0' is considered as one character. 
# THerefore you can slice from index one onwards to get the domestic value

'$18,034,458'

## Let's Practice Web Scraping!

### Items to scrape for each movie:

* Movie Title
* Domestic Total Gross
* Runtime
* MPAA Rating
* Release Date

### Movie Title

In [39]:
# Extracting the movie title
soup.find('title')

<title>The Big Lebowski (1998) - Box Office Mojo</title>

In [40]:
soup.find('title').text

'The Big Lebowski (1998) - Box Office Mojo'

In [41]:
# Saving our title into a string
title_string = soup.find('title').text
title_string

'The Big Lebowski (1998) - Box Office Mojo'

In [43]:
# LOok carefully at the text and decide how to extract values

title_string.split('(')

['The Big Lebowski ', '1998) - Box Office Mojo']

In [48]:
# Now if we want to get the year

title_string.split('(')[1].split(')')[0]

'1998'

In [45]:
# .strip() removes the white spaces at the beginning and end of the string
title = title_string.split('(')[0].strip()  # Now we take the 0 index (where the title lives), then remove the whitespace behind the word
title

'The Big Lebowski'

### Domestic Total Gross

Let's try to find the exact text.

In [50]:
print(soup.find(text="Domestic Total Gross"))  # We can't find by ID because the guy who made the website didn't add a div for it.

# So as an alternative, we find the string closest to what you want

None


`Text` does an exact match search, so we have to be careful.

So we have to go back to the source code and then look at how the text/string is written

In [51]:
print(soup.find(text="Domestic Total Gross: "))

Domestic Total Gross: 


What if we don't want to be careful? [Regular expressions](https://xkcd.com/208/) to the rescue!

We are going to talk a lot more about regular expressions in the next week or two, but there's **a really powerful way to search for patterns in text**. Today, we're going to use a very simple case, basically doing a "contains" instead of an "exact match".

In [53]:
import re
domestic_total_regex = re.compile('Domestic Total')  # Essentially "Domestic Total*". Using wildcard.
domestic_total_regex # Basically, please help me find text with "Domestic Total"

# This returns a pattern object. You are telling python you want something with this pattern

re.compile(r'Domestic Total', re.UNICODE)

In [55]:
dtg_string = soup.find(text=domestic_total_regex) 
dtg_string

'Domestic Total Gross: '

In [56]:
dtg_string.findNextSibling()

<b>$17,451,873</b>

We found the domestic total gross! Now let's strip it down and convert it to an integer.

In [57]:
dtg = dtg_string.findNextSibling().text 
print(dtg, type(dtg))

$17,451,873 <class 'str'>


In [58]:
# Replace the $ sign with nothing
dtg = dtg.replace('$','').replace(',','')
print(dtg, type(dtg))

17451873 <class 'str'>


In [61]:
domestic_total_gross = int(dtg)
print(domestic_total_gross, type(domestic_total_gross)) # Now we finally get a number value by applying int to it

17451873 <class 'int'>


### Runtime, MPAA Rating & Release Date

#### Step 1: Create Function to Identify Values

Let's make a function to scrape multiple things, assuming the value will always follow the field name.

In [63]:
def get_movie_value(soup, field_name):
    
    '''Grab a value from boxofficemojo HTML
    
    Takes a string attribute of a movie on the page and returns the string in
    the next sibling object (the value for that attribute) or None if nothing is found.
    '''
    
    obj = soup.find(text=re.compile(field_name))   # find the string
    
    if not obj:          # if cannot find anything, return none
        return None
    
    # this works for most of the values
    next_sibling = obj.findNextSibling()  # We know that next to domestic total, it has a sibling; the number.
    
    if next_sibling:
        return next_sibling.text 
    else:
        return None

In [64]:
# domestic total gross
dtg = get_movie_value(soup,'Domestic Total')
print(dtg)

$17,451,873


In [71]:
# runtime
runtime = get_movie_value(soup,'Runtime')
print(runtime)
print(runtime.split())

1 hrs. 57 min.
['1', 'hrs.', '57', 'min.']


In [66]:
# rating
rating = get_movie_value(soup,'MPAA Rating')
print(rating)

R


In [67]:
release_date = get_movie_value(soup,'Release Date')
print(release_date)

March 6, 1998


#### Step 2: Convert Values to Appropriate Data Types

In [73]:
# This is a bunch of code to help you treat your parsed text

import dateutil.parser  # This helps you parse dates into a datetime format

def money_to_int(moneystring):
    moneystring = moneystring.replace('$', '').replace(',', '')
    return int(moneystring)

def runtime_to_minutes(runtimestring):
    runtime = runtimestring.split()  # by default will split by space. Look above for what the split string looks like
    try:
        minutes = int(runtime[0])*60 + int(runtime[2])
        return minutes
    except:
        return None

def to_date(datestring):
    date = dateutil.parser.parse(datestring)
    return date

#### Step 3: Apply the Conversions

In [74]:
# Let's get these again and format them all in one swoop

raw_domestic_total_gross = get_movie_value(soup,'Domestic Total')
domestic_total_gross = money_to_int(raw_domestic_total_gross)

raw_runtime = get_movie_value(soup,'Runtime')
runtime = runtime_to_minutes(raw_runtime)

raw_release_date = get_movie_value(soup,'Release Date')
release_date = to_date(raw_release_date)

print(domestic_total_gross, runtime, release_date)
print(type(domestic_total_gross), type(runtime), type(release_date))

17451873 117 1998-03-06 00:00:00
<class 'int'> <class 'int'> <class 'datetime.datetime'>


#### Step 4: Print It All Out

In [78]:
for i in zip(headers, [title, domestic_total_gross, runtime,rating,release_date]):
    print(i)

('movie title', 'The Big Lebowski')
('domestic total gross', 17451873)
('runtime (mins)', 117)
('rating', 'R')
('release date', datetime.datetime(1998, 3, 6, 0, 0))


In [77]:
from pprint import pprint # pretty print

# Converting the results into a dictionary

headers = ['movie title', 'domestic total gross',
           'runtime (mins)', 'rating', 'release date']  # we define the headers ourselves

movie_data = []
movie_dict = dict(zip(headers, [title,  # zip merges two lists and runs them side-by-side
                                domestic_total_gross,
                                runtime,
                                rating, 
                                release_date])) # zip works very nicely with a dictionary

movie_data.append(movie_dict)
pprint(movie_data)

[{'domestic total gross': 17451873,
  'movie title': 'The Big Lebowski',
  'rating': 'R',
  'release date': datetime.datetime(1998, 3, 6, 0, 0),
  'runtime (mins)': 117}]


## Table Scraping Example

### Step 1: Soupify the Website

Let's take a look at the foreign language page of Box Office Mojo. Let's say we wanted to pull all of the data from the **main table on the page.**

In [82]:
url = 'http://www.boxofficemojo.com/genres/chart/?id=foreign.htm'

response = requests.get(url)
page = response.text

soup = BeautifulSoup(page,"lxml")  # this parses the text into something the computer knows 
# Parsing the data for you. What's the different structure, the different text


# Shorter way:
# soup = BeautifulSoup(requests.get(url), 'lxml')

### Step 2: Find the Tables

In [81]:
tables = soup.find_all("table")
tables

[<table border="0" cellpadding="0" cellspacing="0">
 <tr>
 <form action="/adjuster.php" method="POST" name="adjuster">
 <input name="returnURL" type="hidden" value="/genres/chart/?id=foreign.htm"/>
 <td valign="center">
 <font face="Verdana" size="2"><a href="/about/adjuster.htm"><b>Adjuster:</b></a></font>
 <select name="ticketyr" size="1" style="font-family: Verdana; font-size: 10pt">
 <option selected="" value="0">Actuals</option>
 <option value="1">Est. Tckts</option>
 <script language="javascript">
   for(i=2019; i>=1933; i--) {
   	document.write('<option value="' + i + '"');
 	if(i=='0') document.write(' selected');
 	document.write('>' + i );
 	if(i=='0') document.write(', $' + '0.00');
 	document.write('</option>');
   }
 </script>
 <option value="1929">1929</option>
 <option value="1924">1924</option>
 <option value="1910">1910</option>
 </select><input name="Go" style="font-size: 10pt; height: 22" type="submit" value="Go"/>
 </td></form></tr></table>,
 <table border="0" cell

In [84]:
tables[3].find_all('tr')

[<tr bgcolor="#dcdcdc"><td align="center"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=rank&amp;order=ASC&amp;p=.htm">Rank</a></font></td><td align="center"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=title&amp;order=ASC&amp;p=.htm">Title (click to view)</a></font></td><td align="center"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=studio&amp;order=ASC&amp;p=.htm">Studio</a></font></td><td align="center" colspan="2"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=gross&amp;order=ASC&amp;p=.htm"><b>Lifetime Gross</b></a> / </font><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=maxtheaters&amp;order=DESC&amp;p=.htm">Theaters</a></font></td><td align="center" colspan="2"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=opengross&amp;order=DESC&amp;p=.htm">Opening</a> / </font><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=opentheaters&amp;order=DESC&amp;p=.htm">Theaters</a></

### Step 3: Pull Just the Rows

In [85]:
rows = [row for row in tables[3].find_all('tr')] # tr tag is for rows

In [86]:
# let's take a look at one row
rows[0]

<tr bgcolor="#dcdcdc"><td align="center"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=rank&amp;order=ASC&amp;p=.htm">Rank</a></font></td><td align="center"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=title&amp;order=ASC&amp;p=.htm">Title (click to view)</a></font></td><td align="center"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=studio&amp;order=ASC&amp;p=.htm">Studio</a></font></td><td align="center" colspan="2"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=gross&amp;order=ASC&amp;p=.htm"><b>Lifetime Gross</b></a> / </font><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=maxtheaters&amp;order=DESC&amp;p=.htm">Theaters</a></font></td><td align="center" colspan="2"><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=opengross&amp;order=DESC&amp;p=.htm">Opening</a> / </font><font size="2"><a href="/genres/chart/?id=foreign.htm&amp;sort=opentheaters&amp;order=DESC&amp;p=.htm">Theaters</a></f

In [87]:
# let's take a look at one value in the row
rows[0].find_all('td')[1].find('a')['href']

'/genres/chart/?id=foreign.htm&sort=title&order=ASC&p=.htm'

### Step 4: Pull All Values

In [88]:
rows[1].find_all('td')[1].find('a')['href']

'/movies/?id=crouchingtigerhiddendragon.htm'

In [89]:
rows = rows[1:21] # let's just look at the first 20 rows for now
movies = {}

for row in rows:
    items = row.find_all('td')
    title = items[1].find('a')['href']
    movies[title] = [i.text for i in items[1:]]
    
list(movies.items())[0]

('/movies/?id=crouchingtigerhiddendragon.htm',
 ['Crouching Tiger, Hidden Dragon(Taiwan)',
  'SPC',
  '$128,078,872',
  '2,027',
  '$663,205',
  '16',
  '12/8/00'])

### Step 5: Pandas Alternative

In [90]:
# you can also use pandas to read tables
import pandas as pd

url = 'http://www.boxofficemojo.com/genres/chart/?id=foreign.htm'

In [91]:
tables = pd.read_html(url)  # This is only for the ones with tables!!

In [93]:
tables

[                                                   0  \
 0  Foreign Language 1980-PresentOnly overseas-pro...   
 
                                                    1  
 0  googletag.cmd.push(function() {  googletag.def...  ,
                                 0                                       1  \
 0                            Rank                   Title (click to view)   
 1                               1  Crouching Tiger, Hidden Dragon(Taiwan)   
 2                               2                Life Is Beautiful(Italy)   
 3                               3                             Hero(China)   
 4                               4               Instructions Not Included   
 5                               5                 Pan's Labyrinth(Mexico)   
 6                               6                          Amelie(France)   
 7                               7                Jet Li's Fearless(China)   
 8                               8                       Il Postino(I

In [94]:
tables[2]
# how can you fix the header?

Unnamed: 0,0,1,2
0,Title (click to view),Studio,Release Date
1,Tampopo (2016 re-release),Jan.,10/21/16
2,"Cyrano, My Love",RAtt.,10/18/19
3,By The Grace of God,MBox,10/18/19
4,Housefull 4,FIP,10/25/19
5,Portrait of a Lady on Fire,Neon,12/6/19
6,The Traitor,SPC,1/31/20
7,Las Pildoras de Mi Novio (My Boyfriend's Meds),PNT,2/21/20
8,Johanna(Hungary),Tar.,TBD
9,Abel,LGF,TBD


In [95]:
tables[2][0:5]

Unnamed: 0,0,1,2
0,Title (click to view),Studio,Release Date
1,Tampopo (2016 re-release),Jan.,10/21/16
2,"Cyrano, My Love",RAtt.,10/18/19
3,By The Grace of God,MBox,10/18/19
4,Housefull 4,FIP,10/25/19


Conclusion: Beautiful Soup is powerful but it has many limitations. If a page needs interactions (like entering password) or if a page is not static, but dynamically generated, we can't use Soup. We will explore other tools for that.

One such tool for tomorrow: *Selenium*