## Web Scraping 1: BeautifulSoup

[BeautifulSoup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Scraping data from the internet.

### First, an HTML refresher

HTML is the basic language used to create a web page. 

It tells the web browser what text/media to display, where to display it, and how to display it (style)

HTML is very structured/hirarchical. 

Every page is made up of discrete "elements."

Elements are labeled with "tags."

For example:

    <p>You are beginning to learn HTML.</p>

A start tag also often contains "attributes" with info about the element.

Attributes usually have a name and value.

Example:

    <p class="my_red_sentences">You are beginning to learn HTML.</p>

A full HTML document has a structure more like this:

```
<html> 
  <head> </head>
  <body>
     <p class="red">You are beginning to learn HTML.</p>
     <h1> This is a header </h1>
     <a href="www.google.com"> Some link </a>
  </body>
</html>
```

Let's explore some live HTML!

Go to http://boxofficemojo.com/movies/?id=biglebowski.htm in your browser,
click Inspect Element, also click on View Page Source.

### Get the HTML from a page and convert to a BeautifulSoup object

We'll start by scraping some of that information about [The Big Lebowski](http://boxofficemojo.com/movies/?id=biglebowski.htm).

In [1]:
# if needed: pip install requests
import requests

url = 'http://boxofficemojo.com/movies/?id=biglebowski.htm'

response = requests.get(url)

For information on HTTP status codes, see:

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [2]:
response.status_code

200

In [4]:
print response.text

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-type" content="text/html;charset=iso-8859-1">
<title>The Big Lebowski (1998) - Box Office Mojo</title>

<style type="text/css">
table.chart-wide { width: 100%; }
</style>
<META name="keywords" content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, showtimes, tickets, show times, theaters, playing, weekend box office results, weekly box office, weekly box office, similar movies, box office mojo">
<META name="description" content="The Big Lebowski summary of box office results, charts and release information and related links.">

<link rel="stylesheet"

In [4]:
page = response.text

In [5]:
# if needed: pip install beautifulsoup4
from bs4 import BeautifulSoup

soup = BeautifulSoup(page)

In [6]:
print soup

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
<title>The Big Lebowski (1998) - Box Office Mojo</title>
<style type="text/css">
table.chart-wide { width: 100%; }
</style>
<meta content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, showtimes, tickets, show times, theaters, playing, weekend box office results, weekly box office, weekly box office, similar movies, box office mojo" name="keywords"/>
<meta content="The Big Lebowski summary of box office results, charts and release information and related links." name="description"/>
<link charset="utf-8" href

In [7]:
print soup.prettify()

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
  <title>
   The Big Lebowski (1998) - Box Office Mojo
  </title>
  <style type="text/css">
   table.chart-wide { width: 100%; }
  </style>
  <meta content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, showtimes, tickets, show times, theaters, playing, weekend box office results, weekly box office, weekly box office, similar movies, box office mojo" name="keywords"/>
  <meta content="The Big Lebowski summary of box office results, charts and release information and related links." name="description"/>
  <

## `soup.find()`

`soup.find()` is the most common function we will use from this package.  

Let's try out some common variations of `soup.find()`

In [6]:
# soup.find() returns the first matched tag it finds.
# It searches the entire tree.

# Search for a type of tag by using the tag as a string
# (like 'body','div','p','a') as an argument.

print soup.find('a')

<a href="/daily/chart/">Daily Box Office (Mon.)</a>


In [9]:
# Equivalently:
print soup.a

<a href="/daily/chart/">Daily Box Office (Sun.)</a>


In [10]:
# Prettier:
print soup.a.prettify()

<a href="/daily/chart/">
 Daily Box Office (Sun.)
</a>



In [11]:
# soup.find_all() returns a list of all matches

for link in soup.find_all('a'): 
    print link

<a href="/daily/chart/">Daily Box Office (Sun.)</a>
<a href="/weekend/chart/">Weekend Box Office (Jul. 3–5)</a>
<a href="/movies/?id=pixar2014.htm">#1 Movie: 'Inside Out'</a>
<a href="http://www.imdb.com/showtimes/?ref_=mojo">Showtimes</a>
<a href="http://ad.doubleclick.net/N4215/jump/imdb2.bom.movie/;p=t;tile=1;sz=728x90;oe=ISO-8859-1;ord=1902523258?" target="_blank"><img alt="" border="0" src="http://ad.doubleclick.net/N4215/ad/imdb2.bom.movie/;p=t;tile=1;sz=728x90;oe=ISO-8859-1;ord=1902523258?"/></a>
<a href="/"><img alt="Box Office Mojo" height="56" src="/img/misc/bom_logo1.png" width="245"/></a>
<a href="http://facebook.com/boxofficemojo" style="vertical-align:middle;"><img alt="Facebook Logo" border="0" src="/images/FaceBook_16x16.png"/>Facebook</a>
<a href="http://twitter.com/boxofficemojo" style="vertical-align:middle;"><img alt="Twitter Logo" border="0" src="/images/Twitter_16x16.png"/>Twitter</a>
<a href="/news/">News</a>
<a href="/schedule/">Release Sched.</a>
<a href="http:

In [12]:
# retrieve the url from an anchor tag
soup.find('a')['href']

'/daily/chart/'

In [7]:
# You can match on an attribute like an id or class.
# Take a look at what the 'mp_box_content' classes
# look like on the webpage, with Inspect Element.

for element in soup.find_all(class_='mp_box_content'):
   print element, '\n'

<div class="mp_box_content">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td width="40%"><b>Domestic:</b></td>
<td align="right" width="35%"> <b>$17,451,873</b></td>
</tr>
</table>
</div> 

<div class="mp_box_content">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td align="center"><a href="/weekend/chart/?yr=1998&amp;wknd=10&amp;p=.htm">Opening Weekend:</a></td><td> $5,533,844</td></tr>
<tr>
<td align="center" colspan="2"><font size="2">(#6 rank, 1,207 theaters, $4,585 average)</font></td></tr>
<tr>
<td align="right">% of Total Gross:</td><td> 31.7%</td></tr>
<tr><td align="right" colspan="2"><font face="Helvetica, Arial, Sans-Serif" size="1"><a href="/movies/?page=weekend&amp;id=biglebowski.htm"><b>&gt; View All 4 Weekends</b></a></font></td></tr>
</table>
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td>Widest Release:</td>
<td> 1,235 theaters</td>
</tr>
</table>
</div> 

<div class="mp_box_content">
<table>
<tr><td align="right" valign="top"><font 

In [14]:
# We can find all the columns in the first mp_box_content table
# by "chaining" `find` and `find_all`.

print soup.find(class_='mp_box_content').find_all('td')

[<td width="40%"><b>Domestic:</b></td>, <td align="right" width="35%"> <b>$17,451,873</b></td>]


In [15]:
# To extract just the value of interest:

soup.find(class_='mp_box_content').find_all('td')[1].text

u'\xa0$17,451,873'

Be careful with non-printing characters!

In [8]:
# find with an "id". (ID is unique.)

print soup.find(id='hp_footer')

<div id="hp_footer">
<div style="padding-bottom: 20px;">
<div style="margin: 0px 121px; vertical-align: top;">
<div id="footer_links">
<ul class="footer_link_list">
<li><strong>Latest Updates</strong></li>
<li><a href="/news/?ref=ft">Movie News</a>
</li><li><a href="/daily/chart/?ref=ft">Daily Chart</a></li>
<li><a href="/weekend/chart/?ref=ft">Weekend Chart</a></li>
<li><a href="/alltime/?ref=ft">All Time Charts</a></li>
<li><a href="/intl/?ref=ft">International Charts</a></li>
</ul>
<!--
					<ul class="footer_link_list">
						<li><strong>Popular Movies</strong></li>
											</ul>
					-->
<ul class="footer_link_list">
<li><strong>Indices</strong></li>
<li><a href="/movies/?ref=ft">Movies A-Z</a></li>
<li><a href="/people/?ref=ft">People</a></li>
<li><a href="/genres/?ref=ft">Genres</a></li>
<li><a href="/franchises/?ref=ft">Franchises</a></li>
<li><a href="/showdowns/?ref=ft">Showdowns</a></li>
</ul>
<ul class="footer_link_list">
<li><strong>Other</strong></li>
<li><a href="/abo

##Consistency
Web scraping is made simple by the consistent format of information among like pages of a website. 

###Items to scrape for each movie:
* movie title
* total domestic gross
* release date
* runtime
* rating


In [9]:
# Movie Title

print soup.find('title')

<title>The Big Lebowski (1998) - Box Office Mojo</title>


In [10]:
title_string = soup.find('title').text
print title_string

The Big Lebowski (1998) - Box Office Mojo


In [11]:
print title_string.split('(')

[u'The Big Lebowski ', u'1998) - Box Office Mojo']


In [12]:
title = title_string.split('(')[0].strip()
print title

The Big Lebowski


In [13]:
# Domestic Total Gross

## text does an exact match search!
print soup.find(text="Domestic Total Gross")

None


In [14]:
# You could find a perfect match:

print soup.find(text="Domestic Total Gross: ")

Domestic Total Gross: 


In [17]:
# You could also use [regular expressions](https://xkcd.com/208/).

import re
domestic_total_regex = re.compile('Domestic Total')
soup.find(text=domestic_total_regex)

u'Domestic Total Gross: '

In [18]:
dtg_string = soup.find(text=re.compile('Domestic Total'))
print dtg_string

Domestic Total Gross: 


In [19]:
print dtg_string.findNextSibling()

<b>$17,451,873</b>


In [26]:
dtg = dtg_string.findNextSibling().text
dtg = dtg.replace('$','').replace(',','')
domestic_total_gross = int(dtg)
print domestic_total_gross

17451873


###We can actually do several of these using the text matching method, so let's make a function for that

In [21]:
def get_movie_value(soup, field_name):
    '''Grab a value from boxofficemojo HTML
    
    Takes a string attribute of a movie on the page and
    returns the string in the next sibling object
    (the value for that attribute)
    or None if nothing is found.
    '''
    obj = soup.find(text=re.compile(field_name))
    if not obj: 
        return None
    # this works for most of the values
    next_sibling = obj.findNextSibling()
    if next_sibling:
        return next_sibling.text 
    else:
        return None

In [28]:
# domestic total gross
dtg = get_movie_value(soup,'Domestic Total')
print dtg

$17,451,873


In [29]:
# runtime
runtime = get_movie_value(soup,'Runtime')
print runtime

1 hrs. 57 min.


In [30]:
# rating
rating = get_movie_value(soup,'MPAA Rating')
print rating 

R


In [22]:
rating = get_movie_value(soup,'Release Date')
print rating 

March 6, 1998


### We need a few helper methods to parse the strings we've gotten

In [31]:
import dateutil.parser

def to_date(datestring):
    date = dateutil.parser.parse(datestring)
    return date

def money_to_int(moneystring):
    moneystring = moneystring.replace('$', '').replace(',', '')
    return int(moneystring)

def runtime_to_minutes(runtimestring):
    runtime = runtimestring.split()
    try:
        minutes = int(runtime[0])*60 + int(runtime[2])
        return minutes
    except:
        return None

In [32]:
# Let's get these again and format them all in one swoop

from pprint import pprint

raw_release_date = get_movie_value(soup,'Release Date')
release_date = to_date(raw_release_date)

raw_domestic_total_gross = get_movie_value(soup,'Domestic Total')
domestic_total_gross = money_to_int(raw_domestic_total_gross)

raw_runtime = get_movie_value(soup,'Runtime')
runtime = runtime_to_minutes(raw_runtime)

headers = ['movie title', 'domestic total gross',
           'release date', 'runtime (mins)', 'rating']

movie_data = []
movie_dict = dict(zip(headers, [title,
                                domestic_total_gross,
                                release_date,
                                runtime,
                                rating]))
movie_data.append(movie_dict)

pprint(movie_data)

[{'domestic total gross': 17451873,
  'movie title': u'The Big Lebowski',
  'rating': u'R',
  'release date': datetime.datetime(1998, 3, 6, 0, 0),
  'runtime (mins)': 117}]


### What about scraping tables? 

In [33]:
url = 'http://www.boxofficemojo.com/genres/chart/?id=foreign.htm'

response=requests.get(url)
page=response.text

soup=BeautifulSoup(page)


In [38]:
tables=soup.find_all("table")
rows=[row for row in tables[3].find_all('tr')]

# Just want to look at 1st 20 rows for now
rows=rows[1:20]

movies={}
for row in rows:
    items=row.find_all('td')
    title=items[1].find('a')['href']
    movies[title]=[i.text for i in items[1:]]
    

movies.items()[1]

('/movies/?id=monsoonwedding.htm',
 [u'Monsoon Wedding(India)',
  u'USA',
  u'$13,885,966',
  u'254',
  u'$68,546',
  u'2',
  u'2/22/02'])