# Week 11: Web scraping, and other forms of data "munging."

There are lots of ways to get data off the web. Ideally, you find a complete dataset already prepared that meets your needs. Click and download.

A little less ideally, you find an API that can be queried, and that returns structured data easy to interpret. We've already been doing this with the Google Maps API. Here, you often start with a list of queries for the API.

But in some cases, you're going to need to get data from the open web, where it's formatted more for beautiful display than for ease of analysis.

The obvious challenge here (opening a URL and getting data from the web into Python) is actually very easy to solve.

In [4]:
import requests
test_url = "http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html"
page = requests.get(test_url)
print(page.text)

<html>

<head>
<title>A very simple webpage</title>
<basefont size=4>
</head>

<body bgcolor=FFFFFF>

<h1>A very simple webpage. This is an "h1" level header.</h1>

<h2>This is a level h2 header.</h2>

<h6>This is a level h6 header.  Pretty small!</h6>

<p>This is a standard paragraph.</p>

<p align=center>Now I've aligned it in the center of the screen.</p>

<p align=right>Now aligned to the right</p>

<p><b>Bold text</b></p>

<p><strong>Strongly emphasized text</strong>  Can you tell the difference vs. bold?</p>

<p><i>Italics</i></p>

<p><em>Emphasized text</em>  Just like Italics!</p>

<p>Here is a pretty picture: <img src=example/prettypicture.jpg alt="Pretty Picture"></p>

<p>Same thing, aligned differently to the paragraph: <img align=top src=example/prettypicture.jpg alt="Pretty Picture"></p>

<hr>

<h2>How about a nice ordered list!</h2>
<ol>
  <li>This little piggy went to market
  <li>This little piggy went to SB228 class
  <li>This little piggy went to an expensive restaura

Well, that was incredibly easy. You can open [http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html](http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html) to see what this looks like in a browser. Then select "Show Page Source" (or the equivalent in your browser).

### The hard parts

The hard parts of web scraping are 1) finding the right URL in the first place and 2) extracting what you need from the hierarchical tangle of html tags.

1) I don't have a general solution to the first problem. You'll need to exercise some cleverness. Sometimes you can look at the ways URLs are formed and infer how to generate other valid URLs. For instance, if you search "Stranger Things" on IMDb, you get this URL

[http://www.imdb.com/find?ref_=nv_sr_fn&q=stranger+things&s=all](http://www.imdb.com/find?ref_=nv_sr_fn&q=stranger+things&s=all)

It's not too hard to see what you would have to replace there to get the search page for *Star Wars.*

Sometimes you need to find URLs by extracting links from other web pages. Which brings us to problem two.

2) To get what you need from a web page, you can use an html-parsing module like Beautiful Soup. For instance, let's get the level h6 header from the Very Simple Web Page.

In [5]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.text, "lxml")
soup.find('h6')

<h6>This is a level h6 header.  Pretty small!</h6>

Then, if we want to extract the text, it's as simple as

In [6]:
soup.find('h6').text

'This is a level h6 header.  Pretty small!'

Okay, that's simple enough! Things get a little harder if we want to get the link to "another page." We could try ```.find('a')```, but that would give us the link to Yahoo instead, because it will return the *first* example of a specified tag. We could use ```.find_all()```, but that will return *all* the examples.

To find a section of html that includes a given chunk of text, we have to use something called a regular expression, which matches a given pattern. A lot can be said about regular expressions, but the special characters used to match patterns are not easy to memorize; in practice when I need a complex one I go [to Regex 101](https://regex101.com), and play around. In this case, we don't have to. It's dead easy, because we just want to literally match a particular chunk of text.

In [7]:
import re
regex = re.compile('another')
soup.find('a', text = regex)

<a href="../../index.html">another page on this server</a>

That's lovely, but we actually want the *link*.

In [8]:
htmlsection = soup.find('a', text = regex)
htmlsection.get('href')

'../../index.html'

### A real problem.

Say we want to scrape IMDb data for a series of movies. First, we're going to need to find the right pages. That's not easy. Here, for instance, is the URL for "Casablanca":

    http://www.imdb.com/title/tt0034583/?ref_=fn_al_tt_1

That's not going to be easy to infer from the title. We're going to have to do what we actually do as human beings — use the search function. As noted above, the URLs for searches are pretty transparent, e.g.,

    http://www.imdb.com/find?ref_=nv_sr_fn&q=stranger+things&s=all

So we can build a function that takes a movie title and returns the IMDb page. While we're doing that, we'll implement some very basic error handling. The internet is a messy place and tends to return errors for no clear reason, so patience is sometimes needed.

In [73]:
import time

def trythreetimes(url):
    ''' Sometimes when you initially attempt to load a page, you get an error:
    a page with status 404 or 503, etc.
    
    There are sophisticated ways to handle that, but let's try a simple way:
    Keep trying until it works. We'll stop after three tries, so as not to hang
    up in an infinite loop.
    '''
    found = False
    tries = 0
    
    while not found and tries < 3:

        if tries > 0:
            time.sleep(1)
            # maybe the web needs a brief rest?
        
        tries += 1
        
        try:
            page = requests.get(url)
            if page.status_code == 200:
                # success!
                found = True
            # otherwise found will stay False
            else:
                print(page.status_code)
            
        except Exception:
            # something really went wrong; let's quit
            tries = 3
            page = ''
    
    # we'll return the found flag so whoever called this
    # function knows whether it worked
    return page, found

def getsearchpage(atitle):
    ''' Takes a title string, breaks it into words, joins the
    words with a plus character, and pours it into a url of the
    correct form for IMDb. Then triesthreetimes.
    '''
    
    words = atitle.split()
    url = "http://www.imdb.com/find?ref_=nv_sr_fn&q=" + '+'.join(words) + '&s=all'
    page, found = trythreetimes(url)
    return page, found # url

In [53]:
jaws_url = "http://www.imdb.com/find?ref_=nv_sr_fn&q=jaws&s=all"
page = requests.get(jaws_url)
# print(page.text)

jaws_soup = BeautifulSoup(page.text, "lxml")
# print(jaws_soup)
print(jaws_soup.find('<td class="result_text">'))

None


In [76]:
page, found = getsearchpage('Casablanca')
found

True

In [77]:
getsearchpage('Jaws')

(<Response [200]>, True)

Great, that part is working. We can get the html for a search page! Now, you have the fun job of figuring out how to use Beautiful Soup to find the part of that page that contains the link to the full IMDb page, and then extract the link.

Mainly, this requires opening the page source on your browser, and using "find" (command-F or control-F) to locate the part of that vast mess that contains the title of the movie. (You may often get title results for a search. Let's assume for now that we're interested in the first one; that will usually be true.) Then you need to figure out a series of Beautiful Soup commands that will return the link for that title. Build this as a function, so we can reuse it further down the page. The function should accept a movie title, and return the IMDb link to the main page for that movie.

Here's a little hint involving a feature we didn't try above.

If you see an html tag like ```<td class="a_specific_class">```

You can get that section of html using Beautiful Soup like so:

    soup.find('td', 'a_specific_class')

In [78]:
# Code for the function link2movie goes here

def link2movie(movie_title): # string as input
    ''' Takes a movie title, gets the URL for the
    search page for that movie, and then crawls the
    search page to get the link to the movie page
    itself.
    '''
    from bs4 import BeautifulSoup
    import re
    search_page, found = getsearchpage(movie_title)
    # print(type(search_page))
    # print(search_page)
    soup = BeautifulSoup(search_page.text, 'lxml')
    # print(type(soup))

    result_text = soup.find('td', 'result_text')
    # print(result_text)
    page_block = result_text.find('a')
    raw_pagelink = page_block.get('href')
    pagelink = 'http://www.imdb.com' + raw_pagelink
    
    return pagelink

def getmoviepage(movie_title):
    url = link2movie(movie_title)
    page, found = trythreetimes(url)
    
    return page, found # url

In [33]:
link2movie('Jaws')

link2movie('A New Hope')

link2movie('Goodfellas')

link2movie('Community')

<class 'requests.models.Response'>
<td class="result_text"> <a href="/title/tt0073195/?ref_=fn_al_tt_1">Jaws</a> (1975) </td>
<class 'requests.models.Response'>
<td class="result_text"> <a href="/title/tt0076759/?ref_=fn_al_tt_1">Star Wars: Episode IV - A New Hope</a> (1977) </td>
<class 'requests.models.Response'>
<td class="result_text"> <a href="/title/tt0099685/?ref_=fn_al_tt_1">Goodfellas</a> (1990) </td>
<class 'requests.models.Response'>
<td class="result_text"> <a href="/title/tt1439629/?ref_=fn_al_tt_1">Community</a> (2009) (TV Series) </td>


'http://www.imdb.com/title/tt1439629/?ref_=fn_al_tt_1'

### A harder version of web scraping, for homework.

Now may be the time to admit that IMDb itself actually makes things easier for us. Because they get a lot of use, they've created [a couple of APIs](http://stackoverflow.com/questions/1966503/does-imdb-provide-an-api/7744369#7744369), and even [a bunch of plain text files as static downloads](http://www.imdb.com/interfaces).

So we don't actually need to be artful web crawlers to get data out of IMDb. However, many other sites you'll encounter have not been so generous with data, so web crawling is still a useful skill, and IMDb continues to be a fun example.

Suppose we want to get data for a series of movies, including (for each movie), the title, the budget and gross, the genres, and the storyline summary. We also have to be prepared for the possibility that some of this data will not be provided for every movie!

First, write a function that extracts those fields from an IMDb page.

Then, write a loop that cycles through a list of movie titles in order to create a pandas DataFrame. For each title it should
1. retrieve the search page for the movie, and get the link to the title page
2. load the title page and extract the fields described above, and then
3. add those fields to ever-growing lists that can become the columns of a DataFrame
4. create a dictionary where the keys are column names and the values are the lists you created in (3); then say ```pd.DataFrame(your_dictionary)``` to create a DataFrame.
 
Try this on a list of five or six movies. ['Casablanca', 'Jaws', 'Bill & Ted's Excellent Adventure', others of your choice]

In [89]:
from bs4 import BeautifulSoup
  
def getmoviepage(movie_title):
    url = link2movie(movie_title)
    page, found = trythreetimes(url)
    
    return page, found # url

# <span class="itemprop" itemprop="genre">Adventure</span></a>
    
moviepage, found = getmoviepage('Jaws')
type(movielink)
soup = BeautifulSoup(moviepage.text, 'lxml')
# print(soup)
page_text = soup.find('div')
# print(page_text)
# genre_section = page_text.find('h4')
# print(genre_section)
# genre = genre_section.get('href')
# pagelink = 'http://www.imdb.com' + raw_pagelink

In [173]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
    
url = link2movie('Jaws')
# print(url)
page = urlopen(url)
soup = BeautifulSoup(page.read(), 'lxml')


# genre = soup.find('div', {'itemprop': 'genre'}).contents

for a in soup.find_all('a'):
    if "Genre:" in a:
        print (a.next_sibling.strip())

# print(type(genre))
# print(genre)

# xlist = genre.find('a')
# print(xlist)

# genre_list = []

# for item in genre:
#     import re
#     # print(item)
#     match = re.match('\>(.*?)<\/a>', str(item))
#     if match is not None:
#         genre_list.append(match)

# genre_list

# genres = genre.a.contents
# print(genres)

# genre_list = genres.find('a').contents
# genre_list

In [28]:
# First, write a function that extracts these fields from an IMDb page:
#    1. Title
#    2. Budget
#    3. Gross
#    4. Genres
#    5. Storyline summary

def getmovieinfo(movie_title):
    from bs4 import BeautifulSoup
    import urllib2
    
    url = getmoviepage(movie_title)
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    
    genre = soup.find('div', {'itemprop': 'genre'}).h4.contents
    
    # <span class="itemprop" itemprop="genre">Adventure</span></a>
    
#     movielink = link2movie(movie_title_str)
#     soup = BeautifulSoup(movielink, 'lxml')
#     result_text = soup.find('itemprop="genre"', 'result_text')
#     print(result_text)
    

In [15]:
# Then, write a loop that cycles through a list of movie titles in order to create a pandas DataFrame, and for each title:
#    1. retrieve the search page for the movie, and get the link to the title page
#    2. load the title page and extract the fields described above, and then
#    3. add those fields to ever-growing lists that can become the columns of a DataFrame
#    4. create a dictionary where the keys are column names and the values are the lists you created in (3);
#       then say pd.DataFrame(your_dictionary) to create a DataFrame.


### Another form of data cleaning.

We've learned to import files as pandas DataFrames. In the real world, however, files are not created by pandas, but by primates who do not always structure them neatly as DataFrames.

Take a look, for instance, at the "genres.list.gz" dataset available from the [IMDB static download page](http://www.imdb.com/interfaces). This is one of the static plain text files I described above as an easier option than web scraping. 

It *is* easier. But. Um. It's not really a table.

Suppose we want to chart the rise and fall of genres (as defined by IMDb) over historical time, from 1930 to the present. Our y axis will be, number of films (or TV movies, whatever) in a genre. The x axis will be date.

We can get the data we need from this file. Each film is listed multiple times, once for each genre that has been assigned to it. The tricky part is "date." Instead of breaking that out as a separate column, the authors have left it attached to the title (sometimes between two different versions of the title!)

Let's tackle this. We're going to need to step through the file line by line, ignoring all lines until we reach the list of titles. Then, for each line, we need to

a. break it in two parts at the tab

b. somehow extract a date from the title (help us, [Regex 101](https://regex101.com)!)

c. keep a running count of counts-per-year *for each genre*. (This may call for a dictionary where the keys are genres, and the values are Counters or lists).

d. produce a pandas dataframe where each genre is a column and each row is a year

e. finally, make a visualization for a couple of genres.

In [174]:
## Have at it!

import re
text = 'The film "2001: A Space Odyssey" (1968)'
match = re.search('\([0-9]{4}\)', text)
stringdate = match.group(0)
date = stringdate[1:5]
date

date_int = int(date)
date_int

action_list = Counter()
genre_dict = {}


with open('../data/genres.list', encoding ='latin-1') as f:
    for line in f:
        if still_intro:
            if line.startswith('"!Next?"'):
                still_intro = False
            else:
                continue
        
        fields = line.split('\t')
        if len(fields[0]) < 1:
            continue
            # that line was badly formed for some reason
        else:
            print(fields[0])
            match = re.search('\([0-9]{4}\)', fields[0])
            matchedstring = match.group(0)
            
            try:
                stringdate = matchedstring[1:5]
                intdate = int(stringdate)
            except Exception:
                # maybe there's no date in this line?
                intdate = -1
                print(Exception)
            
            if intdate > 1930:
                print(intdate)
                stop_ctr += 1
            
            if stop_ctr > 10:
                break
                # because this is just a test
                # otherwise we would keep going
            
            # Here is where you would get the genre from
            # fields[0] and test to see whether it's already
            # in your dict of genrecounters.
        
        

        
# can create a data frame from a dictionary where the values for the keys are lists,
#  but all lists have to be the same length!

# Example: [action:'action_list', romance:'romance_list]

NameError: name 'Counter' is not defined

In [None]:
test_genre = 'apple'

### Fuzzy matching, deduplication, etc.

On the syllabus, I describe an absurdly ambitious list of things to cover today. We can't really practice them all. But just for future reference: fuzzy matching is what you need to do, in the real world, when you're working with data created by primates who might create entries for

    Shakespeare, William
    Shakspear, William
    Shakespeare, W.
    Shakespeare, W
    Shakespeare, W (1564-1616)

and you need to recognize those as the same person. How can you do that? Well, we're not going to have time to cover it completely, but here's a useful lead that can take you where you need to go. Watch this:

In [None]:
from difflib import SequenceMatcher

match = SequenceMatcher(None, 'Shakspear', 'Shakespeare')
print("Badly spelled: ", match.ratio())
match2 = SequenceMatcher(None, 'Shakespeare, W', 'Shakespeare, W.')
print("Missing a period: ", match2.ratio())



I'll also try to say a few words about git and rsync.

## 