In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

# Scraping

Today we'll see talk about "scraping": getting unstructured data and turning it into something usable.  The tools available through Python are mature and easy to use.  We'll focus primarily on _web scraping_, where the source data comes in HTML form.

The basic workflow is:

1.  Find the data you want on the web.
2.  Inspect the page you're dealing with, to figure out how to zoom-in towards the content you want.  This will involve some combiation of
    - Looking at the source code of the page (especially if it is simple), and
    - Figuring out the structure of the HTML parse tree.  This step is much easier with a something like __Chrome Developer Tools__.
3.  Write code to get out what you want:
    - If the page is very simple, treat it as a bunch of text => __string manipulation / regular expressions__ in Python.
    - If the page is more complicated (and/or written in good style), we want to use the HTML parse tree => __BeautifulSoup__ in Python.
4.  Make sure it worked!
5.  If your crawling problem is at all non-trivial, you will now have to go back to Step 2 to zoom in further -- or you'll have parsed the URL of a link you want to follow, in which case you'll go back to Step 1 to figure out how to parse what you want from the new target page.

**Example**

As an example, suppose we want to crawl the list of "Available Technologies" being licensed by MIT at http://technology.mit.edu and store their basic info; their associated patents; and the reference counts on their associated patents.

**Step 1**: Okay, let's go to that URL.

- _First try_:  Aha, a list of links on the right.  Let's click on a few -- what do we see?  Many are empty, the categories are not obviously mutually exclusive, okay.  Maybe there's a better way.
- _Second try_: Let's just search for all technologies at http://technology.mit.edu/technologies.  Okay, better but it only gives us 50 at a time.  We could just combine the four pages, that's fine.  Let's just click on page 2 to see what happens
- _Third try_: Aha, the URL for page 2 is http://technology.mit.edu/technologies?limit=50&offset=50&query=.  That looks like we can just specify a higher limit and offset 0 and get the whole thing.
- _Final answer_: Indeed, http://technology.mit.edu/technologies?limit=1000 has a giant list.

In [None]:
import urllib2

url = "http://technology.mit.edu/technologies?limit=1000"
raw_page = urllib2.urlopen(url).read()
print(raw_page)

**A quick introduction to HTML and the DOM**

To get started:

- Pull up http://technology.mit.edu/technologies?limit=1000 in Chrome.  
- Open __View->Developer->Developer Tools__.  
- Right click on one of the technology titles, and choose __"Inspect Element"__.

What are we looking at?  Well.. it's this is the structure of the webpage.  Nested _tags_ of different _types_ and having a variety of _attributes_.

**Step 2**: What we learned above:

  - All of the technologies are underneath ("_descendents of_")   `<div class="search" id="nouvant-portfolio-content">`
  - In fact, each of them is in its own `<div class="technology" data-images="true" id="technology_XXXX">`
  
Now we're ready to move on to **Step 3**: We'll use BeautifulSoup to leverage the above to zoom in on the individual technologies and to get links to the pages with detailed info.

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(raw_page)
print soup.prettify()

In [None]:
parent_div = soup.find('div', attrs={'id': 'nouvant-portfolio-content'}) #Find (at most) *one*
tech_divs = parent_div.find_all('div', attrs={'class':'technology'})  #Find *all*
print len(tech_divs)

** Introduction to CSS selectors**

This pattern -- where you have nested finds, each given by conditions on tag type, id, and class -- is very common.  It's so common that there is a special convenience language for such traversals: [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp).

BeautifulSoup supports a form of CSS selectors, and this will let us write the above in a more concise and expressive way:
    >    tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')

All selectors work like a 'find_all'.  Some basic building examples of selectors are:

 - _'mytag'_ picks out all tags of type _mytag_.
 - _'#myid'_ picks out all tags whose _id_ is equal to _myid_
 - _'.myclass'_ picks out all tags whose _class_ is equal to _myclass_
 - _'mytag#myid'_ will pick all tags of type _mytag_ **and** _id_ equal to _myid_ (analgously for _'mytag.myclass'_)
 - If _'selector1'_ and _'selector2'_ are two selectors, then there is another selector '_selector1 selector2'_.  It picks out all tags satisfying _selector2_ that are __descendents__(*) of something satisfying _selector1_, i.e., it's like our nested find.
 
 (*) It doesn't have to be a _direct_ descedent.  I.e., it can be a grand-grand-..-grand-child of something satisfying _selector1_.  For direct descendents we'd instead write _'selector1 > selector2'_
 
Let's just explain how this applies to our example:

1.  Let's start with the first half
        >    tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')
        >                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This picks out all 'div' tags with id 'nouvant-portfolio-content'.
2.  Then the second half
        >    tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')
        >                                                            ^^^^^^^^^^^^^^
This picks out all 'div' tags with class 'technology'.
3.  Finally the whole thing
        >    tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')
        >                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
does exactly the same as our nested find above!

In [None]:
tech_divs = soup.select('div#nouvant-portfolio-content div.technology')
print len(tech_divs)

Let's check out what we've zoomed in to 

In [None]:
print tech_divs[0].prettify()

Now we're ready to pull out some key pieces of info:

- The technology's "title" (the text in the `<a>` element)
- The link to follow for more info on the technology (the _href_ attribute of the `<a>`)
- And a short blurb about the text (in the `<span>`)

Let's write some code to extract this.  But before we do, let's discuss what _form_ the output should take: It is often convenient to store data in _key-value_ form (e.g., as a hashtable), in other words to name the bits of data you are collecting.  One big advantage is that this makes it easier to add in extra fields progrssively.

Let's see what the code looks like:

In [None]:
firsta = tech_divs[0].find('a')
print firsta.text
print firsta['href']

In [None]:
## 
# We're going to use a "named tuple" to store our key-value data.
# We could also have used a dictionary, with strings as keys.
# Named tuples have some advantages
#  - Better notation, x.field_name instead of x['field_name']
#  - If you change your object structure later and fail to update your
#    code to include the new fields, this will make it easier to find.
#  - They are immutable, preventing certain sorts of bugs.
# .. and some disadvantages:
#  - If you want to augment object structure you need a new type
#    (or to go back and fill your code )
#  - They are immutable.
##
from collections import namedtuple
TechBasic = namedtuple('TechBasic', 'title, url, short')

def td_info(td):
    la = td.select('h2 > a')
    ls = td.select('span')
    if len(la)!=1 or len(ls)!=1:
        print "Uh oh! We did something wrong"
        return None
    return TechBasic (
            title = la[0].text,
            url   = la[0]['href'],
            short = ls[0].text
            )
tech_links=[td_info(td) for td in tech_divs]

print tech_links[0].url

In [None]:
Patent = namedtuple('Patent', 'name url')
TechDetailed = namedtuple('TechDetailed', 'tech_basic, patents')
def get_tech_details(tech_basic):
    url_base="http://technology.mit.edu/"
    soup = BeautifulSoup( urllib2.urlopen(url_base + tech_basic.url) )
    def patent_info(a):
       return Patent ( 
                name = a.text, 
                url = a['href'] 
                )
    patents = [patent_info(a) for a in soup.select('dd.us_patent_issued a')]
    return TechDetailed ( 
            tech_basic = tech_basic, 
            patents = patents 
            )

tech_basics = map(get_tech_details, tech_links[0:2])  #This takes a list
print tech_basics

**Note**: 
In the last code segment, we only did the first one.  If we try to get them all this way, it'll take a while.  Run the next cell for as long (or not) as you wish, and when you get bored use _Kernel->Interrupt_ to stop it.

The problem is of course that it takes a while to connect to the remote server and fetch the page.  Fortunately, thought it takes a long time it is not actually _computationally expensive_: your computer would be perfectly happy doing this for 20 pages at a time.  The **multiprocessing** package in Python makes it easy to do this kind of (easy) parallelization.

In [None]:
# Slow version -- when I wrote this sheet, it took about 2 minutes to complete
# Uncomment and run it to see
# import time

# start_time = time.time()
# tech_details = map(get_tech_details, tech_links)  #This takes a list
# end_time = time.time()

# print "Done!", end_time-start_time

In [None]:
# Multi-processor version -- when I wrote this sheet, it took about 8 seconds to complete
import time
from multiprocessing import Pool
workers = Pool(30)  # 30 worker processes

start_time = time.time()
tech_details = workers.map(get_tech_details, tech_links)
end_time = time.time()

print "Done!", end_time-start_time

**Exercise**:

Let's put all of that together.  Write a function 
```python
def get_tech_basics(url):
    ...
```

that returns `TechBasic` all each technology on the page.  Combine this with the pooled requests to get_tech_details to obtain a list of TechDetails.

**Fin.**
That's it, we now have a basic not-entirely-trivial example.  Along the way we took some detours, so let's just take a look at what our code looks like without those detours:

**Exercises:**

1. Modify "get_tech_details" to get other interesting information on the technology, like a long form description and/or the authors' names.  (You'll also want to modify TechDetailed.  Do that first and note that now the code breaks when it tries to construct a TechDetailed with the wrong number of fields.)

2. Modify "get_tech_details" to try to follow the link and to get more information on the patent -- for instance when it was filed and granted, or how many other patents reference it.  (Warning: The patent web site is much less regular than MIT's!)

**Coda / More complicated example**:
Suppose we had picked Stanford instead of MIT.  Let's try to do the same thing (it's a bit harder to get a good listing URL, so I just downloaded one).

In [None]:
import urllib2
from bs4 import BeautifulSoup, Comment
from collections import namedtuple
from multiprocessing import Pool

raw_page=open('../small_data/Stanford-Tech-Listing.html', 'r')
soup = BeautifulSoup(raw_page.read())
print soup.prettify()

In [None]:
#BeautifulSoup doesn't seem to support 'or' selectors, so:
tech_rows = soup.find_all(lambda x: x.has_attr('id') and x['id'].startswith('output_row'))[1:]
# Alternate -- showing how to go up and down the tree
#tech_rows = soup.find('tr', attrs={'id':'output_row_1'}).parent.findAll('tr')[1:]
print len(tech_rows)
print tech_rows[0].prettify()
print tech_rows[-1].prettify()


**Details**: Let's quickly break down that last line for two bits of Python syntax that we haven't explicitly talked about
    >    soup.find_all(lambda x: x.has_attr('id') and x['id'].startswith('output_row'))[1:]
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                          ^^^^
                                      (*)                                              (**)
    * This is a "lambda expression" -- a short, inline, single-line, unnamed function.  In Python it has to be an *expression* (i.e., there's an implicit return out front) -- anything more complicated you have to define a named function. 
    * This is list slice notation (we already used this above with [0:1]!).  In this case, we're taking all but the zero-th entry (which is a list header)

**UH OH**: 
When originally preparing this, I was using Anaconda.  The same code only showed about _254_ of the _1727_ entries -- BeautifulSoup was incorrectly parsing the file.  These sorts of things are not entirely uncommon, so sometimes it helps to double-check.

In [None]:
# Warning: This is hacky code!
TechBlurb = namedtuple('TechBlurb', 'docket techid url title')
def parse_tr(tr):
    return TechBlurb(
        docket = tr.select("td.output_data a")[0].text,
        techid = tr.select("td.output_data a")[0]['href'].split("=")[1],
        url    = tr.select("td.output_data a")[0]['href'],
        title  = tr.select("td.output_data")[2].text
        )
tech_blurbs=map(parse_tr, tech_rows)

In [None]:
import traceback

# And this isn't much better!
def find_comment_by_text_in(soup, comment_text):
    return soup.find(text=lambda text: isinstance(text, Comment) and comment_text in text)

TechDetailed = namedtuple('TechDetailed', 'blurb, abstract, similar')
SimilarHint = namedtuple('SimilarHint', 'techid, docket, title')
def get_tech_details(blurb):
    # We're doing s lot of chaining with implicit assumptions here -- 
    #   it might fail in all sorts of way, in which case we give up.
    try:
        url_base="http://techfinder.stanford.edu/"
        soup = BeautifulSoup( urllib2.urlopen(url_base + blurb.url) )
        contents = soup.find_all('form')[1]
        abstract = find_comment_by_text_in(contents, 'Abstract').find_next_sibling('hr').find('div').text
        similar = None
        
        def parse_similar_tr(r):
                tds = r.find_all('td')
                if len(tds) < 3:
                    return None
                return SimilarHint (
                    techid = tds[0].find('a')['href'].split('=')[1], 
                    docket = "S"+tds[0].text.strip(), 
                    title  = tds[2].text.strip()
                )
        try:
            similar_trs = find_comment_by_text_in(soup.find_all('form')[1], 'Similar Tech').find_next_sibling('table').find('div').find('table').find('table').find_all('tr')
            similar = filter(None,[parse_similar_tr(tr) for tr in similar_trs])
        except:
            pass
        return TechDetailed (
            blurb    = blurb,
            abstract = abstract,
            similar  = similar
        )
    except:
        return TechDetailed (
            blurb   = blurb,
            abstract = None,
            similar = None
        )

In [None]:
## Since the point is to show that something goes wrong, let's not wait until the end!
# imap_unordered lets you use the results of the map as they are produced (rather than storing them)
# and with no guarantee on order.

## This takes a while, so don't actually run it.
#workers = Pool(30)  # 30 worker processes
#tech_detailed = []
#for r in workers.imap_unordered(get_tech_details, tech_blurbs):
#    if r.similar is None:
#        print "Hmm, something is wrong with ", r.blurb
#    tech_detailed.append(r)

**Remark**
When we run the above code, it tells us that [this technology](http://techfinder.stanford.edu/technology_detail.php?ID=30261) did not have a list of similar technologies.  But going to the web page shows that it does!  What went wrong?

In [None]:
url='http://techfinder.stanford.edu/technology_detail.php?ID=30261'
soup = BeautifulSoup( urllib2.urlopen(url) )
contents =soup.find_all('form')[1]
print contents

If we go and look at the same part of the **raw** HTML, we find that there is no `</form>` there:

    >    <!--- Applications --->
    >    <h3>Applications</h3><br/>
    >    <ul><li>Imaging apoptosis<ul type="circle" style="margin-bottom:0in"></li><li>Research</li><li>Clinical<ul type="circle" style="margin-bottom:0in"></li><li>Monitor therapeutic efficacy in cancer patients</li><li>Anti-cancer drug selection</ul></ul></li></ul><br/>
    >    
    >    <!--- Advantages --->
    >    <h3>Advantages</h3><br/>
    >    <ul><li>High specificity for caspase-3 and -7</li><li>High sensitivity</li><li>Non-invasive</li><li>Biocompatible</li><li>Small size of probe allows:<ul type="circle" style="margin-bottom:0in"></li><li>Deep tissue penetration</li><li>More extensive biodistribution</ul></li><li>PET probes:<ul type="circle" style="margin-bottom:0in"></li><li>High tumor/muscle ratio in apoptotic tumors</li><li>High uptake value in apoptotic tumors</ul></li><li>Fluorescent probe:<ul type="circle" style="margin-bottom:0in"></li><li>Possess NIR spectral properties</ul></li><li>May help promote personalized cancer medicine</li><li>Potential for probe design strategy to be applied to other enzyme targets</li></ul><br/>

What there **is** is _mal-formed HTML_ that is bad enough to confuse BeautifulSoup.  (Note that it's not nearly bad enough to confuse a web browser however).  If you look at more examples, you will find even worse ones -- a stray `</html>` in the middle of a document is not unheard of.  

To fix this, we can pre-"tidy" the page before feeding it to BeautifulSoup using **pytidylib**.

In [None]:
from tidylib import tidy_document
url='http://techfinder.stanford.edu/technology_detail.php?ID=30261'

tidy_page, _ = tidy_document(urllib2.urlopen(url).read())
soup = BeautifulSoup(tidy_page)
contents =soup.find_all('form')[1]
print contents

**Exercises**:

1. Go back and modify `get_tech_details` to use this 'tidy' approach.

2. Sometimes web servers are slow and/or unreliable, and sometimes your connection it.  If we were to run the above test twice, we'd probably find that some of the failures were just due to a connection error.  We didn't notice this because the _outer_ `try` / `except` is also catching these.  So: Modify `get_tech_details` to allow up to 3 retries. <br/>Bonus points if you actually look what what exceptions `urllib` throws in those cases instead of a general catch-all mechanism.  Alternate type of bonus points if you figure out how to do it using the `retrying` package.  You can test these by throttling your internet on and off to simulate an unreliable connection.

# Spoilers

In [None]:
import urllib2
from bs4 import BeautifulSoup
from collections import namedtuple
from multiprocessing import Pool

# Getting the list of short 'blurbs' about the techs
TechBasic = namedtuple('TechBasic', 'title, url, short')
def get_tech_basics(url):
    url = "http://technology.mit.edu/technologies?limit=1000"
    soup = BeautifulSoup(urllib2.urlopen(url))

    ## Get the list of tech blurbs
    tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')

    ## Parse a single 'td' on the index page
    def td_info(td):
        la = td.select('h2 > a')
        ls = td.select('span')
        if len(la)!=1 or len(ls)!=1:
            print "Uh oh! We did something wrong"
            return None
        return TechBasic (
                title = la[0].text,
                url   = la[0]['href'],
                short = ls[0].text
                )
    
    return [td_info(td) for td in tech_divs]


# Adding in some details (just patent info, for now)
Patent = namedtuple('Patent', 'name url')
TechDetailed = namedtuple('TechDetailed', 'tech_basic, patents')
def get_tech_details(tech_basic):
    url_base="http://technology.mit.edu/"
    soup = BeautifulSoup( urllib2.urlopen(url_base + tech_basic.url) )
    def patent_info(a):
       return Patent ( 
                name = a.text, 
                url = a['href'] 
                )
    patents = [patent_info(a) for a in soup.select('dd.us_patent_issued a')]
    return TechDetailed ( 
            tech_basic = tech_basic, 
            patents = patents 
            )

## The main driver code:
tech_basics = get_tech_basics("http://technology.mit.edu/technologies?limit=1000")

workers = Pool(30)  # 30 worker processes
tech_details = workers.map(get_tech_details, tech_basics)

print tech_details[73]

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*