In [1]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

# Scraping

Today we'll talk about "scraping": how to get unstructured data and turn it into something usable. We'll primarily focus on _web scraping_. Python has mature tools that make this pretty easy.

The basic workflow is:

1. Find the data you want on the web.
2. Inspect the webpage and figure out how to select the content you want. This usually involves some combination of
    - Viewing the source code of the page (especially if it is simple), and
    - Figuring out the structure of the HTML parse tree.  This step is much easier with something like __Chrome Developer Tools__.
3.  Write code to get out what you want:
    - If the page is very simple, treat it as a bunch of text => __string manipulation / [regular expressions](https://docs.python.org/2/howto/regex.html)__ in Python.
    - If the page is more complicated (and/or written in good style), we want to use the HTML parse tree => __[BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) / [lxml](http://lxml.de/lxmlhtml.html)__ in Python.
4.  Make sure it worked!
5.  If your crawling problem is at all non-trivial, you will now have to go back to Step 2 to zoom in further -- or you'll have parsed the URL of a link you want to follow, in which case you'll go back to Step 1 to figure out how to parse what you want from the new target page.

As an example, suppose we want to crawl the list of "Available Technologies" being licensed by MIT at http://technology.mit.edu and store their basic info, their associated patents, and the reference counts on their associated patents.

## Understanding URLs

Let's try to find the correct URL to use.

- _First try_:  Aha, a list of links on the right.  Let's click on a few -- what do we see?  Many are empty, the categories are not obviously mutually exclusive, okay.  Maybe there's a better way.
- _Second try_: Let's just search for all technologies at http://technology.mit.edu/technologies.  Okay, better but it only gives us 50 at a time.  We could just combine the four pages, that's fine.  Let's just click on page 2 to see what happens
- _Third try_: Aha, the URL for page 2 is http://technology.mit.edu/technologies?limit=50&offset=50&query=.  That looks like we can just specify a higher limit and offset 0 and get the whole thing.
- _Final answer_: Indeed, http://technology.mit.edu/technologies?limit=1000 has a giant list.

In [2]:
import requests

page_nums= list(range(2,27))#Get pages from 2,26
url = "http://www.newyorksocialdiary.com/party-pictures?"
responses = [requests.get(url, params={"page": num}) for num in page_nums]
print responses[0].url
print responses[0].text[:1000] + "..."
print len(responses)

http://www.newyorksocialdiary.com/party-pictures?page=2
<!DOCTYPE html>
  <!--[if IEMobile 7]><html class="no-js ie iem7" lang="en" dir="ltr"><![endif]-->
  <!--[if lte IE 6]><html class="no-js ie lt-ie9 lt-ie8 lt-ie7" lang="en" dir="ltr"><![endif]-->
  <!--[if (IE 7)&(!IEMobile)]><html class="no-js ie lt-ie9 lt-ie8" lang="en" dir="ltr"><![endif]-->
  <!--[if IE 8]><html class="no-js ie lt-ie9" lang="en" dir="ltr"><![endif]-->
  <!--[if (gte IE 9)|(gt IEMobile 7)]><html class="no-js ie" lang="en" dir="ltr" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product#"><![endif]-->
  <!--[if !IE]><!--><html class="no-js" lang="en" dir="ltr" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: 

## HTML and the DOM

To get started:

- Pull up http://technology.mit.edu/technologies?limit=1000 in Chrome.  
- Open __View->Developer->Developer Tools__.  
- Right click on one of the technology titles, and choose __"Inspect Element"__.

What are we looking at?  Well... this is the structure of the webpage.  Nested _tags_ of different _types_ and having a variety of _attributes_.

What we learned above:

  - All of the technologies are underneath ("_descendents of_")   `<div class="search" id="nouvant-portfolio-content">`
  - In fact, each of them is in its own `<div class="technology" data-images="true" id="technology_XXXX">`
  
Now we're ready to move on:

## Parsing HTML
Now, we need to parse the raw HTML and actually grab the links of detailed info. The two main parser libraries in Python are `BeautifulSoup` and `lxml`. `lxml` is much faster (it leverages several C libraries), but it's also worse at dealing with malformed, crummy HTML. Because parsing speed isn't our bottleneck here, we'll use `BeautifulSoup`.

In [3]:
from bs4 import BeautifulSoup
soups = [BeautifulSoup(response.text) for response in responses]
print soups[24].head()

[<title>Party Pictures Archive | New York Social Diary</title>, <meta charset="utf-8"/>, <link href="http://www.newyorksocialdiary.com/sites/all/themes/omega_nysd/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>, <link href="http://www.newyorksocialdiary.com/sites/all/themes/omega_nysd/apple-touch-icon-precomposed-114x114.png" rel="apple-touch-icon-precomposed" sizes="114x114"/>, <link href="http://www.w3.org/1999/xhtml/vocab" rel="profile"/>, <link href="http://www.newyorksocialdiary.com/sites/all/themes/omega_nysd/apple-touch-icon-precomposed.png" rel="apple-touch-icon-precomposed"/>, <meta content="width" name="MobileOptimized"/>, <meta content="on" http-equiv="cleartype"/>, <link href="http://www.newyorksocialdiary.com/sites/all/themes/omega_nysd/apple-touch-icon-precomposed-144x144.png" rel="apple-touch-icon-precomposed" sizes="144x144"/>, <meta content="true" name="HandheldFriendly"/>, <link href="http://www.newyorksocialdiary.com/sites/all/themes/omega_nysd/app

In [11]:
# get page content, and then select all div tags that begin with "views-row"
# http://stackoverflow.com/questions/27070690/does-beautifulsoup-select-method-support-use-of-regex
contents = [soup.select('div.view-content') for soup in soups]  ### All content on each of 25 pages


print len(contents)
for i in range(len(contents)):
    print i, len(contents[i]) 
   

25
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 0
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 1
19 1
20 1
21 1
22 1
23 1
24 1


In [12]:
soups[10]

<html>
<head>
<title>Page Unavailable</title>
<style>
    body { background: #303030; text-align: center; color: white; }
    #page { border: 1px solid #CCC; width: 500px; margin: 100px auto 0; padding: 30px; background: #323232; }
    a, a:link, a:visited { color: #CCC; }
    .error { color: #222; }
  </style>
</head>
<body onload="setTimeout(function() { window.location = '/' }, 5000)">
<div id="page">
<h1 class="title">Page Unavailable</h1>
<p>The page you requested is temporarily unavailable.</p>
<p>We're redirecting you to the <a href="/">homepage</a> in 5 seconds.</p>
<div class="error">(Error 503 Service Unavailable)</div>
</div>
</body>
</html>

In [13]:
events = [content[0].select('div[class^=views-row]') for content in contents if content]

print len(events)
#print section

24


In [15]:
import time
import datetime
##print events[24]

links = []
dates = []

for event in events:
    for i in range(len(event)):
        links.append(event[i].select('a')[0]['href'])
        dates.append(event[i].select('span[class^=field-content]')[1].text)
##print re.findall('content">(.*?)<',str(events[0][1]))[0]


dates = [time.strptime(str(date), "%A, %B %d, %Y") for date in dates] # date`

base_url = 'http://www.newyorksocialdiary.com'


cutt_off = time.strptime('12-1-2014', "%m-%d-%Y")  
print cutt_off
search = base_url+ str(links[0])
print search
print len(links)
print len(dates)

time.struct_time(tm_year=2014, tm_mon=12, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=335, tm_isdst=-1)
http://www.newyorksocialdiary.com/party-pictures/2015/philanthropic-endeavors
1167
1167


In [16]:
links_set = []   ### Set of links with date prior to cutoff date
for d,l in zip(dates,links):  
    if d < cutt_off:
        link = base_url + l
        links_set.append(link)

In [69]:
print len(links_set)
print links_set[1141]


1142
http://www.newyorksocialdiary.com/party-pictures/2007/orchids-growing-wild


In [18]:
responses = [requests.get(link) for link in links_set]

In [20]:
picture_content_pages = [BeautifulSoup(response.text) for response in responses]
print picture_content_pages[0].prettify()

<!DOCTYPE html>
<!--[if IEMobile 7]><html class="no-js ie iem7" lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="no-js ie lt-ie9 lt-ie8 lt-ie7" lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="no-js ie lt-ie9 lt-ie8" lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="no-js ie lt-ie9" lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><html class="no-js ie" lang="en" dir="ltr" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product#"><![endif]-->
<!--[if !IE]><!-->
<html class="no-js" dir="ltr" lang="en" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product#">
 <!--<![endif]-->
 <head>
  <title>
   

In [21]:
len(picture_content_pages)

1142

In [22]:
picture_content_pages[1].select('div[class=photocaption]')

[<div align="center" class="photocaption">Jon Batiste and Marcus Miller at The NAACP Legal Defense and Educational Fund's  28th annual National Equal Justice Award Dinner.</div>,
 <div align="center" class="photocaption">Sherrilyn Ifill and Geoffrey Canada</div>,
 <div align="center" class="photocaption">Sherrilyn Ifill and Debra Lee</div>,
 <div align="center" class="photocaption">Deborah Roberts, Bernard Tyson, and Gerald Adolph</div>,
 <div align="center" class="photocaption">Yvonne and Geoffrey Canada</div>,
 <div align="center" class="photocaption">Denise Tyson and Judith Byrd</div>,
 <div align="center" class="photocaption">Philip Wells, Tonya Lewis Lee, Ted Wells, and Byron Pitts</div>,
 <div align="center" class="photocaption">Angela Vallot and Deborah Roberts</div>,
 <div align="center" class="photocaption">Star Jones and Frank Ahimaz</div>,
 <div align="center" class="photocaption">Debra Lee with Denise and Bernard Tyson</div>,
 <div align="center" class="photocaption">Debra 

In [62]:
##(re.findall('sans-serif">(.*?)<',str(picture_content_pages[1141]) ))
##(face="Verdana, Arial, Helvetica, sans-serif")
 
##picture_content_pages[1141].find_all(face=re.compile('^sans-serif">'))
picture_content_pages[1141].select('font[face= ˆVerdana, Arial, Helvetica, sans-serif]')
###find_all(href=re.compile('^http://www.newyorksocialdiary.com/partypictures/\d+.jpg'))

ValueError: Unsupported or invalid CSS selector: "font[face="

In [52]:
picture_content_pages[1141]

<!DOCTYPE html>
<!--[if IEMobile 7]><html class="no-js ie iem7" lang="en" dir="ltr"><![endif]--><!--[if lte IE 6]><html class="no-js ie lt-ie9 lt-ie8 lt-ie7" lang="en" dir="ltr"><![endif]--><!--[if (IE 7)&(!IEMobile)]><html class="no-js ie lt-ie9 lt-ie8" lang="en" dir="ltr"><![endif]--><!--[if IE 8]><html class="no-js ie lt-ie9" lang="en" dir="ltr"><![endif]--><!--[if (gte IE 9)|(gt IEMobile 7)]><html class="no-js ie" lang="en" dir="ltr" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product#"><![endif]--><!--[if !IE]><!--><html class="no-js" dir="ltr" lang="en" prefix="fb: http://ogp.me/ns/fb# og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product#"><!--<![endif]-->
<head>
<title>Orchids growing

In [75]:
import re
re.findall('sans-serif">(.*?)<', picture_content_pages[1141].text)

picture_content_pages[1141].select('td > div > font')
##re.findall('sans-serif">(.*?)<',test)


[<font face="Verdana, Arial, Helvetica, sans-serif" size="1">Oscar
                     Mora for Valentino</font>,
 <font face="Verdana, Arial, Helvetica, sans-serif" size="1">Eric
                     Cohler Design &amp; L'Olivier </font>,
 <font face="Verdana, Arial, Helvetica, sans-serif" size="1">Thomas
                     M. Burak Interiors, Ltd.</font>,
 <font face="Verdana, Arial, Helvetica, sans-serif" size="1">Whitney
                     and James Fairchild</font>,
 <font face="Verdana, Arial, Helvetica, sans-serif" size="1">Dominique
                     Browning, Robert Rufino, and Naz Tesfit</font>,
 <font face="Verdana, Arial, Helvetica, sans-serif" size="1">Gretchen
                     and Gene Grisanti</font>,
 <font face="Verdana, Arial, Helvetica, sans-serif" size="1">Philip
                     and Lisa Gorrivan</font>,
 <font face="Verdana, Arial, Helvetica, sans-serif" size="1">Elizabeth
                     Stribling and Guy Robinson</font>,
 <font face="Verdana

In [79]:
i=0
sum_captions = 0
full_list_captions = []
for page in picture_content_pages:
    captions1 = page.select('div[class=photocaption]')
    captions2 = page.select('td[class=photocaption]')
    captions3 = page.select('td > div > font')
    captions = captions1+ captions2+captions3
    full_list_captions.append(captions)
    print i, len(captions1), len(captions2),len(captions3), len(captions)
    sum_captions+= len(captions)
    i+=1
print sum_captions 

0 73 1 0 74
1 82 1 0 83
2 60 4 0 64
3 66 1 0 67
4 107 1 0 108
5 52 1 0 53
6 106 1 0 107
7 81 1 0 82
8 125 1 0 126
9 94 1 0 95
10 103 5 0 108
11 114 4 0 118
12 97 1 0 98
13 64 10 0 74
14 100 14 0 114
15 74 1 0 75
16 71 5 0 76
17 72 8 0 80
18 118 1 1 120
19 59 5 0 64
20 98 1 0 99
21 84 1 0 85
22 52 1 0 53
23 81 1 0 82
24 98 1 0 99
25 80 1 0 81
26 89 6 0 95
27 88 10 0 98
28 50 1 0 51
29 116 1 0 117
30 131 1 0 132
31 122 5 0 127
32 0 0 0 0
33 89 1 0 90
34 59 1 0 60
35 53 5 0 58
36 102 1 0 103
37 90 1 0 91
38 79 1 0 80
39 109 2 0 111
40 81 1 0 82
41 0 0 0 0
42 106 4 0 110
43 56 1 0 57
44 114 1 0 115
45 68 1 0 69
46 67 1 0 68
47 122 1 0 123
48 93 1 0 94
49 103 5 0 108
50 94 92 0 186
51 111 1 0 112
52 95 1 0 96
53 114 1 0 115
54 153 1 0 154
55 58 1 0 59
56 96 1 0 97
57 85 1 0 86
58 1 208 0 209
59 96 2 0 98
60 90 1 0 91
61 99 1 0 100
62 101 1 0 102
63 91 1 0 92
64 87 1 0 88
65 59 1 0 60
66 108 1 0 109
67 122 1 0 123
68 96 1 0 97
69 81 1 0 82
70 109 1 0 110
71 97 1 0 98
72 85 1 0 86
73 64 41 0 

In [90]:
count = 0
caption_texts = []
for page in full_list_captions:
    for caption in page:
        if caption.text:
            caption_texts.append(caption.text)
            count+=1
            
print count

104852


In [91]:
len(caption_texts)

104852

In [92]:
from collections import Counter
def eliminate_long_text(captions):
    new_list = []
    for caption in captions:
        if sum(Counter(caption).values()) < 250:
            new_list.append(caption)
    return new_list
    

In [372]:
caption_texts = eliminate_long_text(caption_texts)
print len(caption_texts)
print caption_texts[104669]

104670
Charlotte
                    Frieze, Eric and Patti Fast, Gregory Long, Julie Graham,
                Robin Graham, Fernanda Kellogg, Dominique Browning, and Joan
                Khoury


In [308]:
def eliminate_prefix(caption):
    prefix=['Mayor','Mr','Dr','Miss','Mrs','Ofc','Governor','Gov',
        'Professor','Prof','Hon','President','Pres','Captain',
        'Cap','Major','Maj','Sir','Senator','Congressman',
        'New York','Dean','PhD','Committee','Hospital','Dean','Dinner','OSL',
        'Benefit','Board','Art','NYC','Friend','Ambassador','Consul','School','MD',
        'GALA','CEO','Museum','Gala','friend','friends','Jr.','M.D.', 'friend', 'friends', '\n']
        
    for pre in prefix:
        caption=caption.replace(pre,' ')
    return caption

In [299]:
import re
def name_item_list(caption):
    words = re.split('(\W+)',caption)
    return words

In [465]:
example = name_item_list(caption_texts[104669])
print example

[u'Charlotte', u'\n                    ', u'Frieze', u', ', u'Eric', u' ', u'and', u' ', u'Patti', u' ', u'Fast', u', ', u'Gregory', u' ', u'Long', u', ', u'Julie', u' ', u'Graham', u',\n                ', u'Robin', u' ', u'Graham', u', ', u'Fernanda', u' ', u'Kellogg', u', ', u'Dominique', u' ', u'Browning', u', ', u'and', u' ', u'Joan', u'\n                ', u'Khoury']


In [99]:
def extract_names(caption):
    new_name_list = []
    new_name = str('')
    for name in caption:
        if ',' in name:
            new_name_list.append(new_name)
            new_name = str('')
        elif len(name) > 2:
            try:
                new_name += ' '+ str(name) 
            except ValueError:
                continue
    new_name_list.append(new_name)
    return new_name_list

In [466]:
example = extract_names(example)
print example

[' Charlotte \n                     Frieze', ' Eric and Patti Fast', ' Gregory Long', ' Julie Graham', ' Robin Graham', ' Fernanda Kellogg', ' Dominique Browning', ' and Joan \n                 Khoury']


In [464]:
def clean_and(caption):
    new_name_list= []
    for name in caption:
        if 'and' in name:
            if len(name.split(' ')) > 4:
                new_name = re.sub(r'\sAND\s', ', ', name , flags=re.IGNORECASE)
                new_name_list.append(new_name)
            elif len(name.split(' ')) > 3:
                if name.split(' ')[1] == 'and':
                    name1 = str(name.split(' ')[0]) +' '+ str(name.split(' ')[3])
                    name2 = str(name.split(' ')[2]) +' '+ str(name.split(' ')[3])
                    new_name_list.append(name1)
                    new_name_list.append(name2)
                else:
                    new_name = re.sub(r'\sAND\s', '', name , flags=re.IGNORECASE)
                    new_name_list.append(new_name) 
            elif len(name.split(' ')) > 2:
                    new_name = re.sub(r'AND\s', '', name , flags=re.IGNORECASE)
                    new_name_list.append(new_name) 
            else:
                new_name_list.append(name) 
           
        else:
            new_name_list.append(name)

    return new_name_list


In [467]:
example = clean_spaces(example)
print example

['Charlotte Frieze', 'Eric and Patti Fast', 'Gregory Long', 'Julie Graham', 'Robin Graham', 'Fernanda Kellogg', 'Dominique Browning', 'and Joan Khoury']


In [468]:
example = clean_and(example)
print example

['Charlotte Frieze', 'Eric Fast', 'Patti Fast', 'Gregory Long', 'Julie Graham', 'Robin Graham', 'Fernanda Kellogg', 'Dominique Browning', 'Joan Khoury']


In [469]:
def clean_spaces(caption):
    new_name_list = []
    for name in caption:
        new_name=" ".join(name.split())
        new_name_list.append(new_name)
    return new_name_list



In [470]:
def process_pictures(captions):
    new_list = []
    for caption in captions:
        caption = eliminate_prefix(caption)
        caption = name_item_list(caption)
        caption = extract_names(caption)
        caption = clean_spaces(caption)
        caption = clean_and(caption)
        new_list.append(caption)
    return new_list



In [471]:
pictures_processed = process_pictures(caption_texts)
print pictures_processed[335]
print pictures_processed[335][0]
print len(pictures_processed[335][0].split(' '))

['Michelle Fizer Peterson', 'Charles Grodin', 'Julio Peterson']
Michelle Fizer Peterson
3


In [472]:
len(pictures_processed)

104670

In [473]:
names = []
for picture in pictures_processed:
    for name in picture:
        names.append(name)
names = sorted(names)

In [474]:
len(names)

191471

In [475]:
import networkx as nx

G2=nx.Graph()



In [476]:
count = 0
for picture in pictures_processed:
    count+=1
    for name1 in picture:
        for name2 in picture:
            if len(name1.split(' '))>1 and len(name1.split(' '))>1:
                if G2.has_edge(name1,name2):
                    G2.edge[name1][name2]['weight'] += 1
                else:
                    G2.add_edge(name1, name2, weight=1 )
            
print count



104670


In [477]:

degrees = [v[0] for v in sorted(nx.degree(G2).iteritems(), key=lambda(k, v): (-v, k))]


In [478]:
nx.degree(G2)['Jean Shafiroff']

341

In [288]:

degree_list = []
for i in range(200):
    name = degrees[i]
    count = nx.degree(G2)[name]
    if len(name.split(' ')) > 1:
        degree_list.append((name,count))






In [158]:
degree_list [:100]

[('and and', 561),
 ('Jean Shafiroff', 494),
 ('Mark Gilbertson', 406),
 ('Gillian Miniter', 296),
 ('Geoffrey Bradfield', 262),
 ('Alexandra Lebenthal', 248),
 ('Michael Bloomberg', 241),
 ('Somers Farkas', 232),
 ('Debbie Bancroft', 220),
 ('Andrew Saffir', 218),
 ('Yaz Hernandez', 218),
 ('Mario Buatta', 216),
 ('Sharon Bush', 216),
 ('Alina Cho', 214),
 ('Kamie Lightburn', 211),
 ('Patrick McMullan', 199),
 ('Eleanora Kennedy', 190),
 ('Lucia Hwong Gordon', 190),
 ('Bettina Zilkha', 180),
 ('Allison Aston', 176),
 ('Jamee Gregory', 168),
 ('Muffie Potter Aston', 167),
 ('Martha Stewart', 163),
 ('Barbara Tober', 162),
 ('Diana Taylor', 160),
 ('Deborah Norville', 158),
 ('Margo Langenberg', 156),
 ('Michele Herbert', 153),
 ('Leonard Lauder', 152),
 ('Lydia Fenet', 152),
 ('Audrey Gruss', 151),
 ('Kipton Cronkite', 150),
 ('Amy Fine Collins', 148),
 ('Paula Zahn', 148),
 ('Karen Klopp', 146),
 ('Nicole Miller', 146),
 ('Evelyn Lauder', 145),
 ('Karen LeFrak', 144),
 ('Liliana Caven

In [775]:
pagerank = nx.pagerank(G2, alpha=0.85, personalization=None, max_iter=100, tol=1e-06, nstart=None, weight=2 , dangling=None)

In [776]:
pagerank_list = [v[0] for v in sorted(nx.degree(G2).iteritems(), key=lambda(k, v): (-v, k))]

In [777]:
pagerank_results = []
for i in range(200):
    name = pagerank_list[i]
    count = pagerank[name]
    if len(name.split(' ')) > 1:
        pagerank_results.append((name,count))


In [778]:
pagerank_results[:100]

[('Jean Shafiroff', 0.0003712512044332871),
 ('Mark Gilbertson', 0.0003441654523210229),
 ('Alexandra Lebenthal', 0.0002271074358133392),
 ('Geoffrey Bradfield', 0.0002175076842215615),
 ('Gillian Miniter', 0.00020139403925847316),
 ('Alina Cho', 0.00018710992223556824),
 ('Kamie Lightburn', 0.00017839180813006812),
 ('Mario Buatta', 0.00020547578879683264),
 ('Somers Farkas', 0.00017402273615527936),
 ('Patrick McMullan', 0.00019115669252859775),
 ('Debbie Bancroft', 0.00016832551300976486),
 ('Andrew Saffir', 0.00020468409313214428),
 ('Sharon Bush', 0.00017874799855173976),
 ('Allison Aston', 0.00013613155081609383),
 ('Eleanora Kennedy', 0.00013585071376869567),
 ('Yaz Hernandez', 0.0001537304802936618),
 ('Lydia Fenet', 0.00015562076412397804),
 ('Bettina Zilkha', 0.00013717764746293745),
 ('Lucia Hwong Gordon', 0.00013930448629402776),
 ('Martha Stewart', 0.00015387522603186563),
 ('Karen LeFrak', 0.0001451500820920034),
 ('Liz Peek', 0.0001404986953173415),
 ('Kipton Cronkite', 

In [479]:
G2.edges()[5001]

('Janine Luke', 'Janine Luke')

In [480]:
len(G2.edges())

261298

In [481]:
weights = G2.edges(data='weight')

In [482]:
weights[:10]

[('', 'Ellen Corwin, Steven Corwin', 1),
 ('', 'and Lucy ummond the 67th Bal des Berceaux', 1),
 ('', 'WFUV station manager Ralph Jennings', 1),
 ('', 'Leslie Balthazar, her son', 1),
 ('', 'Kellie McLaughlin, Paul Colarusso From Aperture', 1),
 ('', 'Kent Clark', 1),
 ('', 'Carlisle Campbell', 1),
 ('', 'Patricia Patterson, Henry Darlington', 1),
 ('', 'Herb Pardes', 1),
 ('', 'Sharon Bush', 1)]

In [483]:
sorted_weights = sorted(weights, key=lambda x : x[2], reverse=True)

In [484]:
sorted_weights[:100]

[('Click here for NYSD Contents', 'Click here for NYSD Contents', 1545),
 ('Photographs PatrickMcMullan com', 'Photographs PatrickMcMullan com', 320),
 ('Jean Shafiroff', 'Jean Shafiroff', 262),
 ('Gillian Miniter', 'Gillian Miniter', 250),
 ('Gillian Miniter', 'Sylvester Miniter', 210),
 ('Mark Gilbertson', 'Mark Gilbertson', 155),
 ('Jamee Gregory', 'Jamee Gregory', 144),
 ('Debbie Bancroft', 'Debbie Bancroft', 135),
 ('Alexandra Lebenthal', 'Alexandra Lebenthal', 134),
 ('Somers Farkas', 'Somers Farkas', 134),
 ('Sylvester Miniter', 'Sylvester Miniter', 116),
 ('Peter Gregory', 'Jamee Gregory', 116),
 ('Yaz Hernandez', 'Yaz Hernandez', 112),
 ('Geoffrey Bradfield', 'Geoffrey Bradfield', 107),
 ('Bonnie Comley', 'Stewart Lane', 98),
 ('Barbara Tober', 'Barbara Tober', 95),
 ('Bettina Zilkha', 'Bettina Zilkha', 94),
 ('Eleanora Kennedy', 'Eleanora Kennedy', 94),
 ('Alina Cho', 'Alina Cho', 93),
 ('Donald Tober', 'Barbara Tober', 92),
 ('Somers Farkas', 'Jonathan Farkas', 92),
 ('Sharo

In [485]:
best_friends_results = []
index = 0
for item in sorted_weights:
    if index <100:
        if item[0] != item[1] and len(item[0].split(' '))>1 and len(item[1].split(' '))>1:
            best_friends_results.append(((item[0],item[1]),item[2]/2))
            index +=1


In [486]:
best_friends_results

[(('Gillian Miniter', 'Sylvester Miniter'), 105),
 (('Peter Gregory', 'Jamee Gregory'), 58),
 (('Bonnie Comley', 'Stewart Lane'), 49),
 (('Donald Tober', 'Barbara Tober'), 46),
 (('Somers Farkas', 'Jonathan Farkas'), 46),
 (('Campion Platt', 'Tatiana Platt'), 41),
 (('Geoffrey Bradfield', 'Roric Tobin'), 41),
 (('Jean Shafiroff', 'Martin Shafiroff'), 41),
 (('Margo Catsimatidis', 'John Catsimatidis'), 41),
 (('Peter Regna', 'Barbara Regna'), 40),
 (('Yaz Hernandez', 'Valentin Hernandez'), 35),
 (('Melissa Morris', 'Chappy Morris'), 32),
 (('Michael Kennedy', 'Eleanora Kennedy'), 32),
 (('Coco Kopelman', 'Arie Kopelman'), 31),
 (('David Koch', 'Julia Koch'), 31),
 (('Grace Meigher', 'Chris Meigher'), 30),
 (('Jonathan Tisch', 'Lizzie Tisch'), 30),
 (('Andrew Saffir', 'Daniel Benedict'), 27),
 (('Michael Cominotto', 'Dennis Basso'), 22),
 (('Dan Lufkin', 'Cynthia Lufkin'), 22),
 (('Frederick Anderson', 'Douglas Hannant'), 20),
 (('Clo Cohen', 'Charles Cohen'), 19),
 (('Richard Steinberg'

In [168]:
G2.number_of_edges('Lorraine Gallard', 'Richard Levy')

1

In [506]:
len(G.edges()[5000][0].split(' '))

2

In [604]:
len(G.edges())

75632

In [569]:
import nltk

sample="Our pod people: Paul Paczuski, Alejandro Rojas, Mark Fontana and Ashkan Balouchi."
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

entity_names = []

for tree in chunked_sentences:
    entity_names.extend(extract_entity_names(tree))


# Print unique entity names
print "Text:", sample
print "List of names:", entity_names

Text: Our pod people: Paul Paczuski, Alejandro Rojas, Mark Fontana and Ashkan Balouchi.
List of names: ['Paul Paczuski', 'Alejandro Rojas', 'Mark Fontana', 'Ashkan Balouchi']


In [554]:
len(edges_count)

77264

In [558]:
edges_count[('Lorraine Gallard', 'Richard Levy')]

1

In [562]:
edge_results = [v for v in sorted(edges_count.iteritems(), key=lambda(k, v): (-v, k))]

In [565]:
edge_results[:-100]

[(('', ''), 1),
 (('', '1967 Oil canvas'), 1),
 (('', '?'), 1),
 (('', 'Agnes Tang'), 1),
 (('', 'Alan Graves'), 1),
 (('', 'Alan Patricof'), 1),
 (('', 'Aldo Papone'), 1),
 (('', 'Alex Abizaid'), 1),
 (('', 'Alexander Teague'), 1),
 (('', 'Alexandra Scott'), 1),
 (('', 'Alicia Bouzan Cordon'), 1),
 (('', 'Alma Salky'), 1),
 (('', 'Amos Kaminski'), 1),
 (('', 'Andrea Abizaid'), 1),
 (('', 'Andrea Charney'), 1),
 (('', 'Andrea Vazzana'), 1),
 (('', 'Andrew Kolodny'), 1),
 (('', 'Angela Mellon'), 1),
 (('', 'Anita Gotto'), 1),
 (('', 'Anja von Schondorf Gleicher'), 1),
 (('', 'Ann Roberts'), 1),
 (('', 'Anne Altchek'), 1),
 (('', 'Anne Joel Ehrenkranz Dean'), 1),
 (('', 'Annie Pressman'), 1),
 (('', 'Antonio Gotto'), 1),
 (('', 'Armelle Vienne'), 1),
 (('', 'Aso Tavitian'), 1),
 (('', 'Assemblyman Michael Mrs Benjamin'), 1),
 (('',
   'Award presenters Theresa Lana Eric Brinker flank Compassionate Care Award honoree Patrick Borgen'),
  1),
 (('', 'Barbara Loughlin'), 1),
 (('', 'Barry Sa

In [532]:
sorted_edge_results= sorted(edge_results, key=lambda x: x[1], reverse=True)

In [528]:
sorted_edge_results[:100]

[(('Ann Tenenbaum', 'Fiona Rudin'), 1),
 (('Eileen Berdam', 'Eileen Berdam'), 1),
 (('Harriet Levine', 'Ania Poinvil'), 1),
 (('Demarco Majors', 'Demarco Majors'), 1),
 (('Allison Derusha', 'Stephanie Williams'), 1),
 (('Cey Adams', 'Eldin Villafane'), 1),
 (('Hilary Knight signing his book', 'Hilary Knight signing his book'), 1),
 (('Robin Smith', 'Anthony Edwards'), 1),
 (('Stephanie Brag', 'Scott Blumenkranz'), 1),
 (('Jolyne Caruso FitzGerald', 'Taline Aynilian'), 1),
 (('Bill Koeningsberg', 'Mary Levine'), 1),
 (('NYRP Development Manager', 'Christina Vescovo'), 1),
 (('Melania Lonchyna', 'Melania Lonchyna'), 1),
 (('Cindy Allen', 'Cindy Allen'), 1),
 (('Beth Dannhauser', 'Mary Ellen Gilgan'), 1),
 (('Karen Mastrandrea', 'Lynn Levy'), 1),
 (('Dinorah Delfin', 'Ethan Sklar'), 1),
 (('Valentina HadlShinnecock Animal Hospital Hampton Bays', 'David Hadland'),
  1),
 (('Elle Klymer', 'Elle Klymer'), 1),
 (('Pier Guerci', 'Pier Guerci'), 1),
 (('Patricia Kennedy', 'Amanda Major Jake Mil

## CSS selectors

This pattern of nested finds, based on tag type, ID, and class, is very common. It's so common that there are two special convenience languages for such traversals: [CSS selectors](http://www.w3schools.com/cssref/css_selectors.asp) and [XPath](http://www.w3schools.com/xsl/xpath_intro.asp) (which works for all XML, not just HTML). We'll be using CSS selectors, which are more common for HTML and easier to learn.

With CSS selectors, we can write the above in a more concise and expressive way:

```python
tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')
```

All selectors work like 'find_all'.  Some basic building examples of selectors are:

 - `'mytag'` picks out all tags of type `mytag`.
 - `'#myid'` picks out all tags whose _id_ is equal to `myid`
 - `'.myclass'` picks out all tags whose _class_ is equal to `myclass`
 - `'mytag#myid'` will pick all tags of type `mytag` **and** `id` equal to `myid` (analogously for `'mytag.myclass'`)
 - If `'selector1'` and `'selector2'` are two selectors, then there is another selector `'selector1 selector2'`.  It picks out all tags satisfying `selector2` that are __descendents__(*) of something satisfying `selector1`, i.e. it's like our nested find.
 
 (*) It doesn't have to be a _direct_ descendent.  I.e. it can be a grand-grand-...-grand-child of something satisfying `selector1`.  For direct descendents we'd instead write `'selector1 > selector2'`
 
Let's just see this in action:

In [472]:
edges['Lorraine Gallard', 'Richard Levy']

1

In [4]:
print soup.select('div#nouvant-portfolio-content')[0].prettify()[:400]
print
print soup.select('div#nouvant-portfolio-content div.technology')[0].prettify()[:400]
tech_divs = soup.select('div#nouvant-portfolio-content div.technology')
print
print len(tech_divs)

<div class="search" id="nouvant-portfolio-content">
 <div class="response">
  <em>
   1-200 of 476 technologies
  </em>
  <a href="http://technology.mit.edu/technologies.rss">
   RSS
  </a>
  <a href="http://technology.mit.edu/technologies.atom">
   ATOM
  </a>
 </div>
 <div class="technology" data-images="false" id="technology_9330">
  <h2>
   <a href="/technologies/17480_superconducting-magnetic

<div class="technology" data-images="false" id="technology_9330">
 <h2>
  <a href="/technologies/17480_superconducting-magnetic-memory-cell-based-on-critical-current-suppression">
   Superconducting-Magnetic Memory Cell Based on Critical-Current Suppression
  </a>
 </h2>
 <p>
  <strong>
   17480
  </strong>
  –
  <span>
   Nano-scale
superconducting memories are a potential method for creating sma

200


Now we're ready to pull out some key pieces of info:

- The technology's "title" (the text in the `<a>` element)
- The link to follow for more info on the technology (the _href_ attribute of the `<a>`)
- And a short blurb about the text (in the `<span>`)

Let's write some code to extract this.  But before we do, let's discuss what _form_ the output should take: It is often convenient to store data in a dictionary (i.e. as a _key-value_ hashtable) - in other words, to name the bits of data you are collecting.  One big advantage is that this makes it easier to add in extra fields progressively.

Let's see what the code looks like:

In [None]:
firsta = tech_divs[0].select('a')[0]
print firsta.text
print firsta['href']

In [None]:
## 
# We're going to use a "named tuple" to store our key-value data.
# We could also have used a dictionary, with strings as keys.
# Named tuples have some advantages:
#  - Better notation with autocomplete, x.field_name instead of x['field_name']
#  - If you change your object structure later and fail to update your
#    code to include the new fields, this will make it easier to find.
#  - They are immutable, preventing certain sorts of bugs.
# ... and some disadvantages:
#  - If you want to augment object structure you need a new type
#    (or to go back and fill your code)
#  - They are immutable.
##
from collections import namedtuple
TechBasic = namedtuple('TechBasic', 'title, url, short')

def td_info(td):
    la = td.select('h2 > a')
    ls = td.select('span')
    if len(la) != 1 or len(ls) != 1:
        print "Uh oh! We did something wrong for:"
        print "\n".join(">>> " + line for line in td.prettify().split("\n"))
        return
    return TechBasic(title=la[0].text, url=la[0]['href'], short=ls[0].text)

tech_links = [td_info(td) for td in tech_divs if td_info(td) is not None]

print tech_links[0].title
print
print tech_links[0]

## Fetching subsequent pages

We'll often want to scrape subsequent pages for more detailed data.  If there are many such pages, this can be slow.

In [None]:
from urlparse import urljoin

Patent = namedtuple('Patent', 'name url')
TechDetailed = namedtuple('TechDetailed', 'tech_basic, patents')

url_base="http://technology.mit.edu/"

def get_tech_details(response, tech_basic):
    soup = BeautifulSoup(response.text)
    patents = [Patent(name=a.text, url=a["href"])
               for a in soup.select('dd.us_patent_issued a')]
    return TechDetailed(tech_basic=response.url, patents=patents)

tech_details = [get_tech_details(requests.get(urljoin(url_base, tech_basic.url)),
                                 tech_basic)
               for tech_basic in tech_links[:2]]
print tech_details

**Note:**
In the last code segment, we only did the first one.  If we try to get them all this way, it'll take a while.  Run the next cell for as long (or not) as you wish, and when you get bored use _Kernel->Interrupt_ to stop it.

The problem is that connecting to a remote server and fetching the pages takes a while. Scraping web pages is usually _IO-bound_ and not CPU-bound (that is, we spent most of our time waiting for data and not processing it). Fortunately, Python gives us lots of different ways to deal with this problem.

We'll be using [requests-futures](https://github.com/ross/requests-futures), a nice wrapper combining `requests` with `concurrent.futures`. For a faster, though harder to debug, alternative, you can look at [grequests](https://github.com/kennethreitz/grequests).

In [None]:
LIMIT = 10
urls = [urljoin(url_base, tech_basic.url) for tech_basic in tech_links]
urls[0:5]

**Solution 1:** The first solution is to run the requests serially.  This is very slow.

In [None]:
%%timeit -n1
# Slow version

tech_details = [get_tech_details(requests.get(url), tech_basic)
                for url, tech_basic in zip(urls, tech_links)[:LIMIT]]

**Solution 2:** We can use Python's [multiprocessing](https://docs.python.org/2/library/multiprocessing.html) interface, which can easily parallelize a map.  This is a very straightforward API to use.  The drawback of this is that it spins up independent processes, which have a potentially significant download time.

In [None]:
%%timeit -n1
# faster version -- using multiprocessing

from multiprocessing import Pool
p = Pool(3)
responses = p.map(requests.get, urls[:LIMIT])
tech_details = [get_tech_details(response, tech_basic)
                for response, tech_basic in zip(responses, tech_links[:LIMIT])]

**Solution 3:** For reqeusts, there is a special library called `requests_futures` which returns a placeholder object that holds a promise to return the webpage sometime later (in the "future").  This allows us to continue making other fetching requests while waiting for the first result to return.

![Synchronous vs. Asynchronous](images/async.png)

In [None]:
%%timeit -n1
# faster version using requests-futures
from requests_futures.sessions import FuturesSession

session = FuturesSession(max_workers=5)
futures = [session.get(url) for url in urls[:LIMIT]]

tech_details = [get_tech_details(future.result(), tech_basic)
                for future, tech_basic in zip(futures, tech_links)[:LIMIT]]

**Exercise:**

Let's put all of that together.  Write a function 
```python
def get_tech_basics(url):
    ...
```

that returns `TechBasic` for each technology on the page.  Combine this with the pooled requests to get_tech_details to obtain a list of TechDetails.

**Fin.**
That's it, we now have a basic not-entirely-trivial example. We took some detours along the way, so let's just take a look at what our code looks like without those detours:

**Exercises:**

1. Modify "get_tech_details" to get other interesting information on the technology, like a long form description and/or the authors' names.  (You'll also want to modify TechDetailed.  Do that first and note that now the code breaks when it tries to construct a TechDetailed with the wrong number of fields.)

2. Modify "get_tech_details" to try to follow the link and to get more information on the patent -- for instance when it was filed and granted, or how many other patents reference it.  (Warning: The patent web site is much less regular than MIT's!)

## Scrapy in Python

If you are really interested in crawling, consider using `scrapy`.  [Scrapy](http://scrapy.org/) is a specialized python package for scraping websites.  In particular, it has a few features:
1. The HTML is parsed and accessed through a `response` object in a `parse` method which in turn supports `response.xpath` and `response.css` methods, allowing one to use `xpath` and `css` selectors on the response dom, respectively.
1. Data is stored in `scrapy.Item` objects (which are similar to `namedtuple`s) or as python dictionaries.
1. Scrapy is object-oriented and calls it's own `parse` method (a generator) that `yield`s values.
1. You can limit your crawls through specifying the class property `allowed_domains` and definte the starting point of your crawl using the class property `start_urls`.
1. You can also build pipelines of crawling and data extraction steps to make sure crawling code more usable.
1. Additional scraping steps (e.g. scraping entries in a directory like in the example above) can be accessed via `scrapy.Request`.
1. It has command lines arguements to allow you to interactively play with the the `response` object from a webpage (`scrapy shell`) or view a page as the library renders it, which may be different from how your browser renders it (`scrapy view`).

The following is a canonical `scrapy` example:

In [None]:
import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['http://stackoverflow.com/questions?sort=votes']

    def parse(self, response):
        for href in response.css('.question-summary h3 a::attr(href)'):
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response):
        yield {
            'title': response.css('h1 a::text').extract()[0],
            'votes': response.css('.question .vote-count-post::text').extract()[0],
            'body': response.css('.question .post-text').extract()[0],
            'tags': response.css('.question .post-tag::text').extract(),
            'link': response.url,
        }

### More complicated example
Suppose we had picked Stanford instead of MIT.  Let's try to do the same thing (it's a bit harder to get a good listing URL, so I just downloaded one).

In [None]:
from collections import namedtuple
from urlparse import urljoin

from bs4 import BeautifulSoup, Comment
import requests

with open("small_data/Stanford-Tech-Listing.html", "r") as fin:
    soup = BeautifulSoup(fin)

In [None]:
#BeautifulSoup doesn't seem to support 'or' selectors, so:
selector = lambda x: x.has_attr("id") and x["id"].startswith("output_row")
tech_rows = soup.find_all(selector)[1:]
# Alternate -- showing how to go up and down the tree
#tech_rows = soup.find('tr', attrs={'id':'output_row_1'}).parent.findAll('tr')[1:]
print len(tech_rows)
print tech_rows[0].prettify()
print tech_rows[-1].prettify()

**Details:** Let's quickly break down that last line for two bits of Python syntax that we haven't explicitly talked about
    >    selector = lambda x: x.has_attr('id') and x['id'].startswith('output_row')
This is a "lambda expresion" -- a short, inline, unnamed function. Lambdas are pretty limited, so you should define a named function for anything complicated  

    >    tech_rows = soup.find_all(selector)[1:]      
                                            ^^^^
This is list slice notation (we already used this above with [:2]!).  In this case, we're taking all but the zero-th entry (which is a list header)

**UH OH:** 
When originally preparing this, I was using Anaconda.  The same code only showed about _254_ of the _1727_ entries -- BeautifulSoup was incorrectly parsing the file.  These sorts of things are not entirely uncommon, so sometimes it helps to double-check.

In [None]:
# Warning: This is hacky code!
TechBlurb = namedtuple('TechBlurb', 'docket techid url title')
def parse_tr(tr):
    link = tr.select("td.output_data a")[0]
    
    docket = link.text
    url = link["href"]
    techid = url.split("=")[1]
    title = tr.select("td.output_data")[2].text
    return TechBlurb(docket=docket, techid=techid, url=url, title=title)
tech_blurbs=map(parse_tr, tech_rows)

In [None]:
# And this isn't much better!
def find_comment_by_text_in(soup, comment_text):
    return soup.find(text=lambda text: isinstance(text, Comment) and comment_text in text)

TechDetailed = namedtuple('TechDetailed', 'blurb, abstract, similar')
SimilarHint = namedtuple('SimilarHint', 'techid, docket, title')
def get_tech_details(response):
    # We're doing a lot of chaining with implicit assumptions here -- 
    #   it might fail in all sorts of way, in which case we give up.
    soup = BeautifulSoup(response.text)
    contents = soup.find_all('form')[1]
    abstract = (find_comment_by_text_in(contents, 'Abstract')
        .find_next_sibling('hr')
        .find('div')
        .text)
        
    def parse_similar_tr(r):
        tds = r.find_all('td')
        if len(tds) < 3:
            return None
        return SimilarHint (
            techid = tds[0].find('a')['href'].split('=')[1], 
            docket = "S"+tds[0].text.strip(), 
            title  = tds[2].text.strip()
        )

    similar_trs = (find_comment_by_text_in(soup.find_all('form')[1], 'Similar Tech')
                      .find_next_sibling('table')
                      .find('div')
                      .find('table')
                      .find('table')
                      .find_all('tr'))
    similar = filter(None, [parse_similar_tr(tr) for tr in similar_trs])
    
    return TechDetailed(blurb=blurb, abstract=abstract, similar=similar)

In [None]:
## Since the point is to show that something goes wrong, let's not wait until the end!
# imap_unordered lets you use the results of the map as they are produced (rather than storing them)
# and with no guarantee on order.

url_base="http://techfinder.stanford.edu/"

for blurb in tech_blurbs:
    response = requests.get(urljoin(url_base, blurb.url))
    try:
        get_tech_details(response)
    except:
        print "Something went wrong!"
        break

####Remark:
When we run the above code, it tells us that [this technology](http://techfinder.stanford.edu/technology_detail.php?ID=30261) did not have a list of similar technologies.  But going to the web page shows that it does!  What went wrong?

In [None]:
url = 'http://techfinder.stanford.edu/technology_detail.php'
soup = BeautifulSoup(requests.get(url, params={"ID": 30261}).text)
contents = soup.find_all('form')[1]
print contents

If we go and look at the same part of the **raw** HTML, we find that there is no `</form>` there:

    >    <!--- Applications --->
    >    <h3>Applications</h3><br/>
    >    <ul><li>Imaging apoptosis<ul type="circle" style="margin-bottom:0in"></li><li>Research</li><li>Clinical<ul type="circle" style="margin-bottom:0in"></li><li>Monitor therapeutic efficacy in cancer patients</li><li>Anti-cancer drug selection</ul></ul></li></ul><br/>
    >    
    >    <!--- Advantages --->
    >    <h3>Advantages</h3><br/>
    >    <ul><li>High specificity for caspase-3 and -7</li><li>High sensitivity</li><li>Non-invasive</li><li>Biocompatible</li><li>Small size of probe allows:<ul type="circle" style="margin-bottom:0in"></li><li>Deep tissue penetration</li><li>More extensive biodistribution</ul></li><li>PET probes:<ul type="circle" style="margin-bottom:0in"></li><li>High tumor/muscle ratio in apoptotic tumors</li><li>High uptake value in apoptotic tumors</ul></li><li>Fluorescent probe:<ul type="circle" style="margin-bottom:0in"></li><li>Possess NIR spectral properties</ul></li><li>May help promote personalized cancer medicine</li><li>Potential for probe design strategy to be applied to other enzyme targets</li></ul><br/>

What there **is** is _mal-formed HTML_ that is bad enough to confuse BeautifulSoup.  (Note that it's not nearly bad enough to confuse a web browser however).  If you look at more examples, you will find even worse ones -- a stray `</html>` in the middle of a document is not unheard of.  

To fix this, we can pre-"tidy" the page before feeding it to BeautifulSoup using **pytidylib**.

In [None]:
from tidylib import tidy_document
url='http://techfinder.stanford.edu/technology_detail.php'

tidy_page, __ = tidy_document(requests.get(url, params={"ID": 30261}).text)
soup = BeautifulSoup(tidy_page)
contents = soup.find_all('form')[1]
print contents

### Exercises

1. Go back and modify `get_tech_details` to use this 'tidy' approach.

2. Sometimes web servers are slow and/or unreliable, and sometimes your connection is.  If we were to run the above test twice, we'd probably find that some of the failures were just due to a connection error.  We didn't notice this because the _outer_ `try` / `except` is also catching these.  So: Modify `get_tech_details` to allow up to 3 retries. <br/>Bonus points if you actually look at what exceptions `urllib` throws in those cases instead of a general catch-all mechanism.  Alternate type of bonus points if you figure out how to do it using the `retrying` package.  You can test these by throttling your internet on and off to simulate an unreliable connection.

### Exit Tickets
1. Write a regex to parse numerical furniture prices from a string of descriptive text which contains other numbers.
1. How would you design a web scraping app such that the user interface remained responsive? One that is robust to poor internet connections?
1. How would you deal with messy/malformatted HTML/XML?

### Spoilers

In [1070]:
from spacy.en import English

ImportError: cannot import name util

In [None]:
from collections import namedtuple

from bs4 import BeautifulSoup
from requests_futures.sessions import FuturesSession
import requests

# Getting the list of short 'blurbs' about the techs
TechBasic = namedtuple('TechBasic', 'title, url, short')
def get_tech_basics(url):
    url = "http://technology.mit.edu/technologies?limit=1000"
    soup = BeautifulSoup(requests.get(url).text)

    ## Get the list of tech blurbs
    tech_divs = soup.select('div#nouvant-portfolio-content  div.technology')

    ## Parse a single 'td' on the index page
    def td_info(td):
        la = td.select('h2 > a')
        ls = td.select('span')
        if len(la) != 1 or len(ls) != 1:
            print "Uh oh! We did something wrong"
            return
        return TechBasic(title=la[0].text, url=la[0]['href'], short=ls[0].text)
    
    return [td_info(td) for td in tech_divs]


# Adding in some details (just patent info, for now)
Patent = namedtuple('Patent', 'name url')
TechDetailed = namedtuple('TechDetailed', 'tech_basic, patents')

def get_tech_details(response):
    soup = BeautifulSoup(response.text)

    def patent_info(a):
        return Patent(name=a.text, url=a['href'])
    patents = [patent_info(a) for a in soup.select('dd.us_patent_issued a')]

    return TechDetailed(tech_basic=tech_basic, patents=patents)

## The main driver code:
tech_basics = get_tech_basics("http://technology.mit.edu/technologies?limit=1000")
url_base="http://technology.mit.edu/"
session = FuturesSession(max_workers=15)
futures = [session.get(url_base + tech_basic.url)
       for tech_basic in tech_basics if tech_basic is not None]
tech_details = [get_tech_details(future.result()) for future in futures]

In [None]:
print len(tech_details)
print tech_details[24]

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*