# Webscraping - Lecture Code

## The Task:

Gather information Illinois' elected state legislators

## The Tools:

1. [Requests](http://docs.python-requests.org/en/latest/user/quickstart/)
2. [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [6]:
# import required modules
import requests
from bs4 import BeautifulSoup

We use `requests` library to:
1. make a GET request to the page
2. read in the html of the page

In [166]:
# import html text
req = requests.get('http://www.ilga.gov/senate/default.asp') # make a GET request
src = req.text # read the content of the server’s response

Now we use the `bBautifulSoup` library to parse the reponse into an HTML tree.

In [5]:
# parse the response into an HTML tree
soup = BeautifulSoup(src)

NameError: name 'src' is not defined

BeautifulSoup has a number of functions to find things on a page. Let's search first for HTML tags.

In [168]:
# find all elements in a certain tag
# these two lines of code are equivilant
soup.find_all("a")
soup("a")

[<a href="/default.asp"><img alt="Illinois General Assembly" border="0" height="49" src="/images/logo_sm.gif" width="462"/></a>,
 <a class="mainmenu" href="/">Home</a>,
 <a class="mainmenu" href="/legislation/" onblur="HM_f_PopDown('elMenu1')" onfocus="HM_f_PopUp('elMenu1',event)" onmouseout="HM_f_PopDown('elMenu1')" onmouseover="HM_f_PopUp('elMenu1',event)">Legislation &amp; Laws</a>,
 <a class="mainmenu" href="/senate/" onblur="HM_f_PopDown('elMenu3')" onfocus="HM_f_PopUp('elMenu3',event)" onmouseout="HM_f_PopDown('elMenu3')" onmouseover="HM_f_PopUp('elMenu3',event)">Senate</a>,
 <a class="mainmenu" href="/house/" onblur="HM_f_PopDown('elMenu2')" onfocus="HM_f_PopUp('elMenu2',event)" onmouseout="HM_f_PopDown('elMenu2')" onmouseover="HM_f_PopUp('elMenu2',event)">House</a>,
 <a class="mainmenu" href="/mylegislation/" onblur="HM_f_PopDown('elMenu4')" onfocus="HM_f_PopUp('elMenu4',event)" onmouseout="HM_f_PopDown('elMenu4')" onmouseover="HM_f_PopUp('elMenu4',event)">My Legislation</a>,
 

We can also search for HTML tags with certin attributes, like particular CSS classes.

In [169]:
# Get only the links in 'sidemenu' class
soup("a", attrs={"class": "sidemenu"}) # in bs4, this should be soup("a", class="sidemenu"})

[<a class="sidemenu" href="/senate/default.asp">  Members  </a>,
 <a class="sidemenu" href="/senate/committees/default.asp">  Committees  </a>,
 <a class="sidemenu" href="/senate/schedules/default.asp">  Schedules  </a>,
 <a class="sidemenu" href="/senate/journals/default.asp">  Journals  </a>,
 <a class="sidemenu" href="/senate/transcripts/default.asp">  Transcripts  </a>,
 <a class="sidemenu" href="/senate/rules.asp">  Rules  </a>,
 <a class="sidemenu" href="/senate/audvid.asp">  Live Audio/Video  </a>]

In [170]:
# Get just the href (url) attribute
for link in soup('a'):
    print link['href']

/default.asp
/
/legislation/
/senate/
/house/
/mylegislation/
/sitemap.asp
/senate/default.asp
/senate/committees/default.asp
/senate/schedules/default.asp
/senate/journals/default.asp
/senate/transcripts/default.asp
/senate/rules.asp
/senate/audvid.asp
99GA_Senate_Leadership.pdf
99GA_Senate_Officers.pdf
99GA_Senate_Seating_Chart.pdf
javascript:Sort('LastName','',99);
javascript:Sort('DistrictNumber','',99);
javascript:Sort('Party','',99);
/senate/Senator.asp?GA=99&MemberID=2130
SenatorBills.asp?MemberID=2130
SenCommittees.asp?MemberID=2130
/senate/Senator.asp?GA=99&MemberID=2275
SenatorBills.asp?MemberID=2275
SenCommittees.asp?MemberID=2275
/senate/Senator.asp?GA=99&MemberID=2208
SenatorBills.asp?MemberID=2208
SenCommittees.asp?MemberID=2208
/senate/Senator.asp?GA=99&MemberID=2276
SenatorBills.asp?MemberID=2276
SenCommittees.asp?MemberID=2276
/senate/Senator.asp?GA=99&MemberID=2210
SenatorBills.asp?MemberID=2210
SenCommittees.asp?MemberID=2210
/senate/Senator.asp?GA=99&MemberID=2209
S

Many elements on a page will have the same html tag. For instance, if you search for everything with the `p` tag, you're likely to get a lot of stuff, much of which you don't want.

Oftentimes a more efficient way to search and find things on a website is by CSS selector.

In [171]:
# Search Tree using CSS Selectors
rows = soup.select('tr') # returns a list of every ‘tr’ css selector in the page
rows = soup.select('tr tr tr') # returns every ‘tr tr tr’ css selector in the page
rows[2]

<tr><td bgcolor="white" class="detail" width="40%"><a href="/senate/Senator.asp?GA=99&amp;MemberID=2130">Pamela J. Althoff</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?MemberID=2130">Bills</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?MemberID=2130">Committees</a></td><td align="center" bgcolor="white" class="detail" width="15%">32</td><td align="center" bgcolor="white" class="detail" width="15%">R</td></tr>

We can use the `select` method on anything. Let's say we want to find everything with the CSS selector `td.detail` in an item of the list we created above.

In [172]:
# Get the table row items 
row = rows[2]
row.select('td.detail') # select only those 'td' tags with class 'detail'

[<td bgcolor="white" class="detail" width="40%"><a href="/senate/Senator.asp?GA=99&amp;MemberID=2130">Pamela J. Althoff</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?MemberID=2130">Bills</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?MemberID=2130">Committees</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%">32</td>,
 <td align="center" bgcolor="white" class="detail" width="15%">R</td>]

Most of the time, we're interested in the actual **text** of a website, not its tags. To get the text of an HTML element, use the `text` method.

In [175]:
# Keep only the text in each of those cells
detailCells = row.select('td.detail')
for cell in detailCells:
    print cell.text

Pamela J. Althoff
Bills
Committees
32
R


Now we can combine the beautifulsoup tools with our basic python skills to scrape an entire web page.

In [174]:
# Check em out
rowData = [cell.text for cell in detailCells]
print rowData[0] # Name
print rowData[3] # district
print rowData[4] # party

Pamela J. Althoff
32
R


In [115]:
# Use a for loop to get 'em all
members = []
for row in rows:
    # see if it's a 'detail' row
    detailCells = row.select('td.detail')
    if len(detailCells) is not 5: 
        continue
    # get the data
    rowData = [cell.text for cell in detailCells]
    name = rowData[0]
    district = int(rowData[3])
    party = rowData[4]
    members.append((name,district,party))

In [116]:
# Putting it all together in a function
def get_members(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    members = []
    for row in rows:
        # see if it's a 'detail' row
        detailCells = row.select('td.detail')
        if len(detailCells) is not 5:
            continue
        # get the data
        rowData = [cell.text for cell in detailCells]
        name = rowData[0]
        district = int(rowData[3])
        party = rowData[4]
        members.append((name,district,party))
    return(members)

In [117]:
get_members('http://www.ilga.gov/senate/default.asp')

[(u'Pamela J. Althoff', 32, u'R'),
 (u'Neil Anderson', 36, u'R'),
 (u'Jason A. Barickman', 53, u'R'),
 (u'Scott M. Bennett', 52, u'D'),
 (u'Jennifer Bertino-Tarrant', 49, u'D'),
 (u'Daniel Biss', 9, u'D'),
 (u'Tim Bivins', 45, u'R'),
 (u'William E. Brady', 44, u'R'),
 (u'Melinda Bush', 31, u'D'),
 (u'James F. Clayborne, Jr.', 57, u'D'),
 (u'Jacqueline Y. Collins', 16, u'D'),
 (u'Michael Connelly', 21, u'R'),
 (u'John J. Cullerton', 6, u'D'),
 (u'Thomas Cullerton', 23, u'D'),
 (u'Bill Cunningham', 18, u'D'),
 (u'William Delgado', 2, u'D'),
 (u'Dan Duffy', 26, u'R'),
 (u'Gary Forby', 59, u'D'),
 (u'William R. Haine', 56, u'D'),
 (u'Don Harmon', 39, u'D'),
 (u'Napoleon Harris, III', 15, u'D'),
 (u'Michael E. Hastings', 19, u'D'),
 (u'Linda Holmes', 42, u'D'),
 (u'Mattie Hunter', 3, u'D'),
 (u'Toi W. Hutchinson', 40, u'D'),
 (u'Emil Jones, III', 14, u'D'),
 (u'David Koehler', 46, u'D'),
 (u'Dan Kotowski', 28, u'D'),
 (u'Darin M. LaHood', 37, u'R'),
 (u'Steven M. Landek', 12, u'D'),
 (u'Kim