# Webscraping - Lecture Code

## The Task:

Gather information Illinois' elected state legislators here: http://www.ilga.gov/senate/default.asp

## The Tools:

1. [Requests](http://docs.python-requests.org/en/latest/user/quickstart/)
2. [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [37]:
# import required modules
import requests
from bs4 import BeautifulSoup

## Step 1: Make a Get Request and Read in HTML

We use `requests` library to:
1. make a GET request to the page
2. read in the html of the page

In [22]:
# make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp')
# read the content of the server’s response
src = req.text

## Step 2: Soup it

Now we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object** which contains all of the HTML in the original document.

In [38]:
# parse the response into an HTML tree
soup = BeautifulSoup(src)
# take a look
# soup

## Step 3: Find Elements

BeautifulSoup has a number of functions to find things on a page. Like other webscraping tools, Beautiful Soup lets you find elements by their:

1. HTML tags
2. Attributes
3. Strings


Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.

What does the example below do?

In [42]:
# find all elements in a certain tag
# these two lines of code are equivilant

# soup.find_all("a")

**NB**: Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. These two lines of code are equivalent:

In [None]:
# soup.find_all("a")
# soup("a")

That's a lot! Many elements on a page will have the same html tag. For instance, if you search for everything with the `a` tag, you're likely to get a lot of stuff, much of which you don't want. What if we wanted to search for HTML tags ONLY with certain attributes, like particular CSS classes? 

We can do this by adding an additional argument to the `find_all`

In the example below, we are finding all the `a` tags, and then filtering those with `class = "sidemenu"`.

In [43]:
# Get only the 'a' tags in 'sidemenu' class
soup("a", class_="sidemenu")

[<a class="sidemenu" href="/senate/default.asp">  Members  </a>,
 <a class="sidemenu" href="/senate/committees/default.asp">  Committees  </a>,
 <a class="sidemenu" href="/senate/schedules/default.asp">  Schedules  </a>,
 <a class="sidemenu" href="/senate/journals/default.asp">  Journals  </a>,
 <a class="sidemenu" href="/senate/transcripts/default.asp">  Transcripts  </a>,
 <a class="sidemenu" href="/senate/rules.asp">  Rules  </a>,
 <a class="sidemenu" href="/senate/audvid.asp">  Live Audio/Video  </a>]

Oftentimes a more efficient way to search and find things on a website is by **CSS selector.** For this we have to use a different method, `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use "a.sidemenu" as a CSS selector, which returns all `a` tags with class `sidemenu`.

In [46]:
# get elements with "a.sidemenu" CSS Selector.
soup.select("a.sidemenu")

[<a class="sidemenu" href="/senate/default.asp">  Members  </a>,
 <a class="sidemenu" href="/senate/committees/default.asp">  Committees  </a>,
 <a class="sidemenu" href="/senate/schedules/default.asp">  Schedules  </a>,
 <a class="sidemenu" href="/senate/journals/default.asp">  Journals  </a>,
 <a class="sidemenu" href="/senate/transcripts/default.asp">  Transcripts  </a>,
 <a class="sidemenu" href="/senate/rules.asp">  Rules  </a>,
 <a class="sidemenu" href="/senate/audvid.asp">  Live Audio/Video  </a>]

## Step 4. Get Attributes and Text of Elements

Once we identify elements, we want the access information in that element. Oftentimes this means two things:

1. Attributes, and 
2. Text

Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:

In [53]:
# this is a list
soup.select("a.sidemenu")

# we first want to get an individual tag object
first_link = soup.select("a.sidemenu")[0]

# check out its class
type(first_link

SyntaxError: invalid syntax (<ipython-input-53-b26e57d5b2a9>, line 8)

It's a tag! Which means it has a `text` member:

In [58]:
print first_link.text

  Members  


In [26]:
# Get just the href (url) attribute from the first 10 links
for link in soup('a')[:10]:
    print link['href']

/default.asp
/
/legislation/
/senate/
/house/
/mylegislation/
/sitemap.asp
/senate/default.asp
/senate/committees/default.asp
/senate/schedules/default.asp


Many elements on a page will have the same html tag. For instance, if you search for everything with the `p` tag, you're likely to get a lot of stuff, much of which you don't want.

Oftentimes a more efficient way to search and find things on a website is by CSS selector.

In [27]:
# Search Tree using CSS Selectors
rows = soup.select('tr') # returns a list of every ‘tr’ css selector in the page
rows = soup.select('tr tr tr') # returns every ‘tr tr tr’ css selector in the page
rows[2]

<tr><td bgcolor="white" class="detail" width="40%"><a href="/senate/Senator.asp?GA=99&amp;MemberID=2130">Pamela J. Althoff</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?MemberID=2130">Bills</a></td><td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?MemberID=2130">Committees</a></td><td align="center" bgcolor="white" class="detail" width="15%">32</td><td align="center" bgcolor="white" class="detail" width="15%">R</td></tr>

We can use the `select` method on anything. Let's say we want to find everything with the CSS selector `td.detail` in an item of the list we created above.

In [28]:
# Get the table row items 
row = rows[2]
row.select('td.detail') # select only those 'td' tags with class 'detail'

[<td bgcolor="white" class="detail" width="40%"><a href="/senate/Senator.asp?GA=99&amp;MemberID=2130">Pamela J. Althoff</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?MemberID=2130">Bills</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?MemberID=2130">Committees</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%">32</td>,
 <td align="center" bgcolor="white" class="detail" width="15%">R</td>]

Most of the time, we're interested in the actual **text** of a website, not its tags. To get the text of an HTML element, use the `text` method.

In [29]:
# Keep only the text in each of those cells
detailCells = row.select('td.detail')
for cell in detailCells:
    print cell.text

Pamela J. Althoff
Bills
Committees
32
R


Now we can combine the beautifulsoup tools with our basic python skills to scrape an entire web page.

In [30]:
# Check em out
rowData = [cell.text for cell in detailCells]
print rowData[0] # Name
print rowData[3] # district
print rowData[4] # party

Pamela J. Althoff
32
R


In [31]:
# Use a for loop to get 'em all
members = []
for row in rows:
    # see if it's a 'detail' row
    detailCells = row.select('td.detail')
    if len(detailCells) is not 5: 
        continue
    # get the data
    rowData = [cell.text for cell in detailCells]
    name = rowData[0]
    district = int(rowData[3])
    party = rowData[4]
    members.append((name,district,party))

In [34]:
# Putting it all together in a function
def get_members(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    members = []
    for row in rows:
        # see if it's a 'detail' row
        detailCells = row.select('td.detail')
        # only get real rows
        if len(detailCells) is not 5:
            continue
        # get the text data in those cells
        rowData = [cell.text for cell in detailCells]
        # parse into relevant variables
        name = rowData[0]
        district = int(rowData[3])
        party = rowData[4]
        # combine into tuple
        tup = (name,district,party)
        # append to list
        members.append(tup)
    return(members)

In [36]:
get_members('http://www.ilga.gov/senate/default.asp?GA=98')

[(u'Pamela J. Althoff', 32, u'R'),
 (u'Jason A. Barickman', 53, u'R'),
 (u'Scott M Bennett', 52, u'D'),
 (u'Jennifer Bertino-Tarrant', 49, u'D'),
 (u'Daniel Biss', 9, u'D'),
 (u'Tim Bivins', 45, u'R'),
 (u'William E. Brady', 44, u'R'),
 (u'Melinda Bush', 31, u'D'),
 (u'James F. Clayborne, Jr.', 57, u'D'),
 (u'Jacqueline Y. Collins', 16, u'D'),
 (u'Michael Connelly', 21, u'R'),
 (u'John J. Cullerton', 6, u'D'),
 (u'Thomas Cullerton', 23, u'D'),
 (u'Bill Cunningham', 18, u'D'),
 (u'William Delgado', 2, u'D'),
 (u'Kirk W. Dillard', 24, u'R'),
 (u'Dan Duffy', 26, u'R'),
 (u'Gary Forby', 59, u'D'),
 (u'Michael W. Frerichs', 52, u'D'),
 (u'William R. Haine', 56, u'D'),
 (u'Don Harmon', 39, u'D'),
 (u'Napoleon Harris, III', 15, u'D'),
 (u'Michael E. Hastings', 19, u'D'),
 (u'Linda Holmes', 42, u'D'),
 (u'Mattie Hunter', 3, u'D'),
 (u'Toi W. Hutchinson', 40, u'D'),
 (u'Mike Jacobs', 36, u'D'),
 (u'Emil Jones, III', 14, u'D'),
 (u'David Koehler', 46, u'D'),
 (u'Dan Kotowski', 28, u'D'),
 (u'Dar