# Webscraping Problem Set

First, run the code below to import the `requests` and `BeautifulSoup` libraries, as well as some other libraries we will be using.

In [1]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time
import re
import sys

## Intro

In this week's lecture, we introduced a function `get_members()` which returns tuples of state senators (upper chamber) and house reps (lower chamber) from Illinois. **For this assignment, we will only be looking at the state senators and not the members of the state house.** I've provided a similar function below that returns a list of tuples: 

  - the senator's name
  - their district number
  - and their party

Our goals in this problem set are:
  - To scrape the relative URL (e.g. "`/senate/SenatorBills.asp?MemberID=1911&GA=98`" for each senator's list of bills
  - To modify the relative URL to be a full URL _which only lists primary-sponsored bills_ (e.g. http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True — note the "`&Primary=True`")
  - To automatically scrape that list of bills primarily sponsored by each senator

(Also note the "GA=98" part of the URLs — we're going to be sticking to the previous term's senators (the "98th" assembly)

Run the code below and make sure it works, then we'll get started.

In [2]:
# modified code from Tuesday's lecture
def get_members(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    members = []
    for row in rows:
        detailCells = row.select('td.detail')
        if len(detailCells) is not 5:
            continue
        rowData = [cell.text for cell in detailCells]
        name = rowData[0]
        district = int(rowData[3])
        party = rowData[4]
        tup = (name,district,party)
        members.append(tup)
    return(members)

senateMembers = get_members('http://www.ilga.gov/senate/default.asp?GA=98')
senateMembers[0:5]

[(u'Pamela J. Althoff', 32, u'R'),
 (u'Jason A. Barickman', 53, u'R'),
 (u'Scott M Bennett', 52, u'D'),
 (u'Jennifer Bertino-Tarrant', 49, u'D'),
 (u'Daniel Biss', 9, u'D')]

## 1. Explain the function `get_members`

Describe, in English, what is going on when this function executes. Pay close attention to the uses of the `select()` function. Feel free to pepper the code with `print` commands and `type()` function calls so that you know what the types of different objects are.

**Your markdown answer here**

## 2. Get HREF element pointint to members' bills. 

The `get_members()` function above uses the <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/">`BeautifulSoup`</a> library to scrape the Illinois General Assembly website and pull out the information about the senators and representatives.

As you can see from the example bill URLs provided in the intro above, the format for the list of bills in 2014 for a given senator is:

http://www.ilga.gov/senate/SenatorBills.asp + ? + GA=98 + &MemberID=**_memberID_** + &Primary=True

to get something like:

http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True

You should be able to see that, unfortunately, _memberID_ is not currently something pulled out in our scraping function `get_members()`. 

Your initial task is to modify `get_members()` so that we also **retrieve the relative URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.

I've pasted `get_members()` code below. Modify it to return the bills URL as a 4th element of the tuples returned by the function.

Tips: 

* To do this, you will want to get the appropriate anchor element (`<a>`) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.
* The anchor elements' HTML will look like `<a href="/senate/Senator.asp/...">Bills</a>`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. (See the <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag">documentation</a> for more details).
* NOTE: There are a _lot_ of different ways to use BeautifulSoup to get things done; whatever you need to do to pull that HREF out is fine. Posting on the etherpad is recommended for discussing different strategies.

In [4]:
# FILL ME OUT
def get_members(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    members = []
    for row in rows:
        detailCells = row.select('td.detail')
        if len(detailCells) is not 5:
            continue
        rowData = [cell.text for cell in detailCells]
        name = rowData[0]
        district = int(rowData[3])
        party = rowData[4]
        
        # YOUR CODE HERE
        
        tup = (name,district,party) # might want to modify this line too
        members.append(tup)
    return(members)

In [None]:
# SOLUTION
def get_members(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    members = []
    for row in rows:
        detailCells = row.select('td.detail')
        if len(detailCells) is not 5:
            continue
        rowData = [cell.text for cell in detailCells]
        name = rowData[0]
        district = int(rowData[3])
        party = rowData[4]
        href = row.select('a')[1]['href']
        tup = (name,district,party,href)
        members.append(tup)
    return(members)

In [5]:
# Uncomment to test

senateMembers = get_members('http://www.ilga.gov/senate/default.asp?GA=98')
senateMembers[:5]

[(u'Pamela J. Althoff', 32, u'R', 'SenatorBills.asp?GA=98&MemberID=1911'),
 (u'Jason A. Barickman', 53, u'R', 'SenatorBills.asp?GA=98&MemberID=2018'),
 (u'Scott M Bennett', 52, u'D', 'SenatorBills.asp?GA=98&MemberID=2272'),
 (u'Jennifer Bertino-Tarrant',
  49,
  u'D',
  'SenatorBills.asp?GA=98&MemberID=2022'),
 (u'Daniel Biss', 9, u'D', 'SenatorBills.asp?GA=98&MemberID=2020')]

## 3. Modify to Full Path

The HREF you got above is what's called a _relative_ URL: i.e., it looks like this:
  
  `SenatorBills.asp?GA=98&MemberID=2018`
  
  as opposed to having a full path, like:
  
  `http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2018`
  
Paste the function you wrote above, and modify the function again to get the **full path** of the members' bills.

**Hint**: You will have to add the appropriate prefix for the senate to the returned string (and thus transform the HREF into a full-fledged URL), i.e., "`http://www.ilga.gov/senate/`".

You'll also have to add a suffix, "&Primary=True" because we only want the bills where the senator is the primary author.

In [None]:
# FILL ME OUT
def get_members(url):
    pass

In [6]:
# SOLUTION
def get_members(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    members = []
    for row in rows:
        detailCells = row.select('td.detail')
        if len(detailCells) is not 5:
            continue
        rowData = [cell.text for cell in detailCells]
        name = rowData[0]
        district = int(rowData[3])
        party = rowData[4]
        # get relative path
        href = row.select('a')[1]['href']
        # make full path
        full_path = "http://www.ilga.gov/senate/" + href + "&Primary=True"
        tup = (name,district,party,full_path)
        members.append(tup)
    return(members)

In [7]:
# Uncomment to test
senateMembers = get_members('http://www.ilga.gov/senate/default.asp?GA=98')
senateMembers[0:5]

[(u'Pamela J. Althoff',
  32,
  u'R',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True'),
 (u'Jason A. Barickman',
  53,
  u'R',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2018&Primary=True'),
 (u'Scott M Bennett',
  52,
  u'D',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2272&Primary=True'),
 (u'Jennifer Bertino-Tarrant',
  49,
  u'D',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2022&Primary=True'),
 (u'Daniel Biss',
  9,
  u'D',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2020&Primary=True')]

## 4.  Create dictionary of senate members.

Given the `senateMembers`  list, we want to create a new dictionary `members_dict` which has as its keys the district number (e.g. ` 32`) and as its values, the entire tuple as returned by `get_members()` (name, district number, party, url). We can do this because the district number is a unique identifier for each senator.

Calling `members_dict[32]`, for example, should return the 4-tuple:

```(u'Pamela J. Althoff',
32,
u'R',
'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True')
```.

(We'll use this later to look up the URL.)

I've started the code for you. Fill out the rest:

In [8]:
# FILL ME OUT
members_dict = {}
for member in senateMembers:
    # YOUR CODE HERE
    pass

In [25]:
# SOLUTION
members_dict = {}
for member in senateMembers:
    members_dict[member[1]] = member

In [26]:
# Uncomment to test  
members_dict[32]

(u'Pamela J. Althoff',
 32,
 u'R',
 'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True')

## 4. Scrape Bills

Now we want to scrape the webpages corresponding to bills.

Write a function called `get_bills(url)` to parse a given Bills URL. This will involve:

  - requesting the URL using the <a href="http://docs.python-requests.org/en/latest/">`requests`</a> library
  - using the features of the `BeautifulSoup` library to find all of the `<td>` elements with the class 'billlist'
  - return a _list_ of tuples, each with:
      - the bill id (1st column) 
          - **NOTE:** we only want you to accept Senate bills with valid names; these start with 'SB' and end with a number. We created a function `is_valid_senatebill_id()` to help you decide if it is valid or not. Otherwise, don't add it to the list!
      - description (2nd column)
      - chamber (S or H) (3rd column)
      - the last action (4th column)
      - the last action date (5th column)
      
I've started the function for you. Fill in the rest.

In [20]:
# DO NOT EDIT THIS.
# This function returns True if the bill in question is a Senate Bill ('SB') followed by a number. 
def is_valid_senatebill_id(bill_id_str):
    match = re.match('SB[0-9]+$',bill_id_str)
    if match: 
        return True
    else:
        return False

In [21]:
# COMPLETE THIS FUNCTION
def get_bills(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    bills = []
    for row in rows:
        
        # YOUR CODE HERE
               
        tup = (bill_id,description,champber,last_action,last_action_date)
        bills.append(tup)
    return(bills)

In [11]:
# SOLUTION
def get_bills(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    bills = []
    for row in rows:
        detailCells = row.select('td.billlist')
        if len(detailCells) is not 5:
            continue
        rowData = [cell.text for cell in row]
        if is_valid_senatebill_id(rowData[0]) is False:
            continue
        bill_id = rowData[0]
        description = rowData[2]
        champber = rowData[3]
        last_action = rowData[4]
        last_action_date = rowData[5] 
        tup = (bill_id,description,champber,last_action,last_action_date)
        bills.append(tup)
    return(bills)

In [12]:
# uncomment to test your code:
get_bills('http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True')[0:5]

[(u'SB27',
  u'MEDICAID BUDGET NOTE ACT',
  u'S',
  u'Session Sine Die',
  u'1/13/2015'),
 (u'SB28',
  u'HOMELESS VETERANS SHELTER ACT',
  u'S',
  u'Session Sine Die',
  u'1/13/2015'),
 (u'SB29', u'ROAD FUND-NO TRANSFERS', u'S', u'Session Sine Die', u'1/13/2015'),
 (u'SB33',
  u'EPA-RULES-DOCUMENT SUBMISSION',
  u'S',
  u'Public Act . . . . . . . . . 98-0072',
  u'7/15/2013'),
 (u'SB104',
  u'MIN WAGE-OVERTIME-ALTERN SHIFT',
  u'S',
  u'Session Sine Die',
  u'1/13/2015')]

## 6. Get all the bills

Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list_of_bills (the value) eminating from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.

NOTE: please call the function `time.sleep(0.5)` for each iteration of the loop, so that we don't destroy the state's web site.

In [None]:
# YOUR CODE HERE

In [13]:
# SOLUTION
bills_dict = {}
for member in members_dict:
    bills_dict[member] = get_bills(members_dict[member][3])
    time.sleep(0.5)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59


In [16]:
# uncomment to test
bills_dict[32][:5]

[(u'SB27',
  u'MEDICAID BUDGET NOTE ACT',
  u'S',
  u'Session Sine Die',
  u'1/13/2015'),
 (u'SB28',
  u'HOMELESS VETERANS SHELTER ACT',
  u'S',
  u'Session Sine Die',
  u'1/13/2015'),
 (u'SB29', u'ROAD FUND-NO TRANSFERS', u'S', u'Session Sine Die', u'1/13/2015'),
 (u'SB33',
  u'EPA-RULES-DOCUMENT SUBMISSION',
  u'S',
  u'Public Act . . . . . . . . . 98-0072',
  u'7/15/2013'),
 (u'SB104',
  u'MIN WAGE-OVERTIME-ALTERN SHIFT',
  u'S',
  u'Session Sine Die',
  u'1/13/2015')]