# Webscraping Problem Set

First, run the code below to import the `requests` and `BeautifulSoup` libraries, as well as some other libraries we will be using.

In [1]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time
import re
import sys

`1`. In this week's lecture, we introduced a function `get_members()` which returns tuples of state senators (upper chamber) and house reps (lower chamber) from Illinois. **For this assignment, we will only be looking at the state senators and not the members of the state house.** We've provided a similar function below that returns a list of tuples: 

  - the senator's name
  - their district number
  - and their party

Our goals in this problem set are:
  - To scrape the relative URL (e.g. "`/senate/SenatorBills.asp?MemberID=1911&GA=98`" for each senator's list of bills
  - To modify the relative URL to be a full URL _which only lists primary-sponsored bills_ (e.g. http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True — note the "`&Primary=True`")
  - To automatically scrape that list of bills primarily sponsored by each senator

(Also note the "GA=98" part of the URLs — we're going to be sticking to the previous term's senators (the "98th" assembly)

Run the code below and make sure it works, then we'll get started.

In [2]:
def get_members(url):
    # modified code from Tuesday Jan 20 lecture
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    members = []
    for row in rows:
        # see if it's a 'detail' row
        detailCells = row.select('td.detail')
        if len(detailCells) is not 5:
            continue
        # get the data 
        rowData = [cell.text for cell in detailCells]
        name = rowData[0]
        district = int(rowData[3])
        party = rowData[4]
        members.append((name,district,party))
    return(members)

senateMembers = get_members('http://www.ilga.gov/senate/default.asp?GA=98')

print "First 5 senate members:", senateMembers[0:5]

First 5 senate members: [(u'Pamela J. Althoff', 32, u'R'), (u'Jason A. Barickman', 53, u'R'), (u'Scott M Bennett', 52, u'D'), (u'Jennifer Bertino-Tarrant', 49, u'D'), (u'Daniel Biss', 9, u'D')]


`2`. Describe, in English, what is going on when this function executes. Pay close attention to the uses of the `select()` function. Feel free to pepper the code with `print` commands and `type()` function calls so that you know what the types of different objects are.

`3`. The `get_members()` function above uses the <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/">`BeautifulSoup`</a> library to scrape the Illinois General Assembly website and pull out the information about the senators and representatives.

As you can see from the example bill URLs provided in question 1 above, the format for the list of bills in 2014 for a given senator is:

http://www.ilga.gov/senate/SenatorBills.asp + ? + GA=98 + &MemberID=_memberID_ + &Primary=True

You should be able to see that, unfortunately, _memberID_ is not currently something pulled out in our scraping function `get_members()`. Your initial task is to modify `get_members()` so that we also retrieve the URL which points to the corresponding page of primary-sponsored bills, for each member, and return it along with their name, district, and party.

Paste `get_members()` below; then modify it to return the bills URL as a 4th element of the tuples returned by the function.

  - To do this, you will want to get the appropriate anchor element (`<a>`) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.
  - The anchor elements' HTML will look like `<a href="/senate/Senator.asp/...">Bills</a>`. The string in the `href` attribute contains the relative link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. (See the <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag">documentation</a> for more details).
      - (NOTE: There are a _lot_ of different ways to use BeautifulSoup to get things done; whatever you need to do to pull that HREF out is fine. Posting on Piazza is recommended for discussing different strategies.)
  
  - The HREF is what's called a _relative_ URL: i.e., it looks like this:
  
  `SenatorBills.asp?GA=98&MemberID=2018`
  
  as opposed to having a full path, like:
  
  `http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2018`
  
  You will have to add the appropriate prefix for the senate to the returned string (and thus transform the HREF into a full-fledged URL), i.e., "`http://www.ilga.gov/senate/`".



In [3]:
def get_members(url):
    # def: your code here
    pass

# uncomment to test your code:
#senateMembers = get_members('http://www.ilga.gov/senate/default.asp?GA=98')
#print "First 5 senate members:", senateMembers[0:5]

'''Answer

def get_members(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    members = []
    for row in rows:
        # see if it's a 'detail' row
        detailCells = row.select('td.detail')
        if len(detailCells) is not 5:
            continue
        # get the data 
        rowData = [cell.text for cell in row]
        name = rowData[0]
        district = int(rowData[3])
        party = rowData[4]
        url = row.select('a')[1]['href']
        url = "http://www.ilga.gov/senate/"+url
        members.append((name,district,party,url))
    return(members)
'''

`4`. Given the `senateMembers`  list, create a new dictionary `members_dict` which has as its keys the district number (e.g. ` 32`) and as its values, the entire tuple as returned by `get_members()` (name, district number, party, url). We can do this because the district number is a unique identifier for each senator.

Calling `members_dict[32]`, for example, should return the 4-tuple:

```(u'Pamela J. Althoff',
32,
u'R',
'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True')```.

(We'll use this later to look up the URL.)

In [None]:
'''Answer
members_dict = {}
for member in senateMembers:
    members_dict[member[1]] = member
print members_dict[32]
'''

`5`. Write a function called `get_bills(url)` to parse a given Bills URL. This will involve:

  - requesting the URL using the <a href="http://docs.python-requests.org/en/latest/">`requests`</a> library (see p. 14 in the slides from Tuesday Jan 20)
  - using the features of the `BeautifulSoup` library to find all of the `<td>` elements with the class 'billlist'
  - return a _list_ of tuples, each with:
      - the bill id (1st column) **NOTE: we only want you to accept Senate bills with valid names; these start with 'SB' and end with a number. We created a function `is_valid_senatebill_id()` to help you decide if it is valid or not. Otherwise, don't add it to the list!**
      - description (2nd column)
      - chamber (S or H) (3rd column)
      - the last action (4th column)
      - the last action date (5th column)
  
  We are going to want to use the date library to represent the date. You can use the function `datetime.strptime()` to convert from the date string (e.g. `'8/1/2014'`) to an object which understands the meaning of that string (e.g. as an actual date), like so:
  - my_date = `datetime.strptime('8/1/2014', '%m/%d/%Y')`




In [4]:
# This function returns True if the bill in question is a Senate Bill ('SB') followed by a number
def is_valid_senatebill_id(bill_id_str):
    match = re.match('SB[0-9]+$',bill_id_str)
    if match: 
        return True
    else:
        return False

def get_bills(url):
    # your code here
    pass

# uncomment to test your code:
#print get_bills('http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2018')[0:5]

'''

def get_bills(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    bills = []
    for row in rows:
        detailCells = row.select('td.billlist')
        if len(detailCells) is not 5:
            continue
        rowData = [cell.text for cell in row]
        if is_valid_senatebill_id(rowData[0]) is False:
            continue
        bill_id = rowData[0]
        description = rowData[2]
        champber = rowData[3]
        last_action = rowData[4]
        last_action_date = datetime.strptime(rowData[5],'%m/%d/%Y')
        bills.append((bill_id,description,champber,last_action,last_action_date))
    return(bills)
'''

`6`. Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list_of_bills (the value). You can do this by looping over all of the senate members and calling `get_bills()` for each of their associated bill URLs.

NOTE: please call the function `time.sleep(0.5)` for each iteration of the loop, so that we don't destroy the state's web site.

`7`. Extra credit / Challenge: Make a similar list of tuples for the information on California state senators, [here](http://senate.ca.gov/senators)

In [None]:
'''
req = requests.get('hhttp://senate.ca.gov/senators') # make a GET request
src = req.text # read the content of the server’s response

rows = soup.select(".views-row")

members = []
for row in rows:
    detailCells = row.select('.views-field')
    # get the data
    rowData = [cell.text for cell in detailCells]
    name = rowData[1]
    district = rowData[2]
    if '(D)' in name:
        party = "D"
    else:
        party = "R"
    members.append((name,district))