# Accessing ProPublica's Congressional API

ProPublica provides API access to legislative data from the House, Senate and Library of Congress. This notebook attempts to explain the major functionality needed for our group project, but for more information you can check __[ProPublica's API documentation](https://projects.propublica.org/api-docs/congress-api/)__. Note that you'll need to get an API Key in order to access the information; users with an API key are restricted to 5000 requests per day. You can sign up at ProPublica's __[ProPublica's Data Store](https://www.propublica.org/datastore/api/propublica-congress-api)__.

The Congressional API includes:
- Roll-call vote data (1991 onward for House; 1989 onward for Senate)
- Member data
- Bill data (since 1995)
- Floor actions
- Committee data (including committee membership)
- Personal explanation (related to missed votes)
- Nomination data (2001 onward)
- Other information

For our purposes we are most interested in the roll-call vote, member, bill and committee data. This document will include a section explaining how to access each of these API endpoints and a description of the underlying data structure returned.

## Base URL and Error Overview

The ProPublica API uses a common base URL for its different API endpoits. The common URL is: <font color = blue>https://api.propublica.org/congress/v1/</font>. In each of the sections below, the additional parameters needed to search a given API endpoint are described. 

The API also uses the following error codes that are universal across the API endpoints and can be used to check whether a request failed and determine the reason.

- 400 : Bad request (improperly formed)
- 403 : Forbidden (request doesn't have authorization header)
- 404 : Not found (specified record can't be found)
- 406 : Not acceptable (you requested a format that is not JSON or XML)
- 500 : Internal server error (ProPublica had server issue; try again later)
- 503 : Service unavailable (service is currently down; try again later)


## Congressional Members 

To start acquiring data on Congressional members it is first useful to retrieve a list of all the members. This is accomplished by adding the Congressional session and chamber to the base API URL and requesting that members.json be returned. 

**The format is: Base URL + {congress}/{chamber}/members.json**

For example, the base URL with <font color=green>115/senate/members.json</font> would request the list of senators in the 115th Congress, while <font color=green>115/house/members.json</font> would request members of the House of Representatives from the 115th Congress. 

The code below illustrates how one could return this information easily via Python. However, in practice it would be necessary to loop through multiple Congressional sessions and also retrieve data for the House and Senate. Since legislator's can serve in multiple sessions and have different roles in those sessions, we'll need to look for potential duplicates in the returned data we plan to use for our project. 

There is also an API endpoint to get additional informatino for a specific member. This includes additional information on the member's roles on committees. But it seems like this might be more easily pulled via the committee API. Other than the committee information, it doesn't seem like we'll need to worry about the individual member query too much.

In [1]:
import itertools
import requests
import numpy as np
import pandas as pd
import json

outPath = "INSERT OUTPUT PATH / FILENAME.CSV HERE" #"INSERT OUTPUT PATH / FILENAME.CSV HERE"
apiKey = "INSERT API KEY HERE" #"INSERT API KEY HERE"
baseUrl = "https://api.propublica.org/congress/v1/"

# Populate required API options
firstCongress = 113
lastCongress = 115
congress = [str(c) for c in range(firstCongress, lastCongress+1)]
chamber = ["house", "senate"]
endPoint = "members.json"

# Make request for each congress and chamber combination using list comprehension
# This essentially gets all the member data -- although we have to clean it up some
apiRequestList = [requests.get(baseUrl+combo[0]+"/"+combo[1]+"/"+endPoint, headers = {'X-API-Key': apiKey})
                  for combo in itertools.product(congress, chamber)]

members = []

# For each request we'll check the status and then use results to extend list of member JSON
for i,resp in enumerate(apiRequestList):
    if (resp.status_code == requests.codes.ok):
        respJSON = resp.json()
        #if respJSON['status'] == 'OK':
        congress = respJSON['results'][0]['congress']
        chamber = respJSON['results'][0]['chamber']
        members.extend( [dict(member, congress = congress, chamber = chamber)
                           for member in respJSON['results'][0]['members'] ] )
        
allMemberDF = pd.DataFrame(members)

print(allMemberDF.dtypes)
display(allMemberDF.head())
# We'll write out the 
## The at_large, district, and geoid columns are present only for house members
## The Senate has lis_id, senate_class and state_rank in addition to house columns

api_uri                  object
at_large                 object
chamber                  object
congress                 object
contact_form             object
crp_id                   object
cspan_id                 object
date_of_birth            object
district                 object
dw_nominate             float64
facebook_account         object
fax                      object
fec_candidate_id         object
first_name               object
gender                   object
geoid                    object
google_entity_id         object
govtrack_id              object
icpsr_id                 object
id                       object
ideal_point             float64
in_office                  bool
last_name                object
last_updated             object
leadership_role          object
lis_id                   object
middle_name              object
missed_votes            float64
missed_votes_pct        float64
next_election            object
ocd_id                   object
office  

Unnamed: 0,api_uri,at_large,chamber,congress,contact_form,crp_id,cspan_id,date_of_birth,district,dw_nominate,...,state_rank,suffix,title,total_present,total_votes,twitter_account,url,votes_with_party_pct,votesmart_id,youtube_account
0,https://api.propublica.org/congress/v1/members...,False,House,113,,N00035451,76386,1946-05-27,12,-0.468,...,,,Representative,0.0,48.0,RepAdams,https://adams.house.gov,97.83,5935.0,
1,https://api.propublica.org/congress/v1/members...,False,House,113,,N00003028,45516,1965-07-22,4,0.361,...,,,Representative,0.0,1192.0,Robert_Aderholt,https://aderholt.house.gov,93.99,441.0,RobertAderholt
2,https://api.propublica.org/congress/v1/members...,False,House,113,,,1004256,1946-12-05,5,0.331,...,,,Representative,0.0,490.0,USRepAlexander,,92.47,,RepRodneyMAlexander
3,https://api.propublica.org/congress/v1/members...,False,House,113,,N00031938,1033767,1980-04-18,3,0.649,...,,,Representative,2.0,1192.0,,https://amash.house.gov,77.01,105566.0,repjustinamash
4,https://api.propublica.org/congress/v1/members...,False,House,113,,N00031177,62817,1958-06-12,2,0.376,...,,,Representative,0.0,1192.0,MarkAmodeiNV2,https://amodei.house.gov,94.17,12537.0,markamodeinv2


Of course there needs to be cleanup of some of the data. One item of particular interest to us will be the ability to connect the member data with the OpenSecrets campaign finance data. The ProPublica Congressional API provides the fec_candidate_id, which should enable linking to the OpenSecrets field that contains the same information.

In [111]:
print('There are', allMemberDF[allMemberDF.fec_candidate_id == ''].shape[0], 'members without FEC Candidate Ids!')
print('There are', allMemberDF[allMemberDF.fec_candidate_id != ''].shape[0], 'members with FEC Candidate Ids!')
## We see that 6,636 records have missing FEC_Candidate_Id numbers!
## 1,103 candidates have FEC_Candidate_Id numbers. We'll have to try to match things better

# But everyone has a GovTrack Id! 
print('==========================================================')
allMemberDF.loc[allMemberDF.govtrack_id.isna(), "govtrack_id"].head()
allMemberDF.loc[(allMemberDF.last_name == 'Chiesa') & (allMemberDF.first_name == 'Jeffrey'), "govtrack_id"] = "412597"
allMemberDF.loc[(allMemberDF.last_name == 'Jones') & (allMemberDF.first_name == 'Brenda'), "govtrack_id"] = "412752"
print('There are', allMemberDF[allMemberDF.govtrack_id == ''].shape[0], 'members without GovTrack Ids!')
print('There are', allMemberDF[allMemberDF.govtrack_id != ''].shape[0], 'members with GovTrack Ids!')

# Lucky for us we can join in another dataset to get a legislator's open secret id looked up from their govtrack one!
usAPIBaseUrl = ["https://theunitedstates.io/congress-legislators/"]
usAPIEndpoints = ["legislators-current.json", "legislators-historical.json"]
# We can read in some JSON info from https://theunitedstates.io/congress-legislators/legislators-current.json
usAPILeg = [requests.get(combo[0]+combo[1]) for combo in itertools.product(usAPIBaseUrl, usAPIEndpoints)]
usAPIMembers = [member for request in usAPILeg for member in request.json() ]

usAPIMemberDF = pd.DataFrame(usAPIMembers)

MemberIdDF = usAPIMemberDF.id.apply(pd.Series)
MemberIdDF = MemberIdDF.loc[:,['govtrack', 'opensecrets']].rename(columns = {'govtrack':'govtrack_id'})

MemberIdDF.loc[:, 'govtrack_id2'] = MemberIdDF.govtrack_id.astype(int)
MemberIdDF.drop({'govtrack_id'}, axis = 1, inplace = True)

# Create different typed column to join on 
allMemberDF.loc[:, 'govtrack_id2'] = allMemberDF.govtrack_id.astype(int)

# Join in the opensecrets Ids
allMemberDF = pd.merge(allMemberDF, MemberIdDF[MemberIdDF.govtrack_id2.notna()], on = 'govtrack_id2', how = 'left')
#allMemberDF.drop({'govtrack_id2_x', 'govtrack_id2_y'}, axis = 1, inplace = True)

# We can verify that we have open secret ids for everyone now
print('==========================================================')
print('There are', allMemberDF[allMemberDF.opensecrets.isna()].shape[0], 'members without open secret Ids!')
print('There are', allMemberDF[allMemberDF.opensecrets.notna()].shape[0], 'members with open secret Ids!')

print('==========================================================')
print('We need to see if these people have opensecrets Ids')
display(allMemberDF.loc[allMemberDF.opensecrets.isna(), 
                        ["first_name", "last_name", "govtrack_id", "chamber", "congress"]])
# Output the member data
allMemberDF.to_csv(outPath)

There are 570 members without FEC Candidate Ids!
There are 1094 members with FEC Candidate Ids!
There are 0 members without GovTrack Ids!
There are 1664 members with GovTrack Ids!
There are 5 members without open secret Ids!
There are 1659 members with open secret Ids!


Unnamed: 0,first_name,last_name,govtrack_id,chamber,congress
469,Jeffrey,Chiesa,412597,Senate,113
1282,Kevin,Hern,412748,House,115
1393,Joe,Morelle,412749,House,115
1474,Mary,Scanlon,412750,House,115
1545,Susan,Wild,412751,House,115


In addition to the general information on each legislator we are interested in understanding their ideology. We may want to estimate a custom version of these ourselves as we get into the final stages of the project (for instance to estimate ideology for each topic); however, in the interim we can use information from  __[VoteView](https://voteview.com/data)__ to gather each candidate's Nominate score. 

There are several flavors of Nominate scores, but each is a low-dimensional representation of a legislator's ideology that is estimated based on their roll-call voting history (using Monte-Carlo Markov Chain techniques). The most recent version of the statistic involves DW-Nominate, updates the original Nominate and W-Nominate scores to allow each legislator's ideology estimate to gradually change over time (the earlier versions assume ideology is constant for a given legislator across time). 

Most often a single dimension, intuitively described as representing the liberal-to-conservative economy scale, estimated by the Nominate (or W-Nominate or DW-Nominate) estimation is used to summarize a legislator's voting ideology and this dimension has been shown to explain most of the variance in roll-call votes. However, a secod dimension often described as relating to the social issues of the day can also be used. 

The VoteView data read in below includes two dimensions, allowing legislator ideology to be described along economic and social dimensions. The dimensions are labeled nominate_dim1 and nominate_dim2. The icpsr codes included in the file should be able to be used to join this data into the other legislator lookup table table data previously collected.

In [6]:
votviewMemberInfo = "https://voteview.com/static/data/out/members/HSall_members.csv"
voteviewMemberDF = pd.read_csv(votviewMemberInfo)
display(voteviewMemberDF.head())

Unnamed: 0,congress,chamber,icpsr,state_icpsr,district_code,state_abbrev,party_code,occupancy,last_means,bioname,...,died,nominate_dim1,nominate_dim2,nominate_log_likelihood,nominate_geo_mean_probability,nominate_number_of_votes,nominate_number_of_errors,conditional,nokken_poole_dim1,nokken_poole_dim2
0,1,President,99869,99,0,USA,5000,,,"WASHINGTON, George",...,,,,,,,,,,
1,1,House,4766,1,98,CT,5000,0.0,1.0,"HUNTINGTON, Benjamin",...,1800.0,0.639,0.304,-29.0467,0.708,84.0,12.0,,0.649,0.229
2,1,House,8457,1,98,CT,5000,0.0,1.0,"SHERMAN, Roger",...,1793.0,0.589,0.307,-40.5958,0.684,107.0,18.0,,0.614,0.298
3,1,House,9062,1,98,CT,5000,0.0,1.0,"STURGES, Jonathan",...,1819.0,0.531,0.448,-25.87361,0.724,80.0,13.0,,0.573,0.529
4,1,House,9489,1,98,CT,5000,0.0,1.0,"TRUMBULL, Jonathan, Jr.",...,1809.0,0.692,0.246,-30.47113,0.75,106.0,11.0,,0.749,0.166


## Bill Information

This section of the notebook will explain the information related to the legislative bill data. One option to query this data is to use the information from ProPublica's Congressional API. This includes bill text and other useful information. Unfortunately, their basic requests only provide info on 20 most recent bills. They do let you search for a specific bill's information. However, that would require a list of all the bills in a given congress (i.e.: the list of all Senate and House bills in the 114 Congress).  To get around this there are several options, but the easiest is probably to use the .CSV file that __[VoteView](https://voteview.com/data)__ makes available on historical bills to get a list of bills that can be looped through to request data.

The code below uses this approach to get the information on all of these bills. I believe ProPublica has a 5,000 request limit per day, so care should be used in requesting all of these multiple times for testing.

In [114]:
# Note we use the meta data on Bills from VoteView so that we can loop through them to get data from ProPublica
votviewBillInfo = "https://voteview.com/static/data/out/rollcalls/HSall_rollcalls.csv"
voteviewBillDF = pd.read_csv(voteviewBillInfo)
voteviewBillDF = voteviewBillDF.loc[(voteviewBillDF.congress >= 113) & (voteviewBillDF.bill_number.notna()), :]
voteviewBillDF['congress_bill_number'] = voteviewBillDF.apply(lambda x: (str(x.congress), x.bill_number), axis = 1)

# We get list of unique bill numbers
bills = voteviewBillDF.congress_bill_number.unique()
print('There are', len(bills), 'bills to request from the ProPublica API')

There are 2196 bills to request from the ProPublica API


With a list of unique bill numbers from the 113th congress onward, it is possible to loop through the ProPublica API to request the bill information. The form of the API request is: https://api.propublica.org/congress/v1/{congress}/bills/{bill-id}.json.

In [113]:
#https://api.propublica.org/congress/v1/{congress}/bills/{bill-id}.json
apiKey = "AwB4zaxyUCsrdIPV2K9S863GD8rUMm98ZRjJaEGC" #"INSERT API KEY HERE"
baseUrl = "https://api.propublica.org/congress/v1/"
https://api.propublica.org/congress/v1/113/bills/HR41.json
# Make request for each congress and chamber combination using list comprehension
# Remember that each bill in the list bills is a tuple where bill[0] is congressional session and bill[1] is bill number
apiRequestListBillz = [requests.get(baseUrl+bill[0]+"/bills/"+bill[1]+".json", headers = {'X-API-Key': apiKey}) 
                      for bill in bills]


billz = []
for i,resp in enumerate(apiRequestListBillz):
    if (resp.status_code == requests.codes.ok):
        respJSON = resp.json()
        if 'results' in respJSON:
            respJSON = respJSON['results'][0]
            billz.append(respJSON)
        else: pass #print('Results not returned for request', i)
    else:
        print('Request', i, 'failed')

In [136]:
billz = []
failed_bill_requests = []
for i,resp in enumerate(apiRequestListBillz):
    if (resp.status_code == requests.codes.ok):
        respJSON = resp.json()
        if 'results' in respJSON:
            respJSON = respJSON['results'][0]
            billz.append(respJSON)
        else: failed_bill_requests.append(i) #print('Results not returned for request', i)
    else:
        print('Request', i, 'failed')
        failed_bill_requests.append(i)
        
failed_bill_requests = bills[failed_bill_requests]
failed_bill_df = pd.DataFrame(failed_bill_requests)
failed_bill_df.to_csv("OUTPUT LOCATION / FILENAME.CSV HERE")

In [102]:
# NOTE: THIS DOESN"T NEED TO BE RUN -- THIS WAS JUST TO GET THE UNIQUE BILL SUBJECTS
outBillSubjectPath = "INSERT OUTPUT PATH / FILENAME.CSV HERE"

bill_subjects = []
for i,resp in enumerate(apiRequestList):
    if (resp.status_code == requests.codes.ok):
        respJSON = resp.json()
        if 'results' in respJSON:
            respJSON = respJSON['results'][0]
            bill_subjects.append(respJSON['primary_subject'])
        else: pass #print('Results not returned for request', i)
    else:
        print('Request', i, 'failed')
            
bill_subjects = list(set(bill_subjects))
bill_subjects = pd.DataFrame(bill_subjects)
bill_subjects.to_csv(outBillSubjectPath)
 
