# WIM Workshop: API-Webscraping with Python

* Date: Nov 3, 2023
* Instructor: Eehyun Kim (eehkim@iu.edu), Anne Kavalerchik (akavaler@iu.edu)

## Workflow
1. Read API documentation - Check the API limit
2. Import packages
3. Build get request
4. Send get request – check server response
<br><font color=green>200 – OK</font>
<br><font color=orange>404 – data not found</font>
<br><font color=red>401 – unauthorized</font>
<br><font color=red>429 – too many requests</font>
5. Explore data structures
<br> lists, dictionaries
6. Save data
<br> e.g. csv

#### 1. read API documentation
https://www.propublica.org/datastore/api/propublica-congress-api  <br>
"Usage is limited to 5000 requests per day (rate limits are subject to change)."

see above for: <br>
2. import packages <br>
3. authentication key

* [Request a key from ProPublica here](https://www.propublica.org/datastore/api/propublica-congress-api)
    * I saved my API key in a .txt document
    * do not share your API key with anyone (i.e., treat it like a password)

### 2. Import packages

#### What you need:
* Python
    * required: `requests`, `json`
    * optional: `pandas`, `pickle`

In [None]:
# packages you need to install
import requests
import json

import pandas as pd

# packages that come with Python
from time import sleep
import pickle
from pprint import pprint

In [None]:
credentials = {'X-API-Key':'kjfneawlfunwaedklmalkmdwql;kdmwelkcm'}  # <- not my real API key (I just slammed my keyboard)

## 3. Build get request

[Return to the documentation here for specifically working with the `members` data set](https://projects.propublica.org/api-docs/congress-api/members/).

This tells us that the structure of a request should look like:

`GET https://api.propublica.org/congress/v1/{congress}/{chamber}/members.json`

Using version 1 (v1) of the API, since that is the only one available. We are requesting the __117th__ (current) Congress, and specifically the __House__

In [None]:
base = "https://api.propublica.org/congress/v1/"
congress_num = "/117"
chamber = "/house"
data_section = "/members.json"
print(host + chamber + data_section)

We can also write a request like this - this can be more readable and flexible. Note the `{}`

In [None]:
request_url = "https://api.propublica.org/congress/v1/{congress}/{chamber}/members.json"
request_url.format(congress='117', chamber='house')

If we click that link, we get an "Unauthorized message" because we haven't verified our authorization through our API key. We will send that in the next step.

If we tried to use `requests.get()` on that, we would get a <font color=red>401</font> response. We need to pass our api key in the "header"

In [None]:
print(request_url.format(congress='117', chamber='house'))
requests.get(request_url.format(congress='117', chamber='house'))

## 4. Send `get` request (with headers) – check server response

In [None]:
response = requests.get(request_url.format(congress='117', chamber='house'), headers=credentials)
print(response)

We can also check print out the status code like this

In [None]:
response.status_code

Response `200` is what we want! `requests` comes built in with a `json` decoder. Since the response code suggests that the request was sent successfully, we will save the response in a `json`, and explore the data structure in the next step.

We save it in the variable, `members`

In [None]:
if response.status_code==200:
    members = response.json()

## 5. Explore data structures

We can use a range of tactics to explore this data structure. Like most `json` files, it is rather nested. There may also be documentation that tells us about the structure of returned requests.

In [None]:
members

In [None]:
print(len(members))
print(type(members))
print(members.keys())

You can use `pprint` ("pretty print") for a more nicely formatted json object

In [None]:
pprint(members['results'][0])

In [None]:
members['results'][0].keys()

In [None]:
print(len(members['results']))
print(len(members['results'][0].keys()))
print(members['results'][0]['congress'])

In [None]:
type(members['results'][0]['members'])

In [None]:
len(members['results'][0]['members'])

In [None]:
pprint(members['results'][0]['members'])

In [None]:
pprint(members['results'][0]['members'][0])

## 6. Save data

In [None]:
df = pd.DataFrame(members['results'][0]['members'])
df.shape

In [None]:
df

In [None]:
df.to_csv("congress_house_116.csv")

## Health bills

We can look at a different dataset - `bills` instead of `members`. See documentation [here](https://projects.propublica.org/api-docs/congress-api/bills/)


Their syntax for a request to search for bills that include a particular keyword (query) is:

`https://api.propublica.org/congress/v1/bills/search.json?query={query}`

Their syntax for a request to search for bills in a particular subject is:

`https://api.propublica.org/congress/v1/bills/subjects/{subject}.json`

Let's call and see what this yields:

In [None]:
r = requests.get(bill_request_url.format(subject='health'),
                 headers=credentials)
bills_20 = r.json()

In [None]:
pprint(bills_20)

That was only the first $20$ results. We need to use __pagination__ to go through successive pages.

Per the API documentation, we will use the `offset` term to "offset" the _first 20_ results, to get the _next 20_...

`https://api.propublica.org/congress/v1/bills/subjects/{subject}.json?offset={offset}`

In [None]:
bill_request_url = 'https://api.propublica.org/congress/v1/bills/subjects/{subject}.json?offset={offset}'

In [None]:
r = requests.get(bill_request_url.format(subject='health', offset='20'), headers=credentials)
bills_40 = r.json()

In [None]:
pprint(bills_40)

Here we'll use a `for` loop. The `range` function means that we are iterating between $0$ and $100$ in increments of $20$.

In [None]:
for i, offset in enumerate(range(0, 100, 20)):
    print(i, offset)

We'll save everything in a dictionary

In [None]:
bills_100 = dict()

Go through this for loop and make a request, increasing the offset each time:

In [None]:
for i, offset in enumerate(range(0, 100, 20)):
    response = requests.get(bill_request_url.format(subject='health',
                                                    offset=str(offset)), headers=credentials)
    if response.status_code==200:  # it would be a good idea to pad this with some exceptions!
        bills = response.json()
        bills_100[i] = bills
        
    sleep(1)  # this makes it wait 1 second after each request!

In [None]:
pd.DataFrame(bills_100)

In [None]:
with open("bills_100.json", "w") as outfile:
    json.dump(bills_100, outfile, indent=4)

To see how to clean up this `json`, refer to the materials from the Intro to Python workshop!