# WIM Workshop: API-Webscraping with Python

* Date: Nov 3, 2023
* Instructor: Eehyun Kim (eehkim@iu.edu), Anne Kavalerchik (akavaler@iu.edu)

## Workflow
1. Read API documentation - Check the API limit
2. Import packages
3. Build get request
4. Send get request – check server response
<br><font color=green>200 – OK</font>
<br><font color=orange>404 – data not found</font>
<br><font color=red>401 – unauthorized</font>
<br><font color=red>429 – too many requests</font>
5. Explore data structures
<br> lists, dictionaries
6. Save data
<br> e.g. csv

### 1. Read API documentation
https://www.propublica.org/datastore/api/propublica-congress-api  <br>
"Usage is limited to 5000 requests per day (rate limits are subject to change)."

see above for import packages and authentication key

* [Request a key from ProPublica here](https://www.propublica.org/datastore/api/propublica-congress-api)
    * I saved my API key in a .txt document
    * do not share your API key with anyone (i.e., treat it like a password)

### 2. Import packages

#### What you need:
* Python
    * required: `requests`, `json`
    * optional: `pandas`, `pickle`

In [None]:
# packages you need to install
import requests
import json

import pandas as pd

# packages that come with Python
from time import sleep
from pprint import pprint

In [None]:
credentials = {'X-API-Key':'asdlkmasdlkmsdl;kmw'}  # <- not my real API key (I just slammed my keyboard)

## 3. Build get request

[Return to the documentation here for specifically working with the `members` data set](https://projects.propublica.org/api-docs/congress-api/members/).

This tells us that the structure of a request should look like:

`GET https://api.propublica.org/congress/v1/{congress}/{chamber}/members.json`

Using version 1 (v1) of the API, since that is the only one available. We are requesting the __117th__ (current) Congress, and specifically the __House__

If we tried to use `requests.get()` on that link, we would get a <font color=red>401</font> response. We need to pass our api key in the "header".

## 4. Send `get` request (with headers) – check server response

We can also check print out the status code like this

In [None]:
response.status_code

Response `200` is what we want! `requests` comes built in with a `json` decoder. Since the response code suggests that the request was sent successfully, we will save the response in a `json`, and explore the data structure in the next step.

We save it in the variable, `members`

In [None]:
if response.status_code==200:
    members = response.json()

## 5. Explore data structures

We can use a range of tactics to explore this data structure. Like most `json` files, it is rather nested. There may also be documentation that tells us about the structure of returned requests.

You can use `pprint` ("pretty print") for a more nicely formatted json object

## 6. Save data

In [None]:
df = pd.DataFrame(members['results'][0]['members'])
df.shape

In [None]:
df

In [None]:
df.to_csv("congress_house_116.csv")

## Health bills

We can look at a different dataset - `bills` instead of `members`. See documentation [here](https://projects.propublica.org/api-docs/congress-api/bills/)


Their syntax for a request to search for bills that include a particular keyword (query) is:

`https://api.propublica.org/congress/v1/bills/search.json?query={query}`

Their syntax for a request to search for bills in a particular subject is:

`https://api.propublica.org/congress/v1/bills/subjects/{subject}.json`

Let's call and see what this yields from the subject "health":

That was only the first $20$ results. We need to use __pagination__ to go through successive pages.

Per the API documentation, we will use the `offset` term to "offset" the _first 20_ results, to get the _next 20_...

`https://api.propublica.org/congress/v1/bills/subjects/{subject}.json?offset={offset}`

Here we'll use a `for` loop. The `range` function means that we are iterating between $0$ and $100$ in increments of $20$.

In [None]:
for i, offset in enumerate(range(0, 100, 20)):
    print(i, offset)

We'll save everything in a dictionary

In [None]:
bills_100 = dict()

Go through this for loop and make a request, increasing the offset each time: