# Scraping APIs

A site with an API (Application Programming Interface) wants you to scrape it.

Examples abound:

* <a href="https://www.census.gov/data/developers/data-sets.html">U.S. Census APIs</a>
* <a href="https://apps.fas.usda.gov/opendataweb/home">US Agriculture Commodities and Exports</a>
* <a href="https://www.federalregister.gov/developers/documentation/api/v1">Federal Register</a>
* <a href="https://developer.dol.gov/beginners-guide/">Labor Department</a>
* <a href="https://www.eia.gov/">Labor Department</a>

Government sites tend  ```CSVs``` for download but their APIs offer more detailed options for data. They are not trying to hide the data.

Private sites might have APIs, but often charge heafty prices for access beyond a basic number of downloads.

The toughest/hardest part of scraping an API is that they ***ALL HAVE DIFFERENT INSTRUCTIONS*** on how to tap their data.

Today, we'll explore different APIs that each build a different skill:

1. Census health data – **building a simple API call.**
2. USDA commodities exports – **using an API key and targeting specific commodities over several years.**
3. Federal Register – **tapping search terms.**
4. Energy Information Administration – **dealing with pagination.**

What they all have in common:

1. a base url
2. a query string
3. tied together with a query character ```?```
4. an API key.

Combined together these are known as an ```API endpoint```.

You make an ```API call``` (a request) using the ```API endpoint```.


Today I've provided most the code, but you will have to build **your own API calls**.




In [None]:
## import libraries
import requests
import pandas as pd
from icecream import ic



### 1. Census health data – **building a simple API call.**

- <a href="https://www.census.gov/data/developers/data-sets/Health-Insurance-Statistics.html">Census health landing page</a>
- List of <a href="https://api.census.gov/data/timeseries/healthins/sahie/variables.html">possible variables</a>

We want to create a dataframe with the following info for every state in 2021:

1. Total number insured
2. Percent insured
3. Total number uninsured
4. Percent uninsured

In [None]:
## the parts to build your API call.

In [None]:
## create a dictionary to know what codes mean


In [None]:
## format into api query format
get_vars = ",".join(target_dict.keys())
get_vars

In [None]:
## other targets


In [None]:
## create query string


In [None]:
## create full API call


In [None]:
## get response
response = requests.get(api_call)

# Check if the request was successful
if response.status_code == 200:
    data = response.json()
    # Process the data as needed
else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")


In [None]:
## turn response into json
data = response.json()
data

In [None]:
## create dataframe
df= pd.DataFrame(data[1:], columns = data[0])
df

In [None]:
## rename column headers with more meaning full headers
df.rename(columns = target_dict, inplace = True)
df

In [None]:
## build dictionary to replace codes for race
race_cats = {
      "0": "All Races",
      "1": "White alone, not Hispanic or Latino",
      "2": "Black or African American alone, not Hispanic or Latino",
      "3": "Hispanic or Latino (any race)",
      "4": "American Indian and Alaska Native alone, not Hispanic or Latino",
      "5": "Asian alone, not Hispanic or Latino",
      "6": "Native Hawaiian and Other Pacific Islander alone, not Hispanic or Latino",
      "7": "Two or More Races, not Hispanic or Latino"
    }

In [None]:
## cooy race category column to be converted to description next cell
df["race_description"] = df["race_category"].copy()
df

In [None]:
## replace with codes
df["race_description"] = df["race_description"].replace(race_cats)
df

### 2. USDA commodities exports – **using an API key and targeting specific commodities over several years.**

- <a href="https://apps.fas.usda.gov/opendataweb/home">USDA APIs endpoints</a>
- Get an <a href="https://apps.fas.usda.gov/opendataweb/home">API key</a>

We want to create a dataframe with exports between 2020-2022 to all countries for the following commodities:

1. All Wheat
2. Oats
3. Cuts of Beef
4. Cuts of Pork

In [None]:
## find the parts to build your API call

In [None]:
## get commodities list


In [None]:
## we create a headers with your API key- use your own


In [None]:
## now let's put into get requests
## we check the response status code
response = requests.get(url = com_url, headers = headers)
response.status_code
all_commodities = response.json()
all_commodities

### Get endpoint and test out on a single commodity


In [None]:
## your end point here


In [None]:
## now let's put into get requests
## we check the response status code
response = requests.get(url = endpoint, headers = headers)
response.status_code

In [None]:
## let's store our response into an object called data
data = response.json()
data[-1]

In [None]:
## convert that list of dicts into a dataframe called df
df = pd.DataFrame(data)
df

In [None]:
## Now iterate through all our target items

commodities_dict = [{'commodityCode': 107, 'commodityName': 'All Wheat', 'unitId': 1},
    {'commodityCode': 601, 'commodityName': 'Oats', 'unitId': 1},
    {'commodityCode': 1701,
  'commodityName': 'Fresh, Chilled, or Frozen Muscle Cuts of Beef',
  'unitId': 1},
 {'commodityCode': 1702,
  'commodityName': 'Fresh, Chilled, or Frozen Muscle Cuts of Pork',
  'unitId': 1}]

commodities_dict

In [None]:
## endpoint templates



In [None]:
## iterate to get all the data
target_data = []
for commodity in commodities_dict:
#     print(commodity)
    target_commodity = commodity.get("commodityCode")
    print(target_commodity)
    for year in range(2020,2023):
        print(year)
        try:
            endpoint = f"{start_endpoint}{target_commodity}{end_endpoint}{year}"
            print(endpoint)
            ## get response
            response = requests.get(endpoint, headers = headers)
            data = response.json()
            # Process the data as needed
            target_data.append(pd.DataFrame(response.json()))
        except:
            print(f"Failed to retrieve data. Status code: {response.status_code}")


In [None]:
## call list
target_data[0]

In [None]:
## concat into single df
df = pd.concat(target_data).reset_index(drop = True)
df.info()

In [None]:
## call df
df

In [None]:
## confirm we have all our target commodities
list(df["commodityCode"].unique())

### 3. Federal Register – **tapping search terms.**

We have decades of <a href="https://docs.google.com/spreadsheets/d/130WeumbMZjcoRP4D-1uJ7bM0aKBZzt4N/edit?usp=sharing&ouid=112307892189798608417&rtpof=true&sd=true">SBA Excel files</a> that detail loans given to small businesses to recover after climate disasters. The only information we have about the type of disasters are codes in one of the columns that look like:

- CA-00279
- IL-00051
- NC-00099

The <a href="https://www.federalregister.gov/">Federal Register</a> allows us to search for what these codes stand for. But we can't search for nearly a thousand such disaster codes. When we try to scrape the site, it warns us to use the API instead.

Federal Register <a href="https://www.federalregister.gov/developers/documentation/api/v1#/Federal%20Register%20Documents/get_documents__format_">API documentation</a>

In [None]:
## find the end point

#### Test on single endpoint after figuring out how to build API call

In [None]:
## endpoint


In [None]:
## get data
response = requests.get(url)
data = response.json()
data

In [None]:
## type
type(data)

In [None]:
## targeting incidents
content = data.get("results")

content[0].get("abstract")

### Iterate through entire list of codes

In [None]:
## Normally will take from df as a list
## build disaster code list
disaster_codes = ["CA-00279","IL-00051", "NC-00099" ]

In [None]:
## provide base url
base_url = "https://www.federalregister.gov/api/v1/documents.json?per_page=20&conditions[docket_id]="

In [None]:
## iterate through all endpoints
incidents_list = []
broken_endpoints = []

for disaster_code in disaster_codes:
    endpoint = base_url + disaster_code
#     print(endpoint)
    try:
        response = requests.get(url)
        data = response.json()
        content = data.get("results")
        incident_text = content[0].get("abstract")
        incidents_list.append({"disaster_code": disaster_code,
                         "incident_text": incident_text})
    except:
        print(f"{disaster_code} threw an error")
        broken_endpoints.append(disaster_code)

print("Done scraping endpoints")


In [None]:
## call list
incidents_list

### 4. Energy Information Administration – **dealing with pagination.**

From the <a href="https://www.eia.gov/">Energy Information Administration</a>, we want to compile energy generation by type of fuel and region for about 5 days.

We will encounter a limit on the number of items per API call.

Find our API endpoint first.

In [None]:
## your target endpoint


In [None]:
## get response
response = requests.get(endpoint)
data = response.json()
data

In [None]:
## import ceiling division from math



In [None]:
## paginate our API calls



In [None]:

all_data = []

for page in range(total_pages):
    ic(page)
    offset = page * rows_per_page
    api_url = f"{endpoint}&offset={offset}&length={rows_per_page}"
    ic(offset)
    # Make an API request to the constructed URL
    response = requests.get(api_url)
    data = response.json()
    data

    # Extract and append the data to your main data storage
    all_data.append(data.get("response").get("data"))

In [None]:
## call all data
all_data

In [None]:
## length
len(all_data)

In [None]:
## use itertools to flatten list with nested lists
import itertools

In [None]:
## flatten nested lists
flat_data = list(itertools.chain(*all_data))
flat_data

In [None]:
df = pd.DataFrame(flat_data)
df

In [None]:
fips_codes = {
    'Alabama': '01',
    'Alaska': '02',
    'Arizona': '04',
    'Arkansas': '05',
    'California': '06',
    'Colorado': '08',
    'Connecticut': '09',
    'Delaware': '10',
    'Florida': '12',
    'Georgia': '13',
    'Hawaii': '15',
    'Idaho': '16',
    'Illinois': '17',
    'Indiana': '18',
    'Iowa': '19',
    'Kansas': '20',
    'Kentucky': '21',
    'Louisiana': '22',
    'Maine': '23',
    'Maryland': '24',
    'Massachusetts': '25',
    'Michigan': '26',
    'Minnesota': '27',
    'Mississippi': '28',
    'Missouri': '29',
    'Montana': '30',
    'Nebraska': '31',
    'Nevada': '32',
    'New Hampshire': '33',
    'New Jersey': '34',
    'New Mexico': '35',
    'New York': '36',
    'North Carolina': '37',
    'North Dakota': '38',
    'Ohio': '39',
    'Oklahoma': '40',
    'Oregon': '41',
    'Pennsylvania': '42',
    'Rhode Island': '44',
    'South Carolina': '45',
    'South Dakota': '46',
    'Tennessee': '47',
    'Texas': '48',
    'Utah': '49',
    'Vermont': '50',
    'Virginia': '51',
    'Washington': '53',
    'West Virginia': '54',
    'Wisconsin': '55',
    'Wyoming': '56',
    'District of Columbia': '11',
    'Puerto Rico': '72'
}
