# Get_Air_Quality_Data_From_EPA_API

This notebook request data from the US Environmental Protection Agency (EPA) Air Quality Service (AQS) API. The [documentation](https://aqs.epa.gov/aqsweb/documents/data_api.html) for the API provides definitions of the different call parameter and examples of the various calls that can be made to the API.

For all years in the dataset (1963-2023), we search for available data from the closest sensor(s) using area bounding boxes around Kearney, Nebraska.  For each year, the search begins with a 50 mile bounding box and, if no sensors are found withing the designated area, the bounding box is increased by 50 mile increments until at least one sensor is found or it is determined that no sensors are available within a 350 mile area.   

For each year, once the relevant sensors (if any) are located, we retreive daiy data related to pollutants code = 88101 or 88502 (PM2.5) from those sensors.   


# License & Attribution Notice
This code was developed by Susan Boyd for use in HW1 assigned in DATA 512, a course in the UW MS Data Science degree program. This code is provided under an MIT license.

In addition, some functions or code block ins in this notebook were adapted (with adaptations noted below) from code provided by David W. McDonald in a notebook entitled "epa_air_quality_history_example.ipbny" and located at https://drive.google.com/drive/folders/1lPJF73GX5Vyu2uAvT5VpAY-xGwP2fCCx, for use in UW Course DATA 512. It is licensed under
the Creative Commons https://creativecommons.org/licenses/by/4.0/CC-BY license. These code blocks may be subject to attribution or any other requirements under that Creative Commons license. In cases of doubt, the more restrictive license terms apply.

# Chat GPT Attribution
Some functions or code blocks in this Notebook were created with assistance from Chat GPT (https://chat.openai.com/). The impacted code is isolated in a function and the use of Chat GPT is noted, along with information on the prompts used to query Chat GPT provided at the end of the notebook.

# Step 0 Set up Notebook 

In [1]:
# import needed packages 

import json, time, requests
import pandas as pd
import numpy as np


# Step 1 Define Constants 

### ATTRIBUTION NOTE¶
The code in the section code block titled "CONSTANTS" are adapted from Dr. McDonald's code in "epa_air_quality_history_example.ipbny".  The code was slightly reformatted for readibility. Additinal modification: 

 (1) The code below adds the CONSTANT AQI_PARAM_CLASS = "AQI POLLUTANTS". For this project, we will always be asking for data related to AQI Pollutants.  For more details regagarding other possible Parm values, plese view the API documentation or the "epa_air_quality_history_example.ipbny" notebook. 
 
 (2) The code below specifies that we are only interested in particulate code 88101 ( AQI_PARAMS_PARTICULATES = 88101,88502) which is the PM2.5 concentration that will be used later for the health impact assessment functions.  
 
 
 (3) The CITY_LOCATIONS constants were moved from a different part of the "epa_air_quality_history_example.ipbn" notebook, and the city of interest, Kearney, Nebraska was added.  

In [2]:
#    CONSTANTS

#    This is the root of all AQS API URLs
API_REQUEST_URL = 'https://aqs.epa.gov/data/api'

# These are 'actions' we can ask the API to take or requests that we can make of the API

#    Sign-up request - generally only performed once - unless you lose your key
API_ACTION_SIGNUP = '/signup?email={email}'

#    List actions provide information on API parameter values that are required by some other actions/requests
API_ACTION_LIST_CLASSES = '/list/classes?email={email}&key={key}'
API_ACTION_LIST_PARAMS = '/list/parametersByClass?email={email}&key={key}&pc={pclass}'
API_ACTION_LIST_SITES = '/list/sitesByCounty?email={email}&key={key}&state={state}&county={county}'
#
#    Monitor actions are requests for monitoring stations that meet specific criteria
API_ACTION_MONITORS_COUNTY = '/monitors/byCounty?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&state={state}&county={county}'
API_ACTION_MONITORS_BOX = '/monitors/byBox?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&minlat={minlat}&maxlat={maxlat}&minlon={minlon}&maxlon={maxlon}'
#
#    Summary actions are requests for summary data. These are for daily summaries
API_ACTION_DAILY_SUMMARY_COUNTY = '/dailyData/byCounty?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&state={state}&county={county}'
API_ACTION_DAILY_SUMMARY_BOX = '/dailyData/byBox?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&minlat={minlat}&maxlat={maxlat}&minlon={minlon}&maxlon={maxlon}'

#
#    It is always nice to be respectful of a free data resource.
#    We're going to observe a 100 requests per minute limit - which is fairly nice
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED
#
#
#    This is a template that covers most of the parameters for the actions we might take, from the set of actions
#    above. In the examples below, most of the time parameters can either be supplied as individual values to a
#    function - or they can be set in a copy of the template and passed in with the template.
# 
AQS_REQUEST_TEMPLATE = {
    "email":      "",     
    "key":        "",      
    "state":      "",     # the two digit state FIPS # as a string
    "county":     "",     # the three digit county FIPS # as a string
    "begin_date": "",     # the start of a time window in YYYYMMDD format
    "end_date":   "",     # the end of a time window in YYYYMMDD format, begin_date and end_date must be in the same year
    "minlat":    0.0,
    "maxlat":    0.0,
    "minlon":    0.0,
    "maxlon":    0.0,
    "param":     "",     # a list of comma separated 5 digit codes, max 5 codes requested
    "pclass":    ""      # parameter class is only used by the List calls
}


AQI_PARAM_CLASS = "AQI POLLUTANTS"
#AQI_PARAMS_PARTICULATES = "81102,88101,88502"
AQI_PARAMS_PARTICULATES = "88101,88502"


CITY_LOCATIONS = {
    'Kearney':        {'city'   : 'Kearney',
                       'county' : 'Buffalo',
                       'state'  : 'Nebraska',
                       'fips'   : '31019',
                       'latlon' : [40.5994, -99.0816] },
    
    'seaside' :       {'city'   : 'Seaside',
                       'county' : 'Clatsop',
                       'state'  : 'Oregon',
                       'fips'   : '41017',
                       'latlon' : [45.9932, -123.9226] }, 
    
    'bend' :          {'city'   : 'Bend',
                       'county' : 'Deschutes',
                       'state'  : 'Oregon',
                       'fips'   : '41017',
                       'latlon' : [44.0582, -121.3153] }
}




# Step 1 Get and Set API Tokens

### ATTRIBUTION NOTE
The code for the function request_signup is adapted from Dr. McDonald's code in "epa_air_quality_history_example.ipbny". The code was slightly reformatted for readibility

In [3]:

#    This implements the sign-up request. The parameters are standardized so that this function definition matches
#    all of the others. However, the easiest way to call this is to simply call this function with your preferred
#    email address.
#
def request_signup(email_address = None,
                   endpoint_url = API_REQUEST_URL, 
                   endpoint_action = API_ACTION_SIGNUP, 
                   request_template = AQS_REQUEST_TEMPLATE,
                   headers = None):
    
    # Make sure we have a string - if you don't have access to this email addres, things might go badly for you
    if email_address:
        request_template['email'] = email_address        
    if not request_template['email']: 
        raise Exception("Must supply an email address to call 'request_signup()'")
    
    # Compose the signup url - create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_action.format(**request_template)
        
    # make the request
    try:
        # Wait first, to make sure we don't exceed a rate limit in the situation where an exception occurs
        # during the request processing - throttling is always a good practice with a free data source
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


You will need to request your own sign up key, which only needs to be done once.  If your request is succesfful, you will receive an email asking you to confirm your email address, which you must do before using the key.  I've included the signup request I made as an example, but commented it out so it does not run again with each run of this notebook.  

In [4]:
# request sign up key 
#response = request_signup("suetboyd@uw.edu")
#print(json.dumps(response,indent=4))

You wil receive an email with your user name and passcode.  Use those to update the relevant CONSTANTS. 

In [5]:
USERNAME = "" ## ENTER YOUR OWN USERNAME
APIKEY = "" ## ENTER YOUR OWN API KEY 

# Step 2 Define Request Functions 

### ATTRIBUTION NOTE
The code for the functions "daily_request_Kearney",, "request_monitors" "bounding_latlon" and extract_summary_from_response
are adapted from Dr. McDonald's code in "epa_air_quality_history_example.ipbny". Modifications: 

(1) Comments are slightly modified to make sense in context of this ntoebook.  
(2) The function "daily_request_Kearny" updates the Dr. McDonald's code from "request_daily_summary" to A - use the bounding box method API_ACTION_DAILY_SUMMARY_BOX as there are no sensors for Kearny, Nebraska available by county; add a "Quiet" paramater to allow the code to run without printout of results; C - to cover the entire fire sesaon; and D - add a scl parameter to designated the scale that should be used for the bounding box.  

(3) The functionality regaring requesting gaseous parameters was removed in the sample call b/c not relevant to this notebook.
(4) THe "find_scale" function includes snippets of code from Dr. McDonald significantly intermixed with new functionality 

Daily request function

In [6]:
#
#    This implements the daily summary request. Daily summary provides a daily summary value for each sensor being requested
#    from the start date to the end date. 
#
#   THis can be called with a mixture of a defined parameter dictionary, or with function
#    parameters. If function parameters are provided, those take precedence over any parameters from the request template.
#
def request_daily_summary(email_address = None, key = None, param=None,
                          begin_date = None, end_date = None, fips = None,
                          endpoint_url = API_REQUEST_URL, 
                          endpoint_action = API_ACTION_DAILY_SUMMARY_COUNTY, 
                          request_template = AQS_REQUEST_TEMPLATE,
                          headers = None):
    
    #  This prioritizes the info from the call parameters - not what's already in the template
    if email_address:
        request_template['email'] = email_address
    if key:
        request_template['key'] = key
    if param:
        request_template['param'] = param
    if begin_date:
        request_template['begin_date'] = begin_date
    if end_date:
        request_template['end_date'] = end_date
    if fips and len(fips)==5:
        request_template['state'] = fips[:2]
        request_template['county'] = fips[2:]            

    # Make sure there are values that allow us to make a call - these are always required
    if not request_template['email']:
        raise Exception("Must supply an email address to call 'request_daily_summary()'")
    if not request_template['key']: 
        raise Exception("Must supply a key to call 'request_daily_summary()'")
    if not request_template['param']: 
        raise Exception("Must supply param values to call 'request_daily_summary()'")
    if not request_template['begin_date']: 
        raise Exception("Must supply a begin_date to call 'request_daily_summary()'")
    if not request_template['end_date']: 
        raise Exception("Must supply an end_date to call 'request_daily_summary()'")
    # Note we're not validating FIPS fields because not all of the daily summary actions require the FIPS numbers
        
    # compose the request
    request_url = endpoint_url+endpoint_action.format(**request_template)
        
    # make the request
    try:
        # Wait first, to make sure we don't exceed a rate limit in the situation where an exception occurs
        # during the request processing - throttling is always a good practice with a free data source
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response



Next is a function that implements the daily summary request for Kearny, Nebraska for an entire fire season. 

In [7]:

# This implements the daily summary reques specific to  Kearney - update fields below if another location is desired 
# Provides data for an entire fire season. 
# Takes parameter year - the year for the fire season to be queried 
# And scl - the scale factor to use for that year to make sure we find at least one monitoring station.  


def daily_request_Kearney(year, scl, quiet = False):

    # Create a copy of the AQS_REQUEST_TEMPLATE
    request_data = AQS_REQUEST_TEMPLATE.copy()

    #update parameters
    request_data['email'] = USERNAME
    request_data['key'] = APIKEY
    request_data['param'] = AQI_PARAMS_PARTICULATES     # same particulate request as the one abover
    request_data['state'] = CITY_LOCATIONS['Kearney']['fips'][:2]
    request_data['county'] = CITY_LOCATIONS['Kearney']['fips'][2:]

    bbox = bounding_latlon(CITY_LOCATIONS["Kearney"], scale=scl)

    #   put our bounding box into the request_data
    request_data['minlat'] = bbox[0]
    request_data['maxlat'] = bbox[1]
    request_data['minlon'] = bbox[2]
    request_data['maxlon'] = bbox[3]
    
    
    # For every year, we will collect day from May 1st through October 31st

    b_mmdd = "0501"                                 
    e_mmdd = "1031"
    
    b = year+b_mmdd
    e = year+e_mmdd
    
    
    particulate_aqi = request_daily_summary(request_template=request_data, begin_date = b, \
                        end_date= e, endpoint_action = API_ACTION_DAILY_SUMMARY_BOX)
    
    if not quiet:
        print("Response for the particulate pollutants ...")
        if particulate_aqi["Header"][0]['status'] == "Success":
            print(json.dumps(particulate_aqi['Data'],indent=4))
        elif particulate_aqi["Header"][0]['status'].startswith("No data "):
            print("Looks like the response generated no data. You might take a closer look at your request and the response data.")
        else:
            print(json.dumps(particulate_aqi,indent=4))

    return particulate_aqi


Before we can implment request summary function, need to make sure we have monitors to query. 
Below function requests list of monitors for a given area using FIPAs Code or bounding box

In [8]:

#    This implements the monitors request. This requests monitoring stations. This can be done by state, county, or bounding box. 
#
#    Like the two other functions, this can be called with a mixture of a defined parameter dictionary, or with function
#    parameters. If function parameters are provided, those take precedence over any parameters from the request template.
#
def request_monitors(email_address = None, key = None, param=None,
                          begin_date = None, end_date = None, fips = None,
                          endpoint_url = API_REQUEST_URL, 
                          endpoint_action = API_ACTION_MONITORS_COUNTY, 
                          request_template = AQS_REQUEST_TEMPLATE,
                          headers = None):
    
    #  This prioritizes the info from the call parameters - not what's already in the template
    if email_address:
        request_template['email'] = email_address
    if key:
        request_template['key'] = key
    if param:
        request_template['param'] = param
    if begin_date:
        request_template['begin_date'] = begin_date
    if end_date:
        request_template['end_date'] = end_date
    if fips and len(fips)==5:
        request_template['state'] = fips[:2]
        request_template['county'] = fips[2:]            

    # Make sure there are values that allow us to make a call - these are always required
    if not request_template['email']:
        raise Exception("Must supply an email address to call 'request_monitors()'")
    if not request_template['key']: 
        raise Exception("Must supply a key to call 'request_monitors()'")
    if not request_template['param']: 
        raise Exception("Must supply param values to call 'request_monitors()'")
    if not request_template['begin_date']: 
        raise Exception("Must supply a begin_date to call 'request_monitors()'")
    if not request_template['end_date']: 
        raise Exception("Must supply an end_date to call 'request_monitors()'")
    # Note we're not validating FIPS fields because not all of the monitors actions require the FIPS numbers
    
    # compose the request
    request_url = endpoint_url+endpoint_action.format(**request_template)
    
    # make the request
    try:
        # Wait first, to make sure we don't exceed a rate limit in the situation where an exception occurs
        # during the request processing - throttling is always a good practice with a free data source
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


Define a function to create a bounding box around a particular location 

In [9]:
#   These are rough estimates for creating bounding boxes based on a city location
#   You can find these rough estimates on the USGS website:
#   https://www.usgs.gov/faqs/how-much-distance-does-a-degree-minute-and-second-cover-your-maps
#
LAT_25MILES = 25.0 * (1.0/69.0)    # This is about 25 miles of latitude in decimal degrees
LON_25MILES = 25.0 * (1.0/54.6)    # This is about 25 miles of longitude in decimal degrees
#
#   Compute a rough estimates for a bounding box around a given place
#   The bounding box is scaled in 50 mile increments. That is the bounding box will have sides that
#   are rough multiples of 50 miles, with the center of the box around the indicated place.
#   The scale parameter determines the scale (size) of the bounding box
#
def bounding_latlon(place=None,scale=1.0):
    minlat = place['latlon'][0] - float(scale) * LAT_25MILES
    maxlat = place['latlon'][0] + float(scale) * LAT_25MILES
    minlon = place['latlon'][1] - float(scale) * LON_25MILES
    maxlon = place['latlon'][1] + float(scale) * LON_25MILES
    return [minlat,maxlat,minlon,maxlon]

Now define a function that finds the bounding box needed to get at least one station 
with data for that particular year for Kearny, Nebraska.  Increase by 50 mile increments up to scale 7 (350 mile box)
then if still no stations, report "no stations within 350 miles"

In [10]:
#function to find scale factor needed for bounding box 
def find_scale(year, quiet = False):

    request_data = AQS_REQUEST_TEMPLATE.copy()
    request_data['email'] = USERNAME
    request_data['key'] = APIKEY
    request_data['param'] = AQI_PARAMS_PARTICULATES
    
    box_scale = 1.0 # start the search with a 50 mile box
    success = False  # haven't found anything yet, haven't started looking 
    
    # set the dates to look for fire season 
    b_mmdd = "0501"                                 
    e_mmdd = "1031"
    
    b = year+b_mmdd
    e = year+e_mmdd
    
    while not success:
        
        if box_scale > 6:
            if not quiet: 
                print(f"No stations found within_{50*box_scale}miles for year_{year}")
            box_scale = np.nan
            return box_scale
            
        
        
        bbox = bounding_latlon(CITY_LOCATIONS['Kearney'],scale=box_scale)
        request_data['minlat'] = bbox[0]
        request_data['maxlat'] = bbox[1]
        request_data['minlon'] = bbox[2]
        request_data['maxlon'] = bbox[3]
        
        #print(f"Looking for stations within a bounding box of{50*box_scale}miles")
        response = request_monitors(request_template=request_data, begin_date=b, end_date=e,
                            endpoint_action = API_ACTION_MONITORS_BOX)
    
        if response["Header"][0]['status'] == "Success":
            success = True
            if not quiet: 
                print(f"for year_{year}, you can use scale of_{50*box_scale}_miles")
                print(json.dumps(response['Data'],indent=4))
            return box_scale
        else:
            if not quiet:
                print(json.dumps(response,indent=4))
            box_scale += 1

Example - find the scale bounding box needed for Kearny in 1990.  Print out tells you the scale you can use to get at least one monitor and provides a list of the monitor(s). Returns the scalefactor.

In [11]:
find_scale("2000", quiet = False)

{
    "Header": [
        {
            "status": "No data matched your selection",
            "request_time": "2023-12-11T15:46:19-05:00",
            "url": "https://aqs.epa.gov/data/api/monitors/byBox?email=suetboyd@uw.edu&key=orangeosprey54&param=88101,88502&bdate=20000501&edate=20001031&minlat=40.23708115942029&maxlat=40.961718840579714&minlon=-99.53947545787545&maxlon=-98.62372454212453",
            "rows": 0
        }
    ],
    "Data": []
}
for year_2000, you can use scale of_100.0_miles
[
    {
        "state_code": "31",
        "county_code": "079",
        "site_number": "0003",
        "parameter_code": "88101",
        "poc": 1,
        "parameter_name": "PM2.5 - Local Conditions",
        "open_date": "1999-03-01",
        "close_date": "2004-04-19",
        "concurred_exclusions": null,
        "dominant_source": null,
        "measurement_scale": null,
        "measurement_scale_def": null,
        "monitoring_objective": "POPULATION EXPOSURE; REGIONAL TRANSPORT",
  

2.0

Example - here's the daily data request for Kearney for 2000 and a scale of 2.
Beware, its long

In [12]:
particulate_aqi = daily_request_Kearney("2000", 2)

Response for the particulate pollutants ...
[
    {
        "state_code": "31",
        "county_code": "079",
        "site_number": "0003",
        "parameter_code": "88101",
        "poc": 1,
        "latitude": 40.925012,
        "longitude": -98.339784,
        "datum": "WGS84",
        "parameter": "PM2.5 - Local Conditions",
        "sample_duration_code": "7",
        "sample_duration": "24 HOUR",
        "pollutant_standard": "PM25 24-hour 2006",
        "date_local": "2000-05-03",
        "units_of_measure": "Micrograms/cubic meter (LC)",
        "event_type": "No Events",
        "observation_count": 1,
        "observation_percent": 100.0,
        "validity_indicator": "Y",
        "arithmetic_mean": 8.2,
        "first_max_value": 8.2,
        "first_max_hour": 0,
        "aqi": 34,
        "method_code": "118",
        "method": "R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC",
        "local_site_name": "JEFFERSON ELEMENTARY",
        "site_address": "314 W 7TH GR

Here is a nice function to extract the data we need from this long, nested dictionary

In [13]:

#    This is a list of field names - data - that will be extracted from each record
#
EXTRACTION_FIELDS = ['sample_duration','observation_count','arithmetic_mean','aqi']

#
#    The function creates a summary record
def extract_summary_from_response(r=None, fields=EXTRACTION_FIELDS):
    ## the result will be structured around monitoring site, parameter, and then date
    result = dict()
    data = r["Data"]
    for record in data:
        # make sure the record is set up
        site = record['site_number']
        param = record['parameter_code']
        #date = record['date_local']    # this version keeps the respnse value YYYY-
        date = record['date_local'].replace('-','') # this puts it in YYYYMMDD format
        if site not in result:
            result[site] = dict()
            result[site]['local_site_name'] = record['local_site_name']
            result[site]['site_address'] = record['site_address']
            result[site]['state'] = record['state']
            result[site]['county'] = record['county']
            result[site]['city'] = record['city']
            result[site]['pollutant_type'] = dict()
        if param not in result[site]['pollutant_type']:
            result[site]['pollutant_type'][param] = dict()
            result[site]['pollutant_type'][param]['parameter_name'] = record['parameter']
            result[site]['pollutant_type'][param]['units_of_measure'] = record['units_of_measure']
            result[site]['pollutant_type'][param]['method'] = record['method']
            result[site]['pollutant_type'][param]['data'] = dict()
        if date not in result[site]['pollutant_type'][param]['data']:
            result[site]['pollutant_type'][param]['data'][date] = list()
        
        # now extract the specified fields
        extract = dict()
        for k in fields:
            if str(k) in record:
                extract[str(k)] = record[k]
            else:
                # this makes sure we always have the requested fields, even if
                # we have a missing value for a given day/month
                extract[str(k)] = None
        
        # add this extraction to the list for the day
        result[site]['pollutant_type'][param]['data'][date].append(extract)
    
    return result


In [14]:
# Update the daily_request function to get an entre year
# Takes parameter year - the year to be queried 
# And scl - the scale factor to use for that 
# Function is specific to Kearny - update fields below if another location is desired 


def daily_request_Kearney(year, scl, quiet = False):

    # Create a copy of the AQS_REQUEST_TEMPLATE
    request_data = AQS_REQUEST_TEMPLATE.copy()

    #update parameters
    request_data['email'] = USERNAME
    request_data['key'] = APIKEY
    request_data['param'] = AQI_PARAMS_PARTICULATES     # same particulate request as the one abover
    request_data['state'] = CITY_LOCATIONS['Kearney']['fips'][:2]
    request_data['county'] = CITY_LOCATIONS['Kearney']['fips'][2:]

    bbox = bounding_latlon(CITY_LOCATIONS["Kearney"], scale=scl)

    #   put our bounding box into the request_data
    request_data['minlat'] = bbox[0]
    request_data['maxlat'] = bbox[1]
    request_data['minlon'] = bbox[2]
    request_data['maxlon'] = bbox[3]
    
    
    # For every year, we will collect day from May 1st through October 31st

    #b_mmdd = "0501"                                 
    #e_mmdd = "1031"
    
    b_mmdd = "0101"                                 
    e_mmdd = "1231"
    
    b = year+b_mmdd
    e = year+e_mmdd
    
    
    particulate_aqi = request_daily_summary(request_template=request_data, begin_date = b, \
                        end_date= e, endpoint_action = API_ACTION_DAILY_SUMMARY_BOX)
    
    if not quiet:
        print("Response for the particulate pollutants ...")
        if particulate_aqi["Header"][0]['status'] == "Success":
            print(json.dumps(particulate_aqi['Data'],indent=4))
        elif particulate_aqi["Header"][0]['status'].startswith("No data "):
            print("Looks like the response generated no data. You might take a closer look at your request and the response data.")
        else:
            print(json.dumps(particulate_aqi,indent=4))

    return particulate_aqi


Here's the extracted data for the fire season 
Beware - still long! 


In [15]:
extract_particulate = extract_summary_from_response(particulate_aqi)
print("Summary of particulate extraction ...")
print(json.dumps(extract_particulate,indent=4))

Summary of particulate extraction ...
{
    "0003": {
        "local_site_name": "JEFFERSON ELEMENTARY",
        "site_address": "314 W 7TH GRAND ISLAND",
        "state": "Nebraska",
        "county": "Hall",
        "city": "Grand Island",
        "pollutant_type": {
            "88101": {
                "parameter_name": "PM2.5 - Local Conditions",
                "units_of_measure": "Micrograms/cubic meter (LC)",
                "method": "R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC",
                "data": {
                    "20000503": [
                        {
                            "sample_duration": "24 HOUR",
                            "observation_count": 1,
                            "arithmetic_mean": 8.2,
                            "aqi": 34
                        },
                        {
                            "sample_duration": "24 HOUR",
                            "observation_count": 1,
                            "arithmetic_mean"

Now let's massage that mess into a datframe.  First is a function to do this, then an example. 

In [16]:
# See chat GPT Attrubtion note at end of notebook 


# function that transforms extracted data into a datframe 
def transform_data (data):
    
    # Initialize empty lists to store the data
    dates = []
    site_names = []
    site_addresses = []
    states = []
    counties = []
    cities = []
    parameter_names = []
    units_of_measure = []
    methods = []
    sample_durations = []
    observation_counts = []
    arithmetic_means = []
    aqis = []


    # Iterate through the nested dictionary
    for site_id, site_data in data.items():
        for pollutant_type, pollutant_info in site_data['pollutant_type'].items():
            for date, pollutant_data in pollutant_info['data'].items():
                for entry in pollutant_data:
                    dates.append(date)
                    site_names.append(site_data['local_site_name'])
                    site_addresses.append(site_data['site_address'])
                    states.append(site_data['state'])
                    counties.append(site_data['county'])
                    cities.append(site_data['city'])
                    parameter_names.append(pollutant_info['parameter_name'])
                    units_of_measure.append(pollutant_info['units_of_measure'])
                    methods.append(pollutant_info['method'])
                    sample_durations.append(entry['sample_duration'])
                    observation_counts.append(entry['observation_count'])
                    arithmetic_means.append(entry['arithmetic_mean'])
                    aqis.append(entry['aqi'])

    # Create a DataFrame
    df = pd.DataFrame({
        'Date': dates,
        'Local Site Name': site_names,
        'Site Address': site_addresses,
        'State': states,
        'County': counties,
        'City': cities,
        'Parameter Name': parameter_names,
        'Units of Measure': units_of_measure,
        'Method': methods,
        'Sample Duration': sample_durations,
        'Observation Count': observation_counts,
        'Arithmetic Mean': arithmetic_means,
        'AQI': aqis
        })

    return df 


In [17]:
results_df = transform_data(extract_particulate)
results_df.head(5)

Unnamed: 0,Date,Local Site Name,Site Address,State,County,City,Parameter Name,Units of Measure,Method,Sample Duration,Observation Count,Arithmetic Mean,AQI
0,20000503,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,8.2,34
1,20000503,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,8.2,34
2,20000503,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,8.2,34
3,20000503,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,8.2,34
4,20000503,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,8.2,34


# Step 3 - Get the Data for Kearney for all the years we can 

In [18]:
# years to try and find data 
years = [str(i) for i in range(1963, 2023)]


In [19]:
# create a dataframe of the scale factor to use for each year 

scales_df = pd.DataFrame({"Year": years})

scales = []
for year in years:
    scale = find_scale(year, quiet = True)
    scales.append(scale)
    
scales_df["Scales"] = scales

#scales_df.head()

In [20]:
# display some results 
print(scales_df.head())
print(scales_df.tail())

   Year  Scales
0  1963     NaN
1  1964     NaN
2  1965     NaN
3  1966     NaN
4  1967     NaN
    Year  Scales
55  2018     2.0
56  2019     2.0
57  2020     2.0
58  2021     2.0
59  2022     2.0


In [21]:
years = [str(i) for i in range(1963, 2023)]
results_all = pd.DataFrame()
no_data_yrs = []

for year in years:
    scale = float(scales_df[scales_df["Year"] == year]["Scales"].values[0])
    if np.isnan(scale):
        no_data_yrs.append(year) 
    else: 
        particulate_aqi = daily_request_Kearney(year, scale, quiet = True)
        results_sum = extract_summary_from_response(r=particulate_aqi, fields=EXTRACTION_FIELDS)
        new = transform_data(results_sum)
        results_all =  pd.concat([results_all, new], axis=0)
        
print("No data was avail for years:",  no_data_yrs)
        

No data was avail for years: ['1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998']


In [22]:
results_all.head(10)

Unnamed: 0,Date,Local Site Name,Site Address,State,County,City,Parameter Name,Units of Measure,Method,Sample Duration,Observation Count,Arithmetic Mean,AQI
0,19990307,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,20.2,68.0
1,19990307,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,20.2,68.0
2,19990307,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,20.2,68.0
3,19990307,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,20.2,68.0
4,19990307,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,20.2,68.0
5,19990307,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,20.2,68.0
6,19990313,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,12.8,52.0
7,19990313,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,12.8,52.0
8,19990313,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,12.8,52.0
9,19990313,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,12.8,52.0


In [23]:
results_all.shape

(21335, 13)

In [24]:
# when identical readings in same day, drop duplicates 
results_all = results_all.drop_duplicates()
print(results_all.shape)
results_all.head()

(4445, 13)


Unnamed: 0,Date,Local Site Name,Site Address,State,County,City,Parameter Name,Units of Measure,Method,Sample Duration,Observation Count,Arithmetic Mean,AQI
0,19990307,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,20.2,68.0
6,19990313,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,12.8,52.0
12,19990316,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,34.0,97.0
18,19990325,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,9.8,41.0
24,19990328,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,4.7,20.0


In [25]:
results_all.describe()

Unnamed: 0,Observation Count,Arithmetic Mean,AQI
count,4445.0,4445.0,3378.0
mean,6.425422,6.82826,28.28804
std,9.705781,4.503496,15.959591
min,1.0,-1.5,0.0
25%,1.0,3.9,17.0
50%,1.0,5.7,25.0
75%,1.0,8.5,37.0
max,24.0,48.833333,134.0


In [26]:
# Function that adds a month and year column 
# See chat GPT attribution at end of notebook 


def add_date_columns(df, date_column_name='date'):  
    # Convert the 'date' column to datetime format
    df[date_column_name] = pd.to_datetime(df[date_column_name])
 

    # Add a new 'Month' column
    df['Month'] = df[date_column_name].dt.month_name()
    df['Month Number'] = df[date_column_name].dt.month
    
    # Add a year column 
    df['Year'] = df[date_column_name].dt.year
    
    return df

In [27]:
results_all = add_date_columns(results_all,  date_column_name = "Date")
results_all.head()

Unnamed: 0,Date,Local Site Name,Site Address,State,County,City,Parameter Name,Units of Measure,Method,Sample Duration,Observation Count,Arithmetic Mean,AQI,Month,Month Number,Year
0,1999-03-07,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,20.2,68.0,March,3,1999
6,1999-03-13,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,12.8,52.0,March,3,1999
12,1999-03-16,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,34.0,97.0,March,3,1999
18,1999-03-25,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,9.8,41.0,March,3,1999
24,1999-03-28,JEFFERSON ELEMENTARY,314 W 7TH GRAND ISLAND,Nebraska,Hall,Grand Island,PM2.5 - Local Conditions,Micrograms/cubic meter (LC),R & P Model 2025 PM2.5 Sequential w/WINS - GRA...,24 HOUR,1,4.7,20.0,March,3,1999


In [28]:
# Save to file
f = "Data/epa_api_pm25.csv"
results_all.to_csv(f, index = False)

Investigate the data from the API pulls 

In [29]:
# Examine the distribution of monitoring sites over time 
tbl = pd.pivot_table(results_all, values ="Observation Count", index = ["Local Site Name"], columns = ["Year"], aggfunc=sum)
tbl

Year,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
Local Site Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Grand Island NDOT,,,,,,,,,,,...,,,,,,,,8541.0,8926.0,8761.0
Grand Island Senior High,,,,,,71.0,113.0,119.0,111.0,109.0,...,114.0,120.0,115.0,113.0,111.0,111.0,97.0,11.0,,
JEFFERSON ELEMENTARY,73.0,110.0,106.0,114.0,111.0,35.0,,,,,...,,,,,,,,,,


From above we see that most years had one monitoring site, but the specific site  changed from year to year. 2004 had two sites

In [30]:
# Examine the distribution of parameters over time 
tbl = pd.pivot_table(results_all, values ="Observation Count", index= ["Parameter Name"], columns = ["Year"], aggfunc=sum)
tbl

Year,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
Parameter Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
PM2.5 - Local Conditions,73,110,106,114,111,106,113,119,111,109,...,114,120,115,113,111,111,97,8552,8926,8761


Table above shows that all measures came from "PM2.5 - Local Conditions"

In [31]:
#Examine Distribution of Sample Duration 
tbl = pd.pivot_table(results_all, values ="Observation Count", index= ["Sample Duration"], columns = ["Year"], aggfunc=sum)
tbl

Year,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
Sample Duration,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1 HOUR,,,,,,,,,,,...,,,,,,,,8203.0,8569.0,8411.0
24 HOUR,73.0,110.0,106.0,114.0,111.0,106.0,113.0,119.0,111.0,109.0,...,114.0,120.0,115.0,113.0,111.0,111.0,97.0,11.0,,
24-HR BLK AVG,,,,,,,,,,,...,,,,,,,,338.0,357.0,350.0


For all years 2019 and prior, a 24 Sample Duration is used.  Beginning in 2020, there are at least two and possibly three durations used (1 Hour, 24 Hour, and 24-HR BLK Avg). 

# Chat GPT Attribution 

The following function(s) or codeblock(s) contained in this notebook were written with assistance from Chat GPT available at: https://chat.openai.com/. In some cases, code suggested by Chat GPT was then further modified by the Notebook author, Sue Boyd

***
For assitance writing the "transform data" function, Chat GPT was given the following prompt: 

"I have data in this format.   Write code to transform it to a python dataframe : {'0005': {'local_site_name': 'Grand Island NDOT',
  'site_address': '3305 W Old Potash Hwy',
  'state': 'Nebraska',
  'county': 'Hall',
  'city': 'Grand Island',
  'pollutant_type': {'88101': {'parameter_name': 'PM2.5 - Local Conditions', . . ."
***                               
 
For assitance writing the code block to calculate AQI by year,  Chat GPT was given the following prompt:                        

"I have a dataframe with a column called "Date" in string format YYYYMMDD.  Write code to create a new column called year that expresses the Year as  a date object

***
For the codeblock titled "Add in years 1963 to 1984 with NAN values in the Average and Max columns", Chat GPT was given the following prompt: 

"I have a dataframe that looks like this.  Add years 1963 to 1984 with "NAN" values.  Average AQI	Max
Year		
1985	28.17	55.0
1986	46.71	63.0
1987	33.94	56.0
1988	37.89	75.0
1989	37.37	69.0
1990	38.03	71.0"

***

For assistance in writing the function "add_date_columns" function, Chat GPT was given the following prompt:

"I have a dataframe called smoke_pm25 with a column called date with values in the format "YYYY-MM-DD". Write a function that adds a column "Month" to the dataframe basedo on the month in the datestring."

AND

"Update the code so that there are two columns "Month Name" as a string with values January, February, etc. and "Month Number" with the number of hte month in the year."


