Standard Python Libraries Required

This is the supporting notebook to get the actual AQI data for Alexandria from 1964-2021. This notebook uses the EPA AQS API to get data for any city using the county or bounding box. For Alexandria I went with the bounding box approach with a 50 miles radius as Alexandria is an independent city without any county associated with it. The code will generate a final csv file with mean annual AQI values which we get from the monitoring stations.

## License
Snippets from this code to get AQI results for the assigned city developed by Dr. David W. McDonald for use in DATA 512, a course in the UW MS Data Science degree program. This code is provided under the [Creative Commons](https://creativecommons.org) [CC-BY license](https://creativecommons.org/licenses/by/4.0/). Revision 1.2 - August 16, 2024

Rest of the code is under MIT license

The code requires standard python modules to be installed which I have already added in the cell below for reproducibility.

In [1]:
%%capture
%pip install pandas
%pip install requests

In [2]:
# 
#    IMPORTS
#
#import json, time, urllib.parse
import json, time
import pandas as pd
#
#    The 'requests' module is a distribution module for making web requests. If you do not have it already, you'll need to install it
import requests

In [3]:
#########
#
#    CONSTANTS
#

#
#    This is the root of all AQS API URLs
#
API_REQUEST_URL = 'https://aqs.epa.gov/data/api'

#
#    These are some of the 'actions' we can ask the API to take or requests that we can make of the API
#
#    Sign-up request - generally only performed once - unless you lose your key
API_ACTION_SIGNUP = '/signup?email={email}'
#
#    List actions provide information on API parameter values that are required by some other actions/requests
API_ACTION_LIST_CLASSES = '/list/classes?email={email}&key={key}'
API_ACTION_LIST_PARAMS = '/list/parametersByClass?email={email}&key={key}&pc={pclass}'
API_ACTION_LIST_SITES = '/list/sitesByCounty?email={email}&key={key}&state={state}&county={county}'
#
#    Monitor actions are requests for monitoring stations that meet specific criteria
API_ACTION_MONITORS_COUNTY = '/monitors/byCounty?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&state={state}&county={county}'
API_ACTION_MONITORS_BOX = '/monitors/byBox?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&minlat={minlat}&maxlat={maxlat}&minlon={minlon}&maxlon={maxlon}'
#
#    Summary actions are requests for summary data. These are for daily summaries
API_ACTION_DAILY_SUMMARY_COUNTY = '/dailyData/byCounty?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&state={state}&county={county}'
API_ACTION_DAILY_SUMMARY_BOX = '/dailyData/byBox?email={email}&key={key}&param={param}&bdate={begin_date}&edate={end_date}&minlat={minlat}&maxlat={maxlat}&minlon={minlon}&maxlon={maxlon}'
#
#    It is always nice to be respectful of a free data resource.
#    We're going to observe a 100 requests per minute limit - which is fairly nice
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED
#
#
#    This is a template that covers most of the parameters for the actions we might take, from the set of actions
#    above. In the examples below, most of the time parameters can either be supplied as individual values to a
#    function - or they can be set in a copy of the template and passed in with the template.
# 
AQS_REQUEST_TEMPLATE = {
    "email":      "",     
    "key":        "",      
    "state":      "",     # the two digit state FIPS # as a string
    "county":     "",     # the three digit county FIPS # as a string
    "begin_date": "",     # the start of a time window in YYYYMMDD format
    "end_date":   "",     # the end of a time window in YYYYMMDD format, begin_date and end_date must be in the same year
    "minlat":    0.0,
    "maxlat":    0.0,
    "minlon":    0.0,
    "maxlon":    0.0,
    "param":     "",     # a list of comma separated 5 digit codes, max 5 codes requested
    "pclass":    ""      # parameter class is only used by the List calls
}

AQI_PARAM_CLASS = "AQI POLLUTANTS"

#   Gaseous AQI pollutants CO, SO2, NO2, and O2
AQI_PARAMS_GASEOUS = "42101,42401,42602,44201"
#
#   Particulate AQI pollutants PM10, PM2.5, and Acceptable PM2.5
AQI_PARAMS_PARTICULATES = "81102,88101,88502"
#
#    This is a list of field names - data - that will be extracted from each record
#
EXTRACTION_FIELDS = ['sample_duration','observation_count','arithmetic_mean','aqi']

In [27]:
#
#
#   HELPER FUNCTIONS
#
#    This implements the sign-up request. The parameters are standardized so that this function definition matches
#    all of the others. However, the easiest way to call this is to simply call this function with your preferred
#    email address.
#
def request_signup(email_address = None,
                   endpoint_url = API_REQUEST_URL, 
                   endpoint_action = API_ACTION_SIGNUP, 
                   request_template = AQS_REQUEST_TEMPLATE,
                   headers = None):
    
    # Make sure we have a string - if you don't have access to this email addres, things might go badly for you
    if email_address:
        request_template['email'] = email_address        
    
    if not request_template['email']: 
        raise Exception("Must supply an email address to call 'request_signup()'")

    if '@' not in request_template['email']: 
        raise Exception(f"Must supply an email address to call 'request_signup()'. The string '{request_template['email']}' does not look like an email address.")

    # Compose the signup url - create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_action.format(**request_template)
        
    # make the request
    try:
        # Wait first, to make sure we don't exceed a rate limit in the situation where an exception occurs
        # during the request processing - throttling is always a good practice with a free data source
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


#
#    This implements the list request. There are several versions of the list request that only require email and key.
#    This code sets the default action/requests to list the groups or parameter class descriptors. Having those descriptors 
#    allows one to request the individual (proprietary) 5 digit codes for individual air quality measures by using the
#    param request. Some code in later cells will illustrate those requests.
#
def request_list_info(email_address = None, key = None,
                      endpoint_url = API_REQUEST_URL, 
                      endpoint_action = API_ACTION_LIST_CLASSES, 
                      request_template = AQS_REQUEST_TEMPLATE,
                      headers = None):
    
    #  Make sure we have email and key - at least
    #  This prioritizes the info from the call parameters - not what's already in the template
    if email_address:
        request_template['email'] = email_address
    if key:
        request_template['key'] = key
    
    # For the basic request we need an email address and a key
    if not request_template['email']:
        raise Exception("Must supply an email address to call 'request_list_info()'")
    if not request_template['key']: 
        raise Exception("Must supply a key to call 'request_list_info()'")

    # compose the request
    request_url = endpoint_url+endpoint_action.format(**request_template)
        
    # make the request
    try:
        # Wait first, to make sure we don't exceed a rate limit in the situation where an exception occurs
        # during the request processing - throttling is always a good practice with a free data source
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

#
#   Compute rough estimates for a bounding box around a given place
#   The bounding box is scaled in 50 mile increments. That is, the bounding box will have sides that
#   are rough multiples of 50 miles, with the center of the box around the indicated place.
#   The scale parameter determines the scale (size) of the bounding box
#
def bounding_latlon(place=None,scale=1.0):
    LAT_25MILES = 25.0 * (1.0/69.0)    # This is about 25 miles of latitude in decimal degrees
    LON_25MILES = 25.0 * (1.0/54.6)    # This is about 25 miles of longitude in decimal degrees
    
    minlat = place['latlon'][0] - float(scale) * LAT_25MILES
    maxlat = place['latlon'][0] + float(scale) * LAT_25MILES
    minlon = place['latlon'][1] - float(scale) * LON_25MILES
    maxlon = place['latlon'][1] + float(scale) * LON_25MILES
    return [minlat,maxlat,minlon,maxlon]

#
#    This implements the monitors request. This requests monitoring stations. This can be done by state, county, or bounding box. 
#
#    Like the two other functions, this can be called with a mixture of a defined parameter dictionary, or with function
#    parameters. If function parameters are provided, those take precedence over any parameters from the request template.
#
def request_monitors(email_address = None, key = None, param=None,
                          begin_date = None, end_date = None, fips = None,
                          endpoint_url = API_REQUEST_URL, 
                          endpoint_action = API_ACTION_MONITORS_COUNTY, 
                          request_template = AQS_REQUEST_TEMPLATE,
                          headers = None):
    
    #  This prioritizes the info from the call parameters - not what's already in the template
    if email_address:
        request_template['email'] = email_address
    if key:
        request_template['key'] = key
    if param:
        request_template['param'] = param
    if begin_date:
        request_template['begin_date'] = begin_date
    if end_date:
        request_template['end_date'] = end_date
    if fips and len(fips)==5:
        request_template['state'] = fips[:2]
        request_template['county'] = fips[2:]            

    # Make sure there are values that allow us to make a call - these are always required
    if not request_template['email']:
        raise Exception("Must supply an email address to call 'request_monitors()'")
    if not request_template['key']: 
        raise Exception("Must supply a key to call 'request_monitors()'")
    if not request_template['param']: 
        raise Exception("Must supply param values to call 'request_monitors()'")
    if not request_template['begin_date']: 
        raise Exception("Must supply a begin_date to call 'request_monitors()'")
    if not request_template['end_date']: 
        raise Exception("Must supply an end_date to call 'request_monitors()'")
    # Note we're not validating FIPS fields because not all of the monitors actions require the FIPS numbers
    
    # compose the request
    request_url = endpoint_url+endpoint_action.format(**request_template)
    
    # make the request
    try:
        # Wait first, to make sure we don't exceed a rate limit in the situation where an exception occurs
        # during the request processing - throttling is always a good practice with a free data source
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

#
#    This implements the daily summary request. Daily summary provides a daily summary value for each sensor being requested
#    from the start date to the end date. 
#
#    Like the two other functions, this can be called with a mixture of a defined parameter dictionary, or with function
#    parameters. If function parameters are provided, those take precedence over any parameters from the request template.
#
def request_daily_summary(email_address = None, key = None, param=None,
                          begin_date = None, end_date = None, fips = None,
                          endpoint_url = API_REQUEST_URL, 
                          endpoint_action = API_ACTION_DAILY_SUMMARY_COUNTY, 
                          request_template = AQS_REQUEST_TEMPLATE,
                          headers = None):
    
    #  This prioritizes the info from the call parameters - not what's already in the template
    if email_address:
        request_template['email'] = email_address
    if key:
        request_template['key'] = key
    if param:
        request_template['param'] = param
    if begin_date:
        request_template['begin_date'] = begin_date
    if end_date:
        request_template['end_date'] = end_date
    if fips and len(fips)==5:
        request_template['state'] = fips[:2]
        request_template['county'] = fips[2:]            

    # Make sure there are values that allow us to make a call - these are always required
    if not request_template['email']:
        raise Exception("Must supply an email address to call 'request_daily_summary()'")
    if not request_template['key']: 
        raise Exception("Must supply a key to call 'request_daily_summary()'")
    if not request_template['param']: 
        raise Exception("Must supply param values to call 'request_daily_summary()'")
    if not request_template['begin_date']: 
        raise Exception("Must supply a begin_date to call 'request_daily_summary()'")
    if not request_template['end_date']: 
        raise Exception("Must supply an end_date to call 'request_daily_summary()'")
    # Note we're not validating FIPS fields because not all of the daily summary actions require the FIPS numbers
        
    # compose the request
    request_url = endpoint_url+endpoint_action.format(**request_template)
        
    # make the request
    try:
        # Wait first, to make sure we don't exceed a rate limit in the situation where an exception occurs
        # during the request processing - throttling is always a good practice with a free data source
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response

# This function, extract_AQI_data, retrieves air quality data from a JSON response if the request is successful. 
# It returns a list of sensor measurements for the day when the status is "Success." 
# If there's no data, it returns None, and for unknown issues, it outputs the full JSON response for inspection.
def extract_AQI_data(aqi_JSON):
    """
    Extracts AQI Data from JSON response if the request is successful.
    
    Parameters: 
        aqi_JSON (dict): The raw JSON response of the request.
    Return:
        list: the list of dictionaries denoting each sensor's measurement for a day.
    """
    if aqi_JSON["Header"][0]['status'] == "Success":
        return aqi_JSON['Data']
    elif aqi_JSON["Header"][0]['status'].startswith("No data "):
        pass
    else:
        print("Take a look!")
        print(json.dumps(aqi_JSON,indent=4))
    return None

#
#    The function creates a summary record
def extract_summary_from_response(r=None, fields=EXTRACTION_FIELDS):
    ## the result will be structured around monitoring site, parameter, and then date
    result = dict()
    data = r["Data"]
    for record in data:
        # make sure the record is set up
        site = record['site_number']
        param = record['parameter_code']
        #date = record['date_local']    # this version keeps the respnse value YYYY-
        date = record['date_local'].replace('-','') # this puts it in YYYYMMDD format
        if site not in result:
            result[site] = dict()
            result[site]['local_site_name'] = record['local_site_name']
            result[site]['site_address'] = record['site_address']
            result[site]['state'] = record['state']
            result[site]['county'] = record['county']
            result[site]['city'] = record['city']
            result[site]['pollutant_type'] = dict()
        if param not in result[site]['pollutant_type']:
            result[site]['pollutant_type'][param] = dict()
            result[site]['pollutant_type'][param]['parameter_name'] = record['parameter']
            result[site]['pollutant_type'][param]['units_of_measure'] = record['units_of_measure']
            result[site]['pollutant_type'][param]['method'] = record['method']
            result[site]['pollutant_type'][param]['data'] = dict()
        if date not in result[site]['pollutant_type'][param]['data']:
            result[site]['pollutant_type'][param]['data'][date] = list()
        
        # now extract the specified fields
        extract = dict()
        for k in fields:
            if str(k) in record:
                extract[str(k)] = record[k]
            else:
                # this makes sure we always have the requested fields, even if
                # we have a missing value for a given day/month
                extract[str(k)] = None
        
        # add this extraction to the list for the day
        result[site]['pollutant_type'][param]['data'][date].append(extract)
    
    return result


### Step 1: Making a sign-up request
Before you can use the API you need to request a key. You will use an email address to make the request. The EPA then sends a confirmation email link and a 'key' that you use for all other requests.

You only need to sign-up once, unless you want to invalidate your current key (by getting a new key) or you lose your key.

Uncommet the below code to request for API KEY and sign up

In [5]:
#
#    A SIGNUP request is only to be done once, to request a key. A key is sent to that email address and needs to be confirmed with a click through
#    This code should probably be commented out after you've made your key request to make sure you don't accidentally make a new sign-up request
#
#print("Requesting SIGNUP ...")
#USERNAME = "<your_email_address>"
#response = request_signup(USERNAME)
#print(json.dumps(response,indent=4))
#

In [6]:
# once you get the API Key paste it below to get the data
# 
USERNAME = "<the_email_address_you_sent_on_signup>"

APIKEY = "<the_key_the_EPA_sent_you_in_email>"

### Step 2: Making a list request

Once you have a key, the next thing is to get information about the different types of air quality monitoring (sensors) and the different places where we might find air quality stations. The monitoring system is complex and changes all the time. The EPA implementation allows an API user to find changes to monitoring sites and sensors by making requests - maybe monthly, or daily. This API approach is probably better than having the EPA publish documentation that may be out of date as soon as it hits a web page. The one problem here is that some of the responses rely on jargon or terms-of-art. That is, one needs to know a bit about the way atmospheric science works to understand some of the terms.

The default should get us a list of the various groups or classes of sensors. These classes are user defined names for clustors of
sensors that might be part of a package or default air quality sensing station. We need a class name to start getting down to the
a sensor ID. Each sensor type has an ID number. We'll eventually need those ID numbers to be able to request values that come from
that specific sensor.

In [16]:

request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY

response = request_list_info(request_template=request_data)

if response["Header"][0]['status'] == "Success":
    print(json.dumps(response['Data'],indent=4))
else:
    print(json.dumps(response,indent=4))

[
    {
        "code": "AIRNOW MAPS",
        "value_represented": "The parameters represented on AirNow maps (88101, 88502, and 44201)"
    },
    {
        "code": "ALL",
        "value_represented": "Select all Parameters Available"
    },
    {
        "code": "AQI POLLUTANTS",
        "value_represented": "Pollutants that have an AQI Defined"
    },
    {
        "code": "CORE_HAPS",
        "value_represented": "Urban Air Toxic Pollutants"
    },
    {
        "code": "CRITERIA",
        "value_represented": "Criteria Pollutants"
    },
    {
        "code": "CSN DART",
        "value_represented": "List of CSN speciation parameters to populate the STI DART tool"
    },
    {
        "code": "FORECAST",
        "value_represented": "Parameters routinely extracted by AirNow (STI)"
    },
    {
        "code": "HAPS",
        "value_represented": "Hazardous Air Pollutants"
    },
    {
        "code": "IMPROVE CARBON",
        "value_represented": "IMPROVE Carbon Parameters"
    }



We're interested in getting to something that might be the Air Quality Index (AQI). You see this reported on the news - often around smog values, but also when there is smoke in the sky. The AQI is a complex measure of different gasses and of the particles in the air (dust, dirt, ash ...).

From the list produced by our 'list/Classes' request above, it looks like there is a class of sensors called "AQI POLLUTANTS". Let's try to get a list of those specific sensors and see what we can get from those.

Once we have a list of the classes or groups of possible sensors, we can find the sensor IDs that make up that class (group). The one that looks to be associated with the Air Quality Index is "AQI POLLUTANTS". We'll use that to make another list request.


In [17]:
AQI_PARAM_CLASS = "AQI POLLUTANTS"

#
#   Structure a request to get the sensor IDs associated with the AQI
#
request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY
request_data['pclass'] = AQI_PARAM_CLASS  # here we specify that we want this 'pclass' or parameter classs

response = request_list_info(request_template=request_data, endpoint_action=API_ACTION_LIST_PARAMS)

if response["Header"][0]['status'] == "Success":
    print(json.dumps(response['Data'],indent=4))
else:
    print(json.dumps(response,indent=4))

[
    {
        "code": "42101",
        "value_represented": "Carbon monoxide"
    },
    {
        "code": "42401",
        "value_represented": "Sulfur dioxide"
    },
    {
        "code": "42602",
        "value_represented": "Nitrogen dioxide (NO2)"
    },
    {
        "code": "44201",
        "value_represented": "Ozone"
    },
    {
        "code": "81102",
        "value_represented": "PM10 Total 0-10um STP"
    },
    {
        "code": "88101",
        "value_represented": "PM2.5 - Local Conditions"
    },
    {
        "code": "88502",
        "value_represented": "Acceptable PM2.5 AQI & Speciation Mass"
    }
]



This list includes the FIPS number for the state and county as a 5 digit string. This format, the 5 digit string, is a 'old' format that is still widely used. There are new codes that may eventually be adopted for the US government information systems. But FIPS is currently what the AQS uses, so that's what is in the list as the constant.

We are interested in exploring the AQI data for the city Alexandria, VA.

The below fields were filled by doing a simple Google Search to find the county, and the fips. The latitude and longitude were taken from the original source - the GeoHack website.


In [18]:
#
#   Given the set of sensor codes, now we can create a parameter list or 'param' value as defined by the AQS API spec.
#   It turns out that we want all of these measures for AQI, but we need to have two different param constants to get
#   all seven of the code types. We can only have a max of 5 sensors/values request per param.
#
#   Gaseous AQI pollutants CO, SO2, NO2, and O2
AQI_PARAMS_GASEOUS = "42101,42401,42602,44201"
#
#   Particulate AQI pollutants PM10, PM2.5, and Acceptable PM2.5
AQI_PARAMS_PARTICULATES = "81102,88101,88502"
#   
#
CITY_LOCATIONS = {
    "alexandria":      {"city"   : "Alexandria",
                       "state"  : ["Virginia", "VA"],
                       "fips"   : "51510",
                       "monitoring_start_year" : 0,
                       "area"   : 22760.0,
                       "pop."   : 76378,
                       'latlon' : [38.820450, -77.050552]}
}

We are trying to use the FIPS code to get the list of monitoring stations based on the county. I got the FIPS code from wikipedia. This list request should give us a list of all the monitoring stations in the county specified by the given city selected from the CITY_LOCATIONS dictionary

In [19]:
request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY
request_data['state'] = CITY_LOCATIONS['alexandria']['fips'][:2]   # the first two digits (characters) of FIPS is the state code
request_data['county'] = CITY_LOCATIONS['alexandria']['fips'][2:]  # the last three digits (characters) of FIPS is the county code

response = request_list_info(request_template=request_data, endpoint_action=API_ACTION_LIST_SITES)

# if response["Header"][0]['status'] == "Success":
#     print(json.dumps(response['Data'],indent=4))
# else:
#     print(json.dumps(response,indent=4))

# Count non-null 'value_represented' entries
not_null_count_stations = [print(item) for item in response['Data'] if item["value_represented"] is not None]

print("Total non-null 'value_represented' entries:", len(not_null_count_stations))


{'code': '0009', 'value_represented': 'Alexandria Health Dept.'}
{'code': '0020', 'value_represented': 'Tucker Elementary School'}
{'code': '0021', 'value_represented': 'City of Alexandria Transportation and Env. Services Maintenance Bldg'}
Total non-null 'value_represented' entries: 3


### Step 3: Bounding box approach to get AQI data from the monitors
We can see we have 3 monitoring stations near the city of Alexandria if we use the county approach. This data is too low and has only about 1000 data points. Below code uses the bounding box appraoch to get the list of monitoring stations around Alexandria and the AQI information for gaseous and particulates.

In [None]:
request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY
request_data['state'] = CITY_LOCATIONS['alexandria']['fips'][:2]
request_data['county'] = CITY_LOCATIONS['alexandria']['fips'][2:]

################### PARTICULATES #############
request_data['param'] = AQI_PARAMS_PARTICULATES     

count = 0
# the first example uses the default - request monitors by county, we'll just use a recent date for now
response = request_monitors(request_template=request_data, begin_date="20210701", end_date="20210731")
#
# the response should be similar to the 'list' request above - but in this case we should only get monitors that
# monitor the AQI_PARAMS_PARTICULATES set of params.
#
if response["Header"][0]['status'] == "Success":
    count += 1
else:
    print(json.dumps(response,indent=4))

print("Total monitoring stations found around Alexandria within a 50 miles radius for particulate: ", count)

################### GASEOUS #############
request_data['param'] = AQI_PARAMS_GASEOUS

count = 0
# default case - request monitors by county, we'll just use a recent date for now
response = request_monitors(request_template=request_data, begin_date="20210701", end_date="20210731")
#
# the response should be similar to the 'list' request above - but in this case we should only get monitors that
# monitor the AQI_PARAMS_GASEOUS set of params.
#
if response["Header"][0]['status'] == "Success":
    count += 1
else:
    print(json.dumps(response,indent=4))

print("Total monitoring stations found around Alexandria within a 50 miles radius for gaseous: ", count)


### Step 4: Making a daily summary request
The following code encapsulates requests to the EPA AQS API through the request_daily_summary function, streamlining the process of retrieving daily air quality summary data for specific sensors over a specified date range. Users are advised to start by creating or copying a parameter template, initializing it with static values, and then updating dynamic parameters like date ranges for each call.

The request_daily_summary function is highly flexible, accepting parameters such as email_address, key, param, begin_date, end_date, and an optional FIPS code. When provided, these parameters override any preset values in the template, ensuring customizable and accurate requests. Essential parameters, including email, key, parameter values, and date range, are checked to prevent errors, ensuring the request can be processed successfully. After constructing the request URL with the API’s base URL and action endpoint, the function enforces a throttle delay to respect rate limits before making the GET request. The response is then parsed as JSON, with errors handled gracefully by printing any encountered issues and returning None if necessary.

This function is optimized for querying the AQS API for historical daily summaries, covering data from 1964 to 2024. Additionally, it applies a filter to retrieve data only for "Wildfire Season" (May to October each year).

In [44]:
#
#    PARTICULATE REQUEST
#
request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY
request_data['param'] = AQI_PARAMS_PARTICULATES   

START_MMDD = "0501" 
END_MMDD = "1031"
YEAR_START = 1961
YEAR_END = 2021

particulate_aqi_response = []
# 
#   Now, we need bounding box parameters

#   50 mile box with Alexandria in the center
bbox = bounding_latlon(CITY_LOCATIONS['alexandria'],scale=1.0)

# the bbox response comes back as a list - [minlat,maxlat,minlon,maxlon]

#   put our bounding box into the request_data
request_data['minlat'] = bbox[0]
request_data['maxlat'] = bbox[1]
request_data['minlon'] = bbox[2]
request_data['maxlon'] = bbox[3]

# We'll iterate through all the years
for year in range(YEAR_START, YEAR_END + 1):
    # Define start and end times for the fire season
    begin_date = str(year) + START_MMDD
    end_date = str(year) + END_MMDD

    response = request_daily_summary(request_template=request_data, begin_date=begin_date, end_date=end_date,
                                            endpoint_action = API_ACTION_DAILY_SUMMARY_BOX)
    
    part_list = extract_AQI_data(response)
    
    if part_list:
        particulate_aqi_response.extend(part_list)
    else:
        print(f"No particulate data available for {year}.")


No particulate data available for 1961.
No particulate data available for 1962.
No particulate data available for 1963.
No particulate data available for 1964.
No particulate data available for 1965.
No particulate data available for 1966.
No particulate data available for 1967.
No particulate data available for 1968.
No particulate data available for 1969.
No particulate data available for 1970.
No particulate data available for 1971.
No particulate data available for 1972.
No particulate data available for 1973.
No particulate data available for 1974.
No particulate data available for 1975.
No particulate data available for 1976.
No particulate data available for 1977.
No particulate data available for 1978.
No particulate data available for 1979.
No particulate data available for 1980.
No particulate data available for 1981.
No particulate data available for 1982.
No particulate data available for 1983.
No particulate data available for 1984.
No particulate data available for 1985.


### Challenges faced:

While running the above code I ran into HTTPSConnectionPool error as below. This is because there was a connection timeout. These issues can occur while running the above cell. I made sure I am connected to proper network and retried to run the cell. The above cell takes about 70 minutes to run.

`HTTPSConnectionPool(host='aqs.epa.gov', port=443): Max retries exceeded with url: /data/api/dailyData/byBox?email=swarali@uw.edu&key=copperfox97&param=81102,88101,88502&bdate=20090501&edate=20091031&minlat=38.45813115942029&maxlat=39.18276884057971&minlon=-77.50842745787546&maxlon=-76.59267654212454 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x1748f0c20>: Failed to establish a new connection: [Errno 51] Network is unreachable'))`



In [45]:
# response example for one station
particulate_aqi_response[0]

{'state_code': '11',
 'county_code': '001',
 'site_number': '0017',
 'parameter_code': '81102',
 'poc': 2,
 'latitude': 38.903723,
 'longitude': -77.051366,
 'datum': 'WGS84',
 'parameter': 'PM10 Total 0-10um STP',
 'sample_duration_code': '7',
 'sample_duration': '24 HOUR',
 'pollutant_standard': 'PM10 24-hour 2006',
 'date_local': '1986-05-02',
 'units_of_measure': 'Micrograms/cubic meter (25 C)',
 'event_type': 'No Events',
 'observation_count': 1,
 'observation_percent': 100.0,
 'validity_indicator': 'Y',
 'arithmetic_mean': 31.0,
 'first_max_value': 31.0,
 'first_max_hour': 0,
 'aqi': 29,
 'method_code': '052',
 'method': 'HI-VOL-SA321A - GRAVIMETRIC',
 'local_site_name': None,
 'site_address': 'WEST END LIBRARY 24 & L STS. NW',
 'state': 'District Of Columbia',
 'county': 'District of Columbia',
 'city': 'Washington',
 'cbsa_code': '47900',
 'cbsa': 'Washington-Arlington-Alexandria, DC-VA-MD-WV',
 'date_of_last_change': '2024-05-22'}

In [46]:
with open("../intermediary_files/particulate_aqi_temp.json", 'w') as json_file:
    json.dump(particulate_aqi_response, json_file, indent=4) 

df_particulate = pd.DataFrame(particulate_aqi_response)
df_particulate.head()

Unnamed: 0,state_code,county_code,site_number,parameter_code,poc,latitude,longitude,datum,parameter,sample_duration_code,...,method_code,method,local_site_name,site_address,state,county,city,cbsa_code,cbsa,date_of_last_change
0,11,1,17,81102,2,38.903723,-77.051366,WGS84,PM10 Total 0-10um STP,7,...,52,HI-VOL-SA321A - GRAVIMETRIC,,WEST END LIBRARY 24 & L STS. NW,District Of Columbia,District of Columbia,Washington,47900,"Washington-Arlington-Alexandria, DC-VA-MD-WV",2024-05-22
1,11,1,17,81102,2,38.903723,-77.051366,WGS84,PM10 Total 0-10um STP,7,...,52,HI-VOL-SA321A - GRAVIMETRIC,,WEST END LIBRARY 24 & L STS. NW,District Of Columbia,District of Columbia,Washington,47900,"Washington-Arlington-Alexandria, DC-VA-MD-WV",2024-05-22
2,11,1,17,81102,2,38.903723,-77.051366,WGS84,PM10 Total 0-10um STP,7,...,52,HI-VOL-SA321A - GRAVIMETRIC,,WEST END LIBRARY 24 & L STS. NW,District Of Columbia,District of Columbia,Washington,47900,"Washington-Arlington-Alexandria, DC-VA-MD-WV",2024-05-22
3,11,1,17,81102,2,38.903723,-77.051366,WGS84,PM10 Total 0-10um STP,7,...,52,HI-VOL-SA321A - GRAVIMETRIC,,WEST END LIBRARY 24 & L STS. NW,District Of Columbia,District of Columbia,Washington,47900,"Washington-Arlington-Alexandria, DC-VA-MD-WV",2024-05-22
4,11,1,17,81102,2,38.903723,-77.051366,WGS84,PM10 Total 0-10um STP,7,...,52,HI-VOL-SA321A - GRAVIMETRIC,,WEST END LIBRARY 24 & L STS. NW,District Of Columbia,District of Columbia,Washington,47900,"Washington-Arlington-Alexandria, DC-VA-MD-WV",2024-05-22


In [47]:
df_particulate.columns

Index(['state_code', 'county_code', 'site_number', 'parameter_code', 'poc',
       'latitude', 'longitude', 'datum', 'parameter', 'sample_duration_code',
       'sample_duration', 'pollutant_standard', 'date_local',
       'units_of_measure', 'event_type', 'observation_count',
       'observation_percent', 'validity_indicator', 'arithmetic_mean',
       'first_max_value', 'first_max_hour', 'aqi', 'method_code', 'method',
       'local_site_name', 'site_address', 'state', 'county', 'city',
       'cbsa_code', 'cbsa', 'date_of_last_change'],
      dtype='object')

We perform the same steps as above for the gaseous particles

In [48]:
#
#    GASEOUS REQUEST
#
request_data = AQS_REQUEST_TEMPLATE.copy()
request_data['email'] = USERNAME
request_data['key'] = APIKEY
request_data['param'] = AQI_PARAMS_GASEOUS   

START_MMDD = "0501" 
END_MMDD = "1031"
YEAR_START = 1961
YEAR_END = 2021

gaseous_aqi_response = []
# 
#   Now, we need bounding box parameters
#   50 mile box with Alexandria in the center
bbox = bounding_latlon(CITY_LOCATIONS['alexandria'],scale=1.0)

# the bbox response comes back as a list - [minlat,maxlat,minlon,maxlon]

#   put our bounding box into the request_data
request_data['minlat'] = bbox[0]
request_data['maxlat'] = bbox[1]
request_data['minlon'] = bbox[2]
request_data['maxlon'] = bbox[3]

# We'll iterate through all the years
for year in range(YEAR_START, YEAR_END + 1):
    # Define start and end times for the fire season
    begin_date = str(year) + START_MMDD
    end_date = str(year) + END_MMDD

    #
    #   we need to change the action for the API from the default to the bounding box - same recent date for now
    # response = request_monitors(request_template=request_data, begin_date=begin_date, end_date=end_date,
    #                             endpoint_action = API_ACTION_MONITORS_BOX)
    response = request_daily_summary(request_template=request_data, begin_date=begin_date, end_date=end_date,
                                            endpoint_action = API_ACTION_DAILY_SUMMARY_BOX)
    
    gas_list = extract_AQI_data(response)
    
    if gas_list:
        gaseous_aqi_response.extend(gas_list)
    else:
        print(f"No gaseous data available for {year}.")


No gaseous data available for 1961.
No gaseous data available for 1962.
No gaseous data available for 1963.
No gaseous data available for 1964.
No gaseous data available for 1965.
No gaseous data available for 1966.
No gaseous data available for 1967.


In [49]:
# response example for one station
gaseous_aqi_response[0]

{'state_code': '24',
 'county_code': '003',
 'site_number': '1003',
 'parameter_code': '42101',
 'poc': 1,
 'latitude': 39.169533,
 'longitude': -76.627933,
 'datum': 'WGS84',
 'parameter': 'Carbon monoxide',
 'sample_duration_code': '1',
 'sample_duration': '1 HOUR',
 'pollutant_standard': 'CO 1-hour 1971',
 'date_local': '1968-05-02',
 'units_of_measure': 'Parts per million',
 'event_type': 'No Events',
 'observation_count': 4,
 'observation_percent': 17.0,
 'validity_indicator': 'Y',
 'arithmetic_mean': 3.25,
 'first_max_value': 4.0,
 'first_max_hour': 22,
 'aqi': None,
 'method_code': '011',
 'method': 'INSTRUMENTAL - NONDISPERSIVE INFRARED',
 'local_site_name': 'GLEN BURNIE',
 'site_address': ' ANNE ARUNDEL CO. PUBLIC WORKS BLDG. 7409 BALTIMORE ANNAPOLIS BLVD.',
 'state': 'Maryland',
 'county': 'Anne Arundel',
 'city': 'Glen Burnie',
 'cbsa_code': '12580',
 'cbsa': 'Baltimore-Columbia-Towson, MD',
 'date_of_last_change': '2018-06-04'}

In [50]:
with open("../intermediary_files/gaseous_aqi_temp.json", 'w') as json_file:
    json.dump(gaseous_aqi_response, json_file, indent=4) 

df_gaseous = pd.DataFrame(gaseous_aqi_response)
df_gaseous.head()

Unnamed: 0,state_code,county_code,site_number,parameter_code,poc,latitude,longitude,datum,parameter,sample_duration_code,...,method_code,method,local_site_name,site_address,state,county,city,cbsa_code,cbsa,date_of_last_change
0,24,3,1003,42101,1,39.169533,-76.627933,WGS84,Carbon monoxide,1,...,11,INSTRUMENTAL - NONDISPERSIVE INFRARED,GLEN BURNIE,ANNE ARUNDEL CO. PUBLIC WORKS BLDG. 7409 BALT...,Maryland,Anne Arundel,Glen Burnie,12580,"Baltimore-Columbia-Towson, MD",2018-06-04
1,24,3,1003,42101,1,39.169533,-76.627933,WGS84,Carbon monoxide,1,...,11,INSTRUMENTAL - NONDISPERSIVE INFRARED,GLEN BURNIE,ANNE ARUNDEL CO. PUBLIC WORKS BLDG. 7409 BALT...,Maryland,Anne Arundel,Glen Burnie,12580,"Baltimore-Columbia-Towson, MD",2018-06-04
2,24,3,1003,42101,1,39.169533,-76.627933,WGS84,Carbon monoxide,1,...,11,INSTRUMENTAL - NONDISPERSIVE INFRARED,GLEN BURNIE,ANNE ARUNDEL CO. PUBLIC WORKS BLDG. 7409 BALT...,Maryland,Anne Arundel,Glen Burnie,12580,"Baltimore-Columbia-Towson, MD",2018-06-04
3,24,3,1003,42101,1,39.169533,-76.627933,WGS84,Carbon monoxide,1,...,11,INSTRUMENTAL - NONDISPERSIVE INFRARED,GLEN BURNIE,ANNE ARUNDEL CO. PUBLIC WORKS BLDG. 7409 BALT...,Maryland,Anne Arundel,Glen Burnie,12580,"Baltimore-Columbia-Towson, MD",2018-06-04
4,24,3,1003,42101,1,39.169533,-76.627933,WGS84,Carbon monoxide,1,...,11,INSTRUMENTAL - NONDISPERSIVE INFRARED,GLEN BURNIE,ANNE ARUNDEL CO. PUBLIC WORKS BLDG. 7409 BALT...,Maryland,Anne Arundel,Glen Burnie,12580,"Baltimore-Columbia-Towson, MD",2018-06-04


In [51]:
df_gaseous_select = df_gaseous[['state_code', 'county_code', 'site_number', 'parameter_code', 'latitude', 'longitude', 'parameter', 'sample_duration', 'date_local', 'units_of_measure', 'arithmetic_mean',
       'first_max_value', 'aqi']]
df_particulate_select = df_particulate[['state_code', 'county_code', 'site_number', 'parameter_code', 'latitude', 'longitude', 'parameter', 'sample_duration', 'date_local', 'units_of_measure', 'arithmetic_mean',
       'first_max_value', 'aqi']]

print(len(df_gaseous_select))
df_gaseous_unique = df_gaseous_select.drop_duplicates()
print(len(df_gaseous_unique))

print(len(df_particulate_select))
df_particulate_unique = df_particulate_select.drop_duplicates()
print(len(df_particulate_unique))

1074053
801084
309566
68125


The unique rows for gaseous is 801084 and for particulate pollutants is 68125

In [52]:
df_final_AQI = pd.concat([df_gaseous_unique, df_particulate_unique], axis=0)
print(df_final_AQI.shape)

(869209, 13)


Unnamed: 0,state_code,county_code,site_number,parameter_code,latitude,longitude,parameter,sample_duration,date_local,units_of_measure,arithmetic_mean,first_max_value,aqi
0,24,3,1003,42101,39.169533,-76.627933,Carbon monoxide,1 HOUR,1968-05-02,Parts per million,3.25,4.0,
1,24,3,1003,42101,39.169533,-76.627933,Carbon monoxide,1 HOUR,1968-05-03,Parts per million,1.363636,3.0,
2,24,3,1003,42101,39.169533,-76.627933,Carbon monoxide,1 HOUR,1968-05-06,Parts per million,1.75,3.0,
3,24,3,1003,42101,39.169533,-76.627933,Carbon monoxide,1 HOUR,1968-05-07,Parts per million,1.583333,6.0,
4,24,3,1003,42101,39.169533,-76.627933,Carbon monoxide,1 HOUR,1968-05-08,Parts per million,1.0,9.0,


In [6]:
df_final_AQI['aqi'].value_counts()

aqi
31.0     12504
44.0     12188
19.0     10209
14.0      9925
6.0       9541
         ...  
184.0        1
180.0        1
168.0        1
183.0        1
196.0        1
Name: count, Length: 265, dtype: int64

In [8]:
print(f"Our data has the following list of pollutants that affect the AQI index:")
df_final_AQI['parameter'].unique()

Our data has the following list of pollutants that affect the AQI index:


array(['Carbon monoxide', 'Sulfur dioxide', 'Nitrogen dioxide (NO2)',
       'Ozone', 'PM10 Total 0-10um STP',
       'Acceptable PM2.5 AQI & Speciation Mass',
       'PM2.5 - Local Conditions'], dtype=object)


#### Checking for missing data

Since we aim to calculate the AQI for each year, let’s examine whether the AQI field contains a substantial amount of data in each row.

In [9]:
print(f"Percentage data will null values for AQI: {(df_final_AQI['aqi'].isna().sum()/len(df_final_AQI))*100}%")
print(f"Percentage data will null values for arithmetic_mean: {(df_final_AQI['arithmetic_mean'].isna().sum()/len(df_final_AQI))*100}%")


Percentage data will null values for AQI: 40.25015847742027%
Percentage data will null values for arithmetic_mean: 0.0%


I have decided to substitute the nulls in AQI with the arithmetic_mean values. In an ideal scenario one should try to fill this data to get an accurate value representation.

In [11]:
df_final_AQI['aqi'] = df_final_AQI['aqi'].fillna(df_final_AQI['arithmetic_mean'])

In [12]:
print(f"Percentage data will null values for AQI: {(df_final_AQI['aqi'].isna().sum()/len(df_final_AQI))*100}%")
print(f"Percentage data will null values for arithmetic_mean: {(df_final_AQI['arithmetic_mean'].isna().sum()/len(df_final_AQI))*100}%")


Percentage data will null values for AQI: 0.0%
Percentage data will null values for arithmetic_mean: 0.0%


In [13]:
df_final_AQI.to_csv("../intermediary_files/final_AQI.csv")