# OpenAQ Data
## Part 1: Compiling Data using the OpenAQ api

In [2]:
# you can pull the data from the past 2 years from https://docs.openaq.org/

# this is a link for a map showing sources and readings: https://openaq.org/#/map?&_k=tqphyo
# you can get more details for any location (and download data)
# san fran (San Francisco-Oakland-Fremont) data: https://openaq.org/#/location/San%20Francisco?_k=9pwelq
# 65k records from 07/2018 - today


In [4]:
# for archived data (06/2015 - 04/2018), you can use https://openaq-data.s3.amazonaws.com/index.html
# *about 03/2016 for san fran data
# for each city there is a record for each type of reading (o3, co, pm, etc.) noted by the key 'parameter'
# the source of the reading is noted by the key 'attribution', which is a dictionary with the name and url


In [87]:
import requests
import pandas as pd

In [88]:
# make a folder to save the data
! mkdir openAQ

A subdirectory or file openAQ already exists.


In [14]:
def save_file(filename, text):
    folder = 'openAQ'
    extension = 'csv'
    path = '{}\{}.{}'.format(folder, filename, extension)
    file = open(path, 'w')
    file.write(text)
    file.close
    

In [15]:
def get_dates_for_range(start_date, end_date):
    date_range = pd.date_range(start=start_date, end=end_date)
    dates = [str(timestamp.date()) for timestamp in date_range]
    return dates

def is_keywords_in_string(string, keywords):
    has_keyword = False
    for keyword in keywords:
        has_keyword = keyword in string
        if has_keyword:
            return has_keyword
    return has_keyword
        

In [16]:
# method for handling requests
def request_openAQ_data(date):
    url = 'https://openaq-data.s3.amazonaws.com/{}.csv'.format(date)
    response = requests.get(url, stream=True)
    if response.status_code == requests.codes.ok:
        return response
    return None

In [17]:
# this function takes in a date as a string with the format yyyy-mm-dd
# filter the results based on lines containing keywords
def filter_date_data_for_keywords(date, keywords):
    filtered_data = ''
    response = request_openAQ_data(date)
    if response is not None:
        for line in response.iter_lines():
            line = str(line, 'utf-8')
            if is_keywords_in_string(line, keywords):
                filtered_data += line + '\n' # probably could optimize
    return filtered_data
    

In [18]:
def filter_data_in_date_range(start_date, end_date, keywords):
    dates = get_dates_for_range(start_date, end_date)
    for date in dates:
        print(date)
        data = filter_date_data_for_keywords(date, keywords)
        if data:
            save_file(date, data)

In [None]:
# pull all of the daily data and filter the results for san-fran related records and save to a new csv
start_date = '2015-06-09'
end_date = '2018-04-06'
keywords = ['San Fran']
filter_data_in_date_range(start_date, end_date, keywords)


## Part 2: Dataset Details

1. location: The name of the location of the site where the reading was observed
2. city: The name of the city
3. country: The name of the country
4. utc: The time of the reading using the UTC timezone
5. local: The time of the reading using the Local timezone
6. parameter: The name of the pollutant being observed
7. value: The value of the pollutant reading
8. unit: The units of the pollutant read
9. latitude: The latitude of the site
10. longitude: The longitude of the site
11. attribution: A dictionary containing the name and url of the source of this data


### Data

This API also provides fairly clean data, with a high degree of geographic specificity. This data will provide the foundation for our analysis, allowing us to determine the severity of fire-related air pollution over time in the Bay Area and thereby to examine its relationship to travel- and mobility-related factors. There is archived daily data from 2016 - 2018. Each file is 270 KB with a total of 200 MB. More recent data (up to 2 years) can be requested with their api.

In [2]:
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy

In [3]:
# getting a list of US cities available from OpenAQ
cities = requests.get('https://api.openaq.org/v1/cities?country=US&limit=1000')
print(cities.status_code)

200


From this list, we find that the areas of interest (in the Bay Area) are 'San Francisco-Oakland-Fremont', 'San Jose-Sunnyvale-Santa Clara', 'Vallejo-Fairfield', 'Napa', and 'Sonoma'.

In [4]:
# we next need to find out which 'locations' are included in each of these broader areas

areas_of_interest = [
    'San Francisco-Oakland-Fremont',
    'San Jose-Sunnyvale-Santa Clara',
    'Vallejo-Fairfield',
    'Napa',
    'Sonoma'
]

bay_area_locations = []

# getting info for each area and adding the locations that it contains to a list
for area in areas_of_interest:
    parameters = { 'city[]': area }
    response = requests.get('https://api.openaq.org/v1/locations', parameters)
    
    for result in response.json()['results']:
        bay_area_locations.append(result['location'])

In [None]:
bay_area_locations

In [6]:
# This is the recent data request
# getting a sense of what the data for each location will look like... 
# testing the process on data for Alameda/Berkeley Aquatic Par location
# focusing on PM2.5 because that's the main pollutant used to gauge wildfires' effects on air quality

test_params = {
    'city': 'ALAMEDA',
    'location': 'Berkeley Aquatic Par',
    'parameter': 'pm25',
    'date_from': '2020-01-01',
    'limit': 10
}

alameda_resp = requests.get('https://api.openaq.org/v1/measurements', test_params)
print(alameda_resp.status_code)

200


In [None]:
alameda_resp.json()

In [9]:
#turning the alameda_resp dictionary into a df

alameda_df = pd.DataFrame.from_dict(alameda_resp.json()['results'])
alameda_df.head()

Unnamed: 0,location,parameter,date,value,unit,coordinates,country,city
0,Berkeley Aquatic Par,pm25,"{'utc': '2020-12-04T03:00:00Z', 'local': '2020...",30,µg/m³,"{'latitude': 37.864767, 'longitude': -122.302741}",US,ALAMEDA
1,Berkeley Aquatic Par,pm25,"{'utc': '2020-12-04T02:00:00Z', 'local': '2020...",32,µg/m³,"{'latitude': 37.864767, 'longitude': -122.302741}",US,ALAMEDA
2,Berkeley Aquatic Par,pm25,"{'utc': '2020-12-04T01:00:00Z', 'local': '2020...",39,µg/m³,"{'latitude': 37.864767, 'longitude': -122.302741}",US,ALAMEDA
3,Berkeley Aquatic Par,pm25,"{'utc': '2020-12-04T00:00:00Z', 'local': '2020...",35,µg/m³,"{'latitude': 37.864767, 'longitude': -122.302741}",US,ALAMEDA
4,Berkeley Aquatic Par,pm25,"{'utc': '2020-12-03T23:00:00Z', 'local': '2020...",36,µg/m³,"{'latitude': 37.864767, 'longitude': -122.302741}",US,ALAMEDA


In [10]:
# get json of all PM2.5 data for the locations of interest; convert to df and combine into one df (pm25_df)
# print the number of records found for PM2.5 measurements in 2020 for each location

pm25_df = pd.DataFrame(columns=['city', 'coordinates', 'country', 'date', 'location', 'parameter', 'unit', 'value'])

for location in bay_area_locations:
    loc_params = {
        'location': location,
        'parameter': 'pm25',
        'limit': 10000,
        'date_from': '2020-01-01'
    }
    
    loc_resp = requests.get('https://api.openaq.org/v1/measurements', loc_params)
    
    print(location, ':', loc_resp.json()['meta']['found'])
    
    loc_df = pd.DataFrame.from_dict(loc_resp.json()['results'])
    
    pm25_df = pd.concat([pm25_df, loc_df])

Berkeley Aquatic Par : 6462
Bethel Island : 0
Concord : 6637
Hayward : 0
Laney College : 6837
Livermore - Rincon : 6863
Oakland : 6811
Oakland West : 6335
Patterson Pass : 0
Pleasanton - Owens C : 6675
Redwood City : 6667
Richmond - 7th St : 0
San Francisco : 6467
San Pablo - Rumrill : 6479
San Rafael : 6670
San Ramon : 0
Gilory - 9th Street : 6497
Hollister AMS : 2178
Hollister AMS : 2178
Los Gatos : 0
Pinnacles NM : 0
San Jose - Jackson S : 6662
San Jose - Knox Ave : 6698
San Martin : 0
Fairfield : 0
Rio Vista : 5510
Rio Vista : 5510
Vacaville : 6586
Vallejo : 6715
Napa - Jefferson St : 0
Napa - Napa Valley C : 6499
Sonoma Technology Mo : 0


In [11]:
pm25_df.reset_index(inplace=True)
pm25_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 127936 entries, 0 to 127935
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   index        127936 non-null  int64 
 1   city         127936 non-null  object
 2   coordinates  127936 non-null  object
 3   country      127936 non-null  object
 4   date         127936 non-null  object
 5   location     127936 non-null  object
 6   parameter    127936 non-null  object
 7   unit         127936 non-null  object
 8   value        127936 non-null  object
dtypes: int64(1), object(8)
memory usage: 8.8+ MB


This df encompasses the PM2.5 measurements for all of our locations of interest for 2020 (up to the present date). There are around 127,000 rows, so it is reasonable to assume that each year's data will be on the same order of magnitude. OpenAQ provides two years of data via their open API, so we will use the same process as above to acquire the previous two years' data. For data prior to that, we will need to query their S3 buckets, which can be done through a distributed query tool like Amazon Athena, Apache Spark, or Google BigQuery.


In [12]:
pm25_df.isnull().any()

index          False
city           False
coordinates    False
country        False
date           False
location       False
parameter      False
unit           False
value          False
dtype: bool