# OpenAQ Data
## Part 1: Compiling Data using the OpenAQ api

In [2]:
# you can pull the data from the past 2 years from https://docs.openaq.org/

# this is a link for a map showing sources and readings: https://openaq.org/#/map?&_k=tqphyo
# you can get more details for any location (and download data)
# san fran (San Francisco-Oakland-Fremont) data: https://openaq.org/#/location/San%20Francisco?_k=9pwelq
# 65k records from 07/2018 - today


In [4]:
# for archived data (06/2015 - 04/2018), you can use https://openaq-data.s3.amazonaws.com/index.html
# *about 03/2016 for san fran data
# for each city there is a record for each type of reading (o3, co, pm, etc.) noted by the key 'parameter'
# the source of the reading is noted by the key 'attribution', which is a dictionary with the name and url


In [87]:
import requests
import pandas as pd

In [88]:
# make a folder to save the data
! mkdir openAQ

A subdirectory or file openAQ already exists.


In [89]:
def save_file(filename, text):
    folder = 'openAQ'
    extension = 'csv'
    path = '{}\{}.{}'.format(folder, filename, extension)
    file = open(path, 'w')
    file.write(text)
    file.close
    

In [90]:
def get_dates_for_range(start_date, end_date):
    date_range = pd.date_range(start=start_date, end=end_date)
    dates = [str(timestamp.date()) for timestamp in date_range]
    return dates

def is_keywords_in_string(string, keywords):
    has_keyword = False
    for keyword in keywords:
        has_keyword = keyword in string
        if has_keyword:
            return has_keyword
    return has_keyword
        

In [119]:
# method for handling requests
def request_openAQ_data(date):
    url = 'https://openaq-data.s3.amazonaws.com/{}.csv'.format(date)
    response = requests.get(url, stream=True)
    if response.status_code == requests.codes.ok:
        return response
    return None

In [121]:
# this function takes in a date as a string with the format yyyy-mm-dd
# filter the results based on lines containing keywords
def filter_date_data_for_keywords(date, keywords):
    filtered_data = ''
    response = request_openAQ_data(date)
    if response is not None:
        for line in response.iter_lines():
            line = str(line, 'utf-8')
            if is_keywords_in_string(line, keywords):
                filtered_data += line + '\n' # probably could optimize
    return filtered_data
    

In [129]:
def filter_data_in_date_range(start_date, end_date, keywords):
    dates = get_dates_for_range(start_date, end_date)
    for date in dates:
        print(date)
        data = filter_date_data_for_keywords(date, keywords)
        if data:
            save_file(date, data)

In [None]:
# pull all of the daily data and filter the results for san-fran related records and save to a new csv
start_date = '2015-06-09'
end_date = '2018-04-06'
keywords = ['San Fran']
filter_data_in_date_range(start_date, end_date, keywords)


## Part 2: Dataset Details

1. location: The name of the location of the site where the reading was observed
2. city: The name of the city
3. country: The name of the country
4. utc: The time of the reading using the UTC timezone
5. local: The time of the reading using the Local timezone
6. parameter: The name of the pollutant being observed
7. value: The value of the pollutant reading
8. unit: The units of the pollutant read
9. latitude: The latitude of the site
10. longitude: The longitude of the site
11. attribution: A dictionary containing the name and url of the source of this data


### Data

This API also provides fairly clean data, with a high degree of geographic specificity. This data will provide the foundation for our analysis, allowing us to determine the severity of fire-related air pollution over time in the Bay Area and thereby to examine its relationship to travel- and mobility-related factors. There is archived daily data from 2016 - 2018. Each file is 270 KB with a total of 200 MB. More recent data (up to 2 years) can be requested with their api.

In [None]:
# TODO add a small request to show what the data looks like
