## AirNow

https://www.airnow.gov

AirNow is a government resource about which our group was really excited. We thought that it would be relatively easy to retrieve everything we needed, but it turned out to be more complex than expected. The AirNow API only had RSS feeds for more recent and current data, while all of their historical files were saved in folders and availble to download (link below). Ultimately, we were able to finish the necessary code and successfully download all of the files we wanted, but didn't end up using any of it for a variety of reasons including time constraints and finding another dataset that would provide most of what these files would've given us. There were also some questions about the format and completeness of the AirNow files.

The one thing we missed out on by setting these aside was data for Particulate Matter (PM2.5 and PM10), which are both inevitable pollution put out by wildfires, and maybe one of the most important metrics to record when looking at fires and air quality. In retrospect, we really wish we had included in our analysis.

**Airnow Links**

API Documentation: https://docs.airnowapi.org  
Historical Data Downloads: http://files.airnowtech.org

**Data Dictionaries**

Daily Data: https://docs.airnowapi.org/docs/DailyDataFactSheet.pdf  
Hourly Data: https://docs.airnowapi.org/docs/HourlyDataFactSheet.pdf  


## Imports

**Primary Imports**

In [1]:
import requests
import pandas as pd
import datetime as dt
import time
import os

import PyPDF2
from bs4 import BeautifulSoup

Each day contained a large number of files, most of which we didn't need. For the Daily Data, there were two filenames that were used over the years from which the data was collected.

- daily_data.dat
- daily_data_v2.dat

Sample Day:  (may not work without sign-in  
http://files.airnowtech.org/?prefix=airnow/2012/20120101/


### Daily Data

In [2]:
report = []

# establish date range
start_date = dt.datetime(2013, 10, 22)    # note: daily data files begin 2013-10-22
end_date = dt.datetime(2020, 12, 31)
delta = dt.timedelta(days=1)

# begin loop through date range
while start_date <= end_date:
    
    # create strings of start and end dates
    start_string = str(start_date)
    end_string = str(end_date)
    
    # create substrings for year, month, day
    year = start_string[0:4]
    month = start_string[5:7]
    day = start_string[8:10]

    # create date string to add to URL
    date_string = f'{year}/{year}{month}{day}/'

    # loop through two possible filenames
    for file in ['daily_data.dat', 'daily_data_v2.dat']:     
            
        try:
            # set full url for data file
            data_url = f' https://s3-us-west-1.amazonaws.com//files.airnowtech.org/airnow/{date_string}{file}'

            # get request for each day's data file
            r = requests.get(data_url)
            
            # Check URL connection status
            if r.status_code == 200:
                
                # find or create destination folder
                file_path = f"data/{file}"
                directory = os.path.dirname(file_path)
                if not os.path.exists(directory):
                    os.makedirs(directory)       

                # write file to destination
                with open(f'data/{date_string[5:-1]}{file}','a') as f:
                    f.write(r.text)
                
                # add logging info for successful download to report list
                cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                report.append({'event': f'Data saved successfully',
                               'date': f'{year}-{month}-{day}',
                               'file': f'{file}',
                               'datetime': f'{cur_datetime}',
                               'exception': ''
                               })
            else:
                pass
        
        # if error is raised above, add logging info for failed download to report list
        except Exception as e:
            cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            print(f'{cur_datetime}: {e}')
            report.append({'event': f'Could not save data',
                           'date': f'{year}-{month}-{day}',
                           'file': f'{file}',
                           'datetime': f'{cur_datetime}',
                           'exception': f'{e}'
                           })            
            
        # sleep for the sake of the server and output the report csv after every file download attempt    
        time.sleep(.25)            
        df_report = pd.DataFrame(report)
        df_report.to_csv(f'report.csv', index=False)    
    
    # advance current position in date range by delta set at the top
    start_date += delta


Had we decided to use the data, our first step would have been to convert the `.dat` files into `.csv`.

In [3]:
file_test = '/Users/rileydrobertson/DSI/projects/project_5/AirQuality-USWest/data/air_now/data_airnow_daily/20131022daily_data.dat'

In [4]:
pd.read_csv(file_test, sep='|', header=None)

Unnamed: 0,0,1,2,3,4,5,6,7
0,10/22/13,000020301,WELLINGTON,PM2.5-24hr,UG/M3,13.0,24,Environment Canada
1,10/22/13,000020301,WELLINGTON,OZONE-1HR,PPB,37.0,1,Environment Canada
2,10/22/13,000020301,WELLINGTON,OZONE-8HR,PPB,33.0,8,Environment Canada
3,10/22/13,000030118,HALIFAX,OZONE-8HR,PPB,35.0,8,Environment Canada
4,10/22/13,000030118,HALIFAX,OZONE-1HR,PPB,38.0,1,Environment Canada
...,...,...,...,...,...,...,...,...
3675,10/22/13,371230001,Candor FRO,PM2.5-24hr,UG/M3,15.7,24,North Carolina DENR - Divison of Air Quality
3676,10/22/13,480271045,Temple Georgia C1045,OZONE-1HR,PPB,46.0,1,Texas Commission on Environmental Quality
3677,10/22/13,480271045,Temple Georgia C1045,OZONE-8HR,PPB,41.0,8,Texas Commission on Environmental Quality
3678,10/22/13,484690609,Inez C609,OZONE-8HR,PPB,43.0,8,Texas Commission on Environmental Quality


### Hourly Data

In [5]:
report = []

# establish date range
start_date = dt.datetime(2014, 1, 1)    # daily data files begin 2013-10-22
end_date = dt.datetime(2019, 12, 31)
delta = dt.timedelta(days=1)

# begin loop through date range
while start_date <= end_date:
    
    # create strings of start and end dates
    start_string = str(start_date)
    end_string = str(end_date)
    
    # create substrings for year, month, day
    year = start_string[0:4]
    month = start_string[5:7]
    day = start_string[8:10]

    # create filepath string
    date_string = f'{year}/{year}{month}{day}/'
    
    
    # print on-screen updates at the first of every month
    if day == '01':
        cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        print(f'Collecting files for {month}-{year}...')   
    

    # loop through files for midnight, 6am, noon, and 6pm
    for file in [f'HourlyData_{year}{month}{day}00.dat', # midnight 1 
                 f'HourlyData_{year}{month}{day}06.dat',
                 f'HourlyData_{year}{month}{day}12.dat',
                 f'HourlyData_{year}{month}{day}18.dat',
                 f'HourlyData_{year}{month}{day}24.dat']: # midnight 2
         
            
        try:
            # set full url for data file
            data_url = f'https://s3-us-west-1.amazonaws.com//files.airnowtech.org/airnow/{date_string}{file}'

            # get request for each day's data files
            r = requests.get(data_url)
            
            # Check URL connection status
            if r.status_code == 200:
                
                # find or create destination folder
                file_path = f"data/{file}"
                directory = os.path.dirname(file_path)

                if not os.path.exists(directory):
                    os.makedirs(directory)       

                # write file to destination
                with open(f'data/{date_string[5:-1]}{file}','a') as f:
                    f.write(r.text)
                
                # add logging info for successful download to report list
                cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                report.append({'event': f'Data saved successfully',
                               'date': f'{year}-{month}-{day}',
                               'file': f'{file}',
                               'timestamp': f'{cur_datetime}',
                               'exception': ''
                               })
            # if status code is not good, add logging info for failed download to report list    
            else:
                cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                report.append({'event': f'Could not save data',
                               'date': f'{year}-{month}-{day}',
                               'file': f'{file}',
                               'timestamp': f'{cur_datetime}',
                               'exception': 'Error: Status Code'
                               })    
        # if error is raised above, add logging info for failed download to report list
        except Exception as e:
            cur_datetime = dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            print(f'{cur_datetime}: {e}')
            report.append({'event': f'Could not save data',
                           'date': f'{year}-{month}-{day}',
                           'file': f'{file}',
                           'timestamp': f'{cur_datetime}',
                           'exception': f'{e}'
                           })            
            
        # sleep for the sake of the server and output the report csv after every file download attempt    
        time.sleep(.25)            
        df_report = pd.DataFrame(report)
        df_report.to_csv(f'airnow_scrape_hourly_report.csv', index=False)    

    # advance current position in date range by delta set at the top    
    start_date += delta
    