# Weather Data Collection

The following code was used to collect the weather data from the [NCDC API](https://www.ncdc.noaa.gov/cdo-web/webservices/v2). The API requires a token, which should be provided in place of `PASTE_YOUR_TOKEN_HERE` in the `make_request()` function below:

In [1]:
import datetime as dt
import requests


STATIONS = {
    'GHCND:USW00023174': 'LA',
    'GHCND:USW00014732': 'NYC',
    'GHCND:USW00023234': 'SF',
    'GHCND:USW00012960': 'Houston',
    'GHCND:USW00013874': 'Atlanta',
    'GHCND:USW00094846': 'Chicago',
    'GHCND:USW00014739': 'Boston',
    'GHCND:USW00012839': 'Miami',
    'GHCND:USW00024233': 'Seattle',
    'GHCND:USW00023183': 'Phoenix',
    'GHCND:USW00022521': 'Honolulu'
}

def make_request(endpoint, payload=None):
    """
    Make a request to a specific endpoint on the weather API
    passing headers and optional payload.
    
    Parameters:
        - endpoint: The endpoint of the API you want to 
                    make a GET request to.
        - payload: A dictionary of data to pass along 
                   with the request.
    
    Returns:
        A response object.
    """
    return requests.get(
        f'https://www.ncdc.noaa.gov/cdo-web/api/v2/{endpoint}',
        headers={
            'token': 'PASTE_YOUR_TOKEN_HERE'
        },
        params=payload
    )

start = current = dt.date(2020, 1, 1)
end = dt.date(2021, 1, 1)

results = []

while current < end:
    # update the cell with status information
    print(f'\rGathering data for {str(current)}', end='')
    response = make_request(
        'data', 
        {
            'datasetid': 'GHCND',
            'stationid': STATIONS.keys(),
            'startdate': current,
            'enddate': current,
            'datatypeid': [
                'TAVG', 'TMAX', 'TMIN', 
                'SNOW', 'PRCP', 
                'AWND', 'TSUN',
                'ACSC', 'SCSH', 'PSUN'
            ],
            'units': 'standard',
            'limit': 1000
        }
    )

    if response.ok:
        # we extend the list instead of appending to avoid getting a nested list
        results.extend(response.json()['results'])

    # update the current date to avoid an infinite loop
    current += dt.timedelta(days=1)
print('\nDone')

Gathering data for 2020-12-31
Done


## Generating the `weather.csv` file

The data collected from the API looks like this:

In [2]:
import pandas as pd
weather = pd.DataFrame(results)
weather.head()

Unnamed: 0,date,datatype,station,attributes,value
0,2020-01-01T00:00:00,AWND,GHCND:USW00012839,",,W,",4.3
1,2020-01-01T00:00:00,PRCP,GHCND:USW00012839,",,W,2400",0.0
2,2020-01-01T00:00:00,TAVG,GHCND:USW00012839,"H,,S,",70.0
3,2020-01-01T00:00:00,TMAX,GHCND:USW00012839,",,W,2400",82.0
4,2020-01-01T00:00:00,TMIN,GHCND:USW00012839,",,W,2400",61.0


We have several different weather observations:

In [3]:
weather.datatype.unique()

array(['AWND', 'PRCP', 'TAVG', 'TMAX', 'TMIN', 'SNOW'], dtype=object)

We will pivot this information and convert station IDs to city names, so the data is easier to work with:

In [4]:
weather = weather.assign(
    date=lambda x: pd.to_datetime(x.date), 
    city=lambda x: pd.Series(STATIONS).loc[x.station,].values
).drop(columns=['attributes', 'station']).pivot(
    index=['date', 'city'], columns='datatype', values='value'
).reset_index().set_index('date')
weather.columns = weather.columns.rename('')

weather

Unnamed: 0_level_0,city,AWND,PRCP,SNOW,TAVG,TMAX,TMIN
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-01-01,Atlanta,7.2,0.00,0.0,45.0,57.0,36.0
2020-01-01,Boston,15.4,0.00,0.0,39.0,43.0,36.0
2020-01-01,Chicago,11.9,0.00,0.0,28.0,42.0,21.0
2020-01-01,Honolulu,6.3,0.00,,76.0,81.0,68.0
2020-01-01,Houston,6.5,0.10,0.0,52.0,60.0,47.0
...,...,...,...,...,...,...,...
2020-12-31,Miami,16.6,0.00,,76.0,81.0,73.0
2020-12-31,NYC,10.7,0.53,0.0,46.0,50.0,38.0
2020-12-31,Phoenix,5.1,0.00,,51.0,60.0,41.0
2020-12-31,SF,9.8,0.03,,54.0,60.0,47.0


Save to a CSV file:

In [5]:
weather.to_csv('../weather.csv')