#External Data Loading

We can load in data from other external sources to aid in modeling and other tasks for this project.  We'll do that in this notebook (for now).

#Historical Weather Data
Weather could have a large effect on crime, and it could be useful to include it in our models.  We can pull historical weather data for the Durham area using [this NOAA API](http://www.ncdc.noaa.gov/cdo-web/webservices/v2#gettingStarted).

In [121]:
import requests
import json
import csv
from pprint import pprint

API_token = "icxXgrAydpGZbNqZtPMpcsCPnIWcwQhu"  # <-- Jason's API token
headers = {'token': API_token}

First, we need to see all the stations in NC, so we can find the one closest to Durham.

In [86]:
stations_url = 'http://www.ncdc.noaa.gov/cdo-web/api/v2/stations'
stations_params = {'locationid': 'FIPS:37', 'limit': 1000, 'offset': 2000}

r = requests.get(stations_url, params=stations_params, headers=headers)

In [87]:
pprint(r.json())

{'metadata': {'resultset': {'count': 2161, 'limit': 1000, 'offset': 2000}},
 'results': [{'datacoverage': 0.9996,
              'elevation': 190.2,
              'elevationUnit': 'METERS',
              'id': 'GHCND:USC00448257',
              'latitude': 36.533333,
              'longitude': -79.533333,
              'maxdate': '1919-12-31',
              'mindate': '1917-08-01',
              'name': 'SWANSONVILLE, NC US'},
             {'datacoverage': 0.9874,
              'elevation': 6.1,
              'elevationUnit': 'METERS',
              'id': 'GHCND:USR0000NBAC',
              'latitude': 34.5328,
              'longitude': -77.7219,
              'maxdate': '2015-08-23',
              'mindate': '2002-05-01',
              'name': 'BACK ISLAND NORTH CAROLINA, NC US'},
             {'datacoverage': 0.9803,
              'elevation': 7.6,
              'elevationUnit': 'METERS',
              'id': 'GHCND:USR0000NBFT',
              'latitude': 35.5206,
              'longit

After digging around, we can find a few stations that could have useful info.  We can examine the datasets these provide.

In [116]:
# Useful datasets: there are many with 'DURHAM' in the name, and not all are useful

durham_id = 'GHCND:USW00003758'
duke_forest_id = 'GHCND:USR0000NDUK'
rdu_radar_id = 'NEXRAD:KRAX'
rdu_id = 'GHCND:USW00013722'

In [117]:
datasets_url = 'http://www.ncdc.noaa.gov/cdo-web/api/v2/datasets'
dataset_params = {
    'stationid': rdu_id,
    'limit': 1000
}
r = requests.get(datasets_url, params=dataset_params, headers=headers)

In [118]:
pprint(r.json())

{'metadata': {'resultset': {'count': 6, 'limit': 1000, 'offset': 1}},
 'results': [{'datacoverage': 1,
              'id': 'GHCND',
              'maxdate': '2015-08-25',
              'mindate': '1763-01-01',
              'name': 'Daily Summaries',
              'uid': 'gov.noaa.ncdc:C00861'},
             {'datacoverage': 1,
              'id': 'GHCNDMS',
              'maxdate': '2015-06-01',
              'mindate': '1763-01-01',
              'name': 'Monthly Summaries',
              'uid': 'gov.noaa.ncdc:C00841'},
             {'datacoverage': 1,
              'id': 'NORMAL_ANN',
              'maxdate': '2010-01-01',
              'mindate': '2010-01-01',
              'name': 'Normals Annual/Seasonal',
              'uid': 'gov.noaa.ncdc:C00821'},
             {'datacoverage': 1,
              'id': 'NORMAL_DLY',
              'maxdate': '2010-12-31',
              'mindate': '2010-01-01',
              'name': 'Normals Daily',
              'uid': 'gov.noaa.ncdc:C00823'},
  

Now we can check out the actual data for a given station, once we know which data sets it has.

Note that the values are in the following units:

- PRCP = Precipitation (tenths of mm)
- SNOW = Snowfall (mm)
- SNWD = Snow depth (mm)
- TMAX = Maximum temperature (tenths of degrees C)
- TMIN = Minimum temperature (tenths of degrees C)

In [119]:
data_url = 'http://www.ncdc.noaa.gov/cdo-web/api/v2/data'
data_params = {
    'datasetid': 'GHCND',
    'stationid': rdu_id,
    'limit': 1000,
    'startdate': '2014-01-01',
    'enddate': '2014-12-31'
}
r = requests.get(data_url, params=data_params, headers=headers)

In [120]:
pprint(r.json())

{'metadata': {'resultset': {'count': 4171, 'limit': 1000, 'offset': 1}},
 'results': [{'attributes': ',,W,',
              'datatype': 'AWND',
              'date': '2014-01-01T00:00:00',
              'station': 'GHCND:USW00013722',
              'value': 11},
             {'attributes': ',,W,2400',
              'datatype': 'PRCP',
              'date': '2014-01-01T00:00:00',
              'station': 'GHCND:USW00013722',
              'value': 0},
             {'attributes': ',,W,',
              'datatype': 'SNOW',
              'date': '2014-01-01T00:00:00',
              'station': 'GHCND:USW00013722',
              'value': 0},
             {'attributes': ',,W,',
              'datatype': 'SNWD',
              'date': '2014-01-01T00:00:00',
              'station': 'GHCND:USW00013722',
              'value': 0},
             {'attributes': 'H,,S,',
              'datatype': 'TAVG',
              'date': '2014-01-01T00:00:00',
              'station': 'GHCND:USW00013722',
        

The RDU Airport station has a lot of info.  We'll save that it to a file so we don't have to keep hitting the API.

In [132]:
offset = 1
limit = 1000

while True:
    data_url = 'http://www.ncdc.noaa.gov/cdo-web/api/v2/data'
    data_params = {
        'datasetid': 'GHCND',
        'stationid': rdu_id,
        'limit': limit,
        'offset': offset,
        'startdate': '2014-01-01',
        'enddate': '2014-12-31'
    }
    r = requests.get(data_url, params=data_params, headers=headers)
    r_json = r.json()
    
    with open("../csv_data/rdu_weather.csv", 'a+') as f:
        writer = csv.writer(f)
        if offset == 1:
            keys = ['attributes', 'datatype', 'date', 'station', 'value']
        writer.writerow(keys)
        for row in r_json['results']:
            writer.writerow([row[key] for key in keys])
            
    offset += limit
    
    if offset > r_json['metadata']['resultset']['count']:
        break