# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Aim-of-notebook" data-toc-modified-id="Aim-of-notebook-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Aim of notebook</a></div><div class="lev1 toc-item"><a href="#Load-the-main-airport-traffic-data" data-toc-modified-id="Load-the-main-airport-traffic-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load the main airport traffic data</a></div><div class="lev1 toc-item"><a href="#Load-lookup-table-provided-by-BTS" data-toc-modified-id="Load-lookup-table-provided-by-BTS-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load lookup table provided by BTS</a></div><div class="lev1 toc-item"><a href="#Create-enhanced-lookuptable" data-toc-modified-id="Create-enhanced-lookuptable-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Create <em>enhanced</em> lookuptable</a></div><div class="lev2 toc-item"><a href="#Remove-Code-that-is-not-present-our-dataset" data-toc-modified-id="Remove-Code-that-is-not-present-our-dataset-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Remove Code that is not present our dataset</a></div><div class="lev2 toc-item"><a href="#Parse-state,city,-and-airport-name-from-'Description'-field" data-toc-modified-id="Parse-state,city,-and-airport-name-from-'Description'-field-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Parse state,city, and airport-name from 'Description' field</a></div><div class="lev2 toc-item"><a href="#Add-state-'region'-information" data-toc-modified-id="Add-state-'region'-information-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Add state 'region' information</a></div><div class="lev2 toc-item"><a href="#Add-airport-latitude-and-longitude-information-using-Google-geocoder" data-toc-modified-id="Add-airport-latitude-and-longitude-information-using-Google-geocoder-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Add airport latitude and longitude information using Google geocoder</a></div>

In [18]:
%matplotlib inline
import pandas as pd
import time

from pprint import pprint




# Aim of notebook

- The [airport traffic dataset](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time) encodes the airport with unique ID numbers.

- In this notebook, we'll create an *enhanced lookup-table* by taking the lookup table provided by the BTS ([download link](http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_AIRPORT_ID)) and adding additional relevant information regarding the airport (such as latitude/longitutde info)

# Load the main airport traffic data

- Load 3 years worth of air-traffic data provided by BTS ([link](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time))
- (from November 2013 to October 2016)

In [2]:
from util import load_airport_data
df_data = load_airport_data()

 ... load dataframe from 2015-11.zip 
 ... load dataframe from 2015-12.zip 
 ... load dataframe from 2016-01.zip 
 ... load dataframe from 2016-02.zip 
 ... load dataframe from 2016-03.zip 
 ... load dataframe from 2016-04.zip 
 ... load dataframe from 2016-05.zip 
 ... load dataframe from 2016-06.zip 
 ... load dataframe from 2016-07.zip 
 ... load dataframe from 2016-08.zip 
 ... load dataframe from 2016-09.zip 
 ... load dataframe from 2016-10.zip 


In [3]:
print df_data.shape
df_data.head(n=5)

(5652973, 7)


Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,ORIGIN_AIRPORT_ID,DEST_AIRPORT_ID
0,2015,4,11,4,3,14570,13930
1,2015,4,11,5,4,13930,14057
2,2015,4,11,6,5,13930,14057
3,2015,4,11,7,6,13930,14057
4,2015,4,11,8,7,13930,14057


# Load lookup table provided by BTS

In [4]:
df_lookup = pd.read_csv('../data/L_AIRPORT_ID.csv')
print df_lookup.shape

(6409, 2)


# Create *enhanced* lookuptable

In [5]:
df_lookup.head(n=10)

Unnamed: 0,Code,Description
0,10001,"Afognak Lake, AK: Afognak Lake Airport"
1,10003,"Granite Mountain, AK: Bear Creek Mining Strip"
2,10004,"Lik, AK: Lik Mining Camp"
3,10005,"Little Squaw, AK: Little Squaw Airport"
4,10006,"Kizhuyak, AK: Kizhuyak Bay"
5,10007,"Klawock, AK: Klawock Seaplane Base"
6,10008,"Elizabeth Island, AK: Elizabeth Island Airport"
7,10009,"Homer, AK: Augustin Island"
8,10010,"Hudson, NY: Columbia County"
9,10011,"Peach Springs, AZ: Grand Canyon West"


## Remove Code that is not present our dataset

In [6]:
# unique ID's in the dataset
uniq_orig = df_data['ORIGIN_AIRPORT_ID'].unique().tolist() 
uniq_dest = df_data['DEST_AIRPORT_ID'].unique().tolist()

# apply ``set`` function to get unique items in concatenated list
uniq_id = list(set(uniq_orig + uniq_dest))

print "There are {} Airport-Codes in the lookup table".format(df_lookup.shape[0])
print "There are {} unique airport-codes in our dataset".format(uniq_id.__len__())

There are 6409 Airport-Codes in the lookup table
There are 319 unique airport-codes in our dataset


Let's filter/drop the rows/records that we do not need in our analysis

In [7]:
# only keep the items in the main dataframe
_mask = df_lookup['Code'].isin( uniq_id )
df_lookup = df_lookup[ _mask ].reset_index(drop=True)

print df_lookup.shape
df_lookup.head(10)

(319, 2)


Unnamed: 0,Code,Description
0,10135,"Allentown/Bethlehem/Easton, PA: Lehigh Valley ..."
1,10136,"Abilene, TX: Abilene Regional"
2,10140,"Albuquerque, NM: Albuquerque International Sun..."
3,10141,"Aberdeen, SD: Aberdeen Regional"
4,10146,"Albany, GA: Southwest Georgia Regional"
5,10154,"Nantucket, MA: Nantucket Memorial"
6,10155,"Waco, TX: Waco Regional"
7,10157,"Arcata/Eureka, CA: Arcata"
8,10158,"Atlantic City, NJ: Atlantic City International"
9,10165,"Adak Island, AK: Adak"


## Parse state,city, and airport-name from 'Description' field

- Above we realize that the ``Description`` field contains information regarding the *city*, *state*, and *name* of the airport.

- Let's create individual field for each information.

- Fortunately, the ``Description`` column uses a comma (``,``) and colon (``:``) to delimit the City, State, Airport-name information, so splitting these are is straightforward.



In [8]:
# apply string "split" method to break information up
df_parse = map(lambda splits: {'City':splits[0],'State':splits[2],'Airport':splits[4]},
               df_lookup['Description'].str.split(r'(,\s|:\s)') )

pprint(df_parse[:5])

# convert dict to dataframe
df_parse = pd.DataFrame(df_parse)
df_parse.head(5)

[{'Airport': 'Lehigh Valley International',
  'City': 'Allentown/Bethlehem/Easton',
  'State': 'PA'},
 {'Airport': 'Abilene Regional', 'City': 'Abilene', 'State': 'TX'},
 {'Airport': 'Albuquerque International Sunport',
  'City': 'Albuquerque',
  'State': 'NM'},
 {'Airport': 'Aberdeen Regional', 'City': 'Aberdeen', 'State': 'SD'},
 {'Airport': 'Southwest Georgia Regional', 'City': 'Albany', 'State': 'GA'}]


Unnamed: 0,Airport,City,State
0,Lehigh Valley International,Allentown/Bethlehem/Easton,PA
1,Abilene Regional,Abilene,TX
2,Albuquerque International Sunport,Albuquerque,NM
3,Aberdeen Regional,Aberdeen,SD
4,Southwest Georgia Regional,Albany,GA


In [9]:
# now we can readily add these information to our lookup table
df_lookup = df_lookup.join(df_parse)

print df_lookup.shape
df_lookup.head()

(319, 5)


Unnamed: 0,Code,Description,Airport,City,State
0,10135,"Allentown/Bethlehem/Easton, PA: Lehigh Valley ...",Lehigh Valley International,Allentown/Bethlehem/Easton,PA
1,10136,"Abilene, TX: Abilene Regional",Abilene Regional,Abilene,TX
2,10140,"Albuquerque, NM: Albuquerque International Sun...",Albuquerque International Sunport,Albuquerque,NM
3,10141,"Aberdeen, SD: Aberdeen Regional",Aberdeen Regional,Aberdeen,SD
4,10146,"Albany, GA: Southwest Georgia Regional",Southwest Georgia Regional,Albany,GA


## Add state 'region' information

I also would like to study patterns among the four-regions in the United States: 

(1) Northeast
(2) South
(3) West
(4) Midwest

I saved a json lookup file for this purpose

In [10]:
%%bash
cat ../data/us_states_regions.json

{
"Northeast" : ["Connecticut","Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont","New Jersey", "New York", "Pennsylvania"],
"Midwest"   : ["Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin", "Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota"],
"South"     : [ "Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia", "District of Columbia", "West Virginia",             "Alabama", "Kentucky", "Mississippi", "Tennessee","Arkansas", "Louisiana", "Oklahoma", "Texas"],
"West"      : ["Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah",  "Wyoming", "Alaska", "California", "Hawaii", "Oregon", "Washington"]
}

In [11]:
import json
with open('../data/us_states_regions.json','r') as f:
    regions = json.load(f)

print regions.keys()
print regions.values()

[u'West', u'Northeast', u'Midwest', u'South']
[[u'Arizona', u'Colorado', u'Idaho', u'Montana', u'Nevada', u'New Mexico', u'Utah', u'Wyoming', u'Alaska', u'California', u'Hawaii', u'Oregon', u'Washington'], [u'Connecticut', u'Maine', u'Massachusetts', u'New Hampshire', u'Rhode Island', u'Vermont', u'New Jersey', u'New York', u'Pennsylvania'], [u'Illinois', u'Indiana', u'Michigan', u'Ohio', u'Wisconsin', u'Iowa', u'Kansas', u'Minnesota', u'Missouri', u'Nebraska', u'North Dakota', u'South Dakota'], [u'Delaware', u'Florida', u'Georgia', u'Maryland', u'North Carolina', u'South Carolina', u'Virginia', u'District of Columbia', u'West Virginia', u'Alabama', u'Kentucky', u'Mississippi', u'Tennessee', u'Arkansas', u'Louisiana', u'Oklahoma', u'Texas']]


In [12]:
df_region = []
for key in regions:
    _dftmp = pd.DataFrame( regions[key], columns=['State']  )
    _dftmp['Region'] = key
    df_region.append(_dftmp)
    
df_region = pd.concat(df_region,ignore_index=True)

df_region.head()

Unnamed: 0,State,Region
0,Arizona,West
1,Colorado,West
2,Idaho,West
3,Montana,West
4,Nevada,West


Let's use a hash-table (source) to map state name to its abbreviation

In [13]:
from util import hash_state_to_abbrev
hash_state = hash_state_to_abbrev()

df_region['State'] = df_region['State'].map(lambda key: hash_state[key])
df_region.head()

Unnamed: 0,State,Region
0,AZ,West
1,CO,West
2,ID,West
3,MT,West
4,NV,West


In [14]:
# good, we're now ready to join this "Region" information to our lookup table
df_lookup = df_lookup.merge(df_region,on='State',how='left')

df_lookup.head(10)

Unnamed: 0,Code,Description,Airport,City,State,Region
0,10135,"Allentown/Bethlehem/Easton, PA: Lehigh Valley ...",Lehigh Valley International,Allentown/Bethlehem/Easton,PA,Northeast
1,10136,"Abilene, TX: Abilene Regional",Abilene Regional,Abilene,TX,South
2,10140,"Albuquerque, NM: Albuquerque International Sun...",Albuquerque International Sunport,Albuquerque,NM,West
3,10141,"Aberdeen, SD: Aberdeen Regional",Aberdeen Regional,Aberdeen,SD,Midwest
4,10146,"Albany, GA: Southwest Georgia Regional",Southwest Georgia Regional,Albany,GA,South
5,10154,"Nantucket, MA: Nantucket Memorial",Nantucket Memorial,Nantucket,MA,Northeast
6,10155,"Waco, TX: Waco Regional",Waco Regional,Waco,TX,South
7,10157,"Arcata/Eureka, CA: Arcata",Arcata,Arcata/Eureka,CA,West
8,10158,"Atlantic City, NJ: Atlantic City International",Atlantic City International,Atlantic City,NJ,Northeast
9,10165,"Adak Island, AK: Adak",Adak,Adak Island,AK,West


## Add airport latitude and longitude information using Google geocoder

This information will be useful especially when making visualization plots

In [52]:
import geocoder
from util import print_time

t = time.time()
lat,lon = [],[]

n_items = df_lookup.shape[0]
for i,airport in enumerate(df_lookup['Airport']):
    if i%20==0: 
         print '({:3} out of {})'.format(i,n_items),print_time(t)
    loc = geocoder.google(airport)
    
    if loc is not None:
        lon.append(loc.lng)
        lat.append(loc.lat)
    else:
        # lookup failed
        lon.append(None)
        lat.append(None)
        
# add as new columns
df_lookup['lat'] = lat
df_lookup['lon'] = lon

n_nans = df_lookup['lat'].isnull().sum(axis=0)
print "-- {} NANs out {} ({:.2f}%) --".format(n_nans,n_items,n_nans/float(n_items)*100)

(  0 out of 319) Elapsed time:  0.00 seconds
( 20 out of 319) Elapsed time:  1.52 seconds


KeyboardInterrupt: 

Some lookup failed...but most succeeded

In [None]:
# let's do lookup based city and state
for i,(city,state) in enumerate(df_lookup['City'],df_lookup['State']):
    loc = geocoder.google(city+' '+state)
    if loc is None:
        # if airport lookup failed, lookup based on city and state info
        city  = df_lookup['City'].ix[i]
        state = df_lookup['State'].ix[i]
        loc = geocoder.google(city+' '+state)

    if loc is None:
        # even if that fails, append None for now
        lon.append(None)
        lat.append(None)
    else:
        # else, append the identified lat/lon informatino
        lon.append(loc.lng)
        lat.append(loc.lat)