# air_passengers RAMP kit: feature extractor construction
# _A tale of two cities_
<i>Sylvain Tostain, 2017</i>

# Introduction
The aim of this notebook is to provide and document a relevant feature extractor for the air_passenger RAMP kit.

As introduced in the starting kit notebook, a good feature extractor is of particular relevance in this ramp kit due to the fact that the data provided is rather thin, and that we noticed in exploratory visualisations that the data seems to expose seasonality and possible special causes that ought to be understood and captured before training a model.

At first, let's import and have a look at the dataset provided.

In [44]:
%matplotlib inline
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
from IPython.display import display

## Fetching RAMP dataset to load it in a dataframe with pandas

In [45]:
data = pd.read_csv("../data/train.csv.bz2")

The data made available are as follows, `log_PAX` being our labels.

In [46]:
data.dtypes

DateOfDeparture      object
Departure            object
Arrival              object
WeeksToDeparture    float64
log_PAX             float64
std_wtd             float64
dtype: object

And an overview of the contents.

In [47]:
data.head(5)

Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,log_PAX,std_wtd
0,2012-06-19,ORD,DFW,12.875,12.331296,9.812647
1,2012-09-10,LAS,DEN,14.285714,10.775182,9.466734
2,2012-10-05,DEN,LAX,10.863636,11.083177,9.035883
3,2011-10-09,ATL,ORD,11.48,11.169268,7.990202
4,2012-02-21,DEN,SFO,11.45,11.269364,9.517159


Let's have a look at the timeframe we are adressing. This is especially usefull for the sake of data enrichment from other sources.

In [48]:
print(min(data['DateOfDeparture']))
print(max(data['DateOfDeparture']))

2011-09-01
2013-03-05


## Fetching the external dataset built at the previous step
The dataset comes in this RAMP kit with a sample additional dataset `external_data.csv` pertaining weather data and used as sample in the kit demonstration. Nevertheless, we'll not use this sample data and import the external data we prepared instead.

In [49]:
ext_data = pd.read_csv("../submissions/iteration_3/external_data.csv")
ext_data.dtypes

Airport        object
Latitude      float64
Longitude     float64
Pop_2010        int64
Age_median    float64
Companies       int64
Graduates     float64
Housings        int64
Income          int64
Foreigners      int64
Poverty       float64
dtype: object

In [50]:
ext_data

Unnamed: 0,Airport,Latitude,Longitude,Pop_2010,Age_median,Companies,Graduates,Housings,Income,Foreigners,Poverty
0,ORD,41.9796,-87.9045,2695598,33.7,291007,82.3,1192544,48522,572066,22.3
1,LAS,36.0852,-115.1507,583756,36.9,55856,83.3,250279,50202,127458,17.5
2,DEN,39.8589,-104.6733,600158,34.1,79097,86.1,294191,53637,10437,17.3
3,ATL,33.641,-84.4226,420003,33.4,64593,89.0,228579,47527,32701,24.6
4,SFO,37.6218,-122.379,805235,38.5,116803,87.0,383676,81294,295417,13.2
5,EWR,40.6971,-74.1756,8175133,35.8,1050911,80.3,3422225,53373,3138169,20.6
6,IAH,29.9869,-95.3421,2099451,32.6,260347,76.7,927107,46187,632743,22.5
7,LAX,33.9425,-118.409,3792621,34.9,497999,75.5,1436543,50205,1489926,22.1
8,DFW,32.8959,-97.0372,1197816,32.4,142658,74.5,533556,43781,305921,24.0
9,SEA,47.449,-122.3093,60866,35.8,83323,93.4,31595,70594,118225,13.5


The are no longer weather data but demographic and economic data regarding cities to which airports are connected and GPS coordinates of the airports. These are all continuous data.

## Structuring the base dataset
As seen in the visualisations and in the starting kit, it is relevant to provide some structure to the base dataset, more specifically:
* Regarding dates, in order to allow our models to capture seasonality. Months, weeks and weekdays are probably the most relevant. We can forget about days of the month that are probably less relevant.
* Regarding factors, we'll apply one hot encoding on departure and arrival airports.

Nevertheless, we have to keep in mind that there will be a huge cost in termes of dimensions...

In [51]:
print("Factors in Departure:")
print(data['Departure'].unique())
print("Total number of factors in Departure: {}".format(len(data['Departure'].unique())))

print("\nFactors in Arrival:")
print(data['Arrival'].unique())
print("Total number of factors in Arrival: {}".format(len(data['Arrival'].unique())))

Factors in Departure:
['ORD' 'LAS' 'DEN' 'ATL' 'SFO' 'EWR' 'IAH' 'LAX' 'DFW' 'SEA' 'JFK' 'PHL'
 'MIA' 'DTW' 'BOS' 'MSP' 'CLT' 'MCO' 'PHX' 'LGA']
Total number of factors in Departure: 20

Factors in Arrival:
['DFW' 'DEN' 'LAX' 'ORD' 'SFO' 'MCO' 'LAS' 'CLT' 'MSP' 'EWR' 'PHX' 'DTW'
 'MIA' 'BOS' 'PHL' 'JFK' 'ATL' 'LGA' 'SEA' 'IAH']
Total number of factors in Arrival: 20


Let's apply these transformations.

We'll keep weekdays and weeks, and forget about the other time series factors so far, as we see lesser interest from our visualisations.

In [52]:
# Directly inspired from the starting kit notebook.
data_enc = data

# One-hot encoding of departure points, then drop of the initial feature
data_enc = data_enc.join(pd.get_dummies(data_enc['Departure'], prefix='d'))

# One-hot encoding of arrival points, then drop of the initial feature
data_enc = data_enc.join(pd.get_dummies(data_enc['Arrival'], prefix='a'))

# One-hot encoding of temporal variables that might catch seasonalities and/or special causes
# following http://stackoverflow.com/questions/16453644/regression-with-date-variable-using-scikit-learn
data_enc['DateOfDeparture'] = pd.to_datetime(data_enc['DateOfDeparture'])

data_enc['weekday'] = data_enc['DateOfDeparture'].dt.weekday
data_enc = data_enc.join(pd.get_dummies(data_enc['weekday'], prefix='wd'))
data_enc = data_enc.drop('weekday', axis=1)

data_enc['week'] = data_enc['DateOfDeparture'].dt.week
data_enc = data_enc.join(pd.get_dummies(data_enc['week'], prefix='w'))
data_enc = data_enc.drop('week', axis=1)

# Commented out : probably useless and most certainly costly in terms of dimensions...
#
# data_enc['year'] = data_enc['DateOfDeparture'].dt.year
# data_enc = data_enc.join(pd.get_dummies(data_enc['year'], prefix='y'))
#
# data_enc['month'] = data_enc['DateOfDeparture'].dt.month
# data_enc = data_enc.join(pd.get_dummies(data_enc['month'], prefix='m'))
#
# data_enc['day'] = data_enc['DateOfDeparture'].dt.day
# data_enc = data_enc.join(pd.get_dummies(data_enc['day'], prefix='d'))
#
# data_enc['n_days'] = data_enc['DateOfDeparture'].apply(lambda date: (date - pd.to_datetime("1970-01-01")).days)

As a result...

In [53]:
print(list(data_enc.columns))
print("Total number of columns: {}".format(len(list(data_enc.columns))))

['DateOfDeparture', 'Departure', 'Arrival', 'WeeksToDeparture', 'log_PAX', 'std_wtd', 'd_ATL', 'd_BOS', 'd_CLT', 'd_DEN', 'd_DFW', 'd_DTW', 'd_EWR', 'd_IAH', 'd_JFK', 'd_LAS', 'd_LAX', 'd_LGA', 'd_MCO', 'd_MIA', 'd_MSP', 'd_ORD', 'd_PHL', 'd_PHX', 'd_SEA', 'd_SFO', 'a_ATL', 'a_BOS', 'a_CLT', 'a_DEN', 'a_DFW', 'a_DTW', 'a_EWR', 'a_IAH', 'a_JFK', 'a_LAS', 'a_LAX', 'a_LGA', 'a_MCO', 'a_MIA', 'a_MSP', 'a_ORD', 'a_PHL', 'a_PHX', 'a_SEA', 'a_SFO', 'wd_0', 'wd_1', 'wd_2', 'wd_3', 'wd_4', 'wd_5', 'wd_6', 'w_1', 'w_2', 'w_3', 'w_4', 'w_5', 'w_6', 'w_7', 'w_8', 'w_9', 'w_10', 'w_11', 'w_12', 'w_13', 'w_14', 'w_15', 'w_16', 'w_17', 'w_18', 'w_19', 'w_20', 'w_21', 'w_22', 'w_23', 'w_24', 'w_25', 'w_26', 'w_27', 'w_28', 'w_29', 'w_30', 'w_31', 'w_32', 'w_33', 'w_34', 'w_35', 'w_36', 'w_37', 'w_38', 'w_39', 'w_40', 'w_41', 'w_42', 'w_43', 'w_44', 'w_45', 'w_46', 'w_47', 'w_48', 'w_49', 'w_50', 'w_51', 'w_52']
Total number of columns: 105


At this stage, our base dataset is prepared and ready for enrichment...

## Creating a richer external dataset
### Adding information on holidays
Given the observations made through visualisations, we have collected additional data.

Regarding holidays, the situation is rather complex given the fact that there is no regulated paid off days for every employer in the US. As a simple approach, we'll make use of the holiday calendar made available in `pandas timeseries`.

In [54]:
from pandas.tseries.holiday import Holiday, USMemorialDay, AbstractHolidayCalendar, nearest_workday, MO
from datetime import datetime

cal = pd.tseries.holiday.USFederalHolidayCalendar()
cal.rules

[Holiday: New Years Day (month=1, day=1, observance=<function nearest_workday at 0x000000A3CCC3F158>),
 Holiday: Dr. Martin Luther King Jr. (month=1, day=1, offset=<DateOffset: kwds={'weekday': MO(+3)}>),
 Holiday: Presidents Day (month=2, day=1, offset=<DateOffset: kwds={'weekday': MO(+3)}>),
 Holiday: MemorialDay (month=5, day=31, offset=<DateOffset: kwds={'weekday': MO(-1)}>),
 Holiday: July 4th (month=7, day=4, observance=<function nearest_workday at 0x000000A3CCC3F158>),
 Holiday: Labor Day (month=9, day=1, offset=<DateOffset: kwds={'weekday': MO(+1)}>),
 Holiday: Columbus Day (month=10, day=1, offset=<DateOffset: kwds={'weekday': MO(+2)}>),
 Holiday: Veterans Day (month=11, day=11, observance=<function nearest_workday at 0x000000A3CCC3F158>),
 Holiday: Thanksgiving (month=11, day=1, offset=<DateOffset: kwds={'weekday': TH(+4)}>),
 Holiday: Christmas (month=12, day=25, observance=<function nearest_workday at 0x000000A3CCC3F158>)]

In [55]:
holiday_dates = cal.holidays(min(data['DateOfDeparture']), max(data['DateOfDeparture']))
holiday_dates

DatetimeIndex(['2011-09-05', '2011-10-10', '2011-11-11', '2011-11-24',
               '2011-12-26', '2012-01-02', '2012-01-16', '2012-02-20',
               '2012-05-28', '2012-07-04', '2012-09-03', '2012-10-08',
               '2012-11-12', '2012-11-22', '2012-12-25', '2013-01-01',
               '2013-01-21', '2013-02-18'],
              dtype='datetime64[ns]', freq=None)

We now need to add this information to the base dataset

In [56]:
data_enc['Holiday'] = data_enc['DateOfDeparture'].isin(holiday_dates)
data_enc = data_enc.join(pd.get_dummies(data_enc['Holiday'], prefix='hd'))
data_enc = data_enc.drop(['Holiday'], axis=1)
data_enc.head(10)

Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,log_PAX,std_wtd,d_ATL,d_BOS,d_CLT,d_DEN,d_DFW,d_DTW,d_EWR,d_IAH,d_JFK,d_LAS,d_LAX,d_LGA,d_MCO,d_MIA,d_MSP,d_ORD,d_PHL,d_PHX,d_SEA,d_SFO,a_ATL,a_BOS,a_CLT,a_DEN,a_DFW,a_DTW,a_EWR,a_IAH,a_JFK,a_LAS,a_LAX,a_LGA,a_MCO,a_MIA,a_MSP,a_ORD,a_PHL,a_PHX,a_SEA,a_SFO,wd_0,wd_1,wd_2,wd_3,wd_4,wd_5,wd_6,w_1,w_2,w_3,w_4,w_5,w_6,w_7,w_8,w_9,w_10,w_11,w_12,w_13,w_14,w_15,w_16,w_17,w_18,w_19,w_20,w_21,w_22,w_23,w_24,w_25,w_26,w_27,w_28,w_29,w_30,w_31,w_32,w_33,w_34,w_35,w_36,w_37,w_38,w_39,w_40,w_41,w_42,w_43,w_44,w_45,w_46,w_47,w_48,w_49,w_50,w_51,w_52,hd_False,hd_True
0,2012-06-19,ORD,DFW,12.875,12.331296,9.812647,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,2012-09-10,LAS,DEN,14.285714,10.775182,9.466734,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2,2012-10-05,DEN,LAX,10.863636,11.083177,9.035883,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
3,2011-10-09,ATL,ORD,11.48,11.169268,7.990202,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,2012-02-21,DEN,SFO,11.45,11.269364,9.517159,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
5,2013-01-22,ATL,MCO,10.363636,12.073649,8.232025,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
6,2011-10-20,SFO,LAS,15.266667,11.173936,9.808277,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
7,2012-01-28,EWR,ORD,8.588235,9.599952,6.16501,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
8,2012-05-27,ATL,CLT,10.238095,9.175645,6.609877,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
9,2013-02-22,ATL,DEN,8.294118,10.73432,5.542616,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


### Merging with airport and city data

We'll split the external dataset firt.

In [14]:
ext_coords = ext_data.get(['Airport', 'Latitude', 'Longitude'])
ext_census = ext_data.drop(['Latitude', 'Longitude'], axis=1)

We merge the census data in bulk to begin

In [15]:
data_enc = pd.merge(data_enc, ext_census, how='left', left_on=['Arrival'], right_on=['Airport'], sort=False)
data_enc = data_enc.drop(['Airport'], axis=1)
data_enc.head(5)

Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,log_PAX,std_wtd,d_ATL,d_BOS,d_CLT,d_DEN,d_DFW,d_DTW,d_EWR,d_IAH,d_JFK,d_LAS,d_LAX,d_LGA,d_MCO,d_MIA,d_MSP,d_ORD,d_PHL,d_PHX,d_SEA,d_SFO,a_ATL,a_BOS,a_CLT,a_DEN,a_DFW,a_DTW,a_EWR,a_IAH,a_JFK,a_LAS,a_LAX,a_LGA,a_MCO,a_MIA,a_MSP,a_ORD,a_PHL,a_PHX,a_SEA,a_SFO,wd_0,wd_1,wd_2,wd_3,wd_4,wd_5,wd_6,w_1,w_2,w_3,w_4,w_5,w_6,w_7,w_8,w_9,w_10,w_11,w_12,w_13,w_14,w_15,w_16,w_17,w_18,w_19,w_20,w_21,w_22,w_23,w_24,w_25,w_26,w_27,w_28,w_29,w_30,w_31,w_32,w_33,w_34,w_35,w_36,w_37,w_38,w_39,w_40,w_41,w_42,w_43,w_44,w_45,w_46,w_47,w_48,w_49,w_50,w_51,w_52,hd_False,hd_True,Pop_2010,Age_median,Companies,Graduates,Housings,Income,Foreigners,Poverty
0,2012-06-19,ORD,DFW,12.875,12.331296,9.812647,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1197816,32.4,142658,74.5,533556,43781,305921,24.0
1,2012-09-10,LAS,DEN,14.285714,10.775182,9.466734,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,600158,34.1,79097,86.1,294191,53637,10437,17.3
2,2012-10-05,DEN,LAX,10.863636,11.083177,9.035883,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,3792621,34.9,497999,75.5,1436543,50205,1489926,22.1
3,2011-10-09,ATL,ORD,11.48,11.169268,7.990202,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2695598,33.7,291007,82.3,1192544,48522,572066,22.3
4,2012-02-21,DEN,SFO,11.45,11.269364,9.517159,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,805235,38.5,116803,87.0,383676,81294,295417,13.2


### Enriching with the distances of the trips
Now that we have the coordinates of the airports, we'll try to compute the distances of the trips.

We'll make use of a mapped version of the <a href=https://en.wikipedia.org/wiki/Haversine_formula>Haversine Formula</a>, implemented in Python as follows :

In [16]:
# Inspired from https://stackoverflow.com
#    /questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points

from math import radians, cos, sin, asin, sqrt

def haversine(row):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # we map the rows
    lon1 = row['D_lon']
    lat1 = row['D_lat']
    lon2 = row['A_lon']
    lat2 = row['A_lat']
    
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

In [17]:
data_dist = data

# A quick view at the data...
display(ext_coords.head(5))
display(data_dist.head(5))

# We perform a first merge to import departure coordinates.
data_dist = pd.merge(
    data_dist, ext_coords,
    how='left',
    left_on=['Departure'],
    right_on=['Airport'],
    sort=False)
data_dist = data_dist.drop(['Airport'], axis=1)

data_dist = data_dist.rename(
    columns={'Latitude': 'D_lat', 'Longitude': 'D_lon'})

# We perform a second merge to import arrival coordinates.
data_dist = pd.merge(
    data_dist, ext_coords,
    how='left',
    left_on=['Arrival'],
    right_on=['Airport'],
    sort=False)
data_dist = data_dist.drop(['Airport'], axis=1)

data_dist = data_dist.rename(
    columns={'Latitude': 'A_lat', 'Longitude': 'A_lon'})

# Another view.
display(data_dist.head(5))
display(data_dist.dtypes)

# And we apply the Haversine formula.
data_dist['Distance'] = data_dist.apply(lambda row: haversine(row), axis=1)

display(data_dist.head(5))

Unnamed: 0,Airport,Latitude,Longitude
0,ORD,41.9796,-87.9045
1,LAS,36.0852,-115.1507
2,DEN,39.8589,-104.6733
3,ATL,33.641,-84.4226
4,SFO,37.6218,-122.379


Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,log_PAX,std_wtd
0,2012-06-19,ORD,DFW,12.875,12.331296,9.812647
1,2012-09-10,LAS,DEN,14.285714,10.775182,9.466734
2,2012-10-05,DEN,LAX,10.863636,11.083177,9.035883
3,2011-10-09,ATL,ORD,11.48,11.169268,7.990202
4,2012-02-21,DEN,SFO,11.45,11.269364,9.517159


Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,log_PAX,std_wtd,D_lat,D_lon,A_lat,A_lon
0,2012-06-19,ORD,DFW,12.875,12.331296,9.812647,41.9796,-87.9045,32.8959,-97.0372
1,2012-09-10,LAS,DEN,14.285714,10.775182,9.466734,36.0852,-115.1507,39.8589,-104.6733
2,2012-10-05,DEN,LAX,10.863636,11.083177,9.035883,39.8589,-104.6733,33.9425,-118.409
3,2011-10-09,ATL,ORD,11.48,11.169268,7.990202,33.641,-84.4226,41.9796,-87.9045
4,2012-02-21,DEN,SFO,11.45,11.269364,9.517159,39.8589,-104.6733,37.6218,-122.379


DateOfDeparture      object
Departure            object
Arrival              object
WeeksToDeparture    float64
log_PAX             float64
std_wtd             float64
D_lat               float64
D_lon               float64
A_lat               float64
A_lon               float64
dtype: object

Unnamed: 0,DateOfDeparture,Departure,Arrival,WeeksToDeparture,log_PAX,std_wtd,D_lat,D_lon,A_lat,A_lon,Distance
0,2012-06-19,ORD,DFW,12.875,12.331296,9.812647,41.9796,-87.9045,32.8959,-97.0372,1290.782275
1,2012-09-10,LAS,DEN,14.285714,10.775182,9.466734,36.0852,-115.1507,39.8589,-104.6733,1008.860199
2,2012-10-05,DEN,LAX,10.863636,11.083177,9.035883,39.8589,-104.6733,33.9425,-118.409,1385.066996
3,2011-10-09,ATL,ORD,11.48,11.169268,7.990202,33.641,-84.4226,41.9796,-87.9045,976.118298
4,2012-02-21,DEN,SFO,11.45,11.269364,9.517159,39.8589,-104.6733,37.6218,-122.379,1552.991274


## Wrap-up
We have build a first enriched dataset, that we'll test as a new submission...

An empty directory has been created accordingly (if not, it needs to be on your local environment).

We then propose the following feature extractor.

In [18]:
%%file ../submissions/iteration_3/feature_extractor.py
import pandas as pd
import os
from pandas.tseries.holiday import Holiday, USMemorialDay, AbstractHolidayCalendar, nearest_workday, MO

# Define a mapped version of the Haversine formula to compute distances between airports. 
# Inspired from https://stackoverflow.com
#    /questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points
from math import radians, cos, sin, asin, sqrt

def haversine(row):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # we map the rows
    lon1 = row['D_lon']
    lat1 = row['D_lat']
    lon2 = row['A_lon']
    lat2 = row['A_lat']
    
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

# Inspired form the feature extractor that comes with starting_kit.
class FeatureExtractor(object):
    def __init__(self):
        pass

    def fit(self, X_df, y_array):
        pass

    def transform(self, X_df):
        X_encoded = X_df

        
        # Fetches external data from external_data.csv
        path = os.path.dirname(__file__)
        ext_data = pd.read_csv(os.path.join(path, 'external_data.csv'))
        
        #Splits the external dataset in two subsets
        ext_coords = ext_data.get(['Airport', 'Latitude', 'Longitude'])
        ext_census = ext_data.drop(['Latitude', 'Longitude'], axis=1)
        
        # Merges (left join) census data with base data
        X_encoded = pd.merge(
            X_encoded, ext_census,
            how='left',
            left_on=['Arrival'],
            right_on=['Airport'],
            sort=False)
        X_encoded = X_encoded.drop(['Airport'], axis=1)
        
        # Performs a first merge to import departure coordinates.
        X_encoded = pd.merge(
            X_encoded, ext_coords,
            how='left',
            left_on=['Departure'],
            right_on=['Airport'],
            sort=False)
        X_encoded = X_encoded.drop(['Airport'], axis=1)

        X_encoded = X_encoded.rename(
            columns={'Latitude': 'D_lat', 'Longitude': 'D_lon'})

        # We perform a second merge to import arrival coordinates.
        X_encoded = pd.merge(
            X_encoded, ext_coords,
            how='left',
            left_on=['Arrival'],
            right_on=['Airport'],
            sort=False)
        X_encoded = X_encoded.drop(['Airport'], axis=1)

        X_encoded = X_encoded.rename(
            columns={'Latitude': 'A_lat', 'Longitude': 'A_lon'})

        # And we apply the Haversine formula.
        X_encoded['Distance'] = X_encoded.apply(lambda row: haversine(row), axis=1)
        
        # Creates one hot encoding for Departure, then drop the original feature
        X_encoded = X_encoded.join(pd.get_dummies(
            X_encoded['Departure'], prefix='d'))
        X_encoded = X_encoded.drop('Departure', axis=1)
        
        # Creates one hot encoding for Arrival, then drop the original feature
        X_encoded = X_encoded.join(pd.get_dummies(
            X_encoded['Arrival'], prefix='a'))
        X_encoded = X_encoded.drop('Arrival', axis=1)
        
        # Adds the Federal Holidays
        cal = pd.tseries.holiday.USFederalHolidayCalendar()
        holiday_dates = cal.holidays(min(X_encoded['DateOfDeparture']), max(X_encoded['DateOfDeparture']))
        X_encoded['Holiday'] = X_encoded['DateOfDeparture'].isin(holiday_dates)
        X_encoded = X_encoded.join(pd.get_dummies(X_encoded['Holiday'], prefix='hd'))
        X_encoded = X_encoded.drop(['Holiday'], axis=1)
        
        # Creates one hot encoding for time period likely to catch seasonality
        X_encoded['DateOfDeparture'] = pd.to_datetime(X_encoded['DateOfDeparture'])
        
        X_encoded['weekday'] = X_encoded['DateOfDeparture'].dt.weekday
        X_encoded = X_encoded.join(pd.get_dummies(X_encoded['weekday'], prefix='wd'))
        X_encoded = X_encoded.drop('weekday', axis=1)
        
        X_encoded['week'] = X_encoded['DateOfDeparture'].dt.week
        X_encoded = X_encoded.join(pd.get_dummies(X_encoded['week'], prefix='w'))
        X_encoded = X_encoded.drop('week', axis=1)
        
        # Drops DateOfDeparture
        X_encoded = X_encoded.drop('DateOfDeparture', axis=1)
        
        # Return the values
        X_array = X_encoded.values
        return X_array

Overwriting ../submissions/iteration_3/feature_extractor.py


Another version of the first feature extractor

In [19]:
%%file ../submissions/iteration_4/feature_extractor.py
import pandas as pd
import os
from pandas.tseries.holiday import Holiday, USMemorialDay, AbstractHolidayCalendar, nearest_workday, MO
from copy import deepcopy

# Define a mapped version of the Haversine formula to compute distances between airports. 
# Inspired from https://stackoverflow.com
#    /questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points
from math import radians, cos, sin, asin, sqrt

def haversine(row):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # we map the rows
    lon1 = row['D_lon']
    lat1 = row['D_lat']
    lon2 = row['A_lon']
    lat2 = row['A_lat']
    
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

# Inspired form the feature extractor that comes with starting_kit.
class FeatureExtractor(object):
    def __init__(self):
        pass

    def fit(self, X_df, y_array):
        pass

    def transform(self, X_df):
        X_encoded = X_df

        
        # Fetches external data from external_data.csv
        path = os.path.dirname(__file__)
        ext_data = pd.read_csv(os.path.join(path, 'external_data.csv'))
        
        # Splits the external dataset in two subsets
        ext_coords = ext_data.get(['Airport', 'Latitude', 'Longitude'])
        ext_census = ext_data.drop(['Latitude', 'Longitude'], axis=1)
        
        # Merges (left join) census data with base data for the departure point
        ext_census_d = deepcopy(ext_census)
        ext_census_d.columns = "D_" + ext_census_d.columns
        X_encoded = pd.merge(
            X_encoded, ext_census_d,
            how='left',
            left_on=['Departure'],
            right_on=['D_Airport'],
            sort=False)
        X_encoded = X_encoded.drop(['D_Airport'], axis=1)
        
        # Merges (left join) census data with base data for the arrival point
        ext_census_a = deepcopy(ext_census)
        ext_census_a.columns = "A_" + ext_census_a.columns
        X_encoded = pd.merge(
            X_encoded, ext_census_a,
            how='left',
            left_on=['Arrival'],
            right_on=['A_Airport'],
            sort=False)
        X_encoded = X_encoded.drop(['A_Airport'], axis=1)
        
        # Performs a first merge to import departure coordinates.
        X_encoded = pd.merge(
            X_encoded, ext_coords,
            how='left',
            left_on=['Departure'],
            right_on=['Airport'],
            sort=False)
        X_encoded = X_encoded.drop(['Airport'], axis=1)

        X_encoded = X_encoded.rename(
            columns={'Latitude': 'D_lat', 'Longitude': 'D_lon'})

        # We perform a second merge to import arrival coordinates.
        X_encoded = pd.merge(
            X_encoded, ext_coords,
            how='left',
            left_on=['Arrival'],
            right_on=['Airport'],
            sort=False)
        X_encoded = X_encoded.drop(['Airport'], axis=1)

        X_encoded = X_encoded.rename(
            columns={'Latitude': 'A_lat', 'Longitude': 'A_lon'})

        # And we apply the Haversine formula.
        X_encoded['Distance'] = X_encoded.apply(lambda row: haversine(row), axis=1)
        
        # Creates one hot encoding for Departure, then drop the original feature
        X_encoded = X_encoded.join(pd.get_dummies(
            X_encoded['Departure'], prefix='d'))
        X_encoded = X_encoded.drop('Departure', axis=1)
        
        # Creates one hot encoding for Arrival, then drop the original feature
        X_encoded = X_encoded.join(pd.get_dummies(
            X_encoded['Arrival'], prefix='a'))
        X_encoded = X_encoded.drop('Arrival', axis=1)
        
        # Adds the Federal Holidays
        cal = pd.tseries.holiday.USFederalHolidayCalendar()
        holiday_dates = cal.holidays(min(X_encoded['DateOfDeparture']), max(X_encoded['DateOfDeparture']))
        X_encoded['Holiday'] = X_encoded['DateOfDeparture'].isin(holiday_dates)
        X_encoded = X_encoded.join(pd.get_dummies(X_encoded['Holiday'], prefix='hd'))
        X_encoded = X_encoded.drop(['Holiday'], axis=1)
        
        # Creates one hot encoding for time period likely to catch seasonality
        X_encoded['DateOfDeparture'] = pd.to_datetime(X_encoded['DateOfDeparture'])
        
        X_encoded['weekday'] = X_encoded['DateOfDeparture'].dt.weekday
        X_encoded = X_encoded.join(pd.get_dummies(X_encoded['weekday'], prefix='wd'))
        X_encoded = X_encoded.drop('weekday', axis=1)
        
        X_encoded['week'] = X_encoded['DateOfDeparture'].dt.week
        X_encoded = X_encoded.join(pd.get_dummies(X_encoded['week'], prefix='w'))
        X_encoded = X_encoded.drop('week', axis=1)
        
        # Drops DateOfDeparture
        X_encoded = X_encoded.drop('DateOfDeparture', axis=1)
        
        # Return the values
        X_array = X_encoded.values
        return X_array

Overwriting ../submissions/iteration_4/feature_extractor.py


Another attempt, we widen the windows for holidays (thank you Fabrice !) :

In [87]:
%%file ../submissions/iteration_5/feature_extractor.py
import pandas as pd
import os
from pandas.tseries.holiday import Holiday, USMemorialDay, AbstractHolidayCalendar, nearest_workday, MO
from copy import deepcopy
from datetime import datetime, timedelta

# Define a mapped version of the Haversine formula to compute distances between airports. 
# Inspired from https://stackoverflow.com
#    /questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points
from math import radians, cos, sin, asin, sqrt

def haversine(row):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # we map the rows
    lon1 = row['D_lon']
    lat1 = row['D_lat']
    lon2 = row['A_lon']
    lat2 = row['A_lat']
    
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

# Inspired form the feature extractor that comes with starting_kit.
class FeatureExtractor(object):
    def __init__(self):
        pass

    def fit(self, X_df, y_array):
        pass

    def transform(self, X_df):
        X_encoded = deepcopy(X_df)

        
        # Convert DateOfDeparture to datetime
        X_encoded['DateOfDeparture'] = pd.to_datetime(X_encoded['DateOfDeparture'], dayfirst = True)
        
        # Fetches external data from external_data.csv
        path = os.path.dirname(__file__)
        ext_data = pd.read_csv(os.path.join(path, 'external_data.csv'))
        
        # Splits the external dataset in two subsets
        ext_coords = ext_data.get(['Airport', 'Latitude', 'Longitude'])
        ext_census = ext_data.drop(['Latitude', 'Longitude'], axis=1)
        
        # Merges (left join) census data with base data for the departure point
        ext_census_d = deepcopy(ext_census)
        ext_census_d.columns = "D_" + ext_census_d.columns
        X_encoded = pd.merge(
            X_encoded, ext_census_d,
            how='left',
            left_on=['Departure'],
            right_on=['D_Airport'],
            sort=False)
        X_encoded = X_encoded.drop(['D_Airport'], axis=1)
        
        # Merges (left join) census data with base data for the arrival point
        ext_census_a = deepcopy(ext_census)
        ext_census_a.columns = "A_" + ext_census_a.columns
        X_encoded = pd.merge(
            X_encoded, ext_census_a,
            how='left',
            left_on=['Arrival'],
            right_on=['A_Airport'],
            sort=False)
        X_encoded = X_encoded.drop(['A_Airport'], axis=1)
        
        # Performs a first merge to import departure coordinates.
        X_encoded = pd.merge(
            X_encoded, ext_coords,
            how='left',
            left_on=['Departure'],
            right_on=['Airport'],
            sort=False)
        X_encoded = X_encoded.drop(['Airport'], axis=1)

        X_encoded = X_encoded.rename(
            columns={'Latitude': 'D_lat', 'Longitude': 'D_lon'})

        # We perform a second merge to import arrival coordinates.
        X_encoded = pd.merge(
            X_encoded, ext_coords,
            how='left',
            left_on=['Arrival'],
            right_on=['Airport'],
            sort=False)
        X_encoded = X_encoded.drop(['Airport'], axis=1)

        X_encoded = X_encoded.rename(
            columns={'Latitude': 'A_lat', 'Longitude': 'A_lon'})

        # And we apply the Haversine formula.
        X_encoded['Distance'] = X_encoded.apply(lambda row: haversine(row), axis=1)
        
        # Creates one hot encoding for Departure, then drop the original feature
        X_encoded = X_encoded.join(pd.get_dummies(
            X_encoded['Departure'], prefix='d'))
        X_encoded = X_encoded.drop('Departure', axis=1)
        
        # Creates one hot encoding for Arrival, then drop the original feature
        X_encoded = X_encoded.join(pd.get_dummies(
            X_encoded['Arrival'], prefix='a'))
        X_encoded = X_encoded.drop('Arrival', axis=1)
        
        # Adds the Federal Holidays
        cal = pd.tseries.holiday.USFederalHolidayCalendar()
        holiday_dates = cal.holidays(min(X_encoded['DateOfDeparture']), max(X_encoded['DateOfDeparture']))
        holiday_dates_before = holiday_dates + timedelta(days = -1)
        holiday_dates_after = holiday_dates + timedelta(days = +1)

        X_encoded['Holiday'] = X_encoded['DateOfDeparture'].isin(holiday_dates)
        X_encoded['Holiday'] = X_encoded['Holiday'].apply(lambda x : int(x == True))
        X_encoded['Holiday_before'] = X_encoded['DateOfDeparture'].isin(holiday_dates_before)
        X_encoded['Holiday_before'] = X_encoded['Holiday_before'].apply(lambda x : int(x == True))
        X_encoded['Holiday_after'] = X_encoded['DateOfDeparture'].isin(holiday_dates_after)
        X_encoded['Holiday_after'] = X_encoded['Holiday_after'].apply(lambda x : int(x == True))

        X_encoded['Holiday_score'] = X_encoded['Holiday'] + X_encoded['Holiday_before']*0.5 + X_encoded['Holiday_after']*0.5
        
        X_encoded = X_encoded.drop(['Holiday', 'Holiday_before', 'Holiday_after'], axis=1)
        
        # Creates one hot encoding for time period likely to catch seasonality
        X_encoded['DateOfDeparture'] = pd.to_datetime(X_encoded['DateOfDeparture'])
        
        X_encoded['weekday'] = X_encoded['DateOfDeparture'].dt.weekday
        X_encoded = X_encoded.join(pd.get_dummies(X_encoded['weekday'], prefix='wd'))
        X_encoded = X_encoded.drop('weekday', axis=1)
        
        X_encoded['week'] = X_encoded['DateOfDeparture'].dt.week
        X_encoded = X_encoded.join(pd.get_dummies(X_encoded['week'], prefix='w'))
        X_encoded = X_encoded.drop('week', axis=1)
        
        # Drops DateOfDeparture
        X_encoded = X_encoded.drop('DateOfDeparture', axis=1)
        
        # Return the values
        X_array = X_encoded.values
        return X_array

Overwriting ../submissions/iteration_5/feature_extractor.py
