# Iterative Model Development Steps with Application to Airlines Dataset
- Steps are outlined in https://goo.gl/A7P4vX
- Link to airlines [dataset](https://github.com/h2oai/h2o-2/wiki/Hacking-Airline-DataSet-with-H2O)

In [82]:
from collections import Counter
import inspect
from joblib import dump
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [48]:
pd.options.display.max_columns = 50
pd.options.display.max_rows = 8

## Step 1: Understand Different Modeling Approaches
Please see [https://goo.gl/A7P4vX](https://goo.gl/A7P4vX)

## Step 2: Understand Business Use Case
- Client: Airline
- Statement of Problem: Airline has to compensate passangers if flight was delayed by 2+ hours or if flight arrived 3+ hours later.
- Question: Are there (any) aspects of delay that could have been prevented?

- Production environment:
  - Jupyter notebook (for POC)
  - Running Python 3.7
  - requirements.txt

- Outcome variable: (arrival delay of 3+ hours or departure delay of 2+ hours) or not, per https://upgradedpoints.com/flight-delay-cancelation-compensation

## Step 3: Get Access to Data

### Step 3-a: Read-in Data

In [3]:
file_name = "https://s3.amazonaws.com/h2o-airlines-unpacked/year2012.csv"
df = pd.read_csv(filepath_or_buffer=file_name,
                 encoding='latin-1')
# df = pd.read_csv("2012.csv")

### Step 3-b: EDA of Data

In [4]:
df.shape

(6096762, 31)

In [5]:
Counter(df['Month'])

Counter({1: 486133,
         2: 464826,
         3: 521628,
         4: 505218,
         5: 518423,
         6: 526933,
         7: 545131,
         8: 540793,
         9: 490199,
         10: 515254,
         11: 488006,
         12: 494218})

In [6]:
df.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,IsArrDelayed,IsDepDelayed
0,2012,1,1,7,855.0,900.0,1142.0,1225.0,AA,1,N325AA,347.0,385.0,330.0,-43.0,-5.0,JFK,LAX,2475,4.0,13.0,0,,0,0,0,0,0,0,NO,NO
1,2012,1,2,1,921.0,900.0,1210.0,1225.0,AA,1,N319AA,349.0,385.0,325.0,-15.0,21.0,JFK,LAX,2475,11.0,13.0,0,,0,0,0,0,0,0,NO,YES
2,2012,1,3,2,931.0,900.0,1224.0,1225.0,AA,1,N323AA,353.0,385.0,319.0,-1.0,31.0,JFK,LAX,2475,22.0,12.0,0,,0,0,0,0,0,0,NO,YES
3,2012,1,4,3,904.0,900.0,1151.0,1225.0,AA,1,N320AA,347.0,385.0,309.0,-34.0,4.0,JFK,LAX,2475,20.0,18.0,0,,0,0,0,0,0,0,NO,YES
4,2012,1,5,4,858.0,900.0,1142.0,1225.0,AA,1,N338AA,344.0,385.0,306.0,-43.0,-2.0,JFK,LAX,2475,22.0,16.0,0,,0,0,0,0,0,0,NO,NO


In [7]:
min(df['DepTime']), max(df['DepTime'])

(1.0, 2400.0)

In [8]:
min(df['ArrTime']), max(df['ArrTime'])

(1.0, 2400.0)

In [46]:
Counter(df['UniqueCarrier'])

Counter({'AA': 525220,
         'AS': 147569,
         'B6': 229056,
         'DL': 726879,
         'EV': 740855,
         'F9': 79255,
         'FL': 218162,
         'HA': 74109,
         'MQ': 473140,
         'OO': 617756,
         'UA': 531245,
         'US': 404263,
         'VX': 54742,
         'WN': 1140535,
         'YV': 133976})

## Step 4: To come a little later...

## Step 5: Feature Engineering for Baseline Model (v0)

- What are potential features we want to include in model?

### Step 5-a: Create an Outcome Variable

In [23]:
# Similar to function introduced in Class 2:

def delays_requiring_compensation(arrival_delay, departure_delay):
    """Fcn to return if arrival and/or departure delay resulted in passenger compensation."""
    count = 0
    if (arrival_delay/60.0 >= 3) | (departure_delay/60.0 >= 2):
        # If arrival delay is 3+ hours, or if departure delay is 2+ hours:
        count += 1
    return count

In [15]:
df['compensated_delays'] = df[['ArrDelay', 'DepDelay']].apply(
    lambda row: delays_requiring_compensation(row[0], row[1]),
    axis=1)
df[['ArrDelay', 'DepDelay', 'compensated_delays']].head()

Unnamed: 0,ArrDelay,DepDelay,compensated_delays
0,-43.0,-5.0,0
1,-15.0,21.0,0
2,-1.0,31.0,0
3,-34.0,4.0,0
4,-43.0,-2.0,0


In [24]:
Counter(df['compensated_delays'])

Counter({0: 5995130, 1: 101632})

### Step 5-b: Create a Time-of-Day Variable
- Per [documentation](http://stat-computing.org/dataexpo/2009/the-data.html) and EDA, time of day is recorded in minutes (float).

In [25]:
print(df['DepTime'][0])
str(int(df['DepTime'][0])).zfill(4)

855.0


'0855'

In [26]:
print(min(df['DepTime']))
str(int(min(df['DepTime']))).zfill(4)

1.0


'0001'

In [27]:
# There are missing departure times:
df['DepTime'] = df['DepTime'].fillna(9999.0)

In [55]:
df['Dep_Hour'] = df['DepTime'].apply(lambda x:
                                     int(
                                         str(int(x)).zfill(4)[0:2]
                                     ))

In [56]:
df['Dep_Hour'].value_counts(sort=False)

0      14878
1       6135
2       1630
3        738
       ...  
22    119418
23     42057
24       293
99     75723
Name: Dep_Hour, Length: 26, dtype: int64

Does anything look weird?

In [57]:
index_24 = np.where(df['Dep_Hour'] == 24)
df['Dep_Hour'].iloc[index_24] = 0

In [58]:
df['Dep_Hour'].value_counts(sort=False)

0      15171
1       6135
2       1630
3        738
       ...  
21    200855
22    119418
23     42057
99     75723
Name: Dep_Hour, Length: 25, dtype: int64

### Step 3-c: Create Indicator Variables from Features for Use with Sklearn
Features:
- Month
- Day of Week
- Time of Day

In [65]:
features_tod = pd.get_dummies(df['Dep_Hour'], drop_first=True, prefix="tod_")

In [66]:
features_tod.head()

Unnamed: 0,tod__1,tod__2,tod__3,tod__4,tod__5,tod__6,tod__7,tod__8,tod__9,tod__10,tod__11,tod__12,tod__13,tod__14,tod__15,tod__16,tod__17,tod__18,tod__19,tod__20,tod__21,tod__22,tod__23,tod__99
0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [67]:
features_month = pd.get_dummies(df['Month'], drop_first=True, prefix="mo_")

In [68]:
features_dow = pd.get_dummies(df['DayOfWeek'], drop_first=True, prefix="dow_")

In [69]:
features = pd.concat([features_tod, features_month, features_dow],
                     axis=1,
                     join='inner')

In [70]:
features.columns

Index(['tod__1', 'tod__2', 'tod__3', 'tod__4', 'tod__5', 'tod__6', 'tod__7',
       'tod__8', 'tod__9', 'tod__10', 'tod__11', 'tod__12', 'tod__13',
       'tod__14', 'tod__15', 'tod__16', 'tod__17', 'tod__18', 'tod__19',
       'tod__20', 'tod__21', 'tod__22', 'tod__23', 'tod__99', 'mo__2', 'mo__3',
       'mo__4', 'mo__5', 'mo__6', 'mo__7', 'mo__8', 'mo__9', 'mo__10',
       'mo__11', 'mo__12', 'dow__2', 'dow__3', 'dow__4', 'dow__5', 'dow__6',
       'dow__7'],
      dtype='object')

In [75]:
dataset = pd.concat([features, df['compensated_delays']],
                     axis=1)

In [76]:
dataset.shape

(6096762, 42)

## Step 4: Determine Data Splits
- What are some data splits that you would propose?

In [77]:
df_tmp, df_test = train_test_split(dataset,
                                   test_size=0.25,
                                   random_state=2019,
                                   stratify=dataset['compensated_delays'])

In [78]:
df_train, df_valid = train_test_split(df_tmp,
                                      test_size=0.25,
                                      random_state=2019,
                                      stratify=df_tmp['compensated_delays'])

In [79]:
df_train['compensated_delays'].value_counts(sort=False)

0    3372260
1      57168
Name: compensated_delays, dtype: int64

In [80]:
df_valid['compensated_delays'].value_counts(sort=False)

0    1124087
1      19056
Name: compensated_delays, dtype: int64

In [81]:
df_test['compensated_delays'].value_counts(sort=False)

0    1498783
1      25408
Name: compensated_delays, dtype: int64

## Step 6: Estimate a Baseline Model (v0)

In [83]:
y = df_train['compensated_delays']
X = df_train.drop(columns=['compensated_delays'])

In [84]:
inspect.signature(LogisticRegression)

<Signature (penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='warn', max_iter=100, multi_class='warn', verbose=0, warm_start=False, n_jobs=None)>

In [103]:
est_model = LogisticRegression(penalty="l2",
                               C=0.5,
                               fit_intercept=True,
                               class_weight='balanced',
                               random_state=2019,
                               max_iter=10000,
                               solver='saga')

In [104]:
est_model

LogisticRegression(C=0.5, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=10000,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=2019,
          solver='saga', tol=0.0001, verbose=0, warm_start=False)

In [None]:
est_model.fit(X, y)

In [88]:
# Save model, per: https://scikit-learn.org/stable/modules/model_persistence.html
dump(est_model, 'logistic.joblib')

## Load saved model:
# est_model = load('logistic.joblib') 

['logistic.joblib']

- Aside: creating a filename that includes a [timestamp](https://stackoverflow.com/questions/10607688/how-to-create-a-file-name-with-the-current-date-time-in-python)
- Logistic Regression with [Cross-Validation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)
- (in general) Cross-Validation with sklearn: [approach 1](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) and [approach 2](https://scikit-learn.org/0.16/modules/generated/sklearn.grid_search.GridSearchCV.html)

## Step 7: Interpret Results

In [100]:
est_model.intercept_

array([2.73075787])

In [None]:
# [round(x, 2) for x in est_model.coef_.tolist()[0]]

## Step 8: Evaluate Performance

## Step 9: Determine Next Steps