<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/), Data Scientist @ Mail.Ru Group <br>All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.

# <center> Assignment #10 (demo)
## <center> Gradient boosting

Your task is to beat at least 2 benchmarks in this [Kaggle Inclass competition](https://www.kaggle.com/c/flight-delays-spring-2018). Here you won’t be provided with detailed instructions. We only give you a brief description of how the second benchmark was achieved using Xgboost. Hopefully, at this stage of the course, it's enough for you to take a quick look at the data in order to understand that this is the type of task where gradient boosting will perform well. Most likely it will be Xgboost, however, we’ve got plenty of categorical features here.

<img src='../../img/xgboost_meme.jpg' width=40% />

In [25]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

In [26]:
import lightgbm as lgb
from lightgbm import LGBMClassifier

In [40]:
train = pd.read_csv('../../data/flight_delays_train.csv.zip')
test = pd.read_csv('../../data/flight_delays_test.csv.zip')

In [41]:
train.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


In [42]:
test.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance
0,c-7,c-25,c-3,615,YV,MRY,PHX,598
1,c-4,c-17,c-2,739,WN,LAS,HOU,1235
2,c-12,c-2,c-7,651,MQ,GSP,ORD,577
3,c-3,c-25,c-7,1614,WN,BWI,MHT,377
4,c-6,c-6,c-3,1505,UA,ORD,STL,258


Given flight departure time, carrier's code, departure airport, destination location, and flight distance, you have to predict departure delay for more than 15 minutes. 

In [43]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
Month                100000 non-null object
DayofMonth           100000 non-null object
DayOfWeek            100000 non-null object
DepTime              100000 non-null int64
UniqueCarrier        100000 non-null object
Origin               100000 non-null object
Dest                 100000 non-null object
Distance             100000 non-null int64
dep_delayed_15min    100000 non-null object
dtypes: int64(2), object(7)
memory usage: 6.9+ MB


In [44]:
import math

def prepData(df):
#     for col in ['Month', 'DayofMonth', 'DayOfWeek']:
#         df[col] = df[col].str[2:].astype(int)
    df['Flight'] = df['Origin'].map(str) + '-' + df['Dest'].map(str)
    df['DepTimeHour'] = df['DepTime'].apply(lambda x: math.floor(x/100)%24)
    df['DepTimeMin'] = df['DepTime'].apply(lambda x: x%100)
    #df.drop(['DepTime'], axis=1,inplace = True)
    return df

In [45]:
train = prepData(train)
test = prepData(test)
train.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min,Flight,DepTimeHour,DepTimeMin
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N,ATL-DFW,19,34
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N,PIT-MCO,15,48
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N,RDU-CLE,14,22
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N,DEN-MEM,10,15
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y,MDW-OMA,18,28


In [46]:
avoid_cols = ['dep_delayed_15min']
cols = [col for col in train.columns if col not in avoid_cols]
cat_cols = ['Month','DayofMonth', 'DayOfWeek',
            'UniqueCarrier','Origin','Dest','Flight', 'DepTimeHour', 'DepTimeMin']#,'DepTimeHour'
print(cols)
print(cat_cols)

['Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'UniqueCarrier', 'Origin', 'Dest', 'Distance', 'Flight', 'DepTimeHour', 'DepTimeMin']
['Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier', 'Origin', 'Dest', 'Flight', 'DepTimeHour', 'DepTimeMin']


In [47]:
X_train = train[cols]#prepData(train).drop(['dep_delayed_15min'], axis=1)
y_train = train['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values
X_test = test[cols]#prepData(test)

# X_train = train[['Distance', 'DepTime']].values
# y_train = train['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values
# X_test = test[['Distance', 'DepTime']].values


In [48]:
def cat_df(df):
    for col in df.columns:
        col_type = df[col].dtype
        if col_type == 'object' or col_type.name == 'category':
            df[col] = df[col].astype('category').cat.codes + 1

In [49]:
cat_df(X_train)
cat_df(X_test)
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 11 columns):
Month            100000 non-null int8
DayofMonth       100000 non-null int8
DayOfWeek        100000 non-null int8
DepTime          100000 non-null int64
UniqueCarrier    100000 non-null int8
Origin           100000 non-null int16
Dest             100000 non-null int16
Distance         100000 non-null int64
Flight           100000 non-null int16
DepTimeHour      100000 non-null int64
DepTimeMin       100000 non-null int64
dtypes: int16(3), int64(4), int8(4)
memory usage: 4.0 MB


In [50]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 11 columns):
Month            100000 non-null int8
DayofMonth       100000 non-null int8
DayOfWeek        100000 non-null int8
DepTime          100000 non-null int64
UniqueCarrier    100000 non-null int8
Origin           100000 non-null int16
Dest             100000 non-null int16
Distance         100000 non-null int64
Flight           100000 non-null int16
DepTimeHour      100000 non-null int64
DepTimeMin       100000 non-null int64
dtypes: int16(3), int64(4), int8(4)
memory usage: 4.0 MB


In [51]:
X_train_part, X_valid, y_train_part, y_valid = \
    train_test_split(X_train, y_train, 
                     test_size=0.3, random_state=17)

We'll train lgboost with default parameters on part of data and estimate holdout ROC AUC.

In [52]:
lgb_model = LGBMClassifier(random_state=17,class_weight='balanced',max_depth=3,min_child_weight = 1,reg_lambda = 1)
#50,num_leaves=900, ,objective='binary',metric = 'auc',
lgb_model.fit(X_train_part, y_train_part,
              feature_name = cols)#, categorical_feature = cat_cols
lgb_valid_pred = lgb_model.predict(X_valid)

roc_auc_score(y_valid, lgb_valid_pred)

0.6515382973502944

In [67]:
y_train_part[0:100]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0])

Now we do the same with the whole training set, make predictions to test set and form a submission file. This is how you beat the first benchmark. 

In [7]:
xgb_model.fit(X_train, y_train)
xgb_test_pred = xgb_model.predict_proba(X_test)[:, 1]

pd.Series(xgb_test_pred, 
          name='dep_delayed_15min').to_csv('xgb_2feat.csv', 
                                           index_label='id', header=True)

The second benchmark in the leaderboard was achieved as follows:

- Features `Distance` and `DepTime` were taken unchanged
- A feature `Flight` was created from features `Origin` and `Dest`
- Features `Month`, `DayofMonth`, `DayOfWeek`, `UniqueCarrier` and `Flight` were transformed with OHE (`LabelBinarizer`)
- Logistic regression and gradient boosting (xgboost) were trained. Xgboost hyperparameters were tuned via cross-validation. First, the hyperparameters responsible for model complexity were optimized, then the number of trees was fixed at 500 and learning step was tuned.
- Predicted probabilities were made via cross-validation using `cross_val_predict`. A linear mixture of logistic regression and gradient boosting predictions was set in the form $w_1 * p_{logit} + (1 - w_1) * p_{xgb}$, where $p_{logit}$ is a probability of class 1, predicted by logistic regression, and $p_{xgb}$ – the same for xgboost. $w_1$ weight was selected manually.
- A similar combination of predictions was made for test set. 

Following the same steps is not mandatory. That’s just a description of how the result was achieved by the author of this assignment. Perhaps you might not want to follow the same steps, and instead, let’s say, add a couple of good features and train a random forest of a thousand trees.

Good luck!