<center>
<img src="https://habrastorage.org/webt/ia/m9/zk/iam9zkyzqebnf_okxipihkgjwnw.jpeg">
     
## <center>  [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 

#### <center> Author: [Yury Kashnitsky](https://yorko.github.io) (@yorko) 

# <center>Assignment #2. Fall 2019. Solution
## <center> Part 2. Gradient boosting

Beating benchmarks in [this competition](https://www.kaggle.com/c/flight-delays-fall-2018/overview).

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier

**Read the data**

In [2]:
PATH_TO_DATA = Path('../input/flight-delays-fall-2018/')

In [3]:
train_df = pd.read_csv(PATH_TO_DATA / 'flight_delays_train.csv')

In [4]:
train_df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


In [5]:
test_df = pd.read_csv(PATH_TO_DATA / 'flight_delays_test.csv')

In [6]:
test_df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance
0,c-7,c-25,c-3,615,YV,MRY,PHX,598
1,c-4,c-17,c-2,739,WN,LAS,HOU,1235
2,c-12,c-2,c-7,651,MQ,GSP,ORD,577
3,c-3,c-25,c-7,1614,WN,BWI,MHT,377
4,c-6,c-6,c-3,1505,UA,ORD,STL,258


**Adding the same feature as in the provided [starter Notebook](https://www.kaggle.com/kashnitsky/mlcourse-ai-fall-2019-catboost-starter)**

In [7]:
train_df['flight'] = train_df['Origin'] + '-->' + train_df['Dest']
test_df['flight'] = test_df['Origin'] + '-->' + test_df['Dest']

**Now adding more features. Actually, all of them are taken from [this Notebook](https://www.kaggle.com/rohitgr/lgbm-bayesianoptimization-eda) my Rohit Gupta**

In [8]:
# Hour and minute
train_df['hour'] = train_df['DepTime'] // 100
train_df.loc[train_df['hour'] == 24, 'hour'] = 0
train_df.loc[train_df['hour'] == 25, 'hour'] = 1
train_df['minute'] = train_df['DepTime'] % 100

test_df['hour'] = test_df['DepTime'] // 100
test_df.loc[test_df['hour'] == 24, 'hour'] = 0
test_df.loc[test_df['hour'] == 25, 'hour'] = 1
test_df['minute'] = test_df['DepTime'] % 100

# Season
train_df['summer'] = (train_df['Month'].isin([6, 7, 8])).astype(np.int32)
train_df['autumn'] = (train_df['Month'].isin([9, 10, 11])).astype(np.int32)
train_df['winter'] = (train_df['Month'].isin([12, 1, 2])).astype(np.int32)
train_df['spring'] = (train_df['Month'].isin([3, 4, 5])).astype(np.int32)

test_df['summer'] = (test_df['Month'].isin([6, 7, 8])).astype(np.int32)
test_df['autumn'] = (test_df['Month'].isin([9, 10, 11])).astype(np.int32)
test_df['winter'] = (test_df['Month'].isin([12, 1, 2])).astype(np.int32)
test_df['spring'] = (test_df['Month'].isin([3, 4, 5])).astype(np.int32)

# Daytime
train_df['daytime'] = pd.cut(train_df['hour'], bins=[0, 6, 12, 18, 23], include_lowest=True)
test_df['daytime'] = pd.cut(test_df['hour'], bins=[0, 6, 12, 18, 23], include_lowest=True)

# Extract the labels
train_y = train_df.pop('dep_delayed_15min')
train_y = train_y.map({'N': 0, 'Y': 1})

# Concatenate for preprocessing
train_split = train_df.shape[0]
full_df = pd.concat((train_df, test_df))
full_df['Distance'] = np.log(full_df['Distance'])

# String to numerical
for col in ['Month', 'DayofMonth', 'DayOfWeek']:
    full_df[col] = full_df[col].apply(lambda x: x.split('-')[1]).astype(np.int32) - 1

# Label Encoding
for col in ['Origin', 'Dest', 'UniqueCarrier', 'daytime', 'flight']:
    full_df[col] = pd.factorize(full_df[col])[0]

# Categorical columns
cat_cols = ['Month', 'DayofMonth', 'DayOfWeek', 'Origin', 'Dest', 'UniqueCarrier', 'hour', 'summer', 'autumn', 'winter', 'spring', 'daytime', 'flight']

# Converting categorical columns to type 'category' as required by LGBM
for c in cat_cols:
    full_df[c] = full_df[c].astype('category')

# Split into train and test
train_df, test_df = full_df.iloc[:train_split], full_df.iloc[train_split:]
train_df.shape, train_y.shape, test_df.shape

((100000, 16), (100000,), (100000, 16))

**Remember the indexes of categorical features**

In [9]:
train_df.dtypes

Month            category
DayofMonth       category
DayOfWeek        category
DepTime             int64
UniqueCarrier    category
Origin           category
Dest             category
Distance          float64
flight           category
hour             category
minute              int64
summer           category
autumn           category
winter           category
spring           category
daytime          category
dtype: object

In [10]:
categ_feat_idx = np.where(train_df.dtypes == 'category')[0]
categ_feat_idx

array([ 0,  1,  2,  4,  5,  6,  8,  9, 11, 12, 13, 14, 15])

**Allocate a hold-out set**

In [11]:
X_train_part, X_valid, y_train_part, y_valid = train_test_split(train_df, train_y, 
                                                                test_size=0.3, 
                                                                random_state=17)

**Create an instance of CatBoost classifier. It's gonna train considerably faster with GPU (notice that GPU is turned on in settings of this Notebook).**

In [12]:
ctb = CatBoostClassifier(task_type='GPU', random_seed=17, silent=True)

**Train Catboost without setting hyperparameters, passing only the indexes of categorical features.**

In [13]:
%%time
ctb.fit(X_train_part, y_train_part,
        cat_features=categ_feat_idx)

CPU times: user 1min 27s, sys: 13.5 s, total: 1min 40s
Wall time: 1min 30s


<catboost.core.CatBoostClassifier at 0x7f676775f748>

In [14]:
ctb_valid_pred = ctb.predict_proba(X_valid)[:, 1]

**We got ~0.8 ROC AUC on a hold-out set.**

In [15]:
roc_auc_score(y_valid, ctb_valid_pred)

0.802180246997313

**Train on the whole train set, make prediction for the test set.**

In [16]:
%%time
ctb.fit(train_df.values, train_y, cat_features=categ_feat_idx)

CPU times: user 30.6 s, sys: 14.3 s, total: 44.9 s
Wall time: 32.7 s


<catboost.core.CatBoostClassifier at 0x7f676775f748>

In [17]:
ctb_test_pred = ctb.predict_proba(test_df.values)[:, 1]

In [18]:
sample_sub = pd.read_csv(PATH_TO_DATA / 'sample_submission.csv',
                         index_col='id')
sample_sub['dep_delayed_15min'] = ctb_test_pred
sample_sub.to_csv('ctb_pred.csv')

In [19]:
sample_sub.head()

Unnamed: 0_level_0,dep_delayed_15min
id,Unnamed: 1_level_1
0,0.016117
1,0.029954
2,0.097063
3,0.391396
4,0.281331
