<center>
<img src="https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg" />
</center> 
     
## <center>  [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 

#### <center> Author: [Yury Kashnitsky](https://yorko.github.io) (@yorko) 

# <center>Assignment #2. Fall 2019
## <center> Part 2. Gradient boosting

**In this assignment, you're asked to beat a baseline in the ["Flight delays" competition](https://www.kaggle.com/c/flight-delays-fall-2018).**

This time we decided to share a pretty decent CatBoost baseline, you'll have to improve the provided solution.

Prior to working on the assignment, you'd better check out the corresponding course material:
 1. [Classification, Decision Trees and k Nearest Neighbors](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic03_decision_trees_kNN/topic3_decision_trees_kNN.ipynb?flush_cache=true), the same as an interactive web-based [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-3-decision-trees-and-knn) 
 2. Ensembles:
  - [Bagging](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic05_ensembles_random_forests/topic5_part1_bagging.ipynb?flush_cache=true), the same as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-5-ensembles-part-1-bagging)
  - [Random Forest](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic05_ensembles_random_forests/topic5_part2_random_forest.ipynb?flush_cache=true), the same as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-5-ensembles-part-2-random-forest)
  - [Feature Importance](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic05_ensembles_random_forests/topic5_part3_feature_importance.ipynb?flush_cache=true), the same as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-5-ensembles-part-3-feature-importance)
 3. - [Gradient boosting](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic10_boosting/topic10_gradient_boosting.ipynb?flush_cache=true), the same as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/topic-10-gradient-boosting) 
   - Logistic regression, Random Forest, and LightGBM in the "Kaggle Forest Cover Type Prediction" competition: [Kernel](https://www.kaggle.com/kashnitsky/topic-10-practice-with-logit-rf-and-lightgbm) 
 4. You can also practice with demo assignments, which are simpler and already shared with solutions:
  - "Decision trees with a toy task and the UCI Adult dataset": [assignment](https://www.kaggle.com/kashnitsky/a3-demo-decision-trees) + [solution](https://www.kaggle.com/kashnitsky/a3-demo-decision-trees-solution)
  - "Logistic Regression and Random Forest in the credit scoring problem": [assignment](https://www.kaggle.com/kashnitsky/assignment-5-logit-and-rf-for-credit-scoring) + [solution](https://www.kaggle.com/kashnitsky/a5-demo-logit-and-rf-for-credit-scoring-sol)
 5. There are also 7 video lectures on trees, forests, boosting and their applications: [mlcourse.ai/video](https://mlcourse.ai/video) 
 6. mlcourse.ai tutorials on [categorical feature encoding](https://www.kaggle.com/waydeherman/tutorial-categorical-encoding) (by Wayde Herman) and [CatBoost](https://www.kaggle.com/mitribunskiy/tutorial-catboost-overview) (by Mikhail Tribunskiy)
 7. Last but not the least: [Public Kernels](https://www.kaggle.com/c/flight-delays-fall-2018/notebooks) in this competition

### Your task is to:
 1. beat **"A2 baseline (10 credits)"** on Public LB (**0.75914** LB score)
 2. rename your [team](https://www.kaggle.com/c/flight-delays-fall-2018/team) in full accordance with A1 and the [course rating](https://docs.google.com/spreadsheets/d/15e1K0tg5ponA5R6YQkZfihrShTDLAKf5qeKaoVCiuhQ/) (to appear on 16.09.2019)
 
This task is intended to be relatively easy. Here you are not required to upload your reproducible solution.
 
### <center> Deadline for A2: 2019 October 6, 20:59 CET (London time)

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score
from category_encoders.target_encoder import TargetEncoder
from xgboost import XGBRegressor

In [None]:
import matplotlib.pyplot as plt
from IPython.display import FileLink
import seaborn as sns
import pickle

In [None]:
!pwd
!ls -l
!ls -lR ../input/

**Read the data**

In [None]:
train_df = pd.read_csv('../input/flight-delays-fall-2018/flight_delays_train.csv')
test_df = pd.read_csv('../input/flight-delays-fall-2018/flight_delays_test.csv')

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
test_df.head()

In [None]:
test_df.info()

**Add extra features**

In [0]:
for df in (train_df, test_df):
    df['DepMin'] = df['DepTime'] % 100
    df['DepHours'] = df['DepTime'] // 100 + df['DepMin'] / 60
    df['DurationHours'] = df['Distance'] / 500
    df['ArrHours'] = [x if x <= 24 else x - 24 for x in (df['DepHours'] + df['DurationHours']).values]
    df['DepHour'] = df['DepTime'] // 100
    df['ArrHour'] = np.round(df['ArrHours']).astype(int)
    df['ArrMin'] = np.round((df['ArrHours'] - df['ArrHour']) * 60).astype(int)
    df['ArrTime'] = df['ArrHour'] * 100 + df['ArrMin']
    df['DepHourRange'] = ['00-06' if x < 6 else '06-09' if x < 9 else '09-12' if x < 12 else '12-15' if x < 15 \
                          else '15-18' if x < 18 else '18-21' if x < 21 else '21-24' for x in df['DepHour'].values]
    df['DepMinRange'] = ['00-15' if x < 15 else '15-30' if x < 30 else '30-45' if x < 45 else '45-60' for x in df['DepMin'].values]
    df['ArrHourRange'] = ['00-03' if x < 3 else '03-06' if x < 6 else '06-09' if x < 9 else '09-12' if x < 12 else '12-15' if x < 15 \
                          else '15-18' if x < 18 else '18-21' if x < 21 else '21-24' for x in df['ArrHour'].values]
    df['ArrMinRange'] = ['00-15' if x < 15 else '15-30' if x < 30 else '30-45' if x < 45 else '45-60' for x in df['ArrMin'].values]
train_df.head()

In [0]:
from itertools import combinations

In [0]:
new_feature_columns = ['Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier', 'Origin', 'Dest',
                       'DepHour', 'ArrHour', 'DepHourRange', 'DepMinRange', 'ArrHourRange', 'ArrMinRange']
new_features = []
for i in range(2,4):
    new_features += [x for x in combinations(new_feature_columns, i)]
len(new_features)

In [0]:
%%time
progress = 0
progress_target = len(new_features) * 2
for df in (train_df, test_df):
    for f in new_features:
        df['_'.join(f)] = df[[x for x in f]].apply(lambda vals: '_'.join([str(x) for x in vals]), axis=1)
        progress += 1
        print("Progress: %d%% (%d/%d)" % (progress / progress_target * 100, progress, progress_target))

In [0]:
train_df['dep_delayed_15min'] = train_df['dep_delayed_15min'].map({'Y': 1, 'N': 0})

In [0]:
train_df.head()

In [0]:
train_df.info()

In [0]:
train_df.to_hdf('train.h5', 'train', mode='w')
FileLink('train.h5')

In [0]:
test_df.head()

In [0]:
test_df.info()

In [0]:
test_df.to_hdf('test.h5', 'test', mode='w')
FileLink('test.h5')

In [None]:
y = train_df['dep_delayed_15min']
train_df.drop('dep_delayed_15min', axis=1, inplace=True)
y.shape, train_df.shape

In [None]:
y.to_hdf('y.h5', 'y', mode='w')
FileLink('y.h5')

**Categorical features Encoding**

Try to decrease **test_size** to build more accurate encoding. Timing is ~3min per 10000 objects.

In [0]:
X_train, X_valid, y_train, y_valid = train_test_split(train_df, y, test_size=0.01, random_state=17, stratify=y)
X_train.shape

In [0]:
%%time
te = TargetEncoder()
te.fit(X_train, y_train)

In [0]:
%%time
train_df = te.transform(train_df)

In [0]:
train_df.head()

In [None]:
train_df.info()

In [0]:
train_df.to_hdf('train_enc.h5', 'train_enc', mode='w')
FileLink('train_enc.h5')

In [0]:
%%time
test_df = te.transform(test_df)

In [None]:
test_df.head()

In [0]:
test_df.info()

In [0]:
test_df.to_hdf('test_enc.h5', 'test_enc', mode='w')
FileLink('test_enc.h5')

**Read previously saved data**

In [0]:
#train_df = pd.read_hdf('../input/mlcourse-ai-fall-2019-xgboost/train.h5')
#test_df = pd.read_hdf('../input/mlcourse-ai-fall-2019-xgboost/test.h5')

In [0]:
#train_df = pd.read_hdf('../input/mlcourse-ai-fall-2019-xgboost/train_enc.h5')
#test_df = pd.read_hdf('../input/mlcourse-ai-fall-2019-xgboost/test_enc.h5')
#y = pd.read_hdf('../input/mlcourse-ai-fall-2019-xgboost/y.h5')

**Train XGBoost**

In [None]:
# measure performance (GPU)
%%time
XGBRegressor(tree_method='gpu_hist').fit(train_df[:10000], y[:10000])

In [None]:
# measure performance (CPU)
%%time
XGBRegressor().fit(train_df[:10000], y[:10000])

Use different **random_state** here.

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(train_df, y, test_size=0.3, random_state=42, stratify=y)
X_train.shape

In [None]:
%%time
params = {'max_depth': range(1, 101, 10), 'n_estimators': range(1, 1001, 100)}
grid = GridSearchCV(XGBRegressor(random_state=17), params, cv=3, scoring='roc_auc', verbose=True)
grid.fit(X_train, y_train)
print(grid.best_score_)
print(roc_auc_score(y_valid, grid.best_estimator_.predict(X_valid)))
plt.figure(figsize=(16,4))
plt.plot([str(x) for x in grid.cv_results_['params']], grid.cv_results_['mean_test_score'])
plt.xticks(rotation=90)
plt.title('ROC AUC / train params')
plt.show()

In [None]:
pickle.dump(grid, open("grid.pkl", "wb"))
FileLink('grid.pkl')

In [None]:
%%time
params = {'max_depth': range(1, 11), 'n_estimators': range(1, 101, 10)}
grid = GridSearchCV(XGBRegressor(random_state=17), params, cv=3, scoring='roc_auc', verbose=True)
grid.fit(X_train, y_train)
print(grid.best_score_)
print(roc_auc_score(y_valid, grid.best_estimator_.predict(X_valid)))
plt.figure(figsize=(16,4))
plt.plot([str(x) for x in grid.cv_results_['params']], grid.cv_results_['mean_test_score'])
plt.xticks(rotation=90)
plt.title('ROC AUC / train params')
plt.show()

In [None]:
pickle.dump(grid, open("grid2.pkl", "wb"))
FileLink('grid2.pkl')

**Validate the model**

In [None]:
grid.best_score_

**Target is 0.756 ROC AUC**

In [None]:
roc_auc_score(y_valid, grid.best_estimator_.predict(X_valid))

In [None]:
plt.figure(figsize=(120,4))
plt.bar(train_df.columns, grid.best_estimator_.feature_importances_)
plt.xticks(rotation=90)
plt.title('Feature importance')
plt.show()

In [None]:
plt.figure(figsize=(16,4))
xx = []
yy = []
cnt = 20
for i in range(1, cnt):
  xx += [len(X_train)//cnt * i]
  yy += [roc_auc_score(y_valid, grid.best_estimator_.fit(X_train[:xx[-1]], y_train[:xx[-1]]).predict(X_valid))]
plt.plot(xx, yy)
plt.title('ROC AUC / train sample size')
plt.show()

**Train on the whole train set, make prediction on the test set.**

In [None]:
%%time
xgb = XGBRegressor(**grid.best_params_, random_state=17)
xgb.fit(train_df, y)

In [None]:
pickle.dump(xgb, open("xgb.pkl", "wb"))
FileLink('xgb.pkl')

In [None]:
plt.figure(figsize=(120,4))
plt.bar(train_df.columns, xgb.feature_importances_)
plt.xticks(rotation=90)
plt.show()

In [None]:
sample_sub = pd.read_csv('../input/flight-delays-fall-2018/sample_submission.csv', index_col='id')
sample_sub['dep_delayed_15min'] = xgb.predict(test_df)
sample_sub.to_csv('submission.csv')
FileLink('submission.csv')

In [None]:
!head submission.csv
!ls -l

Now's your turn! Go and improve the model to beat **"A2 baseline (10 credits)"** - **0.75914** LB score. It's crucial to come up with some good features. 

For discussions, stick to the **#a2_kaggle_fall2019** thread in the **mlcourse_ai_news** [ODS Slack](http://opendatascience.slack.com) channel. Serhii Romanenko (@serhii_romanenko) will be there to help. 

Welcome to Kaggle!

![img](https://habrastorage.org/webt/fs/42/ms/fs42ms0r7qsoj-da4x7yfntwrbq.jpeg)
*from the ["Nerd Laughing Loud"](https://www.kaggle.com/general/76963) thread.*