In this assignment we are to beat the baseline of 0.75914 ROC AUC score.
To do this we need to aggregate some features from the initial data.
My approach is to create as few features as possible.

Most of the code below is taken from public kernel https://www.kaggle.com/kashnitsky/mlcourse-ai-fall-2019-catboost-starter


In [None]:
import warnings
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier

**Read the data**

In [None]:
 PATH_TO_DATA = Path('../input/mini-flight-delay-prediction/')
# PATH_TO_DATA = Path('../input/flight-delays-fall-2018/')
# PATH_TO_DATA = Path('./data/')

**Create only three features**

In [None]:
train_df = pd.read_csv(PATH_TO_DATA / 'flight_delays_train.csv')
test_df = pd.read_csv(PATH_TO_DATA / 'flight_delays_test.csv')

train_df['DepHour'] = train_df['DepTime']//100
train_df['DepHour'].replace(to_replace=[24,25], value=0, inplace=True)
test_df['DepHour'] = test_df['DepTime']//100
test_df['DepHour'].replace(to_replace=[24,25], value=0, inplace=True)

train_df['DepMinute'] = train_df['DepTime']%100
test_df['DepMinute'] = test_df['DepTime']%100

train_df['CarrierOriginDepHour'] = train_df['UniqueCarrier'] + ': ' + train_df['Origin'] + ': ' + train_df['DepHour'].astype('str')
test_df['CarrierOriginDepHour'] = test_df['UniqueCarrier'] + ': ' + test_df['Origin'] + ': ' + test_df['DepHour'].astype('str')


**Remember indexes of categorical features (to be passed to CatBoost)**

In [None]:
categ_feat_idx = np.where(train_df.drop('dep_delayed_15min', axis=1).dtypes == 'object')[0]
categ_feat_idx

In [None]:
train_df.drop('dep_delayed_15min', axis=1).columns[categ_feat_idx]

**Allocate a hold-out set (a.k.a. a validation set) to validate the model**

In [None]:
X_train = train_df.drop('dep_delayed_15min', axis=1).values
y_train = train_df['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values
X_test = test_df.values

In [None]:
X_train_part, X_valid, y_train_part, y_valid = train_test_split(X_train, y_train, 
                                                                test_size=0.3, 
                                                                random_state=17)

**Train Catboost with default arguments, passing only the indexes of categorical features.**

In [None]:
ctb = CatBoostClassifier(random_seed=17,
                         silent=True,
                         task_type="GPU",
                         border_count=254)

# You can use GPU to speed up the process
# border_count is necessary to go along with CPU precision. By default GPU sets border_count to 128.

In [None]:
%%time
ctb.fit(X_train_part, y_train_part,
        cat_features=categ_feat_idx);

In [None]:
ctb_valid_pred = ctb.predict_proba(X_valid)[:, 1]

**We got some ~0.8 ROC AUC on the hold-out set.**

In [None]:
roc_auc_score(y_valid, ctb_valid_pred)

**Lets take a look at the importance of the features**

In [None]:
df = pd.DataFrame({'feature_name': train_df.drop('dep_delayed_15min', axis=1).columns,
                   'importance': ctb.feature_importances_})

df.sort_values(by='importance', ascending=False)

**Train on the whole train set, make prediction on the test set.**

In [None]:
%%time
ctb.fit(X_train, y_train,
        cat_features=categ_feat_idx);

In [None]:
ctb_test_pred = ctb.predict_proba(X_test)[:, 1]

In [None]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")

    sample_sub = pd.read_csv('../input/sample-data/sample_submission.csv', 
                             index_col='id')
    sample_sub['dep_delayed_15min'] = ctb_test_pred
    sample_sub.to_csv('ctb_pred.csv')

 We got ~0.762 in the competition! That's well above the baseline!

<img src='https://habrastorage.org/webt/fs/42/ms/fs42ms0r7qsoj-da4x7yfntwrbq.jpeg' width=50%>
*from the ["Nerd Laughing Loud"](https://www.kaggle.com/general/76963) thread.*