# Final Prediction: LightGBM

Train a GBM using K-fold CV and use the mean test prediction across the folds for the final submission.

## Imports

This utility package imports `numpy`, `pandas`, `matplotlib` and a helper `kg` module into the root namespace.

In [42]:
import datetime
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [43]:
import lightgbm as lgb

In [1]:
from sklearn.model_selection import StratifiedKFold, train_test_split

## Config

Number of CV folds.

In [45]:
NUM_FOLDS = 2

Make subsequent runs reproducible.

In [46]:
RANDOM_SEED = 2017

In [47]:
np.random.seed(RANDOM_SEED)

## Read Data

Load all features we extracted earlier.

In [48]:
df_train = pd.read_csv('../Final_Build/x_train.csv', index_col='id')
df_test = pd.read_csv('../Final_Build/x_test.csv', index_col='test_id') 

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  mask |= (ar1 == a)


In [49]:
X_train = df_train.values
X_test = df_test.values

In [50]:
y_train = pd.read_csv('../Final_Build/y_train.csv', header=None).values.reshape(-1, )

View feature summary.

In [51]:
print('X train:', X_train.shape)
print('X test: ', X_test.shape)
print('y train:', y_train.shape)

X train: (404290, 42)
X test:  (2345796, 42)
y train: (404290,)


## Train models & compute test predictions from each fold

Calculate partitions.

In [52]:
kfold = StratifiedKFold(
    n_splits=NUM_FOLDS,
    shuffle=True,
    random_state=RANDOM_SEED
)

In [53]:
# y_test_pred = np.zeros((len(X_test), NUM_FOLDS))

Fit all folds.

In [54]:
cv_scores = []

In [None]:
%%time

X_train, X_val, y_train, y_val = train_test_split(X_train, y, test_size=.8)

lgb_params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting': 'gbdt',
    'device': 'cpu',
    'feature_fraction': 0.486,
    'num_leaves': 158,
    'lambda_l2': 50,
    'learning_rate': 0.01,
    'num_boost_round': 5000,
    'early_stopping_rounds': 10,
    'verbose': 1,
    'bagging_fraction_seed': RANDOM_SEED,
    'feature_fraction_seed': RANDOM_SEED,
}

lgb_data_train = lgb.Dataset(X_train, y_train)
lgb_data_val = lgb.Dataset(X_val, y_val)    
evals_result = {}

model = lgb.train(
    lgb_params,
    lgb_data_train,
    valid_sets=[lgb_data_train, lgb_data_val],
    evals_result=evals_result,
    num_boost_round=lgb_params['num_boost_round'],
    early_stopping_rounds=lgb_params['early_stopping_rounds'],
    verbose_eval=False,
)

fold_train_scores = evals_result['training'][lgb_params['metric']]
fold_val_scores = evals_result['valid_1'][lgb_params['metric']]

print(fold_train_scores, fold_val_scores)

cv_scores.append(fold_val_scores[-1])
y_test_pred = model.predict(X_test).reshape(-1)

Fitting fold 1 of 2




Print CV score and feature importance.

In [None]:
important_features = pd.DataFrame({
    'column': list(df_train.columns),
    'importance': model.feature_importance(),
}).sort_values(by='importance', ascending=False)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=[8, 10])
feats_to_plot = important_features[:20]
sns.barplot(x=feats_to_plot['importance'], y=feats_to_plot['column'], ax=ax)
plt.show()

In [None]:
important_features

In [None]:
final_cv_score = np.mean(cv_scores)

In [None]:
print('Final CV score:', final_cv_score)

## Generate submission

In [None]:
y_test = np.mean(y_test_pred, axis=1)

In [None]:
submission_id = datetime.datetime.now().strftime('%Y-%m-%d-%H%M')

In [None]:
df_submission = pd.DataFrame({
    'test_id': range(len(y_test)),
    'is_duplicate': y_test
})

### Recalibrate predictions for a different target balance on test

Based on [Mike Swarbrick Jones' blog](https://swarbrickjones.wordpress.com/2017/03/28/cross-entropy-and-training-test-class-imbalance/).

$\alpha = \frac{p_{test}}{p_{train}}$

$\beta = \frac{1 - p_{test}}{1 - p_{train}}$

$\hat{y}_{test}^{\prime} = \frac{\alpha \hat{y}_{test}}{\alpha \hat{y}_{test} + \beta(1 - \hat{y}_{test})}$

Training set balance is 36.92%, test set balance is ~16.5%.

In [None]:
def recalibrate_prediction(pred, train_pos_ratio=0.3692, test_pos_ratio=0.165):
    a = test_pos_ratio / train_pos_ratio
    b = (1 - test_pos_ratio) / (1 - train_pos_ratio)
    return a * pred / (a * pred + b * (1 - pred))

In [None]:
df_submission['is_duplicate'] = df_submission['is_duplicate'].map(recalibrate_prediction)

In [None]:
df_submission = df_submission[['test_id', 'is_duplicate']]

### Explore and save submission

In [None]:
pd.DataFrame(y_test).plot.hist()

In [None]:
print('Test duplicates with >0.9 confidence:', len(df_submission[df_submission.is_duplicate > 0.9]))
print('Test mean prediction:', np.mean(y_test))
print('Calibrated mean prediction:', df_submission['is_duplicate'].mean())

In [None]:
df_submission.to_csv(
    'submission.csv',
    header=True,
    float_format='%.8f',
    index=None,
)

In [None]:
df_submission.shape

In [None]:
2345796

In [None]:
y_test.shape