# Final Prediction: LightGBM

Train a GBM using K-fold CV and use the mean test prediction across the folds for the final submission.

## Imports

This utility package imports `numpy`, `pandas`, `matplotlib` and a helper `kg` module into the root namespace.

In [1]:
import datetime
import pandas as pd
import numpy as np

In [2]:
import lightgbm as lgb

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [3]:
from sklearn.model_selection import StratifiedKFold

## Config

Number of CV folds.

In [4]:
NUM_FOLDS = 2

Make subsequent runs reproducible.

In [5]:
RANDOM_SEED = 2017

In [6]:
np.random.seed(RANDOM_SEED)

## Read Data

Load all features we extracted earlier.

In [7]:
df_train = pd.read_csv('../Final_Build/x_train.csv') 
df_test = pd.read_csv('../Final_Build/x_test.csv') 

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


In [8]:
X_train = df_train.values
X_test = df_test.values

In [9]:
y_train = pd.read_csv('../Final_Build/y_train.csv', header=None).values.reshape(-1, )

View feature summary.

In [10]:
print('X train:', X_train.shape)
print('X test: ', X_test.shape)
print('y train:', y_train.shape)

X train: (404290, 49)
X test:  (2345796, 49)
y train: (404290,)


## Train models & compute test predictions from each fold

Calculate partitions.

In [11]:
kfold = StratifiedKFold(
    n_splits=NUM_FOLDS,
    shuffle=True,
    random_state=RANDOM_SEED
)

In [12]:
y_test_pred = np.zeros((len(X_test), NUM_FOLDS))

Fit all folds.

In [13]:
cv_scores = []

In [14]:
%%time

for fold_num, (ix_train, ix_val) in enumerate(kfold.split(X_train, y_train)):
    print(f'Fitting fold {fold_num + 1} of {kfold.n_splits}')
    
    X_fold_train = X_train[ix_train]
    X_fold_val = X_train[ix_val]

    y_fold_train = y_train[ix_train]
    y_fold_val = y_train[ix_val]
    
    lgb_params = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'boosting': 'gbdt',
        'device': 'cpu',
        'feature_fraction': 0.486,
        'num_leaves': 158,
        'lambda_l2': 50,
        'learning_rate': 0.01,
        'num_boost_round': 5000,
        'early_stopping_rounds': 10,
        'verbose': 1,
        'bagging_fraction_seed': RANDOM_SEED,
        'feature_fraction_seed': RANDOM_SEED,
    }
    
    lgb_data_train = lgb.Dataset(X_fold_train, y_fold_train)
    lgb_data_val = lgb.Dataset(X_fold_val, y_fold_val)    
    evals_result = {}
    
    model = lgb.train(
        lgb_params,
        lgb_data_train,
        valid_sets=[lgb_data_train, lgb_data_val],
        evals_result=evals_result,
        num_boost_round=lgb_params['num_boost_round'],
        early_stopping_rounds=lgb_params['early_stopping_rounds'],
        verbose_eval=False,
    )
    
    fold_train_scores = evals_result['training'][lgb_params['metric']]
    fold_val_scores = evals_result['valid_1'][lgb_params['metric']]
    
    print('Fold {}: {} rounds, training loss {:.6f}, validation loss {:.6f}'.format(
        fold_num + 1,
        len(fold_train_scores),
        fold_train_scores[-1],
        fold_val_scores[-1],
    ))
    print()
    
    cv_scores.append(fold_val_scores[-1])
    y_test_pred[:, fold_num] = model.predict(X_test).reshape(-1)

Fitting fold 1 of 2




Fold 1: 3213 rounds, training loss 0.248407, validation loss 0.314827

Fitting fold 2 of 2
Fold 2: 3793 rounds, training loss 0.241689, validation loss 0.313148

CPU times: user 4h 15min 37s, sys: 2min 20s, total: 4h 17min 58s
Wall time: 1h 18min 12s


Print CV score and feature importance.

In [15]:
pd.DataFrame({
    'column': list(df_train.columns),
    'importance': model.feature_importance(),
}).sort_values(by='importance')

Unnamed: 0,column,importance
38,difference_between_q2,514
27,when_q1,648
43,min_kcore,658
25,who_q1,735
34,when_q2,750
31,difference_between_q1,755
35,where_q2,762
28,where_q1,802
32,who_q2,953
29,why_q1,1959


In [16]:
final_cv_score = np.mean(cv_scores)

In [17]:
print('Final CV score:', final_cv_score)

Final CV score: 0.31398775259783984


## Generate submission

In [18]:
y_test = np.mean(y_test_pred, axis=1)

In [19]:
submission_id = datetime.datetime.now().strftime('%Y-%m-%d-%H%M')

In [20]:
df_submission = pd.DataFrame({
    'test_id': range(len(y_test)),
    'is_duplicate': y_test
})

### Recalibrate predictions for a different target balance on test

Based on [Mike Swarbrick Jones' blog](https://swarbrickjones.wordpress.com/2017/03/28/cross-entropy-and-training-test-class-imbalance/).

$\alpha = \frac{p_{test}}{p_{train}}$

$\beta = \frac{1 - p_{test}}{1 - p_{train}}$

$\hat{y}_{test}^{\prime} = \frac{\alpha \hat{y}_{test}}{\alpha \hat{y}_{test} + \beta(1 - \hat{y}_{test})}$

Training set balance is 36.92%, test set balance is ~16.5%.

In [21]:
def recalibrate_prediction(pred, train_pos_ratio=0.3692, test_pos_ratio=0.165):
    a = test_pos_ratio / train_pos_ratio
    b = (1 - test_pos_ratio) / (1 - train_pos_ratio)
    return a * pred / (a * pred + b * (1 - pred))

In [22]:
df_submission['is_duplicate'] = df_submission['is_duplicate'].map(recalibrate_prediction)

In [23]:
df_submission = df_submission[['test_id', 'is_duplicate']]

### Explore and save submission

In [24]:
pd.DataFrame(y_test).plot.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1a42148898>

In [25]:
print('Test duplicates with >0.9 confidence:', len(df_submission[df_submission.is_duplicate > 0.9]))
print('Test mean prediction:', np.mean(y_test))
print('Calibrated mean prediction:', df_submission['is_duplicate'].mean())

Test duplicates with >0.9 confidence: 26330
Test mean prediction: 0.1515807229083248
Calibrated mean prediction: 0.07312322349297931


In [26]:
df_submission.to_csv(
    'submission.csv',
    header=True,
    float_format='%.8f',
    index=None,
)

In [27]:
df_submission.shape

(2345796, 2)

In [28]:
2345796

2345796

In [29]:
y_test.shape

(2345796,)