I create this notebook in response to https://www.kaggle.com/zhangcheche/work-well-on-trainset-bad-on-testset

I think reason why his validation-score (0.82) is **far higher** that the LB-score (0.74) is because his data is leaking between models.

To summarize, if you want to split the training-set apart from validation-set, make sure you only doing it **once** at the beginning. Ensure all of your models are being trained/fit on the same training-set, and being validated on the same validation-set.

Calling `train_test_split` each time you want to train the base models is a bad idea, because `train_test_split` will shuffle the data by default. Validation-set for the 1st model may become training-set for 2nd model, etc, hence data-leak occurs.

I put a lot of comment in the code here, hence make sure to read the code too.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier

In [None]:
x = pd.read_csv('/kaggle/input/tabular-playground-series-nov-2021/train.csv', index_col=1)
y = x.pop('target')
x = StandardScaler().fit_transform(x)

### 1st model

In [None]:
train_x, val_x, train_y, val_y = train_test_split(x, y)
est = LogisticRegression()
est.fit(train_x, train_y)
val_pred = est.predict_proba(val_x)[:,1]
logreg_pred = est.predict_proba(x)[:,1]
roc_auc_score(val_y, val_pred)

### 2nd model

In [None]:
train_x, val_x, train_y, val_y = train_test_split(x, y) # BAD - DON'T SPLIT AGAIN - DATA LEAKING
est = LinearSVC(dual=False)
est.fit(train_x, train_y) # 2nd-model can see the 1st-model's validation-set, due to 'split again'
val_pred = est.decision_function(val_x)
lsvc_pred = est.decision_function(x)
roc_auc_score(val_y, val_pred)

### 3rd model

In [None]:
train_x, val_x, train_y, val_y = train_test_split(x, y) # BAD - DON'T SPLIT AGAIN - DATA LEAKING
xgb_model = XGBClassifier(use_label_encoder=False, tree_method='hist')
xgb_model.fit(train_x, train_y) # 3rd-model can see the 2nd-model's validation-set, due to 'split again'
val_pred = xgb_model.predict_proba(val_x)[:, 1]
xgb_pred = xgb_model.predict_proba(x)[:, 1]
roc_auc_score(val_y, val_pred)

### Stacking/Ensembling - Feeding the Prediction from 1st+2nd+3rd Models into the 4th Final Model

Remember that the 2nd-model could see the 1st-model's validation-set,

the 3rd-model could see the 1st+2nd model's validation-set, etc.

The final-model will be able to see what 1st+2nd+3rd model saw in the training-set.
Due to leak, Hence the final-model **almost can see everything** in the whole complete-data, including the label/answer from its validation-set.

In [None]:
x_new = pd.DataFrame(x)
x_new['logreg'] = logreg_pred # 1st-model saw some-part of 2nd+3rd validation-set
x_new['lsvc'] = lsvc_pred # 2nd-model saw some-part of 1st+3rd validation-set
x_new['xgb'] = xgb_pred # 3rd-model saw some-part of 1st+2rd validation-set
x_new.shape

In [None]:
train_x, val_x, train_y, val_y = train_test_split(x_new, y) # BAD - DON'T SPLIT AGAIN
final_model = XGBClassifier(use_label_encoder=False, tree_method='hist')
final_model.fit(train_x, train_y)
val_pred = final_model.predict_proba(val_x)[:, 1]
roc_auc_score(val_y, val_pred)

Boom, spot the high validation-score from the final-model.

# How to Fix

All model should be trained on the same training-set, and being validated on the same validation-set. It should be easier to do this by calling `train_test_split` only once at the beginning of your notebook/kernel.

In [None]:
train_x, val_x, train_y, val_y = train_test_split(x, y) # only once at the beginning

est = LogisticRegression()
est.fit(train_x, train_y)
logreg_train_pred = est.predict_proba(train_x)[:,1]
logreg_val_pred = est.predict_proba(val_x)[:,1]
print('lr', roc_auc_score(val_y, logreg_val_pred))

est = LinearSVC(dual=False)
est.fit(train_x, train_y)
lsvc_train_pred = est.decision_function(train_x)
lsvc_val_pred = est.decision_function(val_x)
print('lsvc', roc_auc_score(val_y, lsvc_val_pred))

xgb_model = XGBClassifier(use_label_encoder=False, tree_method='hist')
xgb_model.fit(train_x, train_y)
xgb_train_pred = xgb_model.predict_proba(train_x)[:, 1]
xgb_val_pred = xgb_model.predict_proba(val_x)[:, 1]
print('xgb', roc_auc_score(val_y, xgb_val_pred))

train_x_new = pd.DataFrame(train_x)
train_x_new['logreg'] = logreg_train_pred
train_x_new['lsvc'] = lsvc_train_pred
train_x_new['xgb'] = xgb_train_pred
val_x_new = pd.DataFrame(val_x)
val_x_new['logreg'] = logreg_val_pred
val_x_new['lsvc'] = lsvc_val_pred
val_x_new['xgb'] = xgb_val_pred

final_model = XGBClassifier(use_label_encoder=False, tree_method='hist')
final_model.fit(train_x_new, train_y)
final_val_pred = final_model.predict_proba(val_x_new)[:, 1]
print('final', roc_auc_score(val_y, final_val_pred))

In [None]:
roc_auc_score(val_y, final_val_pred)

Validation-score from the `final` model seems more make sense now :-)