# First leak found

This notebook reveals first of 10 leaks hidden in **Artificial data leaks** dataset. This probably is the easiest leak, as it was found first also in a workshop using this data.

So, where to start looking for some hidden information? We can see from feature importance plot in ["First look at data and baseline model"](https://www.kaggle.com/alijs1/first-look-at-data-and-baseline-model) kernel, that the biggest gain for model is obtained from **col8** feature. Let's start with this one and see if we can find something interesting.

From ["First look at data and baseline model"](https://www.kaggle.com/alijs1/first-look-at-data-and-baseline-model) kernel we also already know that this feature has integer values and it's distribution visually looks very similar to Normal distribution with the mean at about 100.

Let's check the most frequent values of this feature:

In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df_train = pd.read_csv('../input/artificial-data-leaks/train.csv')
df_test = pd.read_csv('../input/artificial-data-leaks/test.csv')

df = pd.DataFrame({'Value': df_train['col8'].value_counts().index,
                  'Count in train': df_train['col8'].value_counts()})
df['Count in test'] = df_test['col8'].value_counts()
df = df.fillna(0).astype(int)
df

Not surprisingly the most frequent values are all close to the mean value of 100.

As this feature is considered very important by our model, it would be interesting to find out, what exactly model sees in it. One thing which could help in this is to plot some actual trees from our model to see, what splits are made on our **col8** feature. We'll use the same baseline model as in ["First look at data and baseline model"](https://www.kaggle.com/alijs1/first-look-at-data-and-baseline-model) kernel.

In [None]:
import lightgbm as lgb
import matplotlib.pyplot as plt
from sklearn import metrics

RS = 0
ROUNDS = 500
TARGET = 'target'

params = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting': 'gbdt',
    'learning_rate': 0.1,
    'verbose': 0,
    'num_leaves': 64,
    'bagging_fraction': 0.8,
    'bagging_seed': RS,
    'feature_fraction': 0.9,
    'feature_fraction_seed': RS,
    'max_bin': 100,
    'max_depth': 5
}

x_train = lgb.Dataset(df_train.drop(TARGET, axis=1), df_train[TARGET])
model = lgb.train(params, x_train, num_boost_round=ROUNDS)

lgb.plot_tree(model, tree_index=0, figsize=(800, 48), show_info=['split_gain'])
plt.show()

We can find that splits are done in a quite interesting pattern. Values on which splits by **col8**
are made looks as folows:
* 92.5
* 95.5
* 98.5
* 101.5
* 104.5
* 107.5

So what's so special about them? First, they are all around 100 (mean value of this feature) - which is kinda expected, as making splits in the ends of long tails woldn't give big gain. But the most interesting thing is that split point values are growing by exactly step 3.

Let's check mean target value for each value of **col8**. We will filter 20 **col8** values close to the mean value (which have more samples).

In [None]:
df['Mean target in train'] = df_train.groupby('col8')['target'].mean()
df_mean = df[df['Value'].isin(range(90,110))]
df_mean

We can clearly see, that mean target values differs - there's a group of values having mean target around 0.4 and another group with mean target 0.6. Let's put the data in a plot for better view.

In [None]:
plt.bar(df_mean.index, df_mean['Mean target in train'])
pass

So here it is! Now it becomes very obvious why model is making splits on values with step 3 - there's a clear pattern.

Model is able to catch pattern like this (as we can see it is very correctly making necessary splits), but it requires a lot of splits for tree-based model to capture it fully. So let's help our model with some feature designed to capture this.

What we need is a feature having values 1 and 0 depending on our observed pattern with step 3. Let's create it and plot to check that it corresponds to values we observed.

In [None]:
df_train['Leak1'] = df_train['col8'].apply(lambda x: 1 if x % 6 < 3 else 0)
df_mean['Leak1 mean'] = df_train.groupby('col8')['Leak1'].mean()
plt.bar(df_mean.index, df_mean['Leak1 mean'])
pass

Looks good! Now we can add it to our baseline model and re-run it to if this new feature helps.

In [None]:
df_train = pd.read_csv('../input/artificial-data-leaks/train.csv')
df_test = pd.read_csv('../input/artificial-data-leaks/test.csv')

df = pd.concat([df_train, df_test])

df['Leak1'] = df['col8'].apply(lambda x: 1 if x % 6 < 3 else 0)

df_train = df[:df_train.shape[0]]
df_test = df[df_train.shape[0]:]

x_train = lgb.Dataset(df_train.drop(TARGET, axis=1), df_train[TARGET])
model = lgb.train(params, x_train, num_boost_round=ROUNDS)
preds = model.predict(df_test.drop(TARGET, axis=1))

score = metrics.roc_auc_score(df_test[TARGET], preds)
print('Test AUC score:',score)

fig, axs = plt.subplots(ncols=2, figsize=(15,6))
lgb.plot_importance(model, importance_type='split', ax=axs[0], title='Feature importance (split)')
lgb.plot_importance(model, importance_type='gain', ax=axs[1], title='Feature importance (gain)')
pass

#Baseline not-tuned model, raw features:  AUC 0.74632
#First leak found, same model parameters: AUC 0.74675
#All leaks found, same model parameters:  AUC 0.93927

What we can observe is that our new **Leak1** feature is very low on Splits graph, but in the top of Gains graph - which means that our model is able to capture all the information with very few splits. That's exactly what was our purpose when building this feature. Also we can see slight increase in score - from 0.74632 to 0.74675. Increase is not that big for this leak, as model was able to capture it almost fully even without our help. What we did was just consolidated information according to our found insights for easier use by our model.

We can observe that **col8** is still used by the model quite actively even if we captured the main signal in our new **Leak1** feature. There could be several reasons for this, like:
* there exists some additional information (like interactions with other features) we didn't capture fully;
* these are consequences of using *feature_fraction* parameter for our model;
* model is simply overfitting on **col8** duplicated feature.

So it's worth to check if dropping the original **col8** feature helps improving our score further.

In [None]:
df_train = pd.read_csv('../input/artificial-data-leaks/train.csv')
df_test = pd.read_csv('../input/artificial-data-leaks/test.csv')

df = pd.concat([df_train, df_test])

df['Leak1'] = df['col8'].apply(lambda x: 1 if x % 6 < 3 else 0)
df.drop(['col8'], axis=1, inplace=True)

df_train = df[:df_train.shape[0]]
df_test = df[df_train.shape[0]:]

x_train = lgb.Dataset(df_train.drop(TARGET, axis=1), df_train[TARGET])
model = lgb.train(params, x_train, num_boost_round=ROUNDS)
preds = model.predict(df_test.drop(TARGET, axis=1))

score = metrics.roc_auc_score(df_test[TARGET], preds)
print('Test AUC score:',score)

fig, axs = plt.subplots(ncols=2, figsize=(15,6))
lgb.plot_importance(model, importance_type='split', ax=axs[0], title='Feature importance (split)')
lgb.plot_importance(model, importance_type='gain', ax=axs[1], title='Feature importance (gain)')
pass

#Baseline not-tuned model, raw features:  AUC 0.74632
#First leak found:                        AUC 0.74675
#First leak found, original col8 dropped: AUC 0.74715
#All leaks found, same model parameters:  AUC 0.93927

Great, we managed to improve the score little bit more! It looks that we have found and fully captured first leak in the data. 9 more leaks left to be found...