The dataset contains 14 continuous features and a continuous target from 300000 samples. The target has a bimodal distribution. In this notebook, I will show how to obtain stratified cross validation splits from the continuous targets.

### Setup

In [None]:
import numpy as np
import pandas as pd

from scipy.stats import ks_2samp

from sklearn.model_selection import StratifiedKFold, KFold

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Read the data
df = pd.read_csv('../input/tabular-playground-series-jan-2021/train.csv')

### Stratifying Continuous Target

The solution is to split the continous target distribution into N bins, and use these bins as classification targets in the standard StratifiedKFold cross-validator of scikit-learn. The binning can be easily done with the `pd.cut` in pandas. The python function to do the splitting is given below.

In [None]:
def create_folds(df, n_s=5, n_grp=None):
    df['Fold'] = -1
    
    if n_grp is None:
        skf = KFold(n_splits=n_s)
        target = df.target
    else:
        skf = StratifiedKFold(n_splits=n_s)
        df['grp'] = pd.cut(df.target, n_grp, labels=False)
        target = df.grp
    
    for fold_no, (t, v) in enumerate(skf.split(target, target)):
        df.loc[v, 'Fold'] = fold_no
    return df

The train set has the following continuous target distribution:

In [None]:
plt.hist(df['target'], bins=100, density=True)
plt.xlabel('Target')
plt.ylabel('Frequency')
plt.show()

Now, let's split the train set into 5 folds with stratification and visualize the target distribution in each fold.

In [None]:
df = create_folds(df, n_s=5, n_grp=1000)

In [None]:
fig, axs = plt.subplots(1, 5, sharex=True, sharey=True, figsize=(10,4))
for i, ax in enumerate(axs):
    ax.hist(df[df.Fold == i]['target'], bins=100, density=True, label=f'Fold-{i}')
    if i == 0:
        ax.set_ylabel('Frequency')
    if i == 2:
        ax.set_xlabel("Target")
    ax.legend(frameon=False, handlelength=0)
plt.tight_layout()
plt.show()

We can compare any two folds with the **Kolmogorov-Smirnov** test to examine if the folds come from the same distribution. Let's compare all folds with the 1st fold for simplicity. The test results are given below. Indeed, the low KS (~0.0008) and high probability (1.0) values confirm that all folds come from the same distribution.

In [None]:
for fold in np.sort(df.Fold.unique())[1:]:
    print(f'Fold 0 vs {fold}:', ks_2samp(df.loc[df.Fold==0,'target'], df.loc[df.Fold==fold,'target']))

What would be the target distribution in each fold without stratification?

To answer this let's split the train data into 5 folds again but this time without stratification. Note that setting n_grp=None will assign the folds without stratification.

The figure below shows the distributions in each folds without stratification strategy. Note that the distributions generally looks alike, but the fine structures at the peaks are quite different.

In [None]:
df = create_folds(df, n_s=5, n_grp=None)

In [None]:
fig, axs = plt.subplots(1, 5, sharex=True, sharey=True, figsize=(10,4))
for i, ax in enumerate(axs):
    ax.hist(df[df.Fold == i]['target'], bins=100, density=True, label=f'Fold-{i}')
    if i == 0:
        ax.set_ylabel('Frequency')
    if i == 2:
        ax.set_xlabel("Target")
    ax.legend(frameon=False, handlelength=0)
plt.tight_layout()
plt.show()

To quantify the differences in the folds, let's run the KS test again. The test results are as anticipated - the KS statistic values are low for all folds but probability values are not small enough to reject the null hypothesis that the all folds come from the same distribution.

In [None]:
for fold in np.sort(df.Fold.unique())[1:]:
    print(f'Fold 0 vs {fold}:', ks_2samp(df.loc[df.Fold==0,'target'], df.loc[df.Fold==fold,'target']))