# Iterative-Stratification: Some seeds are better than others

Stratification of multilabel data is a commonly used method in the MoA prediction competition. Randomly selected seeds do not seem to provide the most optimized distributions. This notebook finds the seeds that would split the train data into train and validation sets with the most similar distributions.

- V3: Control Group Excluded
- V2: Control Group Included

In [None]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
import sys
sys.path.append('../input/iter-strat')
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

import numpy as np
import pandas as pd

from tqdm.notebook import tqdm

import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

### Read Data

In [None]:
x_develop = pd.read_csv('../input/lish-moa/train_features.csv')
y_develop = pd.read_csv('../input/lish-moa/train_targets_scored.csv')

### Exclude Control Group

In [None]:
print('Number of Samples: %d' % y_develop.shape[0])
keep_rows_id = x_develop['cp_type']!='ctl_vehicle'
y_develop = y_develop[keep_rows_id].reset_index(drop=True).drop('sig_id', axis=1)
print('Number of Samples without Control Group: %d' % y_develop.shape[0])

### Metric for best seed

Here I define an Euclidean distance metric to measure how the validation set's distribution matches with that of the training set.

For a given seed, this metric does the following:
1. Group the data into 5-folds.
2. For each train::val couple:
    - Calculate the positivity rate of each of the 206 targets in the validation set.
    - Calculate the positivity rate of each of the 206 targets in the training set.
3. Calculate the Euclidean distances using the positivity rates in Step 2 for each train::val group.
4. Calculate the mean and standard deviation of the Euclidean distances.

I consider the seed with the smallest Euclidean distance to be the best seed.

In [None]:
def calculate_metric(data, seed, n_splits):
    diff_mean = []
    Fold = MultilabelStratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
    for n, (train_index, val_index) in enumerate(Fold.split(data, data)):
        train_mean_positivity = data.iloc[train_index, 1:].mean()
        val_mean_positivity = data.iloc[val_index, 1:].mean()
        diff = train_mean_positivity - val_mean_positivity
        diff_mean += [np.sqrt(np.sum(diff**2.))]
    mn = np.mean(diff_mean)
    std = np.std(diff_mean)
    return i, mn, std

In [None]:
a = []
for i in tqdm(range(100)):
    a += [calculate_metric(y_develop, i, 5)]

### Best Seed to Worst Seed

In [None]:
pd.DataFrame(a, columns=['Seed', 'mean', 'std']).sort_values(by='std').set_index('Seed')

Please let me know if you notice any mistakes or have any suggestions.