On [Kaggle forums](https://www.kaggle.com/c/freesound-audio-tagging/discussion), it has been reported in multiple places that the labels are noisy. Also, the recording quality varies from file to file.

Further to that, as we discovered in NB #1, some of the categories have much fewer examples.

All this combined with there being relatively little train data will lead to the performance of our model varying significantly depending on which files make it into the train and validation sets.

To address this high variance scenario training with k-fold cross validation is a good choice.

Further to that, as is right now, if we were to compare two models with files being assigned randomly to the train and validation sets, we would have a really hard time telling which model performs better. Maybe one model got a more 'lucky' split? Being able to consistently assign files to the train and validation sets between runs will be of great value.

Another nice aspet of training with k-fold validation is that it opens the road various [forms of ensembling](https://mlwave.com/kaggle-ensembling-guide/). This is the single most important technique that can give your models a performance boost!

## stratified k-fold (5 splits)

There is so little that in the train set that people often went for a 10 fold split. As I am most interested in the relative performance of models, I would much rather have a bigger validation set 

Plus 2x less folds means over 2x shorter overal train time!

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold


sk = StratifiedKFold(n_splits=5, shuffle=True)

In [2]:
df_train = pd.read_csv('data/train.csv')

In [3]:
df_train.head()

Unnamed: 0,fname,label,manually_verified
0,00044347.wav,Hi-hat,0
1,001ca53d.wav,Saxophone,1
2,002d256b.wav,Trumpet,0
3,0033e230.wav,Glockenspiel,1
4,00353774.wav,Cello,1


We could split only on the label, but we can do better! Seems whether a sound has been manually verified is also important - let's take this information into consideration as well.

In [4]:
df_train['modified_label'] = df_train.label + '_' + df_train.manually_verified.astype('str')

In [5]:
df_train.head()

Unnamed: 0,fname,label,manually_verified,modified_label
0,00044347.wav,Hi-hat,0,Hi-hat_0
1,001ca53d.wav,Saxophone,1,Saxophone_1
2,002d256b.wav,Trumpet,0,Trumpet_0
3,0033e230.wav,Glockenspiel,1,Glockenspiel_1
4,00353774.wav,Cello,1,Cello_1


In [6]:
splits = list(sk.split(np.zeros(df_train.shape[0]), df_train.modified_label))



Let's doublecheck that everything went ok.

In [7]:
len(splits)

5

In [8]:
len(splits[0][0]), len(splits[0][1])

(7550, 1923)

In [9]:
mask = df_train.index.isin(splits[0][0])
df_train[mask].modified_label.value_counts()

Saxophone_1             204
Violin_or_fiddle_1      200
Applause_0              191
Tearing_0               190
Bass_drum_0             186
                       ... 
Finger_snapping_0        32
Scissors_0               28
Glockenspiel_0           19
Telephone_0               6
Gunshot_or_gunfire_0      1
Name: modified_label, Length: 82, dtype: int64

In [10]:
mask = df_train.index.isin(splits[1][0])
df_train[mask].modified_label.value_counts()

Saxophone_1             205
Violin_or_fiddle_1      200
Applause_0              191
Tearing_0               190
Shatter_0               186
                       ... 
Finger_snapping_0        32
Scissors_0               29
Glockenspiel_0           19
Telephone_0               6
Gunshot_or_gunfire_0      1
Name: modified_label, Length: 82, dtype: int64

Looks quite good to me! Let's save the results so that we can use them across notebooks.

In [11]:
df_train.drop(columns='modified_label', inplace=True)

In [12]:
for i, split in enumerate(splits):
    df_train[f'train_{i}'] = df_train.index.isin(split[0])
    df_train[f'val_{i}'] = df_train.index.isin(split[1])

In [14]:
df_train.head()

Unnamed: 0,fname,label,manually_verified,train_0,val_0,train_1,val_1,train_2,val_2,train_3,val_3,train_4,val_4
0,00044347.wav,Hi-hat,0,True,False,True,False,True,False,True,False,False,True
1,001ca53d.wav,Saxophone,1,True,False,True,False,False,True,True,False,True,False
2,002d256b.wav,Trumpet,0,True,False,True,False,False,True,True,False,True,False
3,0033e230.wav,Glockenspiel,1,True,False,True,False,True,False,True,False,False,True
4,00353774.wav,Cello,1,True,False,True,False,True,False,True,False,False,True


In [16]:
df_train.to_csv('data/train_with_splits.csv', index=False)

In [17]:
pd.to_pickle(splits, 'data/splits.pkl')