One seemingly minute issue during cross validation is how to choose the random split. The worry is that the underlying distribution (of latent factors) of the training samples under the split might not be uniform. This is particularly prominent if the size of the dataset is small. For larger dataset, this might not have a big effect, but for more demanding situations such as contests, where 0.01% can make a difference, the impact is still visible.

Of course, one can choose the lucky number instead of the default one as the random seed, or average over a group of random seeds. But for complicated models such as neural nets or gradient boosting trees, where each CV training can takes hours or longer, this is still costly and like a blind search. I found a simple trick to quickly choose a single or multiple decent splits.

The intuition is that, even though baseline models as simple as linear regression do not perform as well as more complicated models, they usually capture a big chunk of nature of the samples. Therefore we use them like this:
1. Try a large number of random splits on a simple model;

2. For each split, we record the worst score among all the validation sets, i.e. across the different folds;

3. At the end, we pick the random splits with the best records.

So this is a min-max procedure. 

My experiments on different problems show that the results have the following nice properties:
* Since the selected random splits have the best worst validation scores, the variation of their validation scores across folds tend to be small;

* Even better, performing the same search procedure on different models, including complicated models such as gradient boosting trees, we get the same candidates most of the time. This indicates that these chosen random splits are, to some extent, universally good.

Of course this cannot be guaranteed to work well for all problems. But at least it provides a simple heuristic for a quantitative selection of CV splits.

In [1]:
import numpy as np
import pandas as pd
import warnings; warnings.simplefilter('ignore')
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

In [4]:
def find_best_k_split(esti, k, folds, rand_start, rand_end, params, X, y):
    fold_rec = []
    for i in range(rand_start, rand_end):
        kf = KFold(n_splits=folds, random_state=i, shuffle=True)
        gs = GridSearchCV(esti, params, scoring = 'neg_mean_squared_error', cv = kf, return_train_score = False, verbose = 2, refit= False)
        gs.fit(X,y)
        worst_split = min([gs.cv_results_[f'split{j}_test_score'][0] for j in range(folds)])
        fold_rec.append((i, worst_split))
    fold_rec.sort(key = lambda x: -x[1])
    rands = [x[0] for x in fold_rec[:k]]
    return rands, fold_rec   

In [None]:
train = pd.read_csv('train.csv', index_col = 0)
cols = list(train.columns)
cols.remove('TARGET')
tar = 'TARGET'

In [None]:
lr = LinearRegression()
gsparam = {}
k = 20
folds = 5
rand_start = 0
rand_end = 1000
rand, all_rec = find_best_k_split(lr, k, folds, rand_start, rand_end, gsparam, train[cols], train[tar])