Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
109 lines (69 sloc) 4.15 KB

⬅️

Cross-validate it!

Hey!

So, how can we assess whether prediction model that we built is performing on whole dataset? How can we check whether model is generalized enough? Yeah, right, we have to test it (or validate). Once the model is built, some part of dataset must be used for training, and the other for testing, which is previously completely unseen. If we use same data for test, there is a chance the model will strike to overfitting issues. So, in statistics, there has been several techniques on how to split the dataset. The two simplest ones:

  • Splitting into train/test with specific ratio
  • Splitting with k-fold cross-validation

Let's get started with the simplest one.

1. Splitting into train/test

In machine learning practice, people usually go with train_test_split function from existing scikit-learn library. However it is straightforward and we can directly proceed with its code from scratch. In very simple terms, we can assume our dataset as python list. In that case, implementation can become very intuitive. For splitting dataset:

Splitting into train/test

traintest

import random

def split_train_test(data,ratio=0.8):
    train_ = []
    test_ = []

    total_len = ratio*len(data)
    data_copy = data

    while len(train_) < total_len:
        train_.append(data_copy.pop(random.randrange(len(data_copy))))

    test_ = data_copy

    return train_, test_

if __name__ == '__main__':
    random.seed(1)

    dataset = [1,2,3,4,5,6,7,8,9,10]
    ratio = 0.6

    data_train, data_test = split_train_test(dataset,ratio)

    print(data_train)
    print(data_test)

Output is:

[3, 2, 7, 1, 8, 9]  # Train data
[4, 5, 6, 10]       # Test data

That's it. At the output, we can further take data_train for training the model and use data_test to check the performance of the model on previously unseen part of the dataset.

2. Splitting with k-fold cross-validation

Another technique (golden standard for papers), is so called "K-Fold Cross-Validation". The number K is commonly chosen as either 5 or 10, and it is defined by the size of dataset. For smaller datasets, the more value of K is better. But, for research purposes one can also test various options of K, and come up with optimal value so that range of the dataset is more or less balanced. The main thing here is that, instead of primitively going with only one train/test ratio, by testing model with various parts of same dataset one can ensure that test prediction is more generalized, see figure below.

K-fold Cross-Validation

kfold

In ML field, there is a magic KFold function from scikit-learn. However, traditionally, we are more interested in its pythonic way of explanation:

import random

def split_kfold(data,k):

    fold_size = int(len(data)/k)
    data_copy = data

    all_folds = []

    for i in range(k):
        fold = []
        while len(fold) < fold_size:
            fold.append(data_copy.pop(random.randrange(len(data_copy))))

        all_folds.append(fold)

    return all_folds

if __name__ == '__main__':
    random.seed(1)

    dataset = [1,2,3,4,5,6,7,8,9,10]

    knum = 5

    my_fold = split_kfold(dataset,knum)

    print(my_fold)

Output is:

[[3, 2], [7, 1], [8, 9], [10, 6], [5, 4]]  # All 5-folds together

As you can see, output returns list of lists representing each fold. These folds can then be used for training and testing.

Note:
The special case when number k is equal to the length of dataset is called leave-one-out cross-validation technique. It is accurate and powerful, however it takes more time and computational resources for larger datasets.

You can’t perform that action at this time.