So, how can we assess whether prediction model that we built is performing on whole dataset? How can we check whether model is generalized enough? Yeah, right, we have to test it (or validate). Once the model is built, some part of dataset must be used for training, and the other for testing, which is previously completely unseen. If we use same data for test, there is a chance the model will strike to overfitting issues. So, in statistics, there has been several techniques on how to split the dataset. The two simplest ones:
- Splitting into train/test with specific ratio
- Splitting with k-fold cross-validation
Let's get started with the simplest one.
1. Splitting into train/test
In machine learning practice, people usually go with
train_test_split function from existing scikit-learn library. However it is straightforward and we can directly proceed with its code from scratch. In very simple terms, we can assume our dataset as python list. In that case, implementation can become very intuitive. For splitting dataset:
Splitting into train/test
import random def split_train_test(data,ratio=0.8): train_ =  test_ =  total_len = ratio*len(data) data_copy = data while len(train_) < total_len: train_.append(data_copy.pop(random.randrange(len(data_copy)))) test_ = data_copy return train_, test_ if __name__ == '__main__': random.seed(1) dataset = [1,2,3,4,5,6,7,8,9,10] ratio = 0.6 data_train, data_test = split_train_test(dataset,ratio) print(data_train) print(data_test)
[3, 2, 7, 1, 8, 9] # Train data [4, 5, 6, 10] # Test data
That's it. At the output, we can further take
data_train for training the model and use
data_test to check the performance of the model on previously unseen part of the dataset.
2. Splitting with k-fold cross-validation
Another technique (golden standard for papers), is so called "K-Fold Cross-Validation". The number K is commonly chosen as either 5 or 10, and it is defined by the size of dataset. For smaller datasets, the more value of K is better. But, for research purposes one can also test various options of K, and come up with optimal value so that range of the dataset is more or less balanced. The main thing here is that, instead of primitively going with only one train/test ratio, by testing model with various parts of same dataset one can ensure that test prediction is more generalized, see figure below.
In ML field, there is a magic
KFold function from scikit-learn. However, traditionally, we are more interested in its pythonic way of explanation:
import random def split_kfold(data,k): fold_size = int(len(data)/k) data_copy = data all_folds =  for i in range(k): fold =  while len(fold) < fold_size: fold.append(data_copy.pop(random.randrange(len(data_copy)))) all_folds.append(fold) return all_folds if __name__ == '__main__': random.seed(1) dataset = [1,2,3,4,5,6,7,8,9,10] knum = 5 my_fold = split_kfold(dataset,knum) print(my_fold)
[[3, 2], [7, 1], [8, 9], [10, 6], [5, 4]] # All 5-folds together
As you can see, output returns list of lists representing each fold. These folds can then be used for training and testing.
The special case when number k is equal to the length of dataset is called leave-one-out cross-validation technique. It is accurate and powerful, however it takes more time and computational resources for larger datasets.