# This tutorial is divided into 3 parts:

Train and Test Split python code.

k-fold Cross Validation Split python code.

Using API implement k-fold


# 1. Train and Test Split

The train and test split involves separating a dataset into two parts:
    

Training Dataset.

Test Dataset.

The training dataset is used by the machine learning algorithm to train the model.
The test dataset is held back and is used to evaluate the performance of the model.



# We can implement the train and test split of a dataset in a single function.

Below is a function named train_test_split() to split a dataset into a train and test split. It accepts two arguments, the dataset to split as a list of lists and an optional split percentage.

A default split percentage of 0.6 or 60% is used. This will assign 60% of the dataset to the training dataset and leave the remaining 40% to the test dataset. A 60/40 for train/test is a good default split of the data.

The function first calculates how many rows the training set requires from the provided dataset. A copy of the original dataset is made. Random rows are selected and removed from the copied dataset and added to the train dataset until the train dataset contains the target number of rows.

The rows that remain in the copy of the dataset are then returned as the test dataset.

The randrange() function from the random model is used to generate a random integer in the range between 0 and the size of the list.

In [2]:

from random import seed
from random import randrange
 
# Split a dataset into a train and test set
def train_test_split(dataset, split=0.60):
	train = list()
	train_size = split * len(dataset)
	dataset_copy = list(dataset)
	while len(train) < train_size:
		index = randrange(len(dataset_copy))
		train.append(dataset_copy.pop(index))
	return train, dataset_copy
 
# test train/test split
seed(2) # You can change the value to see 

dataset = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
train, test = train_test_split(dataset)
print(train)
print(test)

[[1], [3], [4], [6], [5], [8]]
[[2], [7], [9], [10]]


In [3]:
seed(12) # You can change the value to see 

dataset = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
train, test = train_test_split(dataset)
print(train)
print(test)

[[8], [5], [7], [2], [6], [1]]
[[3], [4], [9], [10]]


# 2. k-fold Cross Validation Split

A limitation of using the train and test split method is that you get a noisy estimate of algorithm performance.

The k-fold cross validation method (also called just cross validation) is a resampling method that provides a more accurate estimate of algorithm performance.

It does this by first splitting the data into k groups. The algorithm is then trained and evaluated k times and the performance summarized by taking the mean performance score. Each group of data is called a fold, hence the name k-fold cross-validation.

It works by first training the algorithm on the k-1 groups of the data and evaluating it on the kth hold-out group as the test set. This is repeated so that each of the k groups is given an opportunity to be held out and used as the test set.

As such, the value of k should be divisible by the number of rows in your training dataset, to ensure each of the k groups has the same number of rows.

You should choose a value for k that splits the data into groups with enough rows that each group is still representative of the original dataset. A good default to use is k=3 for a small dataset or k=10 for a larger dataset. A quick way to check if the fold sizes are representative is to calculate summary statistics such as mean and standard deviation and see how much the values differ from the same statistics on the whole dataset.

We can reuse what we learned in the previous section in creating a train and test split here in implementing k-fold cross validation.

Instead of two groups, we must return k-folds or k groups of data.

Below is a function named cross_validation_split() that implements the cross validation split of data.

As before, we create a copy of the dataset from which to draw randomly chosen rows.

We calculate the size of each fold as the size of the dataset divided by the number of folds required.

# fold size = total rows / total folds

If the dataset does not cleanly divide by the number of folds, 
there may be some remainder rows and they will not be used in the split.

We then create a list of rows with the required size and 
add them to a list of folds which is then returned at the end.

We can test this resampling method on the same small contrived dataset as above. Each row has only a single column value, but we can imagine how this might scale to a standard machine learning dataset.

The complete example is listed below.

As before, we fix the seed for the random number generator to ensure that each time the code is executed that the same rows are used in the same folds.

A k value of 4 is used for demonstration purposes. We would expect that the 10 rows divided into 4 folds will result in 2 rows per fold, with a remainder of 2 that will not be used in the split.

In [7]:

from random import seed
from random import randrange
 
# Split a dataset into k folds
def cross_validation_split(dataset, folds):
	dataset_split = list()
	dataset_copy = list(dataset)
	fold_size = int(len(dataset) / folds)
	for i in range(folds):
		fold = list()
		while len(fold) < fold_size:
			index = randrange(len(dataset_copy))
			fold.append(dataset_copy.pop(index))
		dataset_split.append(fold)
	return dataset_split
 
# test cross validation split
seed(1)
dataset = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
folds = cross_validation_split(dataset, 4)
print(folds)

[[[3], [2]], [[7], [1]], [[8], [9]], [[10], [6]]]


In [8]:
seed(76)
dataset = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
folds = cross_validation_split(dataset, 2)
print(folds)

[[[6], [9], [8], [2], [4]], [[1], [10], [3], [7], [5]]]


# 3. Cross-Validation API

We do not have to implement k-fold cross-validation manually.

The scikit-learn library provides an implementation that will split a given data sample up.

The KFold() scikit-learn class can be used.

It takes as arguments the number of splits, whether or not to shuffle the sample,

and the seed for the pseudorandom number generator used prior to the shuffle.

For example, we can create an instance that splits a dataset into 3 folds, 

shuffles prior to the split, and uses a value of 1 for the pseudorandom number generator.

For example, we can create an instance that splits a dataset into 3 folds, 
shuffles prior to the split, and uses a value of 1 for the pseudorandom number generator.

kfold = KFold(3, True, 1)

The split() function can then be called on the class where the data sample is provided as an argument. Called repeatedly, the split will return each group of train and test sets. Specifically, arrays are returned containing the indexes into the original data sample of observations to use for train and test sets on each iteration.

For example, we can enumerate the splits of the indices for a data sample using the created KFold instance as follows:


In [16]:


# scikit-learn k-fold cross-validation
from numpy import array
from sklearn.model_selection import KFold
# data sample
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])


In [17]:
kfold = KFold(3, True, 87)
# enumerate splits
for train, test in kfold.split(data):
	print('train: %s, test: %s' % (data[train], data[test]))

train: [0.3 0.4 0.5 0.6], test: [0.1 0.2]
train: [0.1 0.2 0.4 0.6], test: [0.3 0.5]
train: [0.1 0.2 0.3 0.5], test: [0.4 0.6]


In [19]:
kfold = KFold(3, True, 54)    # changing the seed value generate new sets
# enumerate splits
for train, test in kfold.split(data):
	print('train: %s, test: %s' % (data[train], data[test]))

train: [0.2 0.3 0.4 0.6], test: [0.1 0.5]
train: [0.1 0.3 0.5 0.6], test: [0.2 0.4]
train: [0.1 0.2 0.4 0.5], test: [0.3 0.6]
