Fetching contributors…
Cannot retrieve contributors at this time
109 lines (69 sloc) 4.15 KB

⬅️

# Cross-validate it!

### Hey!

So, how can we assess whether prediction model that we built is performing on whole dataset? How can we check whether model is generalized enough? Yeah, right, we have to test it (or validate). Once the model is built, some part of dataset must be used for training, and the other for testing, which is previously completely unseen. If we use same data for test, there is a chance the model will strike to overfitting issues. So, in statistics, there has been several techniques on how to split the dataset. The two simplest ones:

• Splitting into train/test with specific ratio
• Splitting with k-fold cross-validation

Let's get started with the simplest one.

### 1. Splitting into train/test

In machine learning practice, people usually go with `train_test_split` function from existing scikit-learn library. However it is straightforward and we can directly proceed with its code from scratch. In very simple terms, we can assume our dataset as python list. In that case, implementation can become very intuitive. For splitting dataset:

Splitting into train/test

```import random

def split_train_test(data,ratio=0.8):
train_ = []
test_ = []

total_len = ratio*len(data)
data_copy = data

while len(train_) < total_len:
train_.append(data_copy.pop(random.randrange(len(data_copy))))

test_ = data_copy

return train_, test_

if __name__ == '__main__':
random.seed(1)

dataset = [1,2,3,4,5,6,7,8,9,10]
ratio = 0.6

data_train, data_test = split_train_test(dataset,ratio)

print(data_train)
print(data_test)```

Output is:

```[3, 2, 7, 1, 8, 9]  # Train data
[4, 5, 6, 10]       # Test data```

That's it. At the output, we can further take `data_train` for training the model and use `data_test` to check the performance of the model on previously unseen part of the dataset.

### 2. Splitting with k-fold cross-validation

Another technique (golden standard for papers), is so called "K-Fold Cross-Validation". The number K is commonly chosen as either 5 or 10, and it is defined by the size of dataset. For smaller datasets, the more value of K is better. But, for research purposes one can also test various options of K, and come up with optimal value so that range of the dataset is more or less balanced. The main thing here is that, instead of primitively going with only one train/test ratio, by testing model with various parts of same dataset one can ensure that test prediction is more generalized, see figure below.

K-fold Cross-Validation

In ML field, there is a magic `KFold` function from scikit-learn. However, traditionally, we are more interested in its pythonic way of explanation:

```import random

def split_kfold(data,k):

fold_size = int(len(data)/k)
data_copy = data

all_folds = []

for i in range(k):
fold = []
while len(fold) < fold_size:
fold.append(data_copy.pop(random.randrange(len(data_copy))))

all_folds.append(fold)

return all_folds

if __name__ == '__main__':
random.seed(1)

dataset = [1,2,3,4,5,6,7,8,9,10]

knum = 5

my_fold = split_kfold(dataset,knum)

print(my_fold)```

Output is:

`[[3, 2], [7, 1], [8, 9], [10, 6], [5, 4]]  # All 5-folds together`

As you can see, output returns list of lists representing each fold. These folds can then be used for training and testing.

Note:
The special case when number k is equal to the length of dataset is called leave-one-out cross-validation technique. It is accurate and powerful, however it takes more time and computational resources for larger datasets.

You can’t perform that action at this time.