# Implement Resampling Methods From Scratch in Python

## Description 
The goal of resampling methods is to make the best use of your training data in order to accurately estimate the performance of a model on new unseen data.

Accurate estimates of performance can then be used to help you choose which set of model parameters to use or which model to select.

Once you have chosen a model, you can train for final model on the entire training dataset and start using it to make predictions.

There are two common resampling methods that you can use:

* A train and test split of your data.
* k-fold cross validation.

In this notebook, we will implement both resampling methods from scatch. [Source](https://machinelearningmastery.com/implement-resampling-methods-scratch-python/)

In [1]:
# Imports 
from random import randrange
from random import seed 

## Train and Test Split 
First, we will create a function to split the dataset into a 80% train and 20% test set. 

In [2]:
# Split a dataset into a train and test set
def train_test_split(df, split=0.8):
    train = list()
    train_size = split * len(df)
    dataset_copy = list(df)
    while len(train) < train_size:
        index = randrange(len(dataset_copy))
        train.append(dataset_copy.pop(index))
    return train, dataset_copy

In [3]:
# Test train/test split 
seed(1)
dataset = [[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]]
train, test = train_test_split(dataset)
print(train)
print(test)

[[3], [2], [7], [1], [8], [9], [10], [6]]
[[4], [5]]


## K-fold Cross Validation Split 
A limitation of using the train and test split method is that you get a noisy estimate of algorithm performance.

The k-fold cross validation method (also called just cross validation) is a resampling method that provides a more accurate estimate of algorithm performance.

It does this by first splitting the data into k groups. The algorithm is then trained and evaluated k times and the performance summarized by taking the mean performance score. Each group of data is called a fold, hence the name k-fold cross-validation.

In [6]:
# Split a dataset into k folds
def cross_validation_split(df, folds=3):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset)/folds)
    for i in range(folds):
        fold = list()
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split

In [7]:
# test cross validation split
seed(1)
dataset = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
folds = cross_validation_split(dataset, 4)
print(folds)

[[[3], [2]], [[7], [1]], [[8], [9]], [[10], [6]]]
