### Train and Test Split
The train and test split is the easiest resampling method. The train and test split involves separating a dataset into two parts:
<br>
1. Training Dataset.
2. Test Dataset.

<br>The training dataset is used by the machine learning algorithm to train the model. The test dataset is held back and is used to evaluate the performance of the model.
The rows assigned to each dataset are randomly selected.

In [1]:
def train_test_split(dataset, split=0.6) :
    train = list()
    dataset_copy = list(dataset)
    train_size = split * len(dataset)
    while len(train) < train_size :
        index = randrange(len(dataset_copy))
        train.append(dataset_copy.pop(index))
    return train, dataset_copy

In [4]:
from random import randrange
from random import seed

def train_test_split(dataset, split=0.6) :
    train = list()
    dataset_copy = list(dataset)
    train_size = split * len(dataset)
    while len(train) < train_size :
        index = randrange(len(dataset_copy))
        train.append(dataset_copy.pop(index))
    return train, dataset_copy

dataset = [[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12]]
seed(1)
train,test = train_test_split(dataset)
print len(train), train
print len(test), test

8 [[2], [11], [9], [4], [6], [7], [8], [10]]
4 [[1], [3], [5], [12]]


In [5]:
train,test = train_test_split(dataset,0.5)
print len(train), train
print len(test), test

6 [[2], [1], [11], [6], [10], [3]]
6 [[4], [5], [7], [8], [9], [12]]


### k-fold cross validation
A limitation of using the train and test method is that you get a noisy estimate of algorithm performance.
Lets sample the dataset as below :
1. First split the dataset into k groups
2. Train the algorithm on the k-1 groups of data.
3. Evaluate the algorithm performance on k th group as the test set.
4. This is repeated so that each of the k groups is given opportunity to be held out and used as a test set.

These groups are called folds, and this approach is called k-fold cross validation

In [6]:
def cross_validation_split(dataset, folds=3) :
    dateset_copy = list(dataset)
    dataset_split = list()
    fold_size = int(len(dataset)/folds)
    for i in range(folds) :
        fold = list()
        for j in range(fold_size) :
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split

In [8]:
from random import randrange
from random import seed

def cross_validation_split(dataset, folds=3) :
    dataset_copy = list(dataset)
    dataset_split = list()
    fold_size = int(len(dataset)/folds)
    for i in range(folds) :
        fold = list()
        for j in range(fold_size) :
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split

dataset = [[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12]]
seed(1)
folds = cross_validation_split(dataset, 4)
print folds

[[[2], [11], [9]], [[4], [6], [7]], [[8], [10], [1]], [[3], [12], [5]]]


In [9]:
folds = cross_validation_split(dataset, 3)
print folds

[[[10], [1], [6], [9]], [[3], [12], [11], [2]], [[4], [7], [8], [5]]]


In [10]:
folds = cross_validation_split(dataset, 5)
print folds

[[[3], [6]], [[1], [4]], [[8], [9]], [[5], [7]], [[2], [11]]]


The standard for estimating the performance of machine learning algorithms on new data is k-fold cross validation.
The downside of cross validation is that it can be time consuming to run, requiring k different models to trained and evaluated.

### Extensions :
1. Repeated train and test
2. LOOCV or Leave out one cross valiation
3. Stratification