# Resampling Methods

Resampling methods are techniques to partition the dataset to obtain representative data for training and testing. Normally, we allocate a larger portion of the data for training, and then leave out the smaller portion for validation or testing to check whether our ML model is performing well on data it hasn't seen yet.

### k-fold Cross Validation

The way k-fold cross validation works is splitting the dataset into k partitions and performing train-test split k times, each time using a different partition as the test set. This way, the model is trained each time with a different training set and tested on a different test set. The overall performance of the model is the average of its peformance on each round.

For example, if k=3, and our dataset is partitioned into the following:
partitions = {A, B, C}

Perform train-test split k times:
```
k     Train     Test
1     A,B       C
2     B,C       A
3     A,C       B
```

In [8]:
from random import randrange

def cross_validation_split(dataset, k_folds=3):
    fold_size = int(len(dataset) / k_folds)
    dataset_copy = list(dataset)
    dataset_split = list()
    
    for i in range(k_folds):
        fold = list()
        
        while len(fold) < fold_size:
            random_index = randrange(len(dataset_copy))
            
            fold.append(dataset_copy.pop(random_index))
        
        dataset_split.append(fold)
        
    return dataset_split

In [9]:
from csv import reader

def load_csv(filename):
    dataset = list()
    
    with open(filename, 'r') as file:
        csv_reader = reader(file)
        
        for row in csv_reader:
            if not row:
                continue
            
            dataset.append(row)
    
    return dataset

In [10]:
filename = 'data/pima-indians-diabetes.data.csv'
dataset = load_csv(filename)

In [11]:
dataset[:4]

[['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1'],
 ['1', '85', '66', '29', '0', '26.6', '0.351', '31', '0'],
 ['8', '183', '64', '0', '0', '23.3', '0.672', '32', '1'],
 ['1', '89', '66', '23', '94', '28.1', '0.167', '21', '0']]

In [12]:
len(dataset)

768

In [13]:
dataset_folds = cross_validation_split(dataset, 4)

The function returns the "folds" or the dataset partitions.

In [14]:
len(dataset_folds)

4

### Train-Test Split

In [1]:
from random import randrange

def train_test_split(dataset, split):
    train_set = list()
    train_size = split * len(dataset)
    test_set = list(dataset)
    
    while len(train_set) < train_size:
        rand_index = randrange(len(test_set))
        
        train_set.append(test_set.pop(rand_index))
        
    return train_set, test_set

In [2]:
dataset = [1,2,3,4,5,6,7,8,9,10]
train, test =  train_test_split(dataset, 0.7)
print(train)
print(test)

[8, 9, 6, 3, 10, 4, 5]
[1, 2, 7]


In [4]:
dataset = [[1,1],[2,2],[3,3],[4,4],[5,5],[6,6],[7,7],[8,8],[9,9],[10,10]]
train, test =  train_test_split(dataset, 0.8)
print(train)
print(test)

[[8, 8], [1, 1], [2, 2], [4, 4], [9, 9], [5, 5], [6, 6], [3, 3]]
[[7, 7], [10, 10]]
