# Resampling

- The best way to evaluate the performance of an algorithm would be to make predictions for new data to which you already know the answers. 
- Use **resampling methods** that allow you to make accurate estimates for how well your algorithm will perform on new data.
- **overfitting**
- We must evaluate our machine learning algorithms on data that is not used to train the algorithm.
- The evaluation is an estimate that we can use to talk about how well we think the algorithm may actually do in practice. It is not a guarantee of performance.
- Once we estimate the performance of our algorithm, we can then re-train the final algorithm on the entire training dataset and get it ready for operational use.


## Four different techniques

- Train and Test Sets.
  - The size of the split can depend on the size and specifics of your dataset, although it is common to use 67% of the data for training and the remaining 33% for testing.
  - This algorithm evaluation technique is very fast. It is ideal for large datasets (millions of records) where there is strong evidence that both splits of the data are representative of the underlying problem. Because of the speed, it is useful to use this approach when the algorithm you are investigating is slow to train.
  - Downside: it can have a high variance, differences in the training and test dataset can result in meaningful differences in the estimate of accuracy
- K-fold Cross Validation.
  - Cross validation is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single train-test set split.
  - It works by splitting the dataset into k-parts (e.g. k=5 or k=10). Each split of the data is called a **fold**. The algorithm is trained on k-1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set.
  - After running cross validation you end up with k different performance scores that you can summarize using a mean and a standard deviation.
  - The choice of k must allow the size of each test partition to be large enough to be a reasonable sample of the problem, whilst allowing enough repetitions of the train-test evaluation of the algorithm to provide a fair estimate of the algorithms performance on unseen data. For modest sized datasets in the thousands or tens of thousands of records, k values of 3, 5 and 10 are common.
- Leave One Out Cross Validation.
  - You can configure cross validation so that the size of the fold is 1 (k is set to the number of observations in your dataset). This variation of cross validation is called leave-one-out cross validation.
  - The result is a large number of performance measures that can be summarized in an effort to give a more reasonable estimate of the accuracy of your model on unseen data. A downside is that it can be a computationally more expensive procedure than k-fold cross validation.
  - LeaveOneOut() is equivalent to KFold(n_splits=n) and LeavePOut(p=1) where n is the number of samples.
  - Due to the high number of test sets (which is the same as the number of samples) this cross-validation method can be very costly.
- Repeated Random Test-Train Splits.
  - Another variation on k-fold cross validation is to create a random split of the data like the train/test split described above, but repeat the process of splitting and evaluation of the algorithm multiple times, like cross validation.
  - This has the speed of using a train/test split and the reduction in variance in the estimated performance of k-fold cross validation. You can also repeat the process many more times as need. A down side is that repetitions may include much of the same data in the train or the test split from run to run, introducing redundancy into the evaluation.

## What Techniques to Use When

- Generally k-fold cross validation is the gold-standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.
- Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.
- Techniques like leave-one-out cross validation and repeated random splits can be useful intermediates when trying to balance variance in the estimated performance, model training speed and dataset size.
- The best advice is to experiment and find a technique for your problem that is fast and produces reasonable estimates of performance that you can use to make decisions. If in doubt, use 10-fold cross validation.


## Validation Set

- A validation dataset is a sample of data held back from training your model that is used to give an estimate of model skill while tuning model’s hyperparameters.
- The validation dataset is different from the test dataset that is also held back from the training of the model, but is instead used to give an unbiased estimate of the skill of the final tuned model when comparing or selecting between final models.

## Definitions of Train, Validation, and Test Datasets

- **Training Dataset**: The sample of data used to fit the model.
- **Validation Dataset**: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
- **Test Dataset**: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

```python
# split data
data = ...
train, validation, test = split(data)

# tune model hyperparameters
parameters = ...
for params in parameters:
    model = fit(train, params)
    skill = evaluate(model, validation)

# evaludate final model for comparision with other models
model = fit(train)
skill = evaluate(model, test)
```

- The validation dataset may also play a role in other forms of model preparation, such as feature selection.
- The final model could be fit on the aggregate of the training and validation datasets.


- That there is clear precedent for what “training dataset,” “validation dataset,” and “test dataset” refer to when evaluating models.
- That the “validation dataset” is predominately used to describe the evaluation of models when tuning hyperparameters and data preparation, and the “test dataset” is predominately used to describe the evaluation of a final tuned model when comparing it to other final models.
- That the notions of “validation dataset” and “test dataset” may disappear when adopting alternate resampling methods like k-fold cross validation, especially when the resampling methods are nested.

## Validation and Test Datasets Disappear

Reference to a “validation dataset” disappears if the practitioner is choosing to tune model hyperparameters using k-fold cross-validation with the training dataset.

```python
# split data
data = ...
train, test = split(data)
 
# tune model hyperparameters
parameters = ...
k = ...
for params in parameters:
	skills = list()
	for i in k:
		fold_train, fold_val = cv_split(i, k, train)
		model = fit(fold_train, params)
		skill_estimate = evaluate(model, fold_val)
		skills.append(skill_estimate)
	skill = summarize(skills)
 
# evaluate final model for comparison with other models
model = fit(train)
skill = evaluate(model, test)
```

Reference to the “test dataset” too may disappear if the cross-validation of model hyperparameters using the training dataset is nested within a broader cross-validation of the model.



# How Much Training Data
It is important to know why you are asking about the required size of the training dataset.

- **Do you have too much data?** 
  - Consider developing some learning curves to find out just how big a representative sample is (below). 
  - consider using a big data framework in order to use all available data.
- **Do you have too little data?**
  - Consider confirming that you indeed have too little data. Consider collecting more data, 
  - Using data augmentation methods to artificially increase your sample size.
- **Have you not collected data yet?** 
  - Consider collecting some data and evaluating whether it is enough. 
  - if it is for a study or data collection is expensive, consider talking to a domain expert and a statistician.


##  It Depends; No One Can Tell You


- The complexity of the problem, nominally the unknown underlying function that best relates your input variables to the output variable.
- The complexity of the learning algorithm, nominally the algorithm used to inductively learn the unknown underlying mapping function from specific examples.

## Reason by Analogy
- Perhaps you can look at studies on problems similar to yours as an estimate for the amount of data that may be required.
- It is common to perform studies on how algorithm performance scales with dataset size

## Use Domain Expertise
- You need a sample of data from your problem that is representative of the problem you are trying to solve.
- In general, the examples must be independent and identically distributed.
- This means that there needs to be enough data to reasonably capture the relationships that may exist both between input features and between input features and output features.

## Use a Statistical Heuristic
- Factor of the number of classes
- Factor of the number of input features
- Factor of the number of model parameters


## Nonlinear Algorithms Need More Data
- These algorithms are often more flexible and even nonparametric (they can figure out how many parameters are required to model your problem in addition to the values of those parameters). They are also high-variance, meaning predictions vary based on the specific data used to train them. This added flexibility and power comes at the cost of requiring more training data, often a lot more data.
- In fact, some nonlinear algorithms like deep learning methods can continue to improve in skill as you give them more data.
- If a linear algorithm achieves good performance with hundreds of examples per class, you may need thousands of examples per class for a nonlinear algorithm, like random forest, or an artificial neural network.

## Evaluate Dataset Size vs Model Skill
- Design a study that evaluates model skill versus the size of the training dataset.
- **Learning Curve**: Plotting the result as a line plot with training dataset size on the x-axis and model skill on the y-axis will give you an idea of how the size of the data affects the skill of the model on your specific problem.

## Naive Guesstimate

- Get and use as much data as you can.
- You need thousands of examples.
- No fewer than hundreds.
- Ideally, tens or hundreds of thousands for “average” modeling problems.
- Millions or tens-of-millions for “hard” problems like those tackled by deep learning.

##  Get More Data (No Matter What!?)

- Keep in mind that machine learning is a process of induction. The model can only capture what it has seen. If your training data does not include edge cases, they will very likely not be supported by the model.

## Don’t Procrastinate; Get Started
- Get all the data you can, use what you have, and see how effective models are on your problem.



In [11]:
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data_frame = pandas.read_csv(url, names=names)
data = data_frame.values

In [12]:
# Train & test sets
X = data[:, 0:8] # shape (768, 8)
Y = data[:, 8]
test_size = 0.33
# Because the split of the data is random, we want to ensure that the results are reproducible. 
# By specifying the random seed we ensure that we get the same random numbers each time we run the code.
seed = 42
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result * 100.0))
# the estimated accuracy for the model was approximately 76%

Accuracy: 76.772%


In [18]:
# K-fold cross validation
num_instances = len(X)
seed = 42
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LogisticRegression()
results = model_selection.cross_val_score(model, X, Y, cv=kfold)

# , it is a good practice to summarize the distribution of the measures, 
# in this case assuming a Gaussian distribution of performance (a very reasonable assumption) and recording the mean and standard deviation.
print('Accuracy: %.3f%% (min: %.3f%%, max: %.3f%%, std:%.3f%%)' % (results.mean() * 100.0, results.min() * 100.0, results.max() * 100.0, results.std() * 100.0))

Accuracy: 76.951% (min: 70.130%, max: 85.714%, std:4.841%)


In [23]:
# Leave One Out cross validation
loocv = model_selection.LeaveOneOut()
model = LogisticRegression()
results = model_selection.cross_val_score(model, X, Y, cv=loocv)
print('Accuracy: %.3f%% (min: %.3f%%, max: %.3f%%, std:%.3f%%)' % (results.mean() * 100.0, results.min() * 100.0, results.max() * 100.0, results.std() * 100.0))

Accuracy: 76.823% (min: 0.000%, max: 100.000%, std:42.196%)


In [24]:
# Shuffle Split cross validation
test_size = 0.33
seed = 42
kfold = model_selection.ShuffleSplit(n_splits=10, test_size=test_size, random_state=seed)
model = LogisticRegression()
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print('Accuracy: %.3f%% (min: %.3f%%, max: %.3f%%, std:%.3f%%)' % (results.mean() * 100.0, results.min() * 100.0, results.max() * 100.0, results.std() * 100.0))

Accuracy: 77.953% (min: 73.228%, max: 81.890%, std:2.728%)


In [34]:
import numpy as np
from sklearn.model_selection import LeaveOneOut

X = np.array([1, 2 ,3, 4, 5, 6])
loo = LeaveOneOut()
for train_idx, test_idx in loo.split(X):
    print(X[train_idx], X[test_idx])

[2 3 4 5 6] [1]
[1 3 4 5 6] [2]
[1 2 4 5 6] [3]
[1 2 3 5 6] [4]
[1 2 3 4 6] [5]
[1 2 3 4 5] [6]


In [39]:
from sklearn.model_selection import ShuffleSplit
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)
rs.get_n_splits(X)

5

In [40]:
rs

ShuffleSplit(n_splits=5, random_state=0, test_size=0.25, train_size=None)

In [41]:
for train_index, test_index in rs.split(X):
    print('TRAIN:', train_index, 'TEST:', test_index)

TRAIN: [3 1 0] TEST: [2]
TRAIN: [2 1 3] TEST: [0]
TRAIN: [0 2 1] TEST: [3]
TRAIN: [0 2 3] TEST: [1]
TRAIN: [2 3 0] TEST: [1]
