# Training Techniques
Never use the same dataset to train your ML model and make predictions for evaluating the model (because this results in overfitting). Always evaluate the model on data that was not used to train it.

In [1]:
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression



In [2]:
# these examples use the Pima Indian diabetes dataset
url = "pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values

In [3]:
# separate array into features (X) and label (y) parts
X = array[:,0:8]
y = array[:,8]

## Train and Test Sets
The simplest technique is to use different training and testing datasets by splitting our original dataset into 2 parts.

In [4]:
test_size = 0.3   # the testing part is 30% of the original dataset
seed = 8

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, y_train)

result = model.score(X_test, y_test)
print("Accuracy: %.3f%%" % (result*100.0))

Accuracy: 77.922%


## K-fold Cross Validation
This technique gives less variance compared to the previous technique. It works by splitting the dataset into k-parts (each part is called a fold). The model is trained on k-1 folds, with one held back as the test set. This is repeated so that each fold has been used as the hold-back test set. At the end, we end up with k different performance scores that are then summarised using mean and standard deviation.

In [5]:
num_folds = 10
num_instances = len(X)
seed = 8

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LogisticRegression()

results = cross_validation.cross_val_score(model, X, y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 76.951% (4.841%)


## Leave One Out Cross Validation
This variation of cross validation works by setting the number of folds to the number of observations in the dataset (i.e. the number of obervations in each fold is one). It produces a large number of performance scores that can then be summarised to give a more reasonable estimate of the accuracy of the model. Downside is that this technique is computationally more expensive.

In [6]:
num_instances = len(X)

loocv = cross_validation.LeaveOneOut(n=num_instances)
model = LogisticRegression()

results = cross_validation.cross_val_score(model, X, y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 76.823% (42.196%)


This technique has more variance than k-fold cross validation (as shown above in the standard deviation).

## Repeated Random Train-Test Splits
Repeats the process of splitting the data and evaluating the model multiple times. Has the speed of the train/test split technique and the reduced variance in the k-fold cross validation technique. Downside is that repetitions may include much of the same data in the train or test split from run to run, introducing redundancy into the evaluation.

In [7]:
num_samples = 10
test_size = 0.3
num_instances = len(X)

seed = 8
kfold = cross_validation.ShuffleSplit(n=num_instances, n_iter=num_samples, test_size=test_size, random_state=seed)
model = LogisticRegression()

results = cross_validation.cross_val_score(model, X, y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 76.840% (1.909%)


## What technique to use?
* K-fold cross validation is the "best practice" for evaluating the performance of a model on unseen data, with k set to 3, 5 or 10.
* Train/test split is good for speed if using a slow algorithm and produces performance scores with lower bias when using large datasets.
* If in doubt, use 10-fold cross validation.