In [6]:
import warnings
warnings.filterwarnings('ignore')

# Evaluate the Performance of Machine Learning Algorithms with Resampling


- The best way to evaluate the performance of an algorithm would be to make predictions for new data to which you already know the answers.
- The second best way is: estimate the accuracy of your machine learning algorithms using resampling methods
- Testing the machine learning model on the same training set which you created the model, will cause overfitting. means, any new unseen dat you will test the model, it will fail to predict terribly.
- That's why testing the model should be done on unseen data.


## Different techniques used to split up the training set and create useful estimates of performance on the ML algorithm



- 1- Train and Test Sets.
- 2- k-fold Cross-Validation.
- 3- Leave One Out Cross-Validation.
- 4- Repeated Random Test-Train Splits.

### 1- Split into Train and Test Sets
- The simplest method
- train the model on training set, and evaluate the model on a test set
- Ideal for large datasets (millions) because it provides good performance
- Downside of this technique, it has high variance, means, differences betwee training set and test set has major imact on the accuracy estimation

In [7]:
# Evaluate using a train and a test set
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

filename = 'pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names)

array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size,
    random_state=seed)

model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%" % (result*100.0))

Accuracy: 75.591%


- To make sure the results are reproducible, we defined the seed, you can set the seed to any number.

### 2- K-fold Cross-Validation
- Compute the accuray with less variance than train-test
- Splitting the dataset into k-parts (e.g. k = 5 or k = 10), each split is called a fold
- Train the model on k-1 folds, test on the held back fold, repeat for all folds
- After cross-validation, you end-up with k different performance scores, which can be summarized using mean and standard deviation
- More accurate due to multiple folds of testing and validation
- For modest sized datasets in the thousands or tens of thousands of records, k values of 3, 5 and 10 are common

In [8]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 76.951% (4.841%)


### 3- Leave One Out Cross-Validation
- Fold size is one
- K is set to number of dataset observations
- Result is a large number of performance measures
- Downside less performance due to high computation power

In [9]:
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

loocv = LeaveOneOut()
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))

Accuracy: 76.823% (42.196%)


### 4- Repeated Random Test-Train Splits
- Random splitted train/test sets
- Repeat the splitting and algorithm evaluation multiple times
- This technique combining the fast performance of train/test split, and reduction of variance for cross-validation
- Downside, it may include much of same data in the train/test split from run to run

In [12]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

n_splits = 10
test_size = 0.33
seed = 7
kfold = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))                                 

Accuracy: 76.496% (1.698%)


### What Techniques to Use When
- Generally k-fold cross-validation is the gold standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10
- Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets
- Techniques like leave-one-out cross-validation and repeated random splits can be useful intermediates when trying to balance variance in the estimated performance, model training speed and dataset size
- Usually 10-fold cross-validation is used