# Why use repeated cross-validation ? #

First of all repeated cross-validation is just repeating cross-validation multiple times where in each repetition, the folds are split in a different way. After each repetition of the cross-validation, the model assessment metric is computed (e.g. accuracy or RMSE). The scores from all repetitions are finally averaged (you can also take the median), to get a final model assessment score. This gives a more “robust” model assessment score than performing cross-validation only once, which is what I aim to demonstrate. Then I will perform hyper-parameter tuning using cross-validation. 

First I will import the data and form the feature matrix (**$X$**) and the output variable vector (**$y$**). My output variable is whether the given wine is “good wine” or “bad wine”. I define a “good wine” as one that has a quality score above 5/10. Therefore this is a classification problem.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from numpy import random
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing

wine_dataset = pd.read_csv('../input/winequality-red.csv')

#form feature matrix and target variable vector
X = wine_dataset[['fixed acidity', 'volatile acidity', 'citric acid',
       'chlorides',  'total sulfur dioxide', 'density',
        'sulphates', 'alcohol']].values #feature matrix

y = (wine_dataset['quality']>5).values.astype(int) #target variable, considers wines with quality score>5 good wines



Here is my function that will perform repeated cross-validation. Note that scikit-learn has it’s own repeated cross-validation function, but I wanted to have more control over the process so I wrote my own. The function takes the data set (feature matrix $X$ and output variable vector $y$) and returns the accuracy, precision and recall scores from all 50 repetitions of cross-validation performed. The model to be used (e.g. logistic regression, SVM etc.) is also an input of the function. 

In [3]:
def perform_repeated_cv(X, y , model):
    #set random seed for repeartability
    random.seed(1)

    #set the number of repetitions
    n_reps = 50

    # perform repeated cross validation
    accuracy_scores = np.zeros(n_reps)
    precision_scores=  np.zeros(n_reps)
    recall_scores =  np.zeros(n_reps)

    for u in range(n_reps):

        #randomly shuffle the dataset
        indices = np.arange(X.shape[0])
        np.random.shuffle(indices)
        X = X[indices]
        y = y[indices] #dataset has been randomly shuffled

        #initialize vector to keep predictions from all folds of the cross-validation
        y_predicted = np.zeros(y.shape)

        #perform 10-fold cross validation
        kf = KFold(n_splits=5 , random_state=142)
        for train, test in kf.split(X):

            #split the dataset into training and testing
            X_train = X[train]
            X_test = X[test]
            y_train = y[train]
            y_test = y[test]

            #standardization
            scaler = preprocessing.StandardScaler().fit(X_train)
            X_train = scaler.transform(X_train)
            X_test = scaler.transform(X_test)

            #train model
            clf = model
            clf.fit(X_train, y_train)

            #make predictions on the testing set
            y_predicted[test] = clf.predict(X_test)

        #record scores
        accuracy_scores[u] = accuracy_score(y, y_predicted)
        precision_scores[u] = precision_score(y, y_predicted)
        recall_scores[u]  = recall_score(y, y_predicted)

    #return all scores
    return accuracy_scores, precision_scores, recall_scores

Now, I will perform repeated cross-validation on this dataset using random forest classification in order to look at the variability of the model assesment scores at each repetition.

In [6]:


#perform repeted CV with logistic regression
accuracy_scores, precision_scores, recall_scores = perform_repeated_cv(X, y ,  RandomForestClassifier(n_estimators=100) )

#plot results from the 50 repetitions
fig, axes = plt.subplots(3, 1)

axes[0].plot(100*accuracy_scores , color = 'xkcd:cherry' , marker = 'o')
axes[0].set_xlabel('Repetition')
axes[0].set_ylabel('Accuracy (%)')
axes[0].set_facecolor((1,1,1))
axes[0].spines['left'].set_color('black')
axes[0].spines['right'].set_color('black')
axes[0].spines['top'].set_color('black')
axes[0].spines['bottom'].set_color('black')
axes[0].spines['left'].set_linewidth(0.5)
axes[0].spines['right'].set_linewidth(0.5)
axes[0].spines['top'].set_linewidth(0.5)
axes[0].spines['bottom'].set_linewidth(0.5)
axes[0].grid(linestyle='--', linewidth='0.5', color='grey', alpha=0.5)

axes[1].plot(100*precision_scores , color = 'xkcd:royal blue' , marker = 'o')
axes[1].set_xlabel('Repetition')
axes[1].set_ylabel('Precision(%)')
axes[1].set_facecolor((1,1,1))
axes[1].spines['left'].set_color('black')
axes[1].spines['right'].set_color('black')
axes[1].spines['top'].set_color('black')
axes[1].spines['bottom'].set_color('black')
axes[1].spines['left'].set_linewidth(0.5)
axes[1].spines['right'].set_linewidth(0.5)
axes[1].spines['top'].set_linewidth(0.5)
axes[1].spines['bottom'].set_linewidth(0.5)
axes[1].grid(linestyle='--', linewidth='0.5', color='grey', alpha=0.5)

axes[2].plot(100*precision_scores , color = 'xkcd:emerald' , marker = 'o')
axes[2].set_xlabel('Repetition')
axes[2].set_ylabel('Recall (%)')
axes[2].set_facecolor((1,1,1))
axes[2].spines['left'].set_color('black')
axes[2].spines['right'].set_color('black')
axes[2].spines['top'].set_color('black')
axes[2].spines['bottom'].set_color('black')
axes[2].spines['left'].set_linewidth(0.5)
axes[2].spines['right'].set_linewidth(0.5)
axes[2].spines['top'].set_linewidth(0.5)
axes[2].spines['bottom'].set_linewidth(0.5)
axes[2].grid(linestyle='--', linewidth='0.5', color='grey', alpha=0.5)

plt.grid(True)
plt.tight_layout()

The scores from each repetition of the cross-validation are seen in the figure above. Accuracy, precision and recall scores vary by ~2.5% across repetitions. Taking the mean of the scores across all repetitions therefore gives a more robust measure. The increased robustness comes at the expense of increased computation complexity and running time. 
___

# Tuning Hyper-parameters using repeated cross-validation
Now I will demonstrate how a hyper-parameter can be tuned using repeated cross-validation. I will tune the $C$ parameter of logistic regression. To do this I will perform repeated cross-validation on a range of $C$ values and compute the accuracy at each value of the grid. I will plot accuracy vs. $C$ and decide on what $C$ to choose. I chose logistic regression because it runs quickly (compared to SVM for example). 

In [None]:

from sklearn.linear_model import LogisticRegression

#set up the parameter sweep
c_sweep =  np.power(10, np.linspace(-4,4,50))

#perform repeated cross-validation by sweeping the parameter
accuracy_parameter_sweep = [] # keep scores here
std_parameter_sweep = [] #keep parameters in here
for c in c_sweep:

    #perform repeated cross-validation
    accuracy_scores, precision_scores, recall_scores = perform_repeated_cv(X, y ,  LogisticRegression(C=c) )

    ##append scores
    accuracy_parameter_sweep.append(np.mean(100*accuracy_scores))
    std_parameter_sweep.append(np.std(100*accuracy_scores))


#plot C vs. accuracy
plt.fill_between(c_sweep , np.array(accuracy_parameter_sweep) - np.array(std_parameter_sweep) ,
                 np.array(accuracy_parameter_sweep) + np.array(std_parameter_sweep) , facecolor = 'xkcd:light pink', alpha=0.7)
plt.semilogx(c_sweep,accuracy_parameter_sweep , color= 'xkcd:red' , linewidth=4)
plt.xlabel('C')
plt.ylabel('Accuracy (%)')
plt.title('Logistic Regression Accuracy vs. Hyper-parameter C')
plt.grid(True, which='both')
plt.tight_layout()

At $C=11.5$, accuracy is maximized at $74.2%$. As $C$ decreases and therefore the model is more heavilty regularized, accuracy decreases.
