# PC Lab 5: Resampling Methods for Model Evaluation
---

In [29]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import urllib.request

from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, LeaveOneOut
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
import pc5

## 1. Introduction

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/K-fold_cross_validation_EN.svg/2880px-K-fold_cross_validation_EN.svg.png" width=500>

This tutorial is about an important tool for evaluating how accurate your model will perform in practice: _Resampling methods_ are an dispensable tool in modern statistics and involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about
the fitted model. 

For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the
extent to which the resulting fits differ. Such an approach may allow us to obtain information that would not be available from fitting the model only once using the original training sample. 

Resampling approaches can be computationally expensive, because they involve fitting the same statistical method multiple times using different subsets of the training data. However, due to recent advances in computing
power, the computational requirements of resampling methods generally are not prohibitive. In this PC lab, we discuss two of the most commonly used resampling methods, _cross-validation_ and nested cross-validation.


We again first load the necessary files for this PC-lab:

In [2]:
!wget https://raw.githubusercontent.com/tfmortie/mlmust/main/05_evaluation/pc5.py
!wget https://raw.githubusercontent.com/tfmortie/mlmust/main/05_evaluation/promoters.csv

('promoters.csv', <http.client.HTTPMessage at 0x16dd330d0>)

## 2. Data exploration: A simple promoter library

The dataset used for the first exercise is a simpe [E. coli promoter database](https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/promoter-gene-sequences/promoters.names) that contains a set of both promoter regions and non-promoter regions. A promoter region is the DNA sequence upstream of genes to which the RNA polymerase binds before the transcription of genes is initiated.

<img src="https://ib.bioninja.com.au/_Media/sections-of-a-gene_med.jpeg" width=500>

<div class="alert alert-success">

<b>EXERCISE 2.1</b>: **Load in the data and explore. Use _pd.Series.value_counts()_ to evaluate the empirical distribution of the labels. The data is stored in promoters.csv**
</div>  

In [None]:
df_data = "..." # TODO: load the data from the promoters.csv file

"..." # explore the data

**We see the first column is not used to identify column names**. Hence, we can just load the dataframe without the header by means of setting <code>header=None<code>.

In [1]:
df_data = "..." # load data again without column names

Now evaluate the distribution of the 2 labels, using the built-in pandas function <code>pd.Series.value_counts()</code>. Is our dataset balanced?

In [None]:
"..." # count the number of promoters and non-promoters

<div class="alert alert-success">

<b>EXERCISE 2.2</b>: **Create the features and labels for the model. Create your own features by using insights from the previous practical or use _pc5.CreateDummyNucleotideFeatures()_. In order to transform the labels, take a look at _sklearn.preprocessing.LabelEncoder_.**
</div>  

**Hint: As given in the docstring: for the features, a pandas DataFrame object is returned. To transform into Numpy array, use the <code>pd.DataFrame.values</code> function.**

In [None]:
X = "..." #TODO: use pc5.CreateDummyNucleotideFeatures to create a feature matrix from the DNA sequences
#get numpy array from pandas dataframe
X = "..."

y = "..." # TODO: encode the labels 

## 3. Cross-validation

In this section we will elaborate on two specific questions that are strongly related to cross-validation. Both questions stem from the fact that we aim to build models with a high predictive power, based on a finite dataset.
 - **Question 1**: Given a relatively small dataset, how can we use the promoter dataset as efficiently as possible, to construct a model that  optimally predicts the existence of a promoter region in the prokaryotic DNA?
 - **Question 2**: How can we decide which machine learning method should be preferred to predict the presence of a promoter region?


### Question 1: k-fold cross-validation

K-fold cross-validation is a resampling technique that is often used in machine learning in order to assess the generalization performance of a machine learning model. It consists of partitioning a dataset into k equally-sized folds or subsets. Next, the model is trained and validated k times, each time using a different fold as the evaluation set and the remaining k-1 folds as training set. The final performance metric is often computed as the average of the performance scores obtained in each iteration, providing a more reliable estimate of the model's generalization ability.

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*AAwIlHM8TpAVe4l2FihNUQ.png" width=600>

<div class="alert alert-success">

<b>EXERCISE 3.1</b>: **Write out code that performs cross-validation for a k-nearest neighbor model by using the [_StratifiedKFold()_](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) function in Scikit-learn. Use four folds and determine the optimal number of neighbors that maximize accuracy. What is the optimal number of neighbors?**
</div>  

In [None]:
fold_scores = []
# define space to iterate over
space = np.arange(1,20)
# define folds
folds = StratifiedKFold("...") # TODO: define number of folds, set shuffle to True

for tr_idx, tu_idx in "...": # TODO: iterate over the folds splitting the data
    hyper_scores = [] # store scores for each hyperparameter (number of neighbors in thiss case)
    for hyper in space:
        knn = "..." # TODO: define classifier with hyper as number of neighbors
        "..." # TODO: fit calssifier on training data
        score = "..." # TODO: evaluate score on leave-out data
        "..." # TODO: save score in list
    fold_scores.append(hyper_scores) # save scores for each fold

fig, ax = plt.subplots(1,1,figsize=(10,6))
ax.plot(space, np.mean(fold_scores, axis=0), '-o') # use the mean of the scores for each hyperparameter
ax.set_xlabel("Number of neighbors")
ax.set_ylabel("Mean Score")
ax.set_title("KNN cross-validation")

<div class="alert alert-success">

<b>EXERCISE 3.2</b>: **Copy your code from the previous exercise below. Replace the plot function with the _ax.errorbar()_ function and add the variances of the scores displayed as vertical bars. In what range do they lie?**
</div>  

In [None]:
fold_scores = []
# define space to iterate over
space = np.arange(1,20)
# define folds
folds = "..."

"..." # TODO: do same procedure, using erorobars to show the variance between the folds

<div class="alert alert-success">

<b>EXERCISE 3.3</b>: **Evaluate the variance of the performance measure for an increasing amount of folds. Use as many folds as possible. Applying cross-validation with n-1 folds is also known as leave-one-out cross-validation (LOOCV). Note that using the _StratifiedKFold()_ function does not allow you to perform LOOCV, since it restricts the user to $n/2$ folds, in order to preserve the percentage of sample per class. Use the [_LeaveOneOut_](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html) class in Scikit-learn instead. Think about the (dis)advantages of using few or many folds.**
</div> 

In [None]:
n_splits = "..."

"..." # TODO: explore the effect of the number of folds on the variance of the scores