# Data Literacy Exercise 06

Machine Learning in Science, University of Tübingen, Winter Semester 2022



## Introduction

Permutation testing is a procedure which can be used to test whether there is an association between two random variables $Z$ and $Y$. Given a set of pairs $D={(z_1, y_1), (z_2, y_2), \ldots, (z_N, y_N)}$, we want to test whether $Z$ and $Y$ are statistically independent or not.

Permutation tests can be defined for different test statistics, but for example we can apply it to $T(D) = \sum_i (z_i-y_i)^2$. The beauty of the permutation test is that there is a very simple way to compute the distribution of the test statistic under the null hypothesis: If $z_i$ and $y_i$ are independent (which they are under $H_o$), then it should not matter whether we compute ($z_i - y_i$) or ($z_i - y_j$) for a $j$ which is drawn (uniformly) at random! One way to implement this is to simply permute the indices of the $y$’s, and to compute $T(D^*)$ on this permuted data set. By repeating this many times for different (uniformly) random permutations, we can thus compute a histogram of the test statistic under the null. Finally, we can then calculate the p-value by checking what fraction of the (permuted) statistic is smaller than the one we observed!

As a concrete example, let's apply this to a regression task. In this case, we assume we have some algorithm that computed predictions $z_i$, and we want to see whether these predictions are closer to the targets $y_i$ than one would expect under the null hypothesis, using the test statistic introduced above.

In [None]:
import pandas as pd
import numpy as np
np.random.seed(42)

## Predicting bike rentals from weather

### 1. Load our dataset

We will be using the [Bike Sharing Dataset](https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset) from University of Porto to predict the number of rental bikes from the weather (you can download the .csv from Ilias).

In [None]:
# Load csv data using the pandas library
df = pd.read_csv("bikedata.csv")

In [None]:
df.head()

In [None]:
# Use the "temp" column as our features (X), and the "cnt" column for labels (y)
X = ...
y = ...

### 2. Splitting into train and test sets

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = X.reshape(-1, 1)
y = y.reshape(-1, 1)

# Split the arrays into training and test sets, using 30% for test
X_train, X_test, y_train, y_test = ...

### 3. Linear regression

In [None]:
from sklearn.linear_model import LinearRegression

# Fit a linear regression model to the data
reg = ...

In [None]:
from sklearn.metrics import mean_squared_error

# Generate a vector of predictions
y_pred = ...

# Calculate the mean squared error (you can use the sklearn method above for this)
mse = ...

### 4. Implement permutation testing

Permutation testing is a procedure which allows us to measure how likely the observed metric (in this case, the mean squared error) is obtained by chance. In this case, a p-value represents the fraction of random data sets under a certain null hypothesis where the model performed as well as or better than our observed metric (which was obtained on the true labels).

Here, our null hypothesis ($H_0$) is that there is no difference between the performance of our trained model and chance (random guessing). To reject this hypothesis, we must create a null distribution, or a set of random (i.e. permuted, or shuffled) data sets. We then calculate our p-value by comparing the error between our observed predictions (calculated above) and the shuffled data (chance level, to be filled in below).

In [None]:
from numpy.random import shuffle

perm_scores = []

for i in np.arange(0, 1000):
    
    # Shuffle the labels
    ...
    
    # Calculate the MSE
    mse_perm = ...
    
    # Append the MSE to perm_scores
    ...

### 5. Plot

Plot a histogram of the test statistic under the null hypothesis. Additionally, plot a vertical line for the value of the observed test statistic.

In [None]:
import matplotlib.pyplot as plt

...

### 6. Calculate p-value

In [None]:
# Calculate the p-value
# This will reflect the proportion of MSE from our permuted labels which are lower than or equal to the MSE on the true labels
p_val = ...

In [None]:
p_val

If you get a p-value of 0, this means that your observed MSE was better than random guessing across all permutations. You would report this by saying that your p-value is less than the minimum, i.e. $p<.001$ (since we do 1000 permutations).

### 6. Interpret your results

What statement can you make based on the p-value you obtained?

(We note that permutation testing is often presented as a way to test whether there is a difference in distribution between two sets of observations (see e.g. Wasserman 10.5, or the [Wikipedia entry](https://en.wikipedia.org/wiki/Permutation_test). Permutation tests can be used for either the purpose described here or for testing for a difference between groups, and the two are indeed related.)

----------

## Predicting diabetes from a single feature

Similar to what you just did above, we will recreate the pipeline using different data.

### 1. Making our dataset

In [None]:
from sklearn import datasets

# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

# Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 0]  # TO BE CHANGED LATER ON

### 2. Splitting into train and test sets

In [None]:
# Split the data into training/testing sets
# In this case we just take the last 20 samples for test
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

### 3. Linear regression

In [None]:
reg = ...

In [None]:
diabetes_y_pred = ...
mse = ...

### 4. Implement permutation testing

In [None]:
perm_scores = []

for i in np.arange(0, 1000):
    
    # Shuffle the labels
    ...
    
    # Calculate the MSE
    mse_perm = ...
    
    # Append the MSE to perm_scores
    ...

### 5. Plot

Plot a histogram of the test statistic under the null hypothesis. Additionally, plot a vertical line for the value of the observed test statistic.

In [None]:
import matplotlib.pyplot as plt

...

### 6. Calculate p-value

In [None]:
p_val = ...

In [None]:
p_val

### 7. Interpret your results

What statement can you make based on the p-value you obtained?

### 8. Try again with a different feature

In [None]:
from sklearn import datasets
data = datasets.load_diabetes()

data['feature_names']

In the code above, we can see that the feature we used previously was 'age'. 

Copy and paste the code from "Predicting diabetes from a single feature" into the cell below, then change the "0" to a "2" (there is a comment in the code indicating where to do this). By making this change, we are using BMI as our feature, instead of age. 

In [None]:
# Paste code below
