# Computing the p-value using permutation tests

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["font.size"] = 14

First let us generate a simple synthetic dataset and load it on Pandas

In [None]:
df = pd.DataFrame(np.random.randn(10, 2), columns=list('AB'))
print(df)

This our dataset for the moment. A question that commonly arises is, are the means of the two populations the same? Let us first see what the means are.

In [None]:
df.mean()

A somewhat restrictive way of answering to this question is by performing a t-test. The $t$-statistic for a 2-sample test where we assume the two populations have equal variance is

$$t = \frac{\bar{x}_A - \bar{x}_B}{\frac{\sqrt{\sigma^2_A + \sigma^2_B}}{n}}$$

where $\bar{x}$ gives the mean and variance of each sample.

**Exercise**: compute the $t$ statistic for the sample above using DataFrame methods.

In [None]:
tval =

In [None]:
# Run this cell after you're done to see the solution.
%load permutation1.py

We are now ready to compute the $p$-value for this statistic. Our null hypothesis $H_0$ is that the two means are the same; we will do a two-tailed test, which means the alternative hypothesis $H_A$ is that the two means are different. For that we need to lookup the survival function of the Student's $t$-distribution.

In [None]:
from scipy.stats import t
pval = 2 * t.sf(np.abs(tval), df.shape[0] - 1)
print(pval)

We could also have done all that using a black-box.

In [None]:
from scipy.stats import ttest_ind
res = ttest_ind(df["A"], df["B"])
print(res)

## Permutation test

The permutation test gives us an alternative way of computing the $p$-value. The first step is to compute all the possible $20 \choose 10$ combinations of the 20 samples we have in 2 groups.

First let us put all samples in a single array.

In [None]:
samples = np.ravel(np.array(df), order = 'f')
print(samples)

Now we compute the difference between the means in the original set

$$D_0 = \hat{x}_A - \hat{x}_B$$

and then again the same quantity for every combination. Finally, we count on how many of the combinations this quantity was larger than $D_0$. In practice, in order to do a two-tailed test, one oughta look at its absolute value, that is

$$p = \frac{\text{# of $|D|$'s $>$ $|D_0|$}}{20\choose10}$$

**Exercise**: compute the quantity above. Tip: use the `combinations` function from the `itertools` package.

In [None]:
from itertools import combinations

In [None]:
# %load permutation2.py
d = []
for p in combinations(range(20), 10):
    p_tilde = set(range(20)).difference(p)
    A = samples[list(p)]
    B = samples[list(p_tilde)]
    d.append(np.mean(A) - np.mean(B))
    
d0 = df["A"].mean() - df["B"].mean()
print(np.mean(np.abs(d) > np.abs(d0)))

Note that the value obtained is very close to that we got using the $t$-statistic. Here, however, we did less assumptions.

Finally we can plot the histogram together with the $p$-value.

In [None]:
plt.hist(d, 30)
plt.axvline(d0, color = "r", lw = 2)
plt.axvline(-d0, color = "r", lw = 2)