# Tutorial

Import necessary packages

In [1]:
import parallelPermutationTest as ppt
import numpy as np
import pandas as pd

# Synthetic data: Integer data

The permutation test only works on integer data, so when one has an integer dataset, one does not have to pre-process the data with a binning-procedure. Hence, one can use ppt.GreenIntCuda method.

Let's construct a synthetic dataset with integers ranging from 0 to 500, and with sample sizes of 500 elements.

In [7]:
n_samples = 1
n = m =  500

In [8]:
data = lambda n,n_samples : np.asarray([np.random.randint(0,n,n,dtype=np.int32) for _ in range(n_samples)])

In [9]:
np.random.seed(1)
A,B = data(n,1), data(n,1)

The shift-algorithm implemented in the R-package Coin(https://cran.r-project.org/web/packages/coin/index.html) is probably the fastest version permutation test today. We have implemented a slightly speeded-up version of their version into Python.

In [10]:
%time p_shift = ppt.CoinShiftInt(A,B)

CPU times: user 24.3 s, sys: 253 ms, total: 24.6 s
Wall time: 24.4 s


Greens algorithm is a slight variation of the shift-algorithm. Unfortunately, on a single thread, it's prolonged compared to the shift-algorithm.  However, it has the perk of being parallelizable. Let's check the available thread version.

In [11]:
%time p_green = ppt.GreenInt(A,B)

CPU times: user 1min 12s, sys: 388 ms, total: 1min 12s
Wall time: 1min 12s


So when only a single thread is accessible, we would recommend using ppt.CoinShiftInt rather than ppt.GreenInt. However, when one has several threads, the Greens algorithm starts to shine. Let us take a look at a multithreaded version of the Greens algorithm. Here we are using an Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz, which has eight threads.

In [12]:
%time p_green_mt = ppt.GreenIntMultiThread(A,B)

CPU times: user 3min 16s, sys: 782 ms, total: 3min 17s
Wall time: 25 s


Quite a large speed-up, but still only on par with ppt.CoinShiftInt. Let us use more threads! We can use GPU for this. Here we use GeForce RTX 2070.

In [13]:
%time p_green_gpu = ppt.GreenIntCuda(A,B)

CPU times: user 4.73 s, sys: 96.1 ms, total: 4.82 s
Wall time: 4.83 s


Great! We have essentially improved the run-time five times compared to ppt.CoinShiftInt.

Let us ensure that they all yield the same result.

In [14]:
np.allclose(p_shift, p_green, p_green_mt, p_green_gpu)

True

# Real data: Finance

This is example is based on https://www.datacamp.com/community/tutorials/stocks-significance-testing-p-hacking.

### The claim is: "Over the past 32 years, October has been the most volatile month on average for the S&P500 and December the least volatile". 


### Let's check if this is statistically significant.

We have to download and pre-process the data.

In [15]:
#Daily S&P500 data from 1986==>
url = "https://raw.githubusercontent.com/Patrick-David/Stocks_Significance_PHacking/master/spx.csv"
df = pd.read_csv(url,index_col='date', parse_dates=True)



In [16]:
#To model returns we will use daily % change
daily_ret = df['close'].pct_change()
#drop the 1st value - nan
daily_ret.dropna(inplace=True)


In [17]:
mnthly_annu = daily_ret.resample('M').std()* np.sqrt(12)

In [18]:
dec_vol = mnthly_annu[mnthly_annu.index.month==12]
rest_vol = mnthly_annu[mnthly_annu.index.month!=12]

In [19]:
dec_vol.head(2)

date
1986-12-31    0.026474
1987-12-31    0.061435
Name: close, dtype: float64

In [20]:
(dec_vol.values.shape, rest_vol.values.shape)

((32,), (358,))

Here we have float data, i.e., real values. So we can not use ppt.GreenIntCuda. We have to pre-process the data with a binning procedure. Let us take 500 bins. This procedure will map all values into 500 integer bins. 

In [21]:
n_bins = 500

In [22]:
%time p = ppt.GreenFloatCuda(dec_vol.values, rest_vol.values, n_bins)

CPU times: user 5.08 ms, sys: 7 µs, total: 5.08 ms
Wall time: 4.37 ms


In [23]:
p

array([0.36120981])

That December is the least volatile month seems not to be statistically significant.

# Real data: Biomedical data

### We want to see if there are any significant genes in breast cancer patients.

Let import the pre-processed from the Experiment 6 notebook.

In [24]:
NotTNP_df = pd.read_csv("experiment_data/experiment6/notTNPdf")
TNP_df = pd.read_csv("experiment_data/experiment6/TNPdf")

In [25]:
(TNP_df.shape, NotTNP_df.shape)

((8051, 26), (8051, 80))

In [26]:
TNP_df.head(2)

Unnamed: 0,A2-A0CM.07TCGA,A2-A0D2.31TCGA,A2-A0SX.36TCGA,A2-A0YM.36TCGA,A7-A0CE.13TCGA,AN-A0AL.28TCGA,AN-A0FL.19TCGA,AO-A0J6.11TCGA,AO-A0JL.35TCGA,AO-A12F.22TCGA,...,C8-A134.32TCGA,D8-A142.18TCGA,E2-A158.29TCGA,A2-A0D0.06TCGA,A2-A0T2.21TCGA,AR-A1AQ.34TCGA,BH-A0E0.10TCGA,BH-A18V.12TCGA,E2-A150.27TCGA,E2-A159.24TCGA
0,0.683404,0.107491,-0.39856,0.65585,-1.123173,0.323663,2.455138,0.831132,-0.10668,-1.947792,...,0.140182,0.538596,-1.086529,-2.579532,-0.642941,-2.367201,-2.316136,0.500384,-1.44604,-1.463032
1,0.694424,0.104164,-0.392601,0.658143,-1.123173,0.326973,2.480137,0.85654,-0.10668,-1.952718,...,0.126054,0.542211,-1.095492,-2.536336,-0.600538,-2.312576,-2.280216,0.520345,-1.380212,-1.404991


This dataset is quite large (8051 experiments) with floating values. Let us take 100 bins. However, we have to be careful, so we do not overload the GPU. Let us make a memory check.

In [27]:
n_bins = 100

In [28]:
ppt.GreenFloatCuda_memcheck(TNP_df.values, NotTNP_df.values, n_bins)



We need to divide our data into batches.

In [29]:
batch_size = int(TNP_df.shape[0] / 4)

In [30]:
%time p_values = ppt.GreenFloatCuda(TNP_df.values, NotTNP_df.values, 100, batch_size=batch_size)

CPU times: user 4.21 s, sys: 524 ms, total: 4.74 s
Wall time: 4.72 s
