# T-Test

Code to run a significance calculation using a test statistic between 2 normally distributed datasets

http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-t-tests-t-values-and-t-distributions

In [32]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

In [27]:
#Create 2 randomly distributed datasets with N points
N = 50
datasetA = np.random.randn(N) + 2
datasetB = np.random.randn(N)

In [28]:
#Calculate the T statistic
#We want the unbiansed variance. Thats why we use ddof = 1   “Delta Degrees of Freedom"
varA = datasetA.var(ddof = 1)
varB = datasetB.var(ddof = 1)
#Pool standard deviation
s = np.sqrt((varA + varB) / 2)    
t = (datasetA.mean()  - datasetB.mean()) / (s * np.sqrt(2.0/N))
#degrees of freedom
df=2*N - 2

In [29]:
#The p-value. Probability of a more extreme example if the null hypothesis is true
p = 1 - stats.t.cdf(t,df=df)           #What is the cdf  method doing?  Calcultating the the cumulative distribution function (From 0 to t)
#Because it is a simetric distribution we multiply the value by 2
print ("t:%s \t p:%s"%(t,2*p))

t:8.88204886616 	 p:3.21964677141e-14


The inbuilt T test assumes a 2 sided test so there's no need of multiplying the result of the cdf by 2
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

In [31]:
#inbuilt ttest   should give the same result
t2,p2 = stats.ttest_ind(datasetA,datasetB)    #the p probability doesnt need to be multiplied by 2
print ("t:%s \t p:%s"%(t2,p2))

t:8.88204886616 	 p:3.21988253508e-14
