# Performing a T-Test
---
**Author:** Robert Kelley  
**Version:** 1.0  
**Semester:** Spring 2021  
**Summary:**  

I developed this notebook to demonstrate how to using numpy and scipy.stats to run a T-Test.  The original article from which I derived this code can be found at:  https://towardsdatascience.com/inferential-statistics-series-t-test-using-numpy-2718f8f9bf2f.

## Import Packages

In [1]:
import numpy as np
from scipy import stats

## Define random distributions
Define two random distributions and set the size of the sample. The original was 10 samples, but T-Tests were designed for sample sizes of 30 or more.

In [7]:
N = 30
#Gaussian distributed data with mean = 2 and var = 1
a = np.random.randn(N) + 2
#Gaussian distributed data with with mean = 0 and var = 1
b = np.random.randn(N)
print('The Distributions are:')
print('A:', a)
print('B:', b)

The Distributions are:
A: [ 2.634866    1.0478871   2.23890356  1.70962278  1.41299146  2.21346797
  2.20635029 -0.14278988  0.50220704  2.69980627  3.75547318  1.5556421
  1.63806643  0.77458061  1.31474997  1.57162656  2.50160824  3.13434712
  1.43535674  2.50905948  1.61208325  0.75168194  3.1705244   2.24861477
  0.62126949  1.49071959  0.81971374  3.44813792  2.96878878  3.55443862]
B: [ 1.1833217   0.68586386  0.60076522 -2.88767256 -0.59511918  0.86401639
 -1.31610959 -0.39610883 -0.92343755  0.63459437 -0.25882198  2.0993389
 -1.22420063  0.6537706  -1.11044033  0.41004048  0.04786377  1.33155102
  0.16082557 -1.06911373 -0.05778172  0.34485765  1.23841149 -1.21104328
  0.1625334  -1.78399631  1.13882417 -0.39679161  0.27445829  0.86291582]


## Calculate the Standard Deviation
Here we calculate both the variance and the standard deviation (which is square root of variance).
For unbiased max likelihood estimate we have to divide the var by N-1, and therefore the parameter ddof = 1

In [8]:
var_a = a.var(ddof=1)
var_b = b.var(ddof=1)

s = np.sqrt((var_a + var_b)/2)

## Calculate the T-Statistics
We'll calculate the T-stat, compare with the t-value and determine the p-value. 

In [15]:
t = (a.mean() - b.mean())/(s*np.sqrt(2/N))
df = 2*N - 2
p = 1 - stats.t.cdf(t,df=df)

print("df = " + str(df))
print("t = " + str(t))
print("p = " + str(2*p))

df = 58
t = 7.17847066837843
p = 1.4695462624558786e-09


You can see that after comparing the t statistic with the critical t value (computed internally) we get a good p value of 0.0005 and thus we reject the null hypothesis and thus it proves that the mean of the two distributions are different and statistically significant.


## Cross Checking 
Let's cross check it with the internal scipy function

In [14]:
t2, p2 = stats.ttest_ind(a,b)
print("t = " + str(t2))
print("p = " + str(p2))

t = 7.178470668378428
p = 1.4695461321187283e-09
