# Hypothesis Testing
In the previous module, we studied how a single population varies over a single variable such as voting preference, visitation frequency, etc. Now, we will study relationships between multiple variables. The most famous of which is the classical Randomized Control Trial, where there is a "treatment" group and a "control" group, and we must understand if there are significant differences between the two groupings.  

## Programming
This topic will force us to talk about data organization. We are no longer working with a simple "list" of numbers. Increasingly, today data are organized into "Data Frames". A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. The data stored in a data frame can be of numeric, factor or character type. We will introduce Python Pandas dataframes where are widely used in data science practice.

In [22]:
import pandas
df = pandas.read_csv('data.csv')
df.describe()

Unnamed: 0,outcome
count,32.0
mean,7.71875
std,8.023813
min,1.0
25%,1.75
50%,6.5
75%,10.25
max,41.0


In [2]:
df[:]

Unnamed: 0,group,outcome
0,t,8.0
1,t,4.0
2,t,7.0
3,t,9.0
4,t,3.4
5,t,8.4
6,t,10.0
7,t,15.0
8,t,21.0
9,t,41.0


The dataset above has two columms one indicating a grouping and one indicating an outcome. Let's do some analysis on this data. I can retrieve certain columns:

In [24]:
df['outcome']

0      8.0
1      4.0
2      7.0
3      9.0
4      3.4
5      8.4
6     10.0
7     15.0
8     21.0
9     41.0
10    16.0
11    11.0
12     9.0
13     7.0
14     3.2
15     1.0
16     4.0
17     6.0
18    11.0
19     2.0
20    14.0
21     4.0
22     1.0
23     1.0
24     1.0
25    13.0
26     8.0
27     1.0
28     1.0
29     1.0
30     1.0
31     4.0
Name: outcome, dtype: float64

I can also retrieve certain rows

In [27]:
df.iloc[22]
df['group'] == 't'

0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14     True
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
Name: group, dtype: bool

I can chain together the logic.

In [28]:
treatment = df[df['group'] == 't'] #filters the frame to only take group 't'
treatment

Unnamed: 0,group,outcome
0,t,8.0
1,t,4.0
2,t,7.0
3,t,9.0
4,t,3.4
5,t,8.4
6,t,10.0
7,t,15.0
8,t,21.0
9,t,41.0


In [29]:
control = df[df['group'] == 'c']

In [30]:
control

Unnamed: 0,group,outcome
15,c,1.0
16,c,4.0
17,c,6.0
18,c,11.0
19,c,2.0
20,c,14.0
21,c,4.0
22,c,1.0
23,c,1.0
24,c,1.0


In [8]:
control.describe()

Unnamed: 0,outcome
count,17.0
mean,4.352941
std,4.499183
min,1.0
25%,1.0
50%,2.0
75%,6.0
max,14.0


In [9]:
treatment.describe()

Unnamed: 0,outcome
count,15.0
mean,11.533333
std,9.490948
min,3.2
25%,7.0
50%,9.0
75%,13.0
max,41.0


The control gorup has a mean of 4.3 and the treatment group has a mean of 11. Was this difference by chance?
## Simulation
To drive this point home let's consider the following simulation. We draw two identically distributed normally distributed numbers and measure their difference

In [33]:
import numpy as np
n1,n2 = np.random.randn(2,1)
n1-n2

array([2.68565754])

Run this simulation a bunch of times and see the different results:

In [34]:
def simulate(thresh, trials=100):
    cnt = 0
    
    for t in range(trials):
        n1,n2 = np.random.randn(2,1)
        if np.abs(np.squeeze(n1-n2)) > thresh:
            cnt += 1
    
    print('# Times Greater than ', thresh, cnt/trials)

simulate(1)
simulate(2)
simulate(3)

# Times Greater than  1 0.53
# Times Greater than  2 0.21
# Times Greater than  3 0.03


There are large differences more often than you would expect!! This sets up the concept of a two-sample hypothesis test. We want to quantify the probability that the difference in mean of the two population is unlikely to have happened purely by chance.

## Null Hypothesis
The null hypothesis is a general statement or default position that there is nothing new happening, like there is no association among groups, or no relationship between two measured phenomena. We start by getting all of the same statistics that we needed before the calculate confidence intervals.

In [35]:
c_mean, c_std, c_count = control.mean(), control.std(), control.count()

In [36]:
t_mean, t_std, t_count = treatment.mean(), treatment.std(), treatment.count()

In [37]:
diff_mean = t_mean - c_mean
print(diff_mean)

outcome    7.180392
dtype: float64


In [15]:
diff_std = np.sqrt( (t_std/np.sqrt(t_count))**2 +  (c_std/np.sqrt(c_count))**2)
print(diff_std)

group           NaN
outcome    2.682527
dtype: float64


The null hypothesis is that you have a 0 mean normal distribution that varies the same amount. We can quantify the probability that a difference as large as 7 could arise:

In [38]:
import scipy.stats as st

print(diff_mean/diff_std)

group           NaN
outcome    2.676727
dtype: float64


In [20]:
print(1 - st.norm.cdf(diff_mean/diff_std)[1])

0.0037172587728658835


This gives us a "p-value" the probability that the null hypothesis could have generated such a difference.