## Two Sample Hypothesis Tests

### Matched Pairs

Blood samples from n = 10 people were sent to each of two laboratories (Lab 1 and Lab 2) for cholesterol determinations. The resulting data are summarized here:
- Sample 1 mean: 260.6
- Sample 2 mean: 275
- Mean of differences: -14.4
- Standard deviation of mean:  6.77

Is there a statistically significant difference at the α = 0.01 level, say, in the (population) mean cholesterol levels reported by Lab 1 and Lab 2?

In [3]:
import numpy as np
import pandas as pd
from scipy import stats

In [4]:
# This is a paired t test
difference_mean = -14.4
difference_std = 6.77
n = 10
statistic = (difference_mean - 0)/(difference_std/np.sqrt(n))
pval = stats.t.sf(np.abs(statistic), n-1)


In [5]:
stats.t.interval(0.95, df=n-1, loc=difference_mean, scale=(difference_std/np.sqrt(n)))

(-19.242966253298913, -9.557033746701087)

#### Using data

In [7]:
bp = pd.read_csv('https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/data/module-2/blood_pressure.csv')
bp.head()


Unnamed: 0,before,after
0,136.713072,92.432965
1,134.735618,105.022643
2,127.529115,82.242766
3,144.527126,93.607172
4,124.21472,103.212223


In [8]:
from scipy.stats import ttest_rel
help(ttest_rel)

Help on function ttest_rel in module scipy.stats.stats:

ttest_rel(a, b, axis=0, nan_policy='propagate')
    Calculate the t-test on TWO RELATED samples of scores, a and b.
    
    This is a two-sided test for the null hypothesis that 2 related or
    repeated samples have identical average (expected) values.
    
    Parameters
    ----------
    a, b : array_like
        The arrays must have the same shape.
    axis : int or None, optional
        Axis along which to compute test. If None, compute over the whole
        arrays, `a`, and `b`.
    nan_policy : {'propagate', 'raise', 'omit'}, optional
        Defines how to handle when input contains nan.
        The following options are available (default is 'propagate'):
    
          * 'propagate': returns nan
          * 'raise': throws an error
          * 'omit': performs the calculations ignoring nan values
    
    Returns
    -------
    statistic : float or array
        t-statistic.
    pvalue : float or array
        Two-

In [10]:
ttest_rel(bp['after'], bp['before'])

Ttest_relResult(statistic=-27.291841767560236, pvalue=7.303035069608042e-48)

In [11]:
from scipy.stats import ttest_1samp
ttest_1samp(bp['after']-bp['before'], 0)

Ttest_1sampResult(statistic=-27.291841767560236, pvalue=7.303035069608042e-48)

### Independent Samples

In [14]:
ab = pd.read_csv('https://s3-eu-west-1.amazonaws.com/ih-materials/uploads/data-static/data/module-2/ab_test.csv')
ab.head()

Unnamed: 0,a,b
0,0.27,13.61
1,6.08,21.53
2,13.74,9.23
3,9.7,5.36
4,7.0,12.9


In [15]:
from scipy.stats import ttest_ind

ttest_ind(ab['a'], ab['b'], equal_var=True)

Ttest_indResult(statistic=-2.637533181209767, pvalue=0.009713140852447347)

In [16]:
ttest_ind(ab['a'], ab['b'], equal_var=False)

Ttest_indResult(statistic=-2.637533181209767, pvalue=0.009776243024828825)