<div style="float:right; padding-top: 15px; padding-right: 15px">
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="250">
        </a>
    </div>
</div>

# two sample hypothesis testing

## 0. python imports

In [2]:
import pandas as pd
from scipy.stats import ttest_rel, ttest_1samp, ttest_ind

## 1. data loading

let's load some data about the blood pressure of patients after and before a treatment...

In [3]:
blood_pressure = pd.read_csv('./data/blood_pressure.csv')
blood_pressure.head()

Unnamed: 0,before,after
0,136.713072,92.432965
1,134.735618,105.022643
2,127.529115,82.242766
3,144.527126,93.607172
4,124.21472,103.212223


and some data about a generic AB testing experiment... as this is tabular data, the number of samples is equivalent in both a and b samples, but it may be different...

In [4]:
ab_test = pd.read_csv('./data/ab_test.csv')
ab_test.head()

Unnamed: 0,a,b
0,0.27,13.61
1,6.08,21.53
2,13.74,9.23
3,9.7,5.36
4,7.0,12.9


## 2. hypothesis test example (related samples)

test related distributions, is the difference between them due to chance?

In [5]:
ttest_rel(blood_pressure['before'], blood_pressure['after'])

Ttest_relResult(statistic=27.291841767560236, pvalue=7.303035069608042e-48)

this is equivalent to a 1 sample vs constant hypothesis test with the sample as the different of values in samples versus a zero value constant...

In [6]:
blood_pressure['diff'] = blood_pressure['after'] - blood_pressure['before']
blood_pressure.head()

Unnamed: 0,before,after,diff
0,136.713072,92.432965,-44.280107
1,134.735618,105.022643,-29.712975
2,127.529115,82.242766,-45.286349
3,144.527126,93.607172,-50.919953
4,124.21472,103.212223,-21.002497


In [7]:
ttest_1samp(blood_pressure['diff'], 0)

Ttest_1sampResult(statistic=-27.291841767560236, pvalue=7.303035069608042e-48)

ojo: https://stackoverflow.com/questions/15984221/how-to-perform-two-sample-one-tailed-t-test-with-numpy-scipy

## 3. hypothesis test example (independent samples)

assuming equal variances

In [8]:
ttest_ind(ab_test['b'], ab_test['a'], equal_var=True)

Ttest_indResult(statistic=2.637533181209767, pvalue=0.009713140852447347)

assuming unequal variances (Welch's)

In [10]:
ttest_ind(ab_test['a'], ab_test['b'], equal_var=False)

Ttest_indResult(statistic=-2.637533181209767, pvalue=0.009776243024828825)

## 4. chi squared contingency test (bonus)

**if there is enough time, perform a test for renfe dataset, and for class example**

In [11]:
from scipy.stats import chi2_contingency

In [12]:
renfe = pd.read_csv('../../week_07/dataframe_calculations_and_big_data_tools/data/renfe_sample.csv')

In [22]:
renfe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 13 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Unnamed: 0   100000 non-null  int64  
 1   insert_date  100000 non-null  object 
 2   origin       100000 non-null  object 
 3   destination  100000 non-null  object 
 4   start_date   100000 non-null  object 
 5   end_date     100000 non-null  object 
 6   train_type   100000 non-null  object 
 7   price        68978 non-null   float64
 8   train_class  71104 non-null   object 
 9   fare         71104 non-null   object 
 10  price_tree   70016 non-null   object 
 11  batch        58099 non-null   object 
 12  id           71561 non-null   float64
dtypes: float64(2), int64(1), object(10)
memory usage: 9.9+ MB


In [23]:
contingency_table = pd.crosstab(renfe['train_class'], renfe['train_type'])

In [24]:
contingency_table

train_type,ALTARIA,ALVIA,AV City,AVANT,AVANT-AVE,AVANT-LD,AVANT-MD,AVE,AVE-AVANT,AVE-AVE,...,LD-AVE,LD-MD,MD,MD-AVANT,MD-AVE,MD-LD,R. EXPRES,REG.EXP.,REGIONAL,TRENHOTEL
train_class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Cama Turista,0,2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,21
Preferente,89,1268,3,0,0,0,0,3975,0,0,...,2,8,7,0,2,12,0,0,0,0
PreferenteSólo plaza H,0,27,0,0,0,0,0,3,0,0,...,0,0,0,0,0,0,0,0,0,0
Turista,305,4027,1783,1575,0,0,0,38183,0,0,...,37,10,1028,0,13,24,364,410,2267,604
Turista Plus,0,3,91,0,0,0,0,3231,0,0,...,10,0,0,0,2,0,0,0,0,0
Turista con enlace,0,2,5,0,428,23,22,15,186,35,...,1064,474,353,43,653,1290,0,0,0,1
TuristaSólo plaza H,0,4,9,0,0,0,0,168,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
chi2, p, dof, ex = chi2_contingency(contingency_table, correction=False)

In [26]:
chi2

71477.89846077052

In [27]:
dof

156

In [28]:
ex

array([[1.38529478e-01, 1.87507032e+00, 6.64871175e-01, 5.53766314e-01,
        1.50483798e-01, 8.08674617e-03, 7.73514851e-03, 1.60240633e+01,
        6.53971647e-02, 1.23059181e-02, 6.74364311e-01, 2.57017889e-01,
        3.34720972e-01, 1.14620837e-01, 7.45035441e-01, 2.90771265e-01,
        2.46118362e-02, 3.91328195e-01, 1.72986049e-01, 4.88017552e-01,
        1.51186994e-02, 2.35570432e-01, 4.66218497e-01, 1.27981548e-01,
        1.44155041e-01, 7.97071895e-01, 2.20100135e-01],
       [3.01717203e+01, 4.08390316e+02, 1.44808942e+02, 1.20610303e+02,
        3.27753713e+01, 1.76129332e+00, 1.68471535e+00, 3.49004100e+03,
        1.42435025e+01, 2.68022896e+00, 1.46876547e+02, 5.59784963e+01,
        7.29022277e+01, 2.49644183e+01, 1.62268719e+02, 6.33299814e+01,
        5.36045792e+00, 8.52312809e+01, 3.76763614e+01, 1.06290223e+02,
        3.29285272e+00, 5.13072401e+01, 1.01542389e+02, 2.78743812e+01,
        3.13969678e+01, 1.73602259e+02, 4.79378094e+01],
       [1.77317732e-01

In [29]:
p

0.0

<div style="padding-top: 25px; float: right">
    <div>    
        <i>&nbsp;&nbsp;© Copyright by</i>
    </div>
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="125">
        </a>
    </div>
</div>