In [1]:
import pandas as pd
from scipy import stats
import statsmodels.stats.weightstats as ws

In [2]:
regular_co2 = pd.read_csv('../data/regular_co2_gms.csv')
holiday_co2 = pd.read_csv('../data/holiday_co2_gms.csv')

### Check Assumptions
https://medium.com/@ntran19/hypothesis-testing-and-python-cheat-sheet-7799e90e6ae9

#### Test for normality assumption
1. scipy.stats.shapiro
2. scipy.stats.normaltest
3. scipy.stats.anderson

#### Test for homogeneity of variance assumption
1. scipy.stats.bartlett
2. scipy.stats.levene

In [3]:
stats.ttest_ind(regular_co2['co2_gms'], holiday_co2['co2_gms'], equal_var=False)

Ttest_indResult(statistic=0.8997465667104911, pvalue=0.3688562986477767)

Statistical significance is not the same thing as practical significance. 
If two means are actually different, in any degree no matter how small, 
a significance test will almost certainly fail given sufficiently large samples; 
this is a well-known feature or bug, depending on one's point of view. 
My advice is dump all significance tests and work only with practical significance,
i.e. assess a value (in money, time, resources, whatever) for your actions and go from there. 
src: https://stackoverflow.com/questions/53517313/why-does-z-test-indicate-significantly-different-for-2-distributions-that-looks


In [7]:
regular_array = ws.DescrStatsW(regular_co2['co2_gms'])
holiday_array = ws.DescrStatsW(holiday_co2['co2_gms'])

col1 = regular_array
col2 = holiday_array

cm_obj = ws.CompareMeans(regular_array, holiday_array)

tstat, pval = cm_obj.ztest_ind(usevar='unequal')

print(ztest_results)

(0.8997465667104919, 0.36825513579480185)


The api documentation is not very helpful to understand how to use this method.
Below is the method syntax 

CompareMeans.ztest_ind(alternative='two-sided', usevar='pooled', value=0)

At the 1st look, it seems there is no option to pass the data values upon which we conduct the z-test

But if we look at the parameters' definition below, 

x1: array_like, 1-D or 2-D
first of the two independent samples, see notes for 2-D case

x2: array_like, 1-D or 2-D
second of the two independent samples, see notes for 2-D case


documentation: https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.CompareMeans.ztest_ind.html?highlight=comparemeans%20ztest_ind#generated-statsmodels-stats-weightstats-comparemeans-ztest-ind--page-root

source code: https://www.statsmodels.org/stable/_modules/statsmodels/stats/weightstats.html#CompareMeans.ztest_ind

In [6]:
import pandas as pd
import statsmodels.stats.weightstats as ws

### 2 sample ztest to compare means - unequal variances and unequal sample sizes ###

col1 = ws.DescrStatsW(df1['amount'])
col2 = ws.DescrStatsW(df2['amount'])

cm_obj = ws.CompareMeans(col1, col2)

tstat, pval = cm_obj.ztest_ind(usevar='unequal')

print(tstat, pval)

0.8997465667104919 0.36825513579480185
