## Statistical Tests for Continuous Dataset

### Flow & Scope

#### A. Chi Sq Test
Dependence among categorical (omnibus) chi squared test: https://www.pythonfordatascience.org/chi-square-test-of-independence-python/

It's a test for dependence between categorical variables and is an omnibus test. Meaning, that if a significant relationship is found and one wants to test for differences between groups then post-hoc testing will need to be conducted. Typically, a proportions test is used as a follow-up post-hoc test.

Test of independence assumptions:

(a) The two samples are independent. (b) No expected cell count is = 0. (c) No more than 20% of the cells have and expected cell count < 5.

Hypothesis:

H0: Variables are independent.

H1: Variables are dependent.

#### B. Chi Sq Post Hoc Test (Pairwise Chi Squared Test with BH Adjustments)

## Step 1: Omnibus Test

In [113]:
## CHI-SQUARE TEST OF INDEPENDENCE WITH PYTHON

import pandas as pd
import researchpy as rp
import scipy.stats as stats

# To load a sample dataset for this demonstration
import statsmodels.api as sm

df = sm.datasets.webuse("citytemp2")

In [114]:
df.head()

Unnamed: 0,division,region,heatdd,cooldd,tempjan,tempjuly,agecat
0,N. Eng.,NE,,,16.6,69.599998,19-29
1,N. Eng.,NE,7947.0,250.0,18.200001,68.0,19-29
2,Mid Atl,NE,7480.0,424.0,18.4,70.199997,19-29
3,N. Eng.,NE,7482.0,353.0,19.9,69.5,19-29
4,N. Eng.,NE,7482.0,353.0,19.9,69.5,19-29


In [115]:
rp.summary_cat(df[["agecat", "region"]])

Unnamed: 0,Variable,Outcome,Count,Percent
0,agecat,19-29,507,53.03
1,,30-34,316,33.05
2,,35+,133,13.91
3,region,N Cntrl,284,29.71
4,,West,256,26.78
5,,South,250,26.15
6,,NE,166,17.36


In [116]:
# alternative 1, this one is with stats library

crosstab = pd.crosstab(df["region"], df["agecat"])

chisq_test_statistics = stats.chi2_contingency(crosstab)[0]
p_value = stats.chi2_contingency(crosstab)[1]
dof = stats.chi2_contingency(crosstab)[2]

if p_value < 0.05:
    print("Chi Sq is significant at p-value: {}, dof: {}, and score: {}. ".format(p_value.round(10), dof, chisq_test_staticstics.round(2)))
else:
    print("Chi Sq is not significant at p-value: {}, dof: {}, and score: {}. ".format(p_value.round(10), dof, chisq_test_staticstics.round(2)))    

Chi Sq is significant at p-value: 0.0, dof: 6, and score: 61.29. 


In [117]:
# alternative 2, this one is with researchpy library
crosstab, test_results, expected = rp.crosstab(df["region"], df["agecat"],
                                               test= "chi-square",
                                               expected_freqs= True,
                                               prop= "cell")

if ((test_results.iloc[:,-1][1] < 0.05) & (test_results.iloc[:,-1][2] > 0.25)):
    print("Chi Sq is significant with Very Strong relationship at p-value: {}, and score: {}. ".format(test_results.iloc[:,-1][1].round(10), test_results.iloc[:,-1][0].round(2)))

elif ((test_results.iloc[:,-1][1] < 0.05) & (test_results.iloc[:,-1][2] > 0.15)):
    print("Chi Sq is significant with Strong relationship at p-value: {}, and score: {}. ".format(test_results.iloc[:,-1][1].round(10), test_results.iloc[:,-1][0].round(2)))

elif ((test_results.iloc[:,-1][1] < 0.05) & (test_results.iloc[:,-1][2] > 0.1)):
    print("Chi Sq is significant with Moderate relationship at p-value: {}, and score: {}. ".format(test_results.iloc[:,-1][1].round(10), test_results.iloc[:,-1][0].round(2)))

elif ((test_results.iloc[:,-1][1] < 0.05) & (test_results.iloc[:,-1][2] > 0.05)):
    print("Chi Sq is significant with Weak relationship at p-value: {}, and score: {}. ".format(test_results.iloc[:,-1][1].round(10), test_results.iloc[:,-1][0].round(2)))
    
else:
    print("Chi Sq is not significant at p-value: {}, and score: {}. ".format(test_results.iloc[:,-1][1].round(10), test_results.iloc[:,-1][0].round(2)))    

test_results
    

Chi Sq is significant with Strong relationship at p-value: 0.0, and score: 61.29. 


Unnamed: 0,Chi-square test,results
0,Pearson Chi-square ( 6.0) =,61.2877
1,p-value =,0.0
2,Cramer's V =,0.179


## Step 2: Chi Square post-hoc tests

Chi Square Post Hoc (pairwise chi sq test) is to compare the different groups and get a p-value that tells us whether these groups are actually different than each other.
https://neuhofmo.github.io/chi-square-and-post-hoc-in-python/

In [248]:
# does West and South region have significantly different age group distribution? 
dfx = df.pivot_table(index='region',columns='agecat', values = 'division', aggfunc='count').reset_index()
dfx

agecat,region,19-29,30-34,35+
0,NE,46,83,37
1,N Cntrl,162,92,30
2,South,139,68,43
3,West,160,73,23


In [249]:
df_test = df[df['region'].isin(['West','South'])]

dummies = pd.get_dummies(df_test['agecat'])
dummies.head()

Unnamed: 0,19-29,30-34,35+
450,1,0,0
451,1,0,0
452,1,0,0
453,1,0,0
454,1,0,0


In [253]:
for series in dummies:
    nl = "\n"
    
    crosstab = pd.crosstab(dummies[f"{series}"], df_test['region'])
    print(crosstab, nl)
    chi2, p, dof, expected = stats.chi2_contingency(crosstab, correction = 'fdr-bh')
    if p < 0.05:
        print(f"Significant difference. Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")
    else:
        print(f"Insignificant difference. Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")
        
    

region  South  West
19-29              
0         111    96
1         139   160 

Insignificant difference. Chi2 value= 2.213817574786325
p-value= 0.13677982353820128
Degrees of freedom= 1

region  South  West
30-34              
0         182   183
1          68    73 

Insignificant difference. Chi2 value= 0.05329532995725284
p-value= 0.8174252655811318
Degrees of freedom= 1

region  South  West
35+                
0         207   233
1          43    23 

Significant difference. Chi2 value= 6.819964192708333
p-value= 0.009014437554213467
Degrees of freedom= 1



#### Additional Note: 2 proportion z test and chi sq is similar in pvalue, therefore interchangeable

Reference: 
1. https://rinterested.github.io/statistics/chi_square_same_as_z_test.html
2. https://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportions_ztest.html

In [234]:
crosstab = pd.crosstab(dummies['19-29'], df_test['region'])
crosstab

region,South,West
19-29,Unnamed: 1_level_1,Unnamed: 2_level_1
0,111,96
1,139,160


In [242]:
import numpy as np
nobs = np.array([crosstab[i].sum() for i in ['South','West']])
count = np.array([crosstab[i][1] for i in ['South','West']])

In [241]:
count

array([250, 256], dtype=int64)

In [256]:
from statsmodels.stats.proportion import proportions_ztest
stat, pval = proportions_ztest(count, nobs)
if pval < 0.05:
    print('Significant difference, pvalue at: {0:0.3f}'.format(pval))
else:
    print('Insignificant difference, pvalue at: {0:0.3f}'.format(pval))

Insignificant difference, pvalue at: 0.114


In [257]:
## the p-value is similar to chi sq with no adjustment (i.e correction = False), 
# as can be seen in first print value for p (19-29 age category)

for series in dummies:
    nl = "\n"
    
    crosstab = pd.crosstab(dummies[f"{series}"], df_test['region'])
    print(crosstab, nl)
    chi2, p, dof, expected = stats.chi2_contingency(crosstab, correction = False)
    if p < 0.05:
        print(f"Significant difference. Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")
    else:
        print(f"Insignificant difference. Chi2 value= {chi2}{nl}p-value= {p}{nl}Degrees of freedom= {dof}{nl}")
        

region  South  West
19-29              
0         111    96
1         139   160 

Insignificant difference. Chi2 value= 2.4910769230769234
p-value= 0.11449335733066458
Degrees of freedom= 1

region  South  West
30-34              
0         182   183
1          68    73 

Insignificant difference. Chi2 value= 0.10891375935101574
p-value= 0.7413842102700469
Degrees of freedom= 1

region  South  West
35+                
0         207   233
1          43    23 

Significant difference. Chi2 value= 7.526881770833334
p-value= 0.006078502758847744
Degrees of freedom= 1

