# Chi-square association test

*Pearson's chi-squared test* is used as a test of independence. It assesses whether observations consisting of measures on two variables, expressed in a contingency table, are independent of each other (e.g. polling responses from people of different nationalities to see if one's nationality is related to the response). A p-value of less than or equal to a critical point is interpreted as justification for rejecting the null hypothesis that the row variable is independent of the column variable. The alternative hypothesis corresponds to the variables having an association or relationship where the structure of this relationship is not specified.

More info: https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test#Testing_for_statistical_independence

Import dependencies and data:

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
%matplotlib inline

In [3]:
data = pd.read_csv("../results/dnds_for_targets.csv", index_col=0)
print(data.shape)
data.head()

(14907, 9)


Unnamed: 0,Gene name,Gene description,dN with Chimpanzee,dS with Chimpanzee,dN/dS with Chimpanzee,dN with Mouse,dS with Mouse,dN/dS with Mouse,present in DrugBank
0,MT-ATP8,mitochondrially encoded ATP synthase membrane ...,0.0325,0.3331,0.0976,0.4871,0.848,0.5744,False
1,MT-ND6,mitochondrially encoded NADH:ubiquinone oxidor...,0.0204,0.6559,0.0311,0.3455,13.3498,0.0259,True
2,KLF13,Kruppel like factor 13 [Source:HGNC Symbol;Acc...,0.0026,0.0771,0.0337,0.0219,0.681,0.0322,False
3,KCNE1B,potassium voltage-gated channel subfamily E re...,0.0165,0.0649,0.2542,0.1382,0.8135,0.1699,False
4,PDK3,pyruvate dehydrogenase kinase 3 [Source:HGNC S...,0.0152,0.03,0.5067,0.0151,0.5213,0.029,True


Filter out inf values:

In [4]:
data_filtered = data.dropna()
print(data_filtered.shape)
data_filtered = data_filtered[(data_filtered["dN/dS with Chimpanzee"] != float("inf")) & (data_filtered["dN/dS with Mouse"] != float("inf"))]
print(data_filtered.shape)
data_filtered.head()

(11882, 9)
(11480, 9)


Unnamed: 0,Gene name,Gene description,dN with Chimpanzee,dS with Chimpanzee,dN/dS with Chimpanzee,dN with Mouse,dS with Mouse,dN/dS with Mouse,present in DrugBank
0,MT-ATP8,mitochondrially encoded ATP synthase membrane ...,0.0325,0.3331,0.0976,0.4871,0.848,0.5744,False
1,MT-ND6,mitochondrially encoded NADH:ubiquinone oxidor...,0.0204,0.6559,0.0311,0.3455,13.3498,0.0259,True
2,KLF13,Kruppel like factor 13 [Source:HGNC Symbol;Acc...,0.0026,0.0771,0.0337,0.0219,0.681,0.0322,False
3,KCNE1B,potassium voltage-gated channel subfamily E re...,0.0165,0.0649,0.2542,0.1382,0.8135,0.1699,False
4,PDK3,pyruvate dehydrogenase kinase 3 [Source:HGNC S...,0.0152,0.03,0.5067,0.0151,0.5213,0.029,True


This function takes 2 rows of a contigency table and prints Chi2 test p-value:

In [5]:
def chi2_test(row1, row2):
    cont_table = [row1, row2]
    print(cont_table)
    
    stat, p, dof, expected = chi2_contingency(cont_table)
    print('significance=0.05, p={}'.format(p))

    if p <= 0.05: 
        print('Dependent (reject H0)')
    else: 
        print('Independent (fail to reject H0)')
        
    return

1. **Chi-square test for comparison of dN/dS with 1.**

The contigency table looks like this:

|             | dN/dS < 1 | dN/dS ≥ 1 |
|:-----------:|:---------:|:---------:|
| bad target  |           |           |
| good target |           |           |

In [6]:
# for chimpanzee vs human
bad_targets = [data_filtered[(data_filtered["dN/dS with Chimpanzee"] < 1) & (data_filtered["present in DrugBank"] == False)].shape[0], 
               data_filtered[(data_filtered["dN/dS with Chimpanzee"] >= 1) & (data_filtered["present in DrugBank"] == False)].shape[0]]
good_targets = [data_filtered[(data_filtered["dN/dS with Chimpanzee"] < 1) & (data_filtered["present in DrugBank"] == True)].shape[0], 
               data_filtered[(data_filtered["dN/dS with Chimpanzee"] >= 1) & (data_filtered["present in DrugBank"] == True)].shape[0]]

print('Results of Chi2 test for Chimpanzee:')
print()
chi2_test(bad_targets, good_targets)
print()

# for mouse vs human
bad_targets = [data_filtered[(data_filtered["dN/dS with Mouse"] < 1) & (data_filtered["present in DrugBank"] == False)].shape[0], 
               data_filtered[(data_filtered["dN/dS with Mouse"] >= 1) & (data_filtered["present in DrugBank"] == False)].shape[0]]
good_targets = [data_filtered[(data_filtered["dN/dS with Mouse"] < 1) & (data_filtered["present in DrugBank"] == True)].shape[0], 
               data_filtered[(data_filtered["dN/dS with Mouse"] >= 1) & (data_filtered["present in DrugBank"] == True)].shape[0]]

print('Results of Chi2 test for Mouse:')
print()
chi2_test(bad_targets, good_targets)
print()

Results of Chi2 test for Chimpanzee:

[[9357, 704], [1370, 49]]
significance=0.05, p=5.999795685706191e-07
Dependent (reject H0)

Results of Chi2 test for Mouse:

[[10058, 3], [1419, 0]]
significance=0.05, p=0.8207081731028516
Independent (fail to reject H0)



2. **Chi-square test for comparison of dN/dS with the average.** 

The contigency table looks like this:

|             | dN/dS < mean | dN/dS ≥ mean |
|:-----------:|:------------:|:------------:|
| bad target  |              |              |
| good target |              |              |

In [7]:
# for chimpanzee vs human
mean_dNdS = np.mean(data_filtered["dN/dS with Chimpanzee"])
bad_targets = [data_filtered[(data_filtered["dN/dS with Chimpanzee"] < mean_dNdS) & (data_filtered["present in DrugBank"] == False)].shape[0], 
               data_filtered[(data_filtered["dN/dS with Chimpanzee"] >= mean_dNdS) & (data_filtered["present in DrugBank"] == False)].shape[0]]
good_targets = [data_filtered[(data_filtered["dN/dS with Chimpanzee"] < mean_dNdS) & (data_filtered["present in DrugBank"] == True)].shape[0], 
               data_filtered[(data_filtered["dN/dS with Chimpanzee"] >= mean_dNdS) & (data_filtered["present in DrugBank"] == True)].shape[0]]

print('Results of Chi2 test for Chimpanzee:')
print()
print('Average dN/dS is {}'.format(mean_dNdS))
chi2_test(bad_targets, good_targets)
print()

# for mouse vs human
mean_dNdS = np.mean(data_filtered["dN/dS with Mouse"])
bad_targets = [data_filtered[(data_filtered["dN/dS with Mouse"] < mean_dNdS) & (data_filtered["present in DrugBank"] == False)].shape[0], 
               data_filtered[(data_filtered["dN/dS with Mouse"] >= mean_dNdS) & (data_filtered["present in DrugBank"] == False)].shape[0]]
good_targets = [data_filtered[(data_filtered["dN/dS with Mouse"] < mean_dNdS) & (data_filtered["present in DrugBank"] == True)].shape[0], 
               data_filtered[(data_filtered["dN/dS with Mouse"] >= mean_dNdS) & (data_filtered["present in DrugBank"] == True)].shape[0]]

print('Results of Chi2 test for Mouse:')
print()
print('Average dN/dS is {}'.format(mean_dNdS))
chi2_test(bad_targets, good_targets)


Results of Chi2 test for Chimpanzee:

Average dN/dS is 0.4354058972125436
[[6674, 3387], [1060, 359]]
significance=0.05, p=3.813229073735863e-10
Dependent (reject H0)

Results of Chi2 test for Mouse:

Average dN/dS is 0.1616633275261324
[[6038, 4023], [1043, 376]]
significance=0.05, p=1.755996317294364e-22
Dependent (reject H0)
