# Determining Statistical Dependence

We are given a __contingency table__, containing the counts of individuals categorized according to two factors.  For example, the following table gives numbers of patients in a New York hospital categorized by blood type (A, AB, B or O) and COVID-19 test result.

|   |  positive test | negative test  |
|:---:|:---:|:---:|
| A  | 231  | 245  |
| AB  | 21  | 47  |
| B  | 116  |  136 |
| O  |  312 | 449  |

We want to determine whether there is evidence of statistical dependence between the two factors, in this case blood type and test result.  The null hypothesis is that the two factors are independent; the alternative hypothesis is that they are not. 

For each cell of the table, we compute its __expected count__ under the null hypothesis, which is given by:
$$\mbox{expected count}=\frac{\mbox{row total}\times\mbox{column total}}{\mbox{grand total}}.$$
Here "row total" is the sum of all the counts in the row containing the cell, and similarly for the column, and "grand total" is the total of all counts in the table.  We then compute the test statistic:
$$s^2=\sum_{\mbox{cell}} \frac{(\mbox{observed} - \mbox{expected})^2}{\mbox{expected}}.$$
Here the sum is over all cells in the table, "observed" is the count in the cell in the original table, and "expected" is the expected count computed above. 

Under the null hypothesis, $s^2$ has approximately a chi-squared distribution with $(\mbox{rows}-1)(\mbox{columns}-1)$ degrees of freedom (where "rows" and "columns" are the numbers of rows and columns in the table, e.g. $4$ and $2$ in the example above).  The distribution is `chi2` (note: _not_ `chisquare`!) in the `scipy` `stats` package.  The p-value is the probability that a random variable with this distribution is greater than or equal to the computed statistic $s^2$.  We reject the null hypothesis if the p-value is less than or equal to the required significance level $\alpha.$

The test involves an approximation.  It is not considered appropriate if any expected cell count is less than $5$.  (In that case there are other "exact" tests that can be used.)  

We will write a function `expected(table)` that takes a two-dimensional contingency table of counts (with any numbers of rows and columns), in the form of a numpy array, and returns a numpy array of the same shape countaining the expected counts under the null hypothesis.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

In [2]:
def expected(table):
    columns = np.shape(table)[0]
    rows = np.shape(table)[1]
    
    row_totals = []
    for i in range(rows):
        row_totals.append(table[0,i]+table[1,i])
        
    column_totals = []
    for i in range(columns):
        column_totals.append(sum(table[i,:]))
        
    grand_total = sum(column_totals)
    
    expected_count_col_1 = []
    for i in range(rows):
        expected_count_col_1.append((row_totals[i]*column_totals[0])/grand_total)
    
    expected_count_col_2 = []
    for i in range(rows):
        expected_count_col_2.append((row_totals[i]*column_totals[1])/grand_total)
    
    x = np.row_stack((expected_count_col_1, expected_count_col_2))
    
    return x

In [3]:
t=np.array([[4,2,2],[7,3,2]])
print(t)
print(expected(t))
print(expected(expected(t)))  # applying the function twice should give same as once
assert np.allclose(expected(t),[[4.4,2,1.6],[6.6,3,2.4]])
assert np.allclose(expected(t),expected(expected(t)))

[[4 2 2]
 [7 3 2]]
[[4.4 2.  1.6]
 [6.6 3.  2.4]]
[[4.4 2.  1.6]
 [6.6 3.  2.4]]


Next, we will write a function `statistic(table)` that computes the test statistic $s^2$.

In [4]:
def statistic(table):
    
    def expected(table):
        columns = np.shape(table)[0]
        rows = np.shape(table)[1]
        row_totals = []
        for i in range(rows):
            row_totals.append(table[0,i]+table[1,i])
        column_totals = []
        for i in range(columns):
            column_totals.append(sum(table[i,:]))
        grand_total = sum(column_totals)
        expected_count_col_1 = []
        for i in range(rows):
            expected_count_col_1.append((row_totals[i]*column_totals[0])/grand_total)
        expected_count_col_2 = []
        for i in range(rows):
            expected_count_col_2.append((row_totals[i]*column_totals[1])/grand_total)
        x = np.row_stack((expected_count_col_1, expected_count_col_2))
        return x
    
    expected = expected(table)
    observed = table
    s_squared = []
    for i in range(len(expected[0])):
        s_squared.append(((observed[0,i]-expected[0,i])**2)/expected[0,i])
        s_squared.append(((observed[1,i]-expected[1,i])**2)/expected[1,i])
    s_squared = sum(s_squared)
    
    return s_squared

In [5]:
# Informal testing
print(statistic(np.array([[12,8],[9,6]])))
print(statistic(np.array([[6,10,4],[3,9,7]])))

0.0
1.8463863006799308


We will now write a function `p_value(table)` that computes the p-value.  If any expected cell count is less than $5$ it will instead print an error message and return `None`.

In [6]:
def p_value(table):
    ex = expected(table)
    if np.any(ex<5)==True:
        print('Error')
        return None
    rows,cols=np.shape(table)
    df=(rows-1)*(cols-1)
    chi = stats.chi2(df)
    F=chi.cdf
    pvalue=1-F(statistic(table))
    return pvalue

We will determine whether or not we can reject the null hypothesis (and thence conclude that there is evidence of dependence between blood type and test result) in the above example, at significance level $0.05$, and also significance levels $0.01$ and $0.001$.  We will store the answers in three boolean variables (which will be `True` if and only if we can reject the null hypothesis) named `reject_5`, `reject_1` and `reject_01` respectively. 

In [7]:
tests = np.array([[231,245],[21,47],[116,136],[312,449]])

print(tests)
print(expected(tests))
print(statistic(tests))
print(p_value(tests))

reject_5 = True
reject_1 = True
reject_01 = False

[[231 245]
 [ 21  47]
 [116 136]
 [312 449]]
[[77.04046243 89.26910726]
 [11.00578035 12.75272961]]
680.3968467636215
0.0


In [8]:
assert (reject_5==True or reject_5==False)