# DASC 512 - 26 - Tests for Categorical Data

This lessons focuses on **hypothesis tests for categorical data**. We'll cover both one-way and two-way analyses using the chi-squared test (contingency test).

***

In [2]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.graphics.api as smg
import statsmodels.formula.api as smf

sns.set_style('whitegrid')

## One-Way Multinomial Experiments (Chi-Squared Test)

First, let's look at the case where the hypothesized proportions are all equal.

In [6]:
grades = np.array([82,103,128,97])
p0 = np.ones(4)/4
n = grades.sum()

In [7]:
# Expected observations
expected = p0 * n
expected

array([102.5, 102.5, 102.5, 102.5])

In [9]:
# Test statistic
stat = np.sum((grades - expected) ** 2 / expected)
print(stat)

10.741463414634145


In [10]:
# Critical Value
dist = stats.chi2(df=3)
dist.isf(0.05)

7.814727903251178

In [11]:
# p-value
dist.sf(stat)

0.013209268576135736

In [33]:
# And the easy way
stat, pval = stats.chisquare(f_obs=grades)
print(f'The test statistic is {stat:.2f} and the p-value is {pval:.4f}.')

The test statistic is 10.74 and the p-value is 0.0132.


And now let's do the same assuming unequal hypothesized probabilities.

In [26]:
p0 = np.array([1/5,3/10,3/10,1/5])

In [27]:
expected = p0 * 410

In [29]:
stat = np.sum((grades - expected) ** 2 / expected)
print(stat)

6.199186991869919


In [30]:
pval = dist.sf(stat)
print(pval)

0.10231141486152456


In [32]:
# And the easy way
stat, pval = stats.chisquare(f_obs=grades, f_exp=expected)
print(f'The test statistic is {stat:.2f} and the p-value is {pval:.4f}.')

The test statistic is 6.20 and the p-value is 0.1023.


## Two-Way Multinomial Experiment (Contingency Test)

In [34]:
grades = np.array([[141,161],[44,61]])

In [35]:
grades

array([[141, 161],
       [ 44,  61]])

In [36]:
stat, pval, df, expected = stats.chi2_contingency(grades)
print(f'The test statistic is {stat:.2f} and the p-value is {pval:.4f}.')

The test statistic is 0.54 and the p-value is 0.4628.


In [37]:
df

1

In [38]:
expected

array([[137.27272727, 164.72727273],
       [ 47.72727273,  57.27272727]])

In [39]:
# Critical value
stats.chi2.isf(q=0.05, df=1)

3.8414588206941285