In [1]:
%%javascript
MathJax.Hub.Config({
    TeX: { equationNumbers: { autoNumber: "AMS" } }
});

<IPython.core.display.Javascript object>

In [4]:
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import statsmodels.stats as smstats

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Goodness of fit tests

The multinomial distribution is a generalization of the binomial distribution. Instead of two outcomes, we consider $k$ outcomes with given individual probabilities $p_i, i=1 \dots k$, and $n$ trials. We also have $k$ random variables in the vector $\bf{X}$ which record the number of times each outcome occurs.

$$
    f(\mathbf{x};n,\mathbf{p}) = P [X_1 = x_1 \dots X_k=x_k]=\frac{n!}{x_1!  \dots x_k!} p_1^{x_1} \dots p_k^{x_k} \hspace{2cm} \sum_{i=1}^k x_i = n
$$

Note that this is a **multivariate and joint** probability mass function for $k$ random variables. 

This is also the sampling distribution of a sample of size $n$ drawn from a population where individuals have a characteristic with $k$ possible values. Under a null hypothesis that the population proportion vector $\boldsymbol{\pi} = \mathbf{p_0}$, we can, in principle, compute exact p values for a given sample realization $\mathbf{x}$

$$
p \text{ value} = \sum_{\mathbf{y} \in S} f (\mathbf{y};n,\mathbf{p_0}) \quad \text{where}\; S = \{\mathbf{y}: f(\mathbf{y};n,\mathbf{p_0}) \ge f(\mathbf{x};n,\mathbf{p_0})\}
$$

The computational method is brute force, but even then can take too long if $n$ and/or $k$ is large. A simple approximate test applicable in the large $n$ limit is the chi-squared goodness of fit test. In this test, given the null hypothesis, we compute the expected counts for each of the $k$ categories and compare them with the observed counts. With $E_j$ and $O_j$ as the expected and observed counts respectively,

$$
\sum_{j=1}^k \frac{(O_j-E_j)^2}{E_j} \sim \chi^2_{k-1}
$$

This test is widely used in the analysis of categorical data.

In [12]:
# https://www.newsweek.com/what-latest-polls-say-about-joe-biden-vs-donald-trump-2020-election-1505533
# May 21, 2020
# According to a new Quinnipiac University poll, 50 percent of voters said they would vote for Biden 
# if the election were today, while 39 percent favored Trump. The poll surveyed 1,323 registered voters 
# from May 14 to 18 and had a margin of error of plus or minus 2.7 percentage points.

n = 1323
k = 3         # Biden, Trump, undecided
ConfLevel = 0.95
alpha = 1 - ConfLevel

p0 = np.array([0.45,0.45,0.1])     # H0: pi = p0, tie between Biden and Trump

Observed = n*np.array([0.50,0.39, 0.11])
Expected = n*p0

TestStat = np.sum((Observed-Expected)**2/Expected)

UpperTailProb = 1-stats.chi2.cdf(TestStat,k-1)

Observed
Expected
TestStat
UpperTailProb   # Null hypothesis H0 rejected if UpperTailProb < alpha
if UpperTailProb < alpha:
    print("Null hypothesis rejected")
else:
    print("Failed to reject null hypothesis")

array([661.5 , 515.97, 145.53])

array([595.35, 595.35, 132.3 ])

19.256999999999987

6.582571437507845e-05

Null hypothesis rejected
