# Goodness of Fit and Contingency Tables

In [None]:
from IPython.display import Markdown
base_path = (
    "https://raw.githubusercontent.com/rezahabibi96/GitBook/refs/heads/main/"
    "books/applied-statistics-with-python/.resources"
)

In [None]:
import math
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.utils import resample
from scipy.stats import norm, binom

import seaborn as sns
import matplotlib.pyplot as plt
from PIL import Image
from matplotlib.pyplot import figure

import requests
from io import BytesIO

## Goodness of Fit

The goodness of fit refers to a method for assessing a distribution of binned data. It is best illustrated with an example.

**Example**

A shop owner wants to compare the number of t-shirts of each size that were sold to the ordered proportions. Assume the sold (observed) numbers of t-shirts and expected proportions for each type are as given in the first two columns in the following table:
| Size    | Observed | Expected Proportions | Expected Values |
| ------- | -------- | -------------------- | --------------- |
| Small   | 25       | 0.1                  | 22.5            |
| Medium  | 41       | 0.2                  | 45              |
| Large   | 91       | 0.4                  | 90              |
| X-Large | 68       | 0.3                  | 67.5            |

The total number of the shirts sold is $Total = 25 + 41 + 91 + 68 = 225$. If the sales follow the expected proportions, we would expect $225 \cdot 0.1 = 22.5$ small shirts to be sold, $225 \cdot 0.2 = 45$ medium, etc. (see the column of the Expected Values in the table above).

The numbers of shirts of each type (observed values) are quite close to the expected values in this case (not the same due to sampling variation). Are the differences large enough to indicate that the observed (sample) sales do not follow the expected (population) sales.

Let's formulate it as a hypothesis test:

$H_0$: The observed sales follow the expected proportions or $O \approx E$ (no bias, natural sampling fluctuation).

$H_1$: The observed sales do NOT follow the expected proportions or $O \ne E$.

In the hypothesis tests for proportions, a test statistic was:

$$ \dfrac{\text{point estimate} - \text{null estimate}}{\text{SE of point estimate}} $$

As always in hypothesis testing, the null hypothesis $H_0$ is initially assumed true, so the expected counts $E$’s are taken as null estimates. In Mathematical Statistics, it is shown that standard error $SE = \sqrt{E}$. Therefore, the standardized residuals for each category are

$$ \dfrac{O_i - E_i}{\sqrt{E_i}} $$

where $i$’s refer to category (Small, Medium, etc.). As always, residuals have varying signs, so adding them would lead to cancellations; instead their squares are considered.

Let $O_1, O_2, ..., O_k$ be the observed counts in $k$ groups, and $E_1, E_2, ..., E_k$ be the corresponding expected counts under a null hypothesis. Then provided that all $E_i \geq 5$, the **Chi-Squared test statistic**

$$ \chi^2 = \dfrac{(O_1 - E_1)^2}{E_1} + \dfrac{(O_2 - E_2)^2}{E_2} + ... + \dfrac{(O_k - E_k)^2}{E_k} $$

follows a chi-squared distribution with $k - 1$ degrees of freedom. A larger difference between observed and expected values (against $H_0$) leads to larger $\chi^2$, so the p-value is the upper tail area of this chi-squared distribution.

Unlike the normal distribution, the $\chi^2$ is a family of distributions with different shapes based on the degree of freedom. It is skewed right, not symmetric as shown by the code and the resulting Figure below.

For our data:

$$ \chi^2 = \dfrac{(O_1 - E_1)^2}{E_1} + \dfrac{(O_2 - E_2)^2}{E_2} + ... + \dfrac{(O_k - E_k)^2}{E_k} $$

$$ = \dfrac{(25 - 22.5)^2}{22.5} + \dfrac{(41 - 45)^2}{45} + ... + \dfrac{(68 - 67.5)^2}{67.5} $$

$$ = 0.278 + 0.356 + ... + 0.004 = 0.648 $$

The function defined below computes the steps of the goodness of fit method and plots a chi-squared distribution with $df = k - 1 = 4 - 1 = 3$ degrees of freedom with the shaded area corresponding to the p-value as shown in the Figure below.

For our example, the observed $O$’s are very close to the expected $E$’s, leading to very small $\chi^2 = 0.648$, which results in a very large p-value $= 0.8853268 > 0.05$ as shown in the chi-squared distribution Figure above. Therefore, there is not enough evidence to reject $H_0$, and the observed shirt sales follow the expected proportions. We can also see it in the side-by-side bar plot of the observed and expected counts above.

**Example**

The data for a particular hospital's emergency room admissions over the previous month are given below. To properly allocate resources, investigate if the admissions are *uniformly* distributed by the day of the week.

$H_0$: The observed values of the number of hospital admissions follow the expected proportions $O \approx E$ (equally likely 7 days, so $p_i = 1/7$).

$H_1$: The observed values do NOT follow the expected proportions $O \ne E$.

Adding the observed values, we obtain $Total = 535$. If the observed values follow the expected proportions, $535 \cdot \frac{1}{7} \approx 76.429$ admissions are expected on Sun, the same on other days. All the Expected Values are given in the table above and none is below 5. The resulting test statistic $\chi^2$ is:

$$ \chi^2 = \dfrac{(O_1 - E_1)^2}{E_1} + \dfrac{(O_2 - E_2)^2}{E_2} + ... + \dfrac{(O_k - E_k)^2}{E_k} $$

$$ = \dfrac{(99 - 76.429)^2}{76.429} + \dfrac{(72 - 76.429)^2}{76.429} + ... + \dfrac{(92 - 76.429)^2}{76.429} $$

$$ = 6.666 + 0.257 + ... + 3.172 = 14.781 $$

The $\chi^2$ degree of freedom is $df = k - 1 = 7 - 1 = 6$.

The $\chi^2 = 14.781$ with the degree of freedom $df = 6$ is sufficiently large to produce small p-value $= 0.0220276 < 0.05$ as shown in the chi-squared distribution Figure above. Therefore, there is enough evidence to reject $H_0$, i.e., the observed calls on different days of the week are not uniformly distributed. We can see from the table that there are more emergency visits on weekends when regular doctors are not available. We can also see it in the side-by-side bar plot of the observed and expected counts above.

**Example**

In the HELPrct file on substance abusers' health data, investigate if the substance variable is equally distributed between alcohol, cocaine, and heroin. Note that we employ value_counts() to get the counts for each substance level.

$H_0$: The observed values of the substance abusers in each group follow the expected proportions $O \approx E$ (equally likely).

$H_1$: The observed values do NOT follow the expected proportions or $O \ne E$.

The observed values add up to $Total = 453$. If the observed values follow the expected proportions, $453 \cdot \frac{1}{3} = 151$ should be the expected count for each substance (none of them is below 5). The resulting test statistic $\chi^2$ is:

$$ \chi^2 = \dfrac{(O_1 - E_1)^2}{E_1} + \dfrac{(O_2 - E_2)^2}{E_2} + ... + \dfrac{(O_k - E_k)^2}{E_k} $$

$$ = \dfrac{(177 - 151)^2}{151} + \dfrac{(152 - 151)^2}{151} + \dfrac{(124 - 151)^2}{151} $$

$$ = 4.477 + 0.007 + 4.828 = 9.311 $$

The chi-squared distribution with $df = k - 1 = 3 - 1 = 2$ degrees of freedom is shown in the Figure above. Here, $\chi^2 = 9.311$ and the resulting p-value $= 0.0095 < 0.05$, as shown in the chi-squared distribution Figure above.

Thus, there is enough evidence to reject $H_0$; the substances are not distributed equally. There are more alcoholics than any other kind. It can also be seen in the side-by-side bar plot of the observed and expected counts above.

**Example**

An M&M candy pack is supposed to have 23% blue, 23% orange, 15% green, 15% yellow, 12% red, and 12% brown candies. A random sample of several packs is selected and different colors are counted. Do the observed counts follow the claimed proportions?

$H_0$: The observed counts of colors follow the expected proportions $O \approx E$.

$H_1$: The observed values do NOT follow the expected proportions or $O \ne E$.

Observed values sum up to $Total = 289$. If they follow the expected proportions, there should be $289 \cdot 0.23 \approx 66.47$ Blue, etc. The Expected Values are in the table above, none of them is below 5. The resulting test statistic $\chi^2$ is:

$$ \chi^2 = \dfrac{(O_1 - E_1)^2}{E_1} + \dfrac{(O_2 - E_2)^2}{E_2} + ... + \dfrac{(O_k - E_k)^2}{E_k} $$

$$ = \dfrac{(58 - 66.47)^2}{66.47} + \dfrac{(69 - 66.47)^2}{66.47} + ... + \dfrac{(29 - 34.68)^2}{34.68} $$

$$ = 1.079 + 0.096 + ... + 0.93 = 4.32 $$

The chi-squared distribution with $df = k - 1 = 6 - 1 = 5$ degrees of freedom is shown above. The $\chi^2 = 4.32$ is small and produces a very large p-value $= 0.5044 > 0.05$ (very large right tail) as shown in the chi-squared distribution Figure above. Thus, there is NOT enough evidence to reject $H_0$, so the observed colors of candies follow the manufacturer's proportions claim. We can also see it in the side-by-side bar plot of the observed and expected counts above.

**Example**

For a number of numerical data sets, the leading digit distribution is surprisingly heavily skewed right (Benford Law): leading 1 – 30% likely, leading 2 – 17.6%, …, leading 9 < 5%. If they were uniformly distributed, each of the nine digits would occur about 11.1% of the time. The formula for the probability distribution of the first digit $d = 1, 2, ..., 9$ is given by:

$$ P(d) = \log_{10}(d + 1) - \log_{10}(d) = \log_{10}\left(1 + \frac{1}{d}\right) $$

The observed counts of leading digits from a large financial document are given below. Do they follow Benford Law?

$H_0$: The observed values of the leading digits follow the expected proportions $O \approx E$.

$H_1$: The observed values do NOT follow the expected proportions or $O \ne E$.

The observed counts add up to $Total = 3542$. If the observed counts follow the expected proportions, $3542 \cdot 0.301 \approx 1066.248$ of the leading digits should be 1, etc. The Expected Values are shown in the table and the second Figure above, none of them is below 5. The resulting test statistic $\chi^2$ is:

$$ \chi^2 = \dfrac{(O_1 - E_1)^2}{E_1} + \dfrac{(O_2 - E_2)^2}{E_2} + ... + \dfrac{(O_k - E_k)^2}{E_k} $$

$$ = \dfrac{(1110 - 1066.248)^2}{1066.248} + \dfrac{(601 - 623.715)^2}{623.715} + ... + \dfrac{(156 - 162.073)^2}{162.073} $$

$$ = 1.795 + 0.827 + ... + 0.228 = 6.058 $$

The $\chi^2$ follows a chi-squared distribution with $df = k - 1 = 9 - 1 = 8$ degrees of freedom shown in the Figure above. Here, $\chi^2 = 6.058$ is quite small, resulting in a very large p-value $= 0.64 > 0.05$ (right tail in the chi-squared plot in the Figure above). Therefore, there is NOT enough evidence to reject $H_0$; the observed distribution of leading digits in this financial report follows Benford's law. We can also see it in the side-by-side bar plot of the observed and expected counts above.

## Chi-Squared Test of Independence in a Two-Way Table

Here’s your text reformatted according to your rules ✅

In this section, two-way contingency tables of frequency counts for categorical data are studied to assess whether there is a dependence between these two variables.

**Example**

A supplement company claims that their extract is effective in preventing common cold viruses. Healthy volunteers randomized into groups were given a placebo, low dose, or high dose of the supplement and exposed to a cold virus. The results are summarized in the Table below. Test the claim that getting a cold infection is independent of the treatment group (i.e., row and column variables are independent) at a 5% level.
|              | Placebo | Low Dose | High Dose | Sum |
| ------------ | ------- | -------- | --------- | --- |
| Infected     | 10      | 28       | 31        | 69  |
| Not Infected | 40      | 60       | 64        | 164 |
| Sum          | 50      | 88       | 95        | 233 |

$H_0$: Getting an infection is independent of the treatment (i.e., **row and column variables are independent**).

$H_1$: **Dependent**.

The $\chi^2$ statistic approach is used again, but what are the expected counts?

As always, initially assume $H_0$ is true — getting an infection is independent of the treatment group. Let's concentrate on a particular cell in the upper left corner:

$$ P(\text{Infected and Placebo}) = [H_0 \text{ true } \Rightarrow \text{independent}]$$

$$= P(\text{Infected}) \cdot P(\text{Placebo}) = \frac{69}{233} \cdot \frac{50}{233}$$

Therefore, the expected count in this cell should be:

$$ E = (\text{Grand Total}) \cdot (\text{Probability}) = 233 \cdot \frac{69}{233} \cdot \frac{50}{233} = \frac{69 \cdot 50}{233} $$

Therefore, the expected value in each cell is given by:

$$ E = \frac{(\text{Row Total}) \cdot (\text{Column Total})}{\text{Grand Total}} $$

Therefore:

$$ E_{1,1} = \frac{69 \cdot 50}{233} = 14.807 $$
$$ E_{1,2} = \frac{69 \cdot 88}{233} = 26.06 $$
…
$$ E_{2,3} = \frac{164 \cdot 95}{233} = 66.87 $$

Given the expected counts, the same $\chi^2$ computations as in Goodness of Fit are performed, but the degree of freedom is:

$$ df = (\text{number of rows} - 1)(\text{number of columns} - 1) = (2 - 1)(3 - 1) = 2 $$

To illustrate this degree of freedom, let's consider the same contingency table as above, but specify only two cell entries. Then all other entries could be found by subtracting from the totals. For example, the entry for Infected and High Dose is $69 - 10 - 28 = 31$, or Not Infected and Placebo is $50 - 10 = 40$.
|              | Placebo | Low Dose | High Dose | Sum |
| ------------ | ------- | -------- | --------- | --- |
| Infected     | 10      | 28       | ?         | 69  |
| Not Infected | ?       | ?        | ?         | 164 |
| Sum          | 50      | 88       | 95        | 233 |

The $\chi^2$ statistic is

$$ \chi^2 = \dfrac{(O_{1,1} - E_{1,1})^2}{E_{1,1}} + \dfrac{(O_{2,1} - E_{2,1})^2}{E_{2,1}} + ... + \dfrac{(O_{2,3} - E_{2,3})^2}{E_{2,3}} $$

$$ = \dfrac{(10 - 14.807)^2}{14.807} + \dfrac{(28 - 26.060)^2}{26.060} + ... + \dfrac{(64 - 66.867)^2}{66.867} $$

$$ = 1.56 + 0.144 + ... + 0.123 = 2.837 $$

The $\chi^2$ follows a chi-squared distribution with $df = (2 - 1)(3 - 1) = 2$ degrees of freedom shown in the Figure below. $\chi^2 = 2.837$ is very small and results in a very large p-value $= 0.2420422 > 0.05$ (the right tail in the Figure).

Therefore, there is NOT enough evidence to reject $H_0$, so getting infected is independent of the treatment group, which implies that the supplement is not effective.

Note in the code below that the counts must be entered column-by-column, and row names are in the data frame index. The second Figure shows a stacked barplot of the data.

Note that when there are only two rows (as in this case) or only two columns, we can look at this problem from the proportion comparison point of view:

$H_0$: $p_1 = p_2 = p_3$ — proportions of infected individuals in each treatment group are the same.

$H_1$: At least one of the proportions is different.

As such, it is a generalization of the two-proportion test studied before. In fact, the proportions of infected individuals in our example are given below for Placebo, low dose, and high dose of the supplement, respectively.

**Example**

A polling company conducted a study to determine the support for the National healthcare plan among randomly chosen respondents with different party affiliations. The data are shown in the code below. Test the claim that the response is independent of the party affiliation (i.e., row and column variables are independent).

$H_0$: The response is independent of the party affiliation (i.e., row and column variables are independent)

$H_1$: Dependent

Let's start with the code and then explain the steps. Note that none of the expected counts are below 5, so the assumptions of the applicability of $\chi^2$ test hold.

The $\chi^2 = 146.45$ is very large, producing a very small p-value $= 1.5803313 \cdot 10^{-32} \ll 0.05$, so there is more than enough evidence to reject $H_0$ as shown in the chi-squared distribution Figure above. The response to the National plan is very much dependent on the party affiliation — Republicans mostly oppose any national plans, Democrats mostly support, and independents are about 50/50. We can also see it in the stacked bar plot above.

There are only two columns in this table; therefore, the problem could be rewritten from the proportion comparison point of view:

$H_0$: $p_1 = p_2 = p_3$ — proportions of support in each party are the same.

$H_1$: At least one of the proportions is different.

The proportions of support in our example are given below for Republicans, Democrats, and Independents, respectively, and according to the problem results, they are significantly different.

**Example**

A marketing study of shopping habits by social class produced the data below. Investigate if the brand choice is independent of the social class (i.e., row and column variables are independent).

$H_0$: The brand choice is independent of the social class (i.e., row and column variables are independent).

$H_1$: Dependent.

None of the expected counts below are less than 5, so $\chi^2$ approach is applicable.

The $\chi^2 = 236.89$ is very large, producing essentially 0 p-value $= 2.591575 \cdot 10^{-49} \ll 0.05$, as shown in the chi-squared distribution Figure above. There is more than enough evidence to reject $H_0$ — the brand choice is very much dependent on social class. We can also see it in the stacked bar plot above.

Note that in this case there are more than two row and column levels, so this problem cannot be thought of as a proportion test.

**Example**

Consider the `HELPrct` data file again. Test the claim that the gender (sex) is independent of the preferred substance (`substance`) at a 5% level.

$H_0$: The substance is independent of the gender (i.e., row and column variables are independent).

$H_1$: Dependent.

First, since the data file is available, `pd.crosstab()` is used to tabulate substance and sex (the row and column names are automatically assigned). Its output is directly fed into my function. None of the expected counts are below 5, so the $\chi^2$ test can be applied.

$\chi^2 = 2.026$ is very small, resulting in a large p-value $= 0.363 > 0.05$, as shown in the chi-squared distribution Figure above. Therefore, there is not enough evidence to reject $H_0$ — substance is independent of gender. The stacked bar plot above shows it as well.

With only two columns, the problem can be reformulated as a proportion test:

$H_0$: $p_1 = p_2 = p_3$ — proportions of males in each substance.

$H_1$: At least one of the proportions is different.

The observed proportions of males for each substance are:

Based on the results, these proportions are not significantly different.

**Example**

Let's come back to an example considered in the previous chapter. First, we repeat the main idea that 87 out of 100 students preparing with the Barron guide passed the Medical admission test MCAT, while 91 out of 120 passed it with the Princeton guide. Is there a significant difference at the 1% level? How about a 5% level?

We recast this problem as a contingency table as follows: test the claim that passing is independent of the type of review used (i.e., row and column variables are independent).

$\chi^2 = 4.4033$ results in p-value $= 0.0358 > 0.01$, as shown in the chi-squared distribution Figure above. Therefore, there is *not* enough evidence to reject $H_0$ at 0.01 level, so at this stricter level, the MCAT passing is independent of the preparation type. We can also see it in the stacked bar plot above.

On the other hand, if $\alpha = 0.05$ were used, we would have rejected $H_0$.

Comparing these results with the 2-proportions test in the previous chapter, we notice that the p-values are the same and $z^2 = 2.0982^2 = 4.4033 = \chi^2$, which is true in general.