# Chi-square Distribution Four Main Applications

**1. One sample test - for population variance.**

**2. Chi-square test - for one-way table (goodness-of-fit).**

**3. Two sample test - for two population variances.**

**4. Chi-square test - for two way tables (independence test).**

# 1. One sample test - for population variance

## 1.1 Formula

$$\chi^2 = \frac {(n-1)s^2} {\sigma_0^2}$$

## 1.2 Question
- **At a cereal filling plant quality control engineers do not want the variance of the weights of 750 gram cereal boxes to exceed 100 grams^2.**

- **Test the null hypothesis that the true population variance is greater than 100.**

$$H_0: \sigma^2 = 100$$

$$H_a: \sigma^2 > 100$$

In [1]:
from scipy.stats import chi2
import numpy as np

In [2]:
def my_chi_square(data, h0, side, direction):
    '''
    Perform a chi-square test for population variance
    
    data: a list of data.
    h0: population variance under the null hypothesis.
    side: either one or two sided.
    direction: can be left (means >, or greater), right, or equal (only for two sided tests).
    
    Output: Return the chi-sqaure statistic and its corresponding p value.
    '''
    sample_var = np.var(data, ddof = 1)
    chi_statistic = ((len(data) - 1)*sample_var) / h0
    
    # When direction is 'left', the corresponding p value is 1 - chi2.cdf.
    # When direction is 'right', the corresponding p value is chi2.cdf.
    if side == 'one':
        if direction == 'left':
            p_value = 1 - chi2.cdf(chi_statistic, df = len(data) - 1)
        elif direction == 'right':
             p_value = chi2.cdf(chi_statistic, df = len(data) - 1)
    
    # When side is 'two', the corresponding p value is the smallest between 1 - chi2.cdf and chi2.cdf.
    elif side == 'two':
        if direction == 'equal':
            p_value = min(1 - chi2.
                          cdf(chi_statistic, df = len(data) - 1), 
                          chi2.cdf(chi_statistic, df = len(data) - 1))       
            p_value = p_value * 2
        
    return chi_statistic, p_value

In [3]:
my_data = [775, 780, 781, 795, 803, 810, 823]
alpha_level = 0.05
chi_statistic, p_value = my_chi_square(my_data, h0 = 100, side = 'one', direction = 'left')

if p_value < alpha_level:
    print('There is a very strong evidence that the true variance of the weights of cereal in boxes of this type \nis greater than 100.')
else: 
    print("We don't have a strong evidence from our data to reject the null hypothesis that the \n\
    true variance of the weights of cereal in boxes of this type is greater than 100.")

There is a very strong evidence that the true variance of the weights of cereal in boxes of this type 
is greater than 100.


- **Find a 95% of confidence interval.**

$$ P( \frac {(n-1)*S^2} {\chi^2_{\alpha/2}} <= \sigma^2 <= \frac {(n-1)*S^2} {\chi^2_{1-{\alpha/2}}}) = 1 - \alpha $$

In [4]:
lower_bound = ((len(my_data) - 1) * np.var(my_data, ddof = 1)) / chi2.ppf(0.975, 6)
upper_bound = ((len(my_data) - 1) * np.var(my_data, ddof = 1)) / chi2.ppf(0.025, 6)
print('The 95% C.I. is [{}, {}].'.format(
    round(lower_bound, 2), round(upper_bound, 2)))

The 95% C.I. is [131.04, 1530.24].


# 2. Chi-square test for one-way table (goodness-of-fit)

## 2.1 Formula

$$ \chi^2 = \sum_{i=1}^{n} \frac {(Observed_i - Expected_i)^2} {Expected_i}$$

$$ d.f. = n - 1$$

## 2.2 Question
- **One pure line of peas that had purple flowers and long pollen grains was crossed with another pure line that had red flowers and round pollen grains. The first generation was then self-crossed. If these two genes are inherited independently, a 9:3:3:1 ratio would be expected.**

- **Test the null hypothesis that the ratio is 9:3:3:1.**

**H0: P(Purple and long) = 9/16, P(Purple and round) = 3/16, 
P(Red and long) = 3/16, P(Red and round) = 1/16**

**Ha: True ratio of phenotypes is not 9:3:3:1.**

| Phenotype | Purple/Long | Purple/Round | Red/Long | Red/Round|
| --- | ---  | --- | --- | --- |
| Observed | 284 | 21 | 21 | 55 |
|Expected | 214.3125 | 71.4375 | 71.4375 | 23.8125 |

In [5]:
observed_phenotype = [284, 21, 21, 55]
expected_phenotype = [9/16*381, 3/16*381, 3/16*381, 1/16*381]

# The sum of the expected values will always be the same as 
# the sum of the observed values.
print(sum(expected_phenotype))
print(expected_phenotype)

381.0
[214.3125, 71.4375, 71.4375, 23.8125]


In [6]:
from scipy.stats import chisquare
chisquare(f_obs = observed_phenotype, f_exp = expected_phenotype, 
          ddof = len(expected_phenotype) - 1, axis = 0)

Power_divergenceResult(statistic=134.72820064158648, pvalue=nan)

In [7]:
# Notice the p-value returned by the chisquare is 'nan'
print('p-value is {}.'.format(1 - chi2.cdf(134.72820064158648, df = len(expected_phenotype) - 1)))
print('There is a very strong evidence that these phenotypes do not occur in a 9:3:3:1 ratio.')

p-value is 0.0.
There is a very strong evidence that these phenotypes do not occur in a 9:3:3:1 ratio.


# 3. Two sample test - for two population variances

## 3.1 Formulas 

**Suppose X1 to Xn is a random sample of size n from a normal population with mean (mu_X) and variance (sigma_X).
And, suppose, independent of the first sample, Y1 to Ym is another random sample of size m from a normal population with mean (mu_Y) and variance (sigma_Y).**

**In this situation, that:**

$$ \frac {(n-1)S_X^2} {\sigma_X^2} $$ 

$$and$$

$$ \frac {(m-1)S_Y^2} {\sigma_Y^2} $$ 

**have independent chi-square distributions with n−1 and m−1 degrees of freedom, respectively. Therefore:**

$$ F = \frac {\frac {(n-1)S_X^2} {\sigma_X^2} \frac {1} {(n-1)}} 
{\frac {(m-1)S_Y^2} {\sigma_Y^2} \frac {1} {m-1} } =  \frac {S_X^2} {S_Y^2} \frac {\sigma_Y^2} {
\sigma_X^2}$$

**follows an F distribution with  n−1 numerator degrees of freedom and m−1 denominator degrees of freedom.**

## 3.2 Question

- **A psychologist was interested in exploring whether or not male and female college students have different driving behaviors. The particular statistical question she framed was as follows:**

| Males(X) | Females (Y) |
| --- | --- |
| n = 34 | m = 29 |
| mean(x) = 105.5 | mean(y) = 90.9 |
| std(x) = 20.1 | std(y) = 12.2 |

- **Is there sufficient evidence at the α = 0.05 level to conclude that the variance of the fastest speed driven by male college students differs from the variance of the fastest speed driven by female college students?**

- **Construct a 95% confidence interval for the estimator.**

$$H_0: \sigma_X^2 = \sigma_Y^2$$

$$H_1: \sigma_X^2 \neq \sigma_Y^2$$

$$F = \frac {s_1^2} {s_2^2}$$

$$d.f. = n_1 - 1, n_2 - 1$$

$$C.I. = (\frac {1}{F_\frac{\alpha}{2}} \frac {s_1^2}{s_2^2}, \frac {1}{F_{1-\frac{\alpha}{2}}} \frac {s_1^2}{s_2^2})$$

In [8]:
# Helper function of f testing for comparing two population variances
from scipy.stats import f
def f_test_two_p_variance(var1, var2, n1, n2, alpha_level):
    '''
    Perform the F-test for comparing variances between two populations.
    
    var1: the variance of the population 1
    var2: the variance of the population 2
    n1: the sample size of the population 1
    n2: the sample size of the population 2
    alpha_level: can be either 0.1, 0.05, or 0.1
    
    Outputs: F ratio, p-value, and confidence intervals given an alpha level.
    '''
    
    # F test statistic
    F_ratio = var1 / var2
    
    # p-value (usually the interest is to compare whether two variances are equal
    p_value = min(1 - f.cdf(x = F_ratio, dfn = n1-1, dfd = n2-1),
                  f.cdf(x = F_ratio, dfn = n1-1, dfd = n2-1))
    p_value = p_value * 2
    
    # 95% C.I.
    upper = 1 / (f.ppf(alpha_level/2, n1-1, n2-1)) * F_ratio
    lower = 1 / (f.ppf(1 - (alpha_level/2), n1-1, n2-1)) * F_ratio
    
    # Outputs
    return F_ratio, p_value, (lower, upper)

In [9]:
alpha_level = 0.05
F_test_statistic, p_value, CI = f_test_two_p_variance(var1 = 20.1**2, var2 = 12.2**2, n1 = 34, n2 = 29, alpha_level = 0.05)

In [10]:
if p_value < alpha_level:
    print('There is a strong evidence that the variance of the fastest speed driven by male college students differs\nfrom the variance of the fastest speed driven by female college students.')
    print('F test statistic is {}.'.format(round(F_test_statistic,4 )))
    print('The corresponding p-value is {}.'.format(round(p_value, 4)))
    print('The 95% C.I. is [{} {}].'.format(round(CI[0],4), round(CI[1],4)))
else:
    print("We don't have a strong evidence that the variance of the fastest speed driven by male college students differs\nfrom the variance of the fastest speed driven by female college students.")
    print('F test statistic is {}.'.format(round(F_test_statistic,4 )))
    print('The corresponding p-value is {}.'.format(round(p_value, 4)))
    print('The 95% C.I. is [{} {}].'.format(round(CI[0],4), round(CI[1],4)))

There is a strong evidence that the variance of the fastest speed driven by male college students differs
from the variance of the fastest speed driven by female college students.
F test statistic is 2.7144.
The corresponding p-value is 0.0086.
The 95% C.I. is [1.2994 5.5484].


# 4. Chi-square test - for two way tables (independence test)

## 4.1 Formula

$$ \chi^2 = \sum_{i=1}^{n} \frac {(Observed_i - Expected_i)^2} {Expected_i}$$

**d.f. = (number of rows - 1) * (number of columns - 1)**

**Expected count = (row total * column total) / overall total**

## 4.2 Question

- **A study of 11,160 alcohol drinkers on university campuses revealed:**

| - | - | Binge Drinking | - |
| -- | -- | -- | -- |
| - | Never | Occasional | Frequent |
| Trouble with police | 71 | 154 | 398 |
| No trouble with police | 4992 | 2808 | 2737 |

- **Test the null hypothesis that binge drinking and trouble with police are independent variables.**

**H0: Binge drinking and trouble with police are independent variables.**

**Ha: They are not independent.**

In [11]:
from scipy.stats import chi2_contingency
observed_values = [[71,154,398],
                   [4992,2808,2737]
                  ]
chi_test_statistic, p_value, df, expected_values = chi2_contingency(observed_values)
chi_test_statistic, p_value, df, expected_values

(469.5949136516296,
 1.0684645945577432e-102,
 2,
 array([[ 282.63879928,  165.35179211,  175.0094086 ],
        [4780.36120072, 2796.64820789, 2959.9905914 ]]))

In [13]:
alpha_level = 0.05
if p_value < alpha_level:
    print("There is a strong evidence that binge drinking and trouble with police are not independent variables.")
    print('Chi test statistic is {}.'.format(round(chi_test_statistic,4 )))
    print('The corresponding p-value is {}.'.format(round(p_value, 4)))
else:
    print("We don't have a strong evidence that binge drinking and trouble with police are not independent variables.")
    print('Chi test statistic is {}.'.format(round(chi_test_statistic,4 )))
    print('The corresponding p-value is {}.'.format(round(p_value, 4)))

There is a strong evidence that binge drinking and trouble with police are not independent variables.
Chi test statistic is 469.5949.
The corresponding p-value is 0.0.
