<span style="font-size: xx-large; font-weight: bold;">Lecture 5 - Chi-square Test</span>

Ref: https://courses.edx.org/courses/BerkeleyX/Stat_2.3x/2T2014/course/

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from scipy.stats import chisquare

# %pylab inline

## Not everything's normal: a chi-square test!

Sample of students at a large university

	- freshmen	54
	- sophomores  40
	- juniors  51
	- seniors  39
	- graduate students  16

**Known**:
- Among students at university, 10% are graduate students
- The rest are evenly distributed among freshmen, sophomores, juniors and seniors.

**Question**: Is the sample like a SRS (Simple Random Sample)?

In [2]:
from scipy.stats import chisquare

f_obs = [54, 40, 51, 39, 16]
f_exp = [45, 45, 45, 45, 20]
chisquare(f_obs, f_exp)

Power_divergenceResult(statistic=4.7555555555555555, pvalue=0.3133107571278041)

## Lec 5.3:   Chi-squared test for independence

Simple random sample of students at a large university

<table>
  <tr><th>observed<th>Female<th>Male<th>Total</tr>
  <tr><th>Declared science<td>62<td>21<td>83</tr>
  <tr><th>Declared other<td>137<td>74<td>211</tr>
  <tr><th>Undeclared<td>48<td>58<td>106</tr>
  <tr><th>Total<td>247<td>153<td>400</tr></table>

Question: At the university, are gender and major declaration status independent?

Hypothesis:
- Null: independent
- Alternative: not independent


In [3]:
# Observations
f_obs = [62, 21, 137, 74, 48, 58]
# Use "margin statsitics" to get the expected values
f_exp = [51.25, 31.75, 130.29, 80.71, 65.45, 40.55]
# Degree of freedom
ddof = (3 - 1) * (2 - 1)

# Get X2 statistics
x2 = chisquare(f_obs, f_exp, ddof=ddof)
print("X2 = {}".format(x2))
print("X2 statistics={:.3f}, p-value={:.8f}".format(x2.statistic, x2.pvalue))

X2 = Power_divergenceResult(statistic=18.959814612837114, pvalue=0.0002786796577823036)
X2 statistics=18.960, p-value=0.00027868


## Genetics Model

According to a genetics model, plants of a particular species occur in the categories A, B, C, and D, in the ratio 9:3:3:1. The categories of different plants are mutually independent. At a lab that grows these plants, 218 are in Category A, 69 in Category B, 84 in Category C, and 29 in Category D.

 

Does the model look good?

In [4]:
ratios = np.array([9, 3, 3, 1])
s = ratios.sum()
f_obs = np.array([218, 69, 84, 29])
f_exp = f_obs.sum() * ratios / s
print("observed = {}".format(f_obs))
print("expected = {}".format(f_exp))

print("X2 = {}".format(chisquare(f_obs, f_exp)))
print("=> The model looks good!")

observed = [218  69  84  29]
expected = [225.  75.  75.  25.]
X2 = Power_divergenceResult(statistic=2.417777777777778, pvalue=0.4903338749035878)
=> The model looks good!


## Car places and fuel types
A simple random sample of cars in a city was categorized according to fuel type and place of manufacture.

       fuel type                  domestic	foreign
       gasoline	                   146	   191
       diesel	                    18	    26
       gasoline/electricity hybrid	51	    79

Are place of manufacture and fuel type independent? Follow the steps in Problems 2A-2D.


In [5]:
df_obs = pd.DataFrame({'fuel': ['gasolin', 'diesel', 'hybrid'],
                   'domestic':[146, 18, 51],
                   'foreign':[191, 26, 79]})
df_obs['total'] = df_obs['domestic'] + df_obs['foreign']
df_obs.head()

Unnamed: 0,fuel,domestic,foreign,total
0,gasolin,146,191,337
1,diesel,18,26,44
2,hybrid,51,79,130


In [6]:
df_exp = df_obs.copy()
sum_domestic = df_obs['domestic'].sum()
sum_foreign = df_obs['foreign'].sum()
sum_total = df_obs['total'].sum()
domestic_ratio = sum_domestic / sum_total
foreign_ratio = sum_foreign / sum_total
print("domestic ratio = {}".format(domestic_ratio))
print("foreign_ratio = {}".format(foreign_ratio))
print("domestic gasolin ratio = {}".format(domestic_ratio * (337/sum_total)))

df_exp['domestic'] = df_obs['total'] * domestic_ratio
df_exp['foreign'] = df_obs['total'] * foreign_ratio
display(df_exp.head())


domestic ratio = 0.4207436399217221
foreign_ratio = 0.5792563600782779
domestic gasolin ratio = 0.2774767253495506


Unnamed: 0,fuel,domestic,foreign,total
0,gasolin,141.790607,195.209393,337
1,diesel,18.51272,25.48728,44
2,hybrid,54.696673,75.303327,130


In [7]:
f_obs = list(df_obs['domestic'].values) + list(df_obs['foreign'].values)
f_exp = list(df_exp['domestic'].values) + list(df_exp['foreign'].values)
print(f_obs)
print(f_exp)
x2 = chisquare(f_obs, f_exp, ddof=2)
print(x2)

[146, 18, 51, 191, 26, 79]
[141.79060665362036, 18.512720156555773, 54.69667318982387, 195.20939334637964, 25.487279843444227, 75.30332681017612]
Power_divergenceResult(statistic=0.6715602421114458, pvalue=0.8798719914484467)


## scipy.stats.chisquare

Returns:
- statistics: Thechi-squared test statistic. 
- pvalue: The p-value

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html