# Multi-category chi-squared tests

Each row in the dataset represents a single person who was counted in the 1990 US Census, and contains information about their income and demographics. Here are some of the relevant columns:

- `age` -- how old the person is 
- `workclass` -- the type of sector the person is employed in.
- `race` -- the race of the person.
- `sex` -- the gender of the person, either Male or Female.

In [14]:
import pandas as pd
from scipy.stats import chisquare
from scipy.stats import chi2_contingency

In [6]:
income = pd.read_csv('income.csv')

In [7]:
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### Calculating expected values

In [18]:
males_over50k = 0.669 * 0.241 * 32561
males_under50k = 0.669 *  0.759 * 32561
females_over50k = 0.331 * 0.241 * 32561
females_under50k = 0.331 * 0.759 * 32561

print(males_over50k)
print(males_under50k)
print(females_over50k)
print(females_under50k)

5249.777469000001
16533.531531000004
2597.423531
8180.267469000001


### Calculating chi-squared

In [22]:
observed = [6662, 1179, 15128, 9592]
expected = [males_over50k, females_over50k, males_under50k, females_under50k]

values = []

for i, obs in enumerate(observed):
    values.append((obs - expected[i]) ** 2 / expected[i])

chisq_gender_income = sum(values)
chisq_gender_income

1517.6008631215655

### Finding statistical significance

In [23]:
expected = [5249.8,2597.4,16533.5,8180.3]
observed = [6662,1179,15128,9592]
pvalue_gender_income = chisquare(observed, expected)
pvalue_gender_income

Power_divergenceResult(statistic=1517.5510981525103, pvalue=0.0)

### Cross tables

In [24]:
table = pandas.crosstab(income["sex"], [income["high_income"]])
print(table)

high_income   <=50K   >50K
sex                       
 Female        9592   1179
 Male         15128   6662


In [25]:
table = pandas.crosstab(income["sex"], [income["race"]])
print(table)

race      Amer-Indian-Eskimo   Asian-Pac-Islander   Black   Other   White
sex                                                                      
 Female                  119                  346    1555     109    8642
 Male                    192                  693    1569     162   19174


### Finding expected values

In [26]:
chisq_value, pvalue_gender_race, df, expected = chi2_contingency(table)
print(chisq_value)
print(pvalue_gender_race)
print(df)
print(expected)

454.2671089131088
5.192061302760456e-97
4
[[  102.87709223   343.69549461  1033.40204539    89.64531188
   9201.3800559 ]
 [  208.12290777   695.30450539  2090.59795461   181.35468812
  18614.6199441 ]]
