# Chi-Squared Tests

In this lesson, we'll learn about the chi-squared test for categorical data. This test allows us to find the statistical significance of observing a set of categorical values.

We'll work with data on U.S. income and demographics throughout this lesson.

- We'll work with data on U.S. income and demographics throughout this lesson. Each row represents a single person who was counted in the 1990 US Census and contains information about their income and demographics. 


- The entire dataset has <font color='red'>32,561</font> rows, and is a sample of the full census. Of the rows, <font color='red'>10,771</font> are Female, and <font color='red'>21,790</font> are Male. These numbers may seem incorrect, because the full census shows that the U.S. is about <font color='red'>50%</font> Male and <font color='red'>50%</font> Female. Therefore, our expected values for number of Males and Females would be <font color='red'>16,280.5</font> each.


We know that the numbers may seem incorrect, but we don't quite know how to find the numerical value for the observed and expected values. We also can't determine if there's a statistically significant difference between the two groups, and if we need to examine further.


<font color='blue'>This is where a chi-squared test can help. The chi-squared test enables us to **quantify the difference between sets of observed and expected categorical values**. </font>

In [3]:
import pandas as pd
import numpy as np
import sys
print(sys.version)
pd.options.display.max_columns

3.8.12 (default, Oct 12 2021, 06:23:56) 
[Clang 10.0.0 ]


20

On the last screen, our observed values were 10771 Females, and 21790 Males. Our expected values were 16280.5 Females and 16280.5 Males.

- Compute the proportional difference in number of observed Females vs number of expected Females. Assign the result to female_diff.


- Compute the proportional difference in number of observed Males vs number of expected Males. Assign the result to male_diff.


In [4]:
df=pd.read_csv('Datasets/income.csv')
print(df.shape)
df.head()

(32561, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [15]:
n_males = df['sex'].value_counts(dropna=False)[0]
n_females = df['sex'].value_counts(dropna=False)[1]

exp_males = 16280.5
exp_females = 16280.5

female_diff = (n_females - exp_females)/exp_females
male_diff = (n_males - exp_males)/exp_males

print(f'''Observed females in survey are {n_females}, 
expected females are{exp_females}, 
proportional difference are {female_diff}''' )
print('\n')
print(f'''Observed males in survey are {n_males}, 
expected males are{exp_males}, 
proportional difference are {male_diff}''' )



Observed females in survey are 10771, 
expected females are16280.5, 
proportional difference are -0.33841098246368356


Observed males in survey are 21790, 
expected males are16280.5, 
proportional difference are 0.33841098246368356


On the last screen, we got -0.338 for the Female difference, and 0.338 for the Male difference. These are great for finding individual differences for each category, but since both values add up to 0, they don't give us an accurate measure of how our overall observed counts deviate from the expected counts.

No matter what numbers you type in for observed Male or Female counts, the differences between observed and expected will always add to 0, because the total observed count for Male and Female items always comes out to 32561. If the observed count of Females is high, the count of Males has to be low to compensate, and vice versa.


What we really want to find is one number that can tell us how much all of our observed counts deviate from all of their expected counterparts. This helps us figure out if our difference in counts is statistically significant. We can get one step closer to this by squaring the top term in our difference formula:

Squaring the difference will ensure that all the differences don't equal to zero (you can't have negative squares), giving us a number higher than zero we can use to assess statistical significance.

We can calculate **$X^2$**, the chi-squared value, by adding the squared differences between observed and expected values.

In [17]:
female_diff = (n_females - exp_females)**2/exp_females
male_diff = (n_males - exp_males)**2/exp_males
gender_chisq = female_diff + male_diff

print(f'''Observed females in survey are {n_females}, 
expected females are{exp_females}, 
Chi-Squared value is {female_diff}''' )
print('\n')
print(f'''Observed males in survey are {n_males}, 
expected males are{exp_males}, 
Chi-Squared value is {male_diff}''' )
print('\n')
print(f'''Add male_diff and female_diff together and assign 
to the variable gender_chisq = {gender_chisq}''')

Observed females in survey are 10771, 
expected females are16280.5, 
Chi-Squared value is 1864.4753078836645


Observed males in survey are 21790, 
expected males are16280.5, 
Chi-Squared value is 1864.4753078836645


Add male_diff and female_diff together and assign 
to the variable gender_chisq = 3728.950615767329


In [36]:
def chi_squared(n):
    rand_list = np.random.rand(n)
    # if <0.5, assign 0 and count how many
    male_count = len([i for i in rand_list if i <0.5])
    female_count = len(rand_list) - male_count
    
    male_diff = (male_count - exp_males)**2/exp_males
    female_diff = (female_count - exp_females)**2/exp_females
    
    chi_squared_values = male_diff + female_diff
    
    return chi_squared_values

In [37]:
chi_squared_values = []
for _ in range(0,1000):
    chi_squared_values.append(chi_squared(32561))
    