### Multi Category Chi-Squared Tests

#### REFERENCES
https://en.wikipedia.org/wiki/Chi-squared_test

#### QUESTIONS

1. How to calculate Expected Values?

#### Multiple Categories

![title](./img/1_mulchi.png)
![title](./img/2_mulchi.png)

#### Calculating Expected Values

![title](./img/3_mulchi.png)
![title](./img/4_mulchi.png)

Above proportion table can be easily calculated as below,

32561 - 7841
1 - ? x

(7841 * 1) /32561 = 0.241

Using the expected proportions in the table above, calculate the expected values for each of the 4 cells in the table.

* Calculate the expected value for Males who earn >50k, and assign to males_over50k.
* Calculate the expected value for Males who earn <=50k, and assign to males_under50k.
* Calculate the expected value for Females who earn >50k, and assign to females_over50k.
* Calculate the expected value for Females who earn <=50k, and assign to females_under50k.

In [4]:
males_over50k = .67 * .241 * 32561
males_under50k = .67 * .759 * 32561
females_over50k = .33 * .241 * 32561
females_under50k = .33 * .759 * 32561

#### Calculating Chi-Squared

![title](./img/mulchi_5.png)
![title](./img/mulchi_6.png)

In [5]:
observed = [6662, 1179, 15128, 9592]
expected = [5257.6, 2589.6, 16558.2, 8155.6]
values = []

for i, obs in enumerate(observed):
    exp = expected[i]
    value = (obs - exp) ** 2 / exp
    values.append(value)

chisq_gender_income = sum(values)
chisq_gender_income

1520.0362248035606

#### Finding Statistical Significance

![title](./img/mulchi_7.png)

![title](./img/mulchi_8.png)


In [6]:
import numpy as np
from scipy.stats import chisquare

observed = np.array([6662, 1179, 15128, 9592])
expected = np.array([5257.6, 2589.6, 16558.2, 8155.6])

chisq_value, pvalue_gender_income = chisquare(observed, expected)

print('[Chi-Squared Value] - ', chisq_value, '[P-Value] -',pvalue_gender_income)

[Chi-Squared Value] -  1520.0362248035606 [P-Value] - 0.0


#### Cross Tables

![title](./img/mulchi_9.png)

In [9]:
# Use the pandas.crosstab function to print out a table comparing the sex column of income to the race column of income.

import pandas as pd

income = pd.read_csv('./datasets/income.csv')
table = pandas.crosstab(income["sex"], [income["race"]])

table

race,Amer-Indian-Eskimo,Asian-Pac-Islander,Black,Other,White
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,119,346,1555,109,8642
Male,192,693,1569,162,19174


#### Finding Expected Values
![title](./img/mulchi_10.png)

You can also directly pass the result of the **pandas.crosstab** function into the **scipy.stats.chi2_contingency** function, which makes it easier to perform a chi-squared test.

In [18]:
import pandas
from scipy.stats import chi2_contingency

table = pandas.crosstab(income["sex"], [income["race"]])

# The function takes in a cross table of observed counts, and returns the chi-squared value, the p-value, the degrees of freedom, and the expected frequencies.
chisq_value, pvalue_gender_race, df, expected = chi2_contingency(table)

print('[chi-squared value] ',chisq_value, '\n')

print('[p-value] ', pvalue_gender_race,'\n')

print('[degrees of freedom] ', df,'\n')

print('[expected frequencies] ', expected)

[chi-squared value]  454.2671089131088 

[p-value]  5.192061302760456e-97 

[degrees of freedom]  4 

[expected frequencies]  [[  102.87709223   343.69549461  1033.40204539    89.64531188
   9201.3800559 ]
 [  208.12290777   695.30450539  2090.59795461   181.35468812
  18614.6199441 ]]


#### Caveats

Now that we've learned the chi-squared test, you should be able to figure out if the association between two columns of categorical data is statistically significant or not. There are a few caveats to using the chi-squared test that are important to cover, though:

* Finding that a result isn't significant doesn't mean that no association between the columns exists. For instance, if we found that the chi-squared test between the sex and race columns returned a p-value of .1, it wouldn't mean that there is no relationship between sex and race. It just means that there isn't a statistically significant relationship.
* Finding a statistically significant result doesn't imply anything about what the correlation is. For instance, finding that a chi-squared test between sex and race results in a p-value of .01 doesn't mean that the dataset contains too many Females who are White (or too few). **A statistically significant finding means that some evidence of a relationship between the variables exists but needs to be investigated further**.
* Chi-squared tests can only be applied in the case where **each possibility within a category is independent**. For instance, the Census counts individuals as either Male or Female, not both.
* **Chi-squared tests are more valid when the numbers in each cell of the cross table are larger**. So if each number is 100, great -- if each number is 1, you may need to gather more data.
