[Link to videos and excercises](https://www.khanacademy.org/math/statistics-probability/inference-categorical-data-chi-square-tests)

In [1]:
# a cell to import modules and define helper functions
import math
import scipy.stats as ss

def calculate_chi_cdf(lower, upper, ddof):
    
    lower_bound = lower if isinstance(lower, float) else -math.inf
    upper_bound = upper if isinstance(upper, float) else math.inf
    
    # chi2 is for continuous variables
    cdf_lower = ss.chi2.cdf(lower_bound, ddof)
    cdf_upper = ss.chi2.cdf(upper_bound, ddof)
    interval = cdf_upper-cdf_lower
    
    print("Probability of %.3f < X < %.3f is %.4f" % (lower_bound, upper_bound, interval))
    
    return interval

Conditions for a goodness-of-fit test:
* random sampling
* large counts (at least 5 expected outcomes in each category)
* independent ( <10% of population or sampling with replacement)

# Test statistic and P-value in a goodness-of-fit test
![](img/chi_squared_p1.png)

In [2]:
# expected probability of events, our null hypothesis
P_EXPECTED = [20, 25, 20, 20, 15] # in percents, , must sum up to 100
# observed outcomes
OBSERVED = [16, 11, 16, 18, 19] # in units

# calculating total number of outcomes 
sample_size = sum(OBSERVED)
# expected number of outcomes
expected = [ sample_size * x/100 for x in P_EXPECTED ]
# degrees of freedom is total number of buckets - 1
ddof = len(OBSERVED) - 1

# summing squared differences of outcomes divided by expeted outcomes
chi_squared = sum((o-e)**2/e for o,e in zip(OBSERVED, expected))

p = calculate_chi_cdf(chi_squared, None, ddof)

Probability of 8.383 < X < inf is 0.0785


# Test statistic and P-value in chi-square tests with two-tables
![](img/chi_squared_p2.png)

In [3]:
# expected outcomes, our null hypothesis
EXPECTED = [35, 21, 14, 15, 9, 6]
# observed outcomes
OBSERVED = [30, 25, 15, 20, 5, 5]
# (col_number-1)(row_num-1)
DDOF = 2

# summing squared differences of outcomes divided by expeted outcomes
chi_squared = sum((o-e)**2/e for o,e in zip(OBSERVED, EXPECTED))

p = calculate_chi_cdf(chi_squared, None, DDOF)

Probability of 5.159 < X < inf is 0.0758


### Chi-square test usecases:

Separate, independent samples or groups (**chi-square test for homogeneity**)

A chi-square test can help us when we want to know whether different populations or groups are alike with regards to the distribution of a variable. Our hypotheses would look something like this:

- H0: The distribution of a variable is the same in each population or group
- Ha: The distribution of a variable differs between some of the populations or groups

One sample or group (**chi square test of association/independence**)

A chi-square test can help us see whether individuals from a sample who belong to a certain category are more likely than others in the sample to also belong to another category. Our hypotheses would look something like this:

- H0: There is no association between the two variables (they are independent)
- Ha: There is an association between the two variables (they are not independent)