## Proportion tests

A **proportion test** looks for a share of a population that have a specific trait. We assume that the population follows a binomial distribution with a probability $p_0$ of having a given trait, and the sample has a measured probability $\hat{p}$.

It also leverages the CLT, with the following tweaks:
+ proportions are equal to $\hat{p} - p_0$
+ population variance equals $p_0 \times(1 - p_0)$ according to the binomial distribution


In [None]:
# example: test if most customers of a website are teenagers. H0: teen_proportion <= 0.5 
# sample_teen_proportion = 0.58, sample_size = 400
# H0_teen_proportion = 0.5, HO_variance = sample_size * H0_teen_proportion * (1 - H0_teen_proportion)

# the sample_teen_proportion under H0 is in the 99.9% percentile, inside our cutoff area: we reject H0

z_score = (0.58 - 0.5) / (np.sqrt(0.5 * (1 - 0.5)) / np.sqrt(400))
print('p-value: {:.1%} < 5% cutoff'.format(stats.norm.sf(x=z_score))) # right-tail so survival function = 1 - cdf


___

# Chi-Square Analysis 

The variable to analyze is **nominal**: you want to compare the **frequencies** among categories.
+ **Chi-Square** test (the test statistic follows a **Chi-Square distribution**).
    + One-way when the categorical variable only has one value (ex: repartition of patient discharges per day of the week).
    + Two-way otherwise: test of independance or conformity (ex: comparing proportion among population categories).
    + The only alternate hypothesis is that the different categories have different frequencies.
+ **Fisher's exact test** if the sample size is small.

The [Chi-Square Analysis](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html) determines the **probability of an observed frequency of events** given an expected frequency. If we get heads ten times in a row, how likely is it to happen if we assume the coin to be fair?

$\chi^2 = \sum (O - E)^2 / E$ for each possible outcome of an experiment.


## Chi-Square Distribution

The Chi-Square Distribution depends on its degrees of freedom, which are the number of possible outcomes minus one.


In [None]:
# example of Chi-Square-distribution
fig, ax = plt.subplots(1, 1)

# degrees of freedom
for df in [3,4,5,6]:

    # 100 x points between the first and 99th percentile of the f-distribution & corresponding f values
    x = np.linspace(stats.chi2.ppf(0.01, df), stats.chi2.ppf(0.99, df), 100)
    y = stats.chi2.pdf(x, df)

    ax.plot(x, y, lw=2, alpha=0.6)

plt.show()


## One-way Chi-Square

In [None]:
# example: we get 12 heads out of 18 coin tosses. Is the coin fair?
stats.chisquare([6, 12], f_exp=[9,9]) # p-value is > 5%: we fail to reject the null hypothesis


In [None]:
# a company has six servers. Do they fail at the same rate?
# we have 240 failures. If the null hypothesis is true, the probability of failure should be the same for all the six servers: 1/6 or 40 failures per server.

obs_failures = [46,36,52,26,42,38]
mean_failure = np.mean(obs_failures)

stats.chisquare(obs_failures, f_exp=mean_failure) # p-value is > 5%: we fail to reject the null hypothesis


In [None]:
# Are most customers of a website teenagers?
# we have a sample of 400 visitors, 58% of which are teenagers. If the null hypothesis is true, the probability of having teenagers is 50% or less.

obs_values = [232, 400 - 232]
exp_values = [200, 200]

stats.chisquare(obs_values, f_exp=exp_values) # p-value is < 5%: we reject the null hypothesis


##  Two-way Chi-Square

Suppose there is a city of 1,000,000 residents with four neighborhoods: $A$, $B$, $C$, and $D$. A random sample of 650 residents of the city is taken and their occupation is recorded as "white collar", "blue collar", or "no collar". 

|                | $A$   | $B$   | $C$   | $D$   | **Total** |
|----------------|-------|-------|-------|-------|-----------|
|White Collar    |  90   |  60   | 104   |  95   |  **349**  |
|Blue Collar     |  30   |  50   |  51   |  20   |  **151**  |
|No Collar       |  30   |  40   |  45   |  35   |  **150**  |
|**Total**       |**150**|**150**|**200**|**150**|  **650**  |


The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification. By the assumption of independence under the hypothesis we should "expect" the number of white-collar workers in neighborhood $A$ to be:

$WC_A = 150\times\frac{349}{650} \approx 80.54$

So the contribution of this cell to $\chi^2$ is $\frac{(90 - 80.54)^2}{80.54} \approx 1.11$

The sum of these quantities over all of the cells is the test statistic; in this case, $\approx 24.6$.  Under the null hypothesis, this sum has approximately a chi-squared distribution whose number of degrees of freedom are:

$(\text{number of rows}-1)(\text{number of columns}-1) = (3-1)(4-1) = 6$

If the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of 'independence'. In this example, the neighborhood and occupation are linked.

In [None]:
df = pd.DataFrame({
    'A': [ 90,30,30], 
    'B': [ 60,50,40], 
    'C': [104,51,45], 
    'D': [ 95,20,35]}, index=['White Collar', 'Blue Collar', 'No Collar']
)
display(df)

chi2, p, _, _ = stats.chi2_contingency(df)
print('Chi-Square test: {:.3} - p-value: {:.3}'.format(chi2, p)) # we reject the null hypothesis of independence


Another example: is the proportion of kids taking swimming lessons depend on their ethnicity:
+ 247 Black kids. 36.8% take swimming lessons
+ 308 Hispanic kids. 38.9% take swimming lessons

The null hypothesis is that the proportion of kids taking swimming lessons does not depend on their ethnicity.

In [None]:
# contingency matrix
df = pd.DataFrame({'black': [91, 156], 'hisp': [120, 188]}, index=['Swim', 'No Swim'])
display(df)

chi2, p, _, _ = stats.chi2_contingency(df)
print('Chi-Square test: {:.3} - p-value: {:.3}'.format(chi2, p)) # we fail to reject the null hypothesis of independence
