**Table of contents**<a id='toc0_'></a>    
- [Hypothesis Testing - Part II](#toc1_)    
  - [Goodness of fit test](#toc1_1_)    
  - [Chi-squared test - observed vs expected distribution](#toc1_2_)    
      - [💡 Do it yourself](#toc1_2_1_1_)    
  - [Chi-squared test - correlation between categoricals](#toc1_3_)    
    - [G-test](#toc1_3_1_)    
- [References](#toc2_)    
- [Acknowledgements](#toc3_)    
- [Extra Reading](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Hypothesis Testing - Part II](#toc0_)

![](https://media0.giphy.com/media/26gR0YFZxWbnUPtMA/giphy.gif?cid=ecf05e47uz8y2qkbtctaa2w4h7kmmfnqodcfo3k9g9f1z93a&ep=v1_gifs_search&rid=giphy.gif&ct=g)

## <a id='toc1_1_'></a>[Goodness of fit test](#toc0_)

> The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically **summarize the discrepancy between observed values and the values expected under the model in question**.   

> Such measures can be used in statistical hypothesis testing for example:
> - to test for normality of residuals
> - to test whether two samples are drawn from identical distributions (Kolmogorov–Smirnov test)
> - outcome frequencies follow a specified distribution (see Pearson's chi-square test). [$^{[1]}$](https://en.wikipedia.org/wiki/Goodness_of_fit)

## <a id='toc1_2_'></a>[Chi-squared test - observed vs expected distribution](#toc0_)

In [None]:
import scipy.stats as stats
import numpy as np

**Null hypothesis ($H_0$):** Beverage distribution for our store is the same as the one in all stores.

**Alternative hypothesis ($H_1$):** Beverage distribution for our store is different from the one in all stores.

In [None]:
E = np.array([120, 90, 90])
O = np.array([120, 80, 100])

In [None]:
# Compute statistic for chi2

Check chi2-value for given significance level and the appriopriate number of degrees of freedom (number of categories - 1):

In [None]:
# Calculate p-value

Do we reject the null hypothesis at a significance level of 5%?

In [None]:
# Use the stats library instead

#### <a id='toc1_2_1_1_'></a>[💡 Do it yourself](#toc0_)

Now repeat this exercise for the same null hypothesis and the following **observed** sample:

> 180 cokes, 210 pepsis and 210 waters

This time, use a significance level of 1%. What do you see?

In [None]:
# Your code here

## <a id='toc1_3_'></a>[Chi-squared test - correlation between categoricals](#toc0_)

> A chi-squared test (also χ2 test) is a statistical hypothesis test used in the analysis of *contingency tables* when the sample sizes are large.   

> In simpler terms, this test is primarily used to examine whether two categorical variables (two dimensions of the contingency table) are independent in influencing the test statistic (values within the table).[$^{[2]}$](https://en.wikipedia.org/wiki/Chi-squared_test)

In [None]:
cars_table=[[256, 74], 
            [41, 42], 
            [66, 34]]

**Null hypothesis ($H_0$):** The car brand does **not** influence the audience perception.

**Alternative hypothesis ($H_1$):** The car brand does influence the audience perception.

In [None]:
# Contingency tables and independence of effects

What can we say about our car manufacturers?

**Note:**
> An often quoted guideline for the validity of this calculation is that the test should be used only if the observed and expected frequencies in each cell are at least 5. [$^{[3]}$](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html#scipy.stats.chi2_contingency)

### <a id='toc1_3_1_'></a>[G-test](#toc0_)

Nowadays, the G-test is more increasingly recommended compared to the chi-squared test as it's a better approximation. However, at quite large sample sizes, the effect is minimal.

To implement using `scipy.stats`, you can simply change the `lambda_` argument to `log-likelihood`:

# <a id='toc2_'></a>[References](#toc0_)

[1] [Goodness of fit, Wikipedia](https://en.wikipedia.org/wiki/Goodness_of_fit)  
[2] [Chi-squared test, Wikipedia](https://en.wikipedia.org/wiki/Chi-squared_test)  
[3] [Chi-contingency table, Scipy documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html#scipy.stats.chi2_contingency)

# <a id='toc3_'></a>[Acknowledgements](#toc0_)

Thank you, David Henriques, for your awesome lesson structure and content.

# <a id='toc4_'></a>[Extra Reading](#toc0_)

- [Using Applied Statistics to Expand Human Knowledge- Statistics by Jim](https://statisticsbyjim.com/basics/applied-statistics-expand-knowledge/)
- [Fischer's Exact Test (for small samples & 2x2 contingency tables) - StatQuest (5 min)](https://www.youtube.com/watch?v=udyAvvaMjfM)