# MODULES


In [5]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

from sklearn import datasets
import pandas as pd
from statsmodels.formula.api import ols
import statsmodels.api as sm

from scipy.stats import chi2
from scipy.stats import chisquare


## Independent two-samples t-test

The null hypothesis $H_0$ states that the population mean of two samples are equal: $\mu_1 = \mu_2$. The test statistic is $t=(\bar {X_1}-\bar {X_2}) / s$, where $s^2$ is a measure of common variance whose formula depends on whether the two samples have equal size and/or variance.

Under the null hypothesis, the test statistic follows a $t$-distribution (its degrees of freedom depend on the assumptions of equal or unequal variance). 

Note: Student's t-test assume equality of variance; the Welch's unequal variances t-test, or Welch U test, doesn't make such assumptions. The litterature [does not recommend](https://onlinelibrary.wiley.com/doi/abs/10.1348/000711004849222) to test equality of variances before choosing between the two test. In general, using the Welch U test must be [preferred](https://www.rips-irsp.com/articles/10.5334/irsp.82/).


In [None]:
# comparing the prod distribution of two plants using the
# number of cars produced over the same 10 days
cars_plant1 = [1184, 1203, 1219, 1238, 1243, 1204, 1269, 1256, 1156, 1248]
cars_plant2 = [1136, 1178, 1212, 1193, 1226, 1154, 1230, 1222, 1161, 1148]

stats.ttest_ind(cars_plant1, cars_plant2, equal_var=False) # t > 0 and p/2 > 0.05 so plant1 performs significantly better


___

# ANOVA

The variable to analyze is **continuous**: you want to compare the **mean** among categories. ANalysis Of VAriance ([ANOVA](https://en.wikipedia.org/wiki/Analysis_of_variance)) tests are used when more than two samples.

+ one-way ANOVA null hypothesis: the means of three or more populations are equal _(see example [here](https://en.wikipedia.org/wiki/One-way_analysis_of_variance#Example))._
+ repeated measures ANOVA null hypothesis: the average difference between in-sample values is null.

## Assumptions

ANOVA is mathematically a [generalized linear model (GLM)](https://pythonfordatascience.org/anova-python/), where the factors of all the categorical variables have been one-encoded. In particular, factorial ANOVA include interaction terms between categorical factors and should therefore be interpreted like traditional linear models.

ANOVA being a GLM, assumptions are the same as for linear regression:

+ Normality
+ Homogeneity of variance
+ Independent observations

_Note: If group sizes are equal, the F-statistic is robust to violations of normality and homogeneity of variance._

## F-distribution

In an ANOVA, the test statistic follows an **F-distribution** under the null hypothesis.

The [F-Distribution](https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Continous-Random-Variables/F-Distribution/index.html) has two numbers of degrees of freedom: the denominator (sample size) and numerator (number of samples). 


In [None]:
# degrees of freedom
dfn, dfd = 30, 8

# 100 x points between the first and 99th percentile of the f-distribution & corresponding f values
x = np.linspace(stats.f.ppf(0.01, dfn, dfd), stats.f.ppf(0.99, dfn, dfd), 100)
y = stats.f.pdf(x, dfn, dfd)

# plot
fig, ax = plt.subplots(1, 1)
ax.plot(x, stats.f.pdf(x, dfn, dfd), 'r-', lw=2, alpha=0.6, label='f pdf')
plt.show()


## F-value

ANOVA compares two types of variance:

+ between groups: how far group means stray from the total mean
+ within groups: how far individual values stray from their respective group mean

The **F-value** is the variance between groups divided by the variance within groups, where:

+ the variance between groups equals the sum of squares group divided by the degrees of freedom (groups)
+ the variance within groups equals the sum of squares errors divided by the degrees of freedom (error)

The groups belong to the **same population** if the **variance between groups** (numerator) is **small** compared to the **variance within groups** (denominator).

## One-way ANOVA

In [None]:
# example: number of days each customer took to pay an invoice based on a percentage of discount if early payment
discount_0perc = [14, 11, 18, 16, 21]
discount_1perc = [21, 15, 23, 10, 16]
discount_2perc = [11, 16,  9, 14, 10]

stats.f_oneway(discount_0perc, discount_1perc, discount_2perc) # p-value > 0.05, the discounts make no significant difference


## Two-way ANOVA

The **two-way ANOVA** is an extension of the one-way ANOVA to test **two independant variables** at the same time, taking interactions between these variables into account. 

_Note: this can be further generalized to N-way ANOVA._


In [None]:
# data - same as before, but checking if the amount has an impact
df = pd.DataFrame({
    'discount': ['2p','2p','2p','2p','2p','1p','1p','1p','1p','1p','0p','0p','0p','0p','0p'],
    'amount': [50,100,150,200,250,50,100,150,200,250,50,100,150,200,250],
    'days': [16,14,11,10,9,23,21,16,15,10,21,16,18,14,11]
})

# fit without interaction factor
model = ols('days ~ C(discount) + C(amount)', df).fit()

# discount has now become significant
sm.api.stats.anova_lm(model, typ=2)

# model.summary()


In [None]:
# data - three fertilizers, warm vs cold, size of plant
df = pd.DataFrame({
    'fertilizer': ['A','A','A','A','A','A','B','B','B','B','B','B','C','C','C','C','C','C'],
    'temperature': ['W','W','W','C','C','C','W','W','W','C','C','C','W','W','W','C','C','C'],
    'size': [13,14,12,16,18,17,21,19,17,14,11,14,18,15,15,15,13,8]
})

# fit with interaction factor
model = ols('size ~ C(fertilizer) * C(temperature)', df).fit()

# discount has now become significant
sm.api.stats.anova_lm(model, typ=2)

# model.summary()


Note on `model.summary()`:
+ Durban-Watson detects the presence of autocorrelation.
+ Jarque-Bera tests the assumption of normality.
+ Omnibus tests the assumption of homogeneity of variance.
+ Condition Number assess multicollinearity (should be < 20).