In [1]:
%%latex
Wariacja bez powtórzeń
$$V_n^k = \frac{n!}{(n - k)!}\\$$
Kombinacja bez powtórzeń
$$C_n^k = {{n}\choose{k}} =\frac{n!}{k!(n - k)!}\\$$

<IPython.core.display.Latex object>

### Przykłady

Liczba kombinacji 2-elementowych zbioru 4-elementowego A = { a , b , c , d } to 6.

Kombinacjami są podzbiory: { a , b } , { a , c } , { a , d } , { b , c } , { b , d } , { c , d }

Liczba wariacji zbioru 3-elementowego B = {e, f, g} to 6: ef, eg, fe, fg, ge, gf.

## Between-group variability

the smaller the distance between sample means, the less likely population means will differ significantly.

The greater distance between sample means, the more likely population means will differ significantly.

## Within group variability

The greater the variability of each individual sample, the less likely population means will differ significantly.

The smaller the variability of each individual sample the more likely population means will differ significantly.


# ANOVA 

When we are comparing samples we are extending idea of the t-test (t = difference / error). We can compare samples by seeing how far is the sample mean from the grand mean (mean of means when samples are the same size). This is called *between group variability*.

We also look at the variability of each sample because it impacts wheather or not samples are significantly different.

Since we are analyzing variabilities this process is called *Analysis of Variance*, shortened to **ANOVA**.

**ANOVA** can compare as many means as we want with just one test. We say one-way ANOVA when we have one independent variable (called *factor*).

ANOVA test can tell whether or not any sample is significantly different from other (no information about which sample from sample group is different).

### F-ratio

**F-ratio** = between-group variability / within-group variability



In [2]:
%%latex
$$F = \frac{\Sigma n_k (\bar{x_k} - \bar{x_G})^2/(k-1)}{\Sigma (x_i - \bar{x_k})^2 / df}\\$$

k - sample size $$$$
k - 1 - degrees of freedom
$$df = N - k \\$$
N - total number of values from all samples
k - number of samples

$$F = \frac{SS_{between} / df_{between}}{SS_{within} / df_{within}} = \frac{MS_{between}}{MS_{within}}\\$$

Sum of df_between and df_within
$$df_{between} + df_{within} = N - 1 = df_{total}\\$$

<IPython.core.display.Latex object>

### F-statistic distribution

![F-statistic distribution](images/f-stat-distribution.jpg)

F-statistic is *positively skewed*. Distribution peaks at **1**.

F-statistic is **non-directional**.

If F-statistic falls in critical region we know at least 2 samples were significantly different.

[Link to F-Table](http://www.socr.ucla.edu/applets.dir/f_table.html)

# Clothing example

In [3]:
import pandas as pd
import numpy as np

data = pd.DataFrame({
    'snapzi': [15, 12, 14, 11],
    'irisa': [39, 45, 48, 60],
    'lola': [65, 45, 32, 38]
})
data

Unnamed: 0,irisa,lola,snapzi
0,39,65,15
1,45,45,12
2,48,32,14
3,60,38,11


In [4]:
means = data.mean()
means['grand'] = data.values.mean()
means

irisa     48.000000
lola      45.000000
snapzi    13.000000
grand     35.333333
dtype: float64

In [5]:
squared = ((means - means['grand']) ** 2)['irisa':'snapzi']
ss_between = (means.count() * squared).sum()
ss_between

3010.6666666666665

In [6]:
ss_within = ((data - means) ** 2).sum().sum()
ss_within

862.0

In [7]:
df_between = len(data.columns) - 1
df_within = data.count().sum() - len(data.columns)

df_between, df_within

(2, 9)

In [8]:
# Calculating mean squares
ms_between = ss_between / df_between
ms_within = ss_within / df_within

ms_between, ms_within

(1505.3333333333333, 95.777777777777771)

In [9]:
# F-statistic - ratio between MS_between and MS_within
F = ms_between / ms_within
F

15.716937354988399

In [10]:
F_critical = 4.2565

`F > F_critical` so we **reject the null**. One of the samples is significantly different from others.