# ANOVA

In [None]:
import numpy as np

# Complete Randomized Design - - One way - ANOVA

## SST = SSC + SSE

$$ \sum_{i=1}^{n_{j}}\sum_{j=1}^{C}(x_{ij}- \bar{x})^2  = \sum_{j=1}^{C} n_{j}(\bar{x}_{j} - \bar{x} )^2   + \sum_{i=1}^{n}\sum_{j=1}^{C}(x_{ij}- \bar{x_{j}})^2 $$

1. SST = total sum of squares
2. SSC = sum of squares column
3. SSE = sum of squares error

In [None]:
sample1=np.random.randint(600,615,15)/10
# sample2=sample3=sample4=sample5=sample1
sample2=np.random.randint(600,615,14)/10
sample3=np.random.randint(600,615,13)/10
sample4=np.random.randint(600,615,15)/10
sample5=np.random.randint(600,700,12)/10

In [None]:
meansample1=sample1.mean()
meansample2=sample2.mean()
meansample3=sample3.mean()
meansample4=sample4.mean()
meansample5=sample5.mean()

In [None]:
meansample1,meansample2,meansample3,meansample4,meansample5

In [None]:
overAllMean=(sum(sample1)+sum(sample2)+sum(sample3)+sum(sample4)+sum(sample5))/(len(sample1)+len(sample2)+len(sample3)+len(sample4)+len(sample5))

In [None]:
overAllMean

# SST

In [None]:
sqSumofTotal=np.sum([(overAllMean-i)**2 for i in sample1]+[(overAllMean-i)**2 for i in sample2]+\
                    [(overAllMean-i)**2 for i in sample3]+[(overAllMean-i)**2 for i in sample4]+\
                    [(overAllMean-i)**2 for i in sample5])

In [None]:
sqSumofTotal

# SSC

In [None]:
sqSumofColumns=len(sample1)*(meansample1-overAllMean)**2+\
                len(sample2)*(meansample2-overAllMean)**2+\
                len(sample3)*(meansample3-overAllMean)**2+\
                len(sample4)*(meansample4-overAllMean)**2+\
                len(sample5)*(meansample5-overAllMean)**2

In [None]:
sqSumofColumns

## SSE

In [None]:
sqSumofErrors=np.sum([(meansample1-i)**2 for i in sample1])+np.sum([(meansample2-i)**2 for i in sample2])+\
                np.sum([(meansample3-i)**2 for i in sample3])+np.sum([(meansample4-i)**2 for i in sample4])+\
                np.sum([(meansample5-i)**2 for i in sample5])

In [None]:
sqSumofErrors


Analysis of variance is used to determine statistically whether the variance between the categorical level means is greater than variance withing levels i.e error value.

Severalimportant assumptions underlie analysis of variance:
1. Observations are drawn from normally distributed populations.
2. Observations represent random samples from the populations.
3. Variances of the populations are equal.

As shown previously, SST contains both SSC and SSE and can be partitioned into SSC and SSE. 
- MSC mean squares of column
- MSE mean squares of error
- MST mean squares total

Mean square is an average and is computed by dividing the sum of squares by the degrees of freedom. Finally, the F value is determined by dividing the treatment variance (MSC) by the error variance (MSE). The F is a ratio of two variances. In the ANOVA situation, the F value is a ratio of the column variance to the error variance.

In [None]:
dofC=5-1
dofE=(len(sample1)+len(sample2)+len(sample3)+len(sample4)+len(sample5))-5
dofT=(len(sample1)+len(sample2)+len(sample3)+len(sample4)+len(sample5))-1

In [None]:
MSC=sqSumofColumns/dofC
MSE=sqSumofErrors/dofE

In [None]:
MSC,MSE

In [None]:
dofC,dofE

In [None]:
fValueObserved=MSC/MSE

In [None]:
fValueObserved

From these computations, an analysis of variance chart can be constructed. The observed F value is 66.24. It is compared to a critical value from the F table to determine whether there is a significant difference in treatment or classification.

In [None]:
fCritical= 3.01

## The null hypothesis is rejected

## Through Scipy 

In [None]:
import scipy.stats as st
st.f_oneway(sample1,sample2,sample3,sample4,sample5)

# Randomized Block Design

$$ \sum_{i=1}^{n_{j}}\sum_{j=1}^{C}(x_{ij}- \bar{x})^2  = n\sum_{j=1}^{C}(\bar{x}_{j} - \bar{x} )^2   + 
C\sum_{i=1}^{n}(\bar{x_{i}}-\bar{x}) +
\sum_{i=1}^{n}\sum_{j=1}^{C}(x_{ij}- \bar{x_{j}} -\bar{x_{i}} + \bar{x})^2 $$

1. SST = total sum of squares
2. SSC = sum of squares column
3. SSR = sum of squares of rows (Blocks)
3. SSE = sum of squares error

- Row effects: 
            - H0: Variable means all are equal.
            - Ha: At least one of the Variable means is different from the others.
- Column effects: 
            - H0: Block means are all equal.
            - Ha: At least one of the blocking means is different from the others.

# Factorial Design

- Row effects: 
        - H0: Row means all are equal.
        - Ha: At least one row mean is different from the others.
- Column effects: 
        - H0: Column means are all equal.
        - Ha: At least one column mean is different from the others.
- Interaction effects: 
        - H0: The interaction effects are zero.
        - Ha: An interaction effect is present.

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [None]:
import pandas as pd
data=pd.read_excel('Sample - Superstore.xls')

In [None]:
data.head(2)

In [None]:
pd.unique(data['Segment'])

# Randomized Block Design approach

In [None]:
model = ols('Sales ~ C(Category) + C(Segment)', data=data).fit()

In [None]:
anovaAnaRep = sm.stats.anova_lm(model, typ=2)
anovaAnaRep

# Factorial Design (Interaction)

In [None]:
model = ols('Sales ~ C(Category) + C(Segment) + C(Category):C(Segment)', data=data).fit()

In [None]:
anovaAnaRep = sm.stats.anova_lm(model, typ=2)
anovaAnaRep