# ANOVA

In [1]:
import numpy as np

# Complete Randomized Design - - One way - ANOVA

## SST = SSC + SSE

$$ \sum_{i=1}^{n_{j}}\sum_{j=1}^{C}(x_{ij}- \bar{x})^2  = \sum_{j=1}^{C} n_{j}(\bar{x}_{j} - \bar{x} )^2   + \sum_{i=1}^{n}\sum_{j=1}^{C}(x_{ij}- \bar{x_{j}})^2 $$

1. SST = total sum of squares
2. SSC = sum of squares column
3. SSE = sum of squares error

In [2]:
sample1=np.random.randint(600,615,15)/10
# sample2=sample3=sample4=sample5=sample1
sample2=np.random.randint(600,615,14)/10
sample3=np.random.randint(600,615,13)/10
sample4=np.random.randint(600,615,15)/10
sample5=np.random.randint(600,700,12)/10

In [3]:
meansample1=sample1.mean()
meansample2=sample2.mean()
meansample3=sample3.mean()
meansample4=sample4.mean()
meansample5=sample5.mean()

In [4]:
meansample1,meansample2,meansample3,meansample4,meansample5

(60.773333333333326,
 60.70714285714286,
 60.646153846153844,
 60.533333333333324,
 64.69166666666668)

In [5]:
overAllMean=(sum(sample1)+sum(sample2)+sum(sample3)+sum(sample4)+sum(sample5))/(len(sample1)+len(sample2)+len(sample3)+len(sample4)+len(sample5))

In [6]:
overAllMean

61.36521739130435

# SST

In [7]:
sqSumofTotal=np.sum([(overAllMean-i)**2 for i in sample1]+[(overAllMean-i)**2 for i in sample2]+\
                    [(overAllMean-i)**2 for i in sample3]+[(overAllMean-i)**2 for i in sample4]+\
                    [(overAllMean-i)**2 for i in sample5])

In [8]:
sqSumofTotal

304.1565217391306

# SSC

In [9]:
sqSumofColumns=len(sample1)*(meansample1-overAllMean)**2+\
                len(sample2)*(meansample2-overAllMean)**2+\
                len(sample3)*(meansample3-overAllMean)**2+\
                len(sample4)*(meansample4-overAllMean)**2+\
                len(sample5)*(meansample5-overAllMean)**2

In [10]:
sqSumofColumns

161.2030949992048

## SSE

In [11]:
sqSumofErrors=np.sum([(meansample1-i)**2 for i in sample1])+np.sum([(meansample2-i)**2 for i in sample2])+\
                np.sum([(meansample3-i)**2 for i in sample3])+np.sum([(meansample4-i)**2 for i in sample4])+\
                np.sum([(meansample5-i)**2 for i in sample5])

In [12]:
sqSumofErrors

142.9534267399268


Analysis of variance is used to determine statistically whether the variance between the categorical level means is greater than variance withing levels i.e error value.

Severalimportant assumptions underlie analysis of variance:
1. Observations are drawn from normally distributed populations.
2. Observations represent random samples from the populations.
3. Variances of the populations are equal.

As shown previously, SST contains both SSC and SSE and can be partitioned into SSC and SSE. 
- MSC mean squares of column
- MSE mean squares of error
- MST mean squares total

Mean square is an average and is computed by dividing the sum of squares by the degrees of freedom. Finally, the F value is determined by dividing the treatment variance (MSC) by the error variance (MSE). The F is a ratio of two variances. In the ANOVA situation, the F value is a ratio of the column variance to the error variance.

In [13]:
dofC=5-1
dofE=(len(sample1)+len(sample2)+len(sample3)+len(sample4)+len(sample5))-5
dofT=(len(sample1)+len(sample2)+len(sample3)+len(sample4)+len(sample5))-1

In [14]:
MSC=sqSumofColumns/dofC
MSE=sqSumofErrors/dofE

In [15]:
MSC,MSE

(40.3007737498012, 2.2336472928113564)

In [16]:
dofC,dofE

(4, 64)

In [17]:
fValueObserved=MSC/MSE

In [18]:
fValueObserved

18.042586168148805

From these computations, an analysis of variance chart can be constructed. The observed F value is 18.04. It is compared to a critical value from the F table to determine whether there is a significant difference in treatment or classification.

In [19]:
fCritical= 3.01

## The null hypothesis is rejected

## Through Scipy 

In [20]:
import scipy.stats as st
st.f_oneway(sample1,sample2,sample3,sample4,sample5)

F_onewayResult(statistic=18.042586168148695, pvalue=5.77330732222661e-10)

# Randomized Block Design

$$ \sum_{i=1}^{n_{j}}\sum_{j=1}^{C}(x_{ij}- \bar{x})^2  = n\sum_{j=1}^{C}(\bar{x}_{j} - \bar{x} )^2   + 
C\sum_{i=1}^{n}(\bar{x_{i}}-\bar{x}) +
\sum_{i=1}^{n}\sum_{j=1}^{C}(x_{ij}- \bar{x_{j}} -\bar{x_{i}} + \bar{x})^2 $$

1. SST = total sum of squares
2. SSC = sum of squares column
3. SSR = sum of squares of rows (Blocks)
3. SSE = sum of squares error

- Row effects: 
            - H0: Variable means all are equal.
            - Ha: At least one of the Variable means is different from the others.
- Column effects: 
            - H0: Block means are all equal.
            - Ha: At least one of the blocking means is different from the others.

# Factorial Design

- Row effects: 
        - H0: Row means all are equal.
        - Ha: At least one row mean is different from the others.
- Column effects: 
        - H0: Column means are all equal.
        - Ha: At least one column mean is different from the others.
- Interaction effects: 
        - H0: The interaction effects are zero.
        - Ha: An interaction effect is present.

In [21]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [22]:
import pandas as pd
data=pd.read_excel('Sample - Superstore.xls')

In [23]:
data.head(2)

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582


In [None]:
pd.unique(data['Segment'])

# Randomized Block Design approach

In [24]:
model = ols('Sales ~ C(Category) + C(Segment)', data=data).fit()

In [25]:
anovaAnaRep = sm.stats.anova_lm(model, typ=2)
anovaAnaRep

Unnamed: 0,sum_sq,df,F,PR(>F)
C(Category),195872900.0,2.0,265.457361,4.7061460000000004e-113
C(Segment),453622.0,2.0,0.614772,0.5407844
Residual,3685290000.0,9989.0,,


# Factorial Design (Interaction)

In [26]:
model = ols('Sales ~ C(Category) + C(Segment) + C(Category):C(Segment)', data=data).fit()

In [27]:
anovaAnaRep = sm.stats.anova_lm(model, typ=2)
anovaAnaRep

Unnamed: 0,sum_sq,df,F,PR(>F)
C(Category),195872900.0,2.0,265.551842,4.313699e-113
C(Segment),453622.0,2.0,0.614991,0.5406661
C(Category):C(Segment),2786415.0,4.0,1.888821,0.1093869
Residual,3682504000.0,9985.0,,
