# ANOVA

The analysis of variance, or more briefly ANOVA, refers broadly to a collection of experimental situations and statistical procedures for the analysis of quantitative responses from experimental units.

The simplest ANOVA problem is referred to variously as a single-factor, single-classification, or one-way ANOVA.

## ONE WAY ANOVA
It involves the analysis either of data sampled from more than two
numerical populations (distributions) or of data from experiments in which
more than two treatments have been used.

There are three primary assumptions in ANOVA:

- The responses for each factor level have a normal population distribution.
- These distributions have the same variance.
- The data are independent.

$H_0: \mu_1 = \mu_2 = \mu_3 =\mu_4 = .....\mu_n$

$H_a: Atleast \ two \ of \ the \  \mu_i \ 's \ are \ different   $

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html

#### QUESTIONS

Six samples of each of four types of cereal grain grown in a
certain region were analyzed to determine thiamin content,
resulting in the following data (mg/g):

- Wheat 5.2 4.5 6.0 6.1 6.7 5.8
- Barley 6.5 8.0 6.1 7.5 5.9 5.6
- Maize 5.8 4.7 6.4 4.9 6.0 5.2
- Oats 8.3 6.1 7.8 7.0 5.5 7.2

Does this data suggest that at least two of the grains differ
with respect to true average thiamin content? Use a level $\alpha = 0.5$
test based on the P-value method.

In [1]:
import scipy.stats as stats

# Thiamin content data for each grain type
wheat = [5.2, 4.5, 6.0, 6.1, 6.7, 5.8]
barley = [6.5, 8.0, 6.1, 7.5, 5.9, 5.6]
maize = [5.8, 4.7, 6.4, 4.9, 6.0, 5.2]
oats = [8.3, 6.1, 7.8, 7.0, 5.5, 7.2]

# Perform one-way ANOVA
f_value, p_value = stats.f_oneway(wheat, barley, maize, oats)

print("F-value:", f_value)
print("P-value:", p_value)

F-value: 3.9565440798649343
P-value: 0.022934212492442103


CONCLUSION

Since p value is less than 0.05 we reject null hyothesis. So this data suggest that at least two of the grains differ
with respect to true average  

## TWO WAY ANOVA

An Example From The Book "Probability And Statistics For Engineering And The Sciences" By Jay Devore.



Is it really as easy to remove marks on fabrics from erasable pens as the word
erasable might imply? Consider the following data from an experiment to compare three different brands of pens and four different wash treatments with respect
to their ability to remove marks on a particular type of fabric (based on “An
Assessment of the Effects of Treatment, Time, and Heat on the Removal of
Erasable Pen Marks from Cotton and Cotton/Polyester Blend Fabrics,” J. of
Testing and Evaluation, 1991: 394–397). The response variable is a quantitative
indicator of overall specimen color change; the lower this value, the more marks
were removed. Is there any difference in the true average amount of color change due either to the
different brands of pens or to the different washing treatments?

In [1]:
import pandas as pd 

data = [(x, y) for x in range(1, 4) for y in range(1, 5)]

# Creating the DataFrame
df = pd.DataFrame(data, columns=['BRAND', 'TREATMENT'])

df['ABILITY'] = [0.97, 0.48, 0.48, 0.46, 0.77, 0.14, 0.22, 0.25, 0.67, 0.39, 0.57, 0.19]

df

Unnamed: 0,BRAND,TREATMENT,ABILITY
0,1,1,0.97
1,1,2,0.48
2,1,3,0.48
3,1,4,0.46
4,2,1,0.77
5,2,2,0.14
6,2,3,0.22
7,2,4,0.25
8,3,1,0.67
9,3,2,0.39


In [2]:
import statsmodels.api as sm 
from statsmodels.formula.api import ols 

# Performing two-way ANOVA 
model = ols('ABILITY ~ C(BRAND) + C(TREATMENT) +  C(BRAND):C(TREATMENT)', 
            data=df).fit() 
result = sm.stats.anova_lm(model, type=2) 
  
# Print the result 
print(result) 

                        df        sum_sq   mean_sq    F  PR(>F)
C(BRAND)               2.0  1.282167e-01  0.064108  0.0     NaN
C(TREATMENT)           3.0  4.796917e-01  0.159897  0.0     NaN
C(BRAND):C(TREATMENT)  6.0  8.678333e-02  0.014464  0.0     NaN
Residual               0.0  2.255418e-29       inf  NaN     NaN


  (model.ssr / model.df_resid))


- In this case, all F-values are 0.0, which is unusual and might indicate an issue with the data or model.

- So let us change BRAND AND TREATMENT COLUMNS as catagotical and apply the model again

In [3]:
# Change BRAND and TREATMENT column into Catagorical variables
df['BRAND'] = df['BRAND'].astype('category')
df['TREATMENT'] = df['TREATMENT'].astype('category')

In [4]:
# Performing two-way ANOVA 
model = ols('ABILITY ~ C(BRAND) + C(TREATMENT) +  C(BRAND):C(TREATMENT)', 
            data=df).fit() 
result = sm.stats.anova_lm(model, type=2) 
  
# Print the result 
print(result) 

                        df        sum_sq   mean_sq    F  PR(>F)
C(BRAND)               2.0  1.282167e-01  0.064108  0.0     NaN
C(TREATMENT)           3.0  4.796917e-01  0.159897  0.0     NaN
C(BRAND):C(TREATMENT)  6.0  8.678333e-02  0.014464  0.0     NaN
Residual               0.0  2.255418e-29       inf  NaN     NaN


  (model.ssr / model.df_resid))


- Same result. So let us change interaction term from the model and evaluate further

In [5]:
# Performing two-way ANOVA without interaction term
model = ols('ABILITY ~ C(BRAND) + C(TREATMENT) ', 
            data=df).fit() 
result = sm.stats.anova_lm(model, type=2) 
  
# Print the result 
print(result) 

               df    sum_sq   mean_sq          F    PR(>F)
C(BRAND)      2.0  0.128217  0.064108   4.432303  0.065765
C(TREATMENT)  3.0  0.479692  0.159897  11.054926  0.007399
Residual      6.0  0.086783  0.014464        NaN       NaN


CONCLUSION

- BARANDS p = 0.065765

p > 0.05. No evidence to reject null hypothesis So no evidence for H_a.


True average color change does not appear to depend on the brand of pen

- TREATMENT p = 0.007399 

p < 0.05. We can reject null hypithesis and accept the alternate hypothesis. 

This means color change varies with washing treatment 

In [10]:
import pandas as pd

path = "C:/Users/sidiq/OneDrive/Desktop/GIT HUB PROJECTS/STATISTICAL TESTING/STATISTICAL-TESTING/DATA/crop.data.csv"

data = pd.read_csv(path)

data.head()

Unnamed: 0,density,block,fertilizer,yield
0,1,1,1,177.228692
1,2,2,1,177.550041
2,1,3,1,176.408462
3,2,4,1,177.703625
4,1,1,1,177.125486


In [3]:
data["yield"].dtype

dtype('float64')

In [4]:
data.isnull().sum()

density       0
block         0
fertilizer    0
yield         0
dtype: int64

In [5]:
data.duplicated().sum()

0

In [6]:
data.fertilizer.nunique()

3

In [7]:
data.density.nunique()

2

In [8]:
# Rename Data Frame
df = data

In [9]:
df.head()

Unnamed: 0,density,block,fertilizer,yield
0,1,1,1,177.228692
1,2,2,1,177.550041
2,1,3,1,176.408462
3,2,4,1,177.703625
4,1,1,1,177.125486


In [11]:
import statsmodels.api as sm 
from statsmodels.formula.api import ols 

# Performing two-way ANOVA 
model = ols('yield ~ C(fertilizer) + C(density) ', 
            data=df).fit() 
result = sm.stats.anova_lm(model, type=2) 
  
# Print the result 
print(result)

PatsyError: Error evaluating factor: SyntaxError: unexpected EOF while parsing (<string>, line 1)
    yield ~ C(fertilizer) + C(density)
    ^^^^^

In [10]:
# import ols
from statsmodels.formula.api import ols

In [11]:
df.columns

Index(['density', 'block', 'fertilizer', 'yield'], dtype='object')

In [12]:
df.drop('block', axis=1, inplace=True)

Index(['density', 'fertilizer', 'yield'], dtype='object')

In [21]:
# Model 
model = ols('yield ~ C(fertilizer) + C(density) + C(fertilizer):C(density)', data=df).fit()


PatsyError: Error evaluating factor: SyntaxError: unexpected EOF while parsing (<string>, line 1)
    yield ~ C(fertilizer) + C(density) + C(fertilizer):C(density)
    ^^^^^

In [20]:
print(df['yield'].dtype)
print(df['fertilizer'].dtype)
print(df['density'].dtype)


float64
int64
int64


In [22]:
simple_model = ols('yield ~ C(fertilizer)', data=df).fit()
print(simple_model.summary())

PatsyError: Error evaluating factor: SyntaxError: unexpected EOF while parsing (<string>, line 1)
    yield ~ C(fertilizer)
    ^^^^^

In [23]:
import statsmodels
import patsy

print(statsmodels.__version__)
print(patsy.__version__)


0.13.5
0.5.3


In [25]:
import statsmodels.api as sm

# Assuming 'fertilizer' and 'density' should be treated as categorical
df['fertilizer'] = df['fertilizer'].astype('category')
df['density'] = df['density'].astype('category')

# Create dummy variables
fertilizer_dummies = pd.get_dummies(df['fertilizer'], prefix='fertilizer', drop_first=True)
density_dummies = pd.get_dummies(df['density'], prefix='density', drop_first=True)

# Prepare the features and target variable
X = pd.concat([fertilizer_dummies, density_dummies], axis=1)
y = df['yield']

# Add a constant to the model (intercept)
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(y, X).fit()

# Print the summary
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                  yield   R-squared:                       0.267
Model:                            OLS   Adj. R-squared:                  0.243
Method:                 Least Squares   F-statistic:                     11.15
Date:                Mon, 08 Apr 2024   Prob (F-statistic):           2.60e-06
Time:                        05:17:41   Log-Likelihood:                -81.595
No. Observations:                  96   AIC:                             171.2
Df Residuals:                      92   BIC:                             181.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const          176.5261      0.118   1495.490   