# Lab | Inferential statistics - ANOVA

## Part 1
- In this activity, we will look at another example. Your task is to understand the problem and write down all the steps to set up ANOVA. After the next lesson, we will ask you to solve this problem using Python. 
- Here are the steps that you would need to work on: - Null hypothesis - Alternate hypothesis - Level of significance - Test statistic - P-value - F table

### Context
        Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data anova_lab_data.xlsx file in the files_for_lab folder


1. State the null hypothesis
   - Changing the power of the plasma beam do not have an effect on the etching rate of the machine. 
   - H0 : mean1 = mean2 = mean3

2. State the alternate hypothesis
   - Changing the power of the plasma beam have an effect on the etching rate of the machine. 
   - Ha : mean1 != mean2 != mean3

3. What is the significance level
   - alpha = 0.05
   
4. What are the degrees of freedom of model, error terms, and total DoF
   
   - degrees of freedom of model = If there are m groups being compared, then there are m−1 degrees of freedom associated with the factor of interest.
   - total DoF = If there are n total data points collected, then there are n−1 total degrees of freedom.

   -  error terms = If there are n total data points collected and m groups being compared, then there are n−m error degrees of freedom
   -  OR sum of squares and mean of squares of residuals

## Part 2
- In this section, use Python to conduct ANOVA.
- What conclusions can you draw from the experiment and why?

#### Import Libraries

In [48]:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

#### Gather Data

In [49]:
df = pd.read_excel('files_for_lab/anova_lab_data.xlsx', sheet_name='data_collected')
df

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


In [60]:
# rename columns
df.columns = ['power', 'etching_rate']
df

Unnamed: 0,power,etching_rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


In [51]:
df.describe()

Unnamed: 0,etching_rate
count,15.0
mean,6.782667
std,1.228643
min,5.43
25%,5.845
50%,6.24
75%,7.725
max,9.2


In [61]:
# average etching rate per power
df.groupby('power')['etching_rate'].mean()

power
160 W    5.792
180 W    6.238
200 W    8.318
Name: etching_rate, dtype: float64

In [62]:
# ANOVA results
model = ols('etching_rate ~ C(power)',data=df).fit()
sm.stats.anova_lm(model)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
C(power),2.0,18.176653,9.088327,36.878955,8e-06
Residual,12.0,2.95724,0.246437,,


In [66]:
# another way to get f-statistic and p-value
df_grouped = df.groupby('power')['etching_rate'].apply(list)
df_grouped

power
160 W    [5.43, 5.71, 6.22, 6.01, 5.59]
180 W     [6.24, 6.71, 5.98, 5.66, 6.6]
200 W      [8.79, 9.2, 7.9, 8.15, 7.55]
Name: etching_rate, dtype: object

In [68]:
anova_grouped = stats.f_oneway(*df_grouped)
anova_grouped

F_onewayResult(statistic=36.87895470100505, pvalue=7.506584272358903e-06)

In [69]:
# critical value
crit_value = stats.f.ppf(q = 1-.05, dfn= 2, dfd = 12)
crit_value.round(2)

3.89

> - With the p-value much lower than 0.05 and the f-statistic higher than the critical value, we reject the null hypothesis that there will be no change in etching power if the power applied is modified.
> 
> - To find out how different the average would be for each power, we will conduct a t-test:

In [70]:
# looking at the decriptive statistics of each power level
df.pivot(columns='power').describe()

Unnamed: 0_level_0,etching_rate,etching_rate,etching_rate
power,160 W,180 W,200 W
count,5.0,5.0,5.0
mean,5.792,6.238,8.318
std,0.319875,0.434304,0.669604
min,5.43,5.66,7.55
25%,5.59,5.98,7.9
50%,5.71,6.24,8.15
75%,6.01,6.6,8.79
max,6.22,6.71,9.2


> Above, we can see that there seem to be an increase in the average etching rate as we increase the power supplied to the beam.

In [77]:
# t-test to check differences per group

display_a = df[df['power'] == "200 W"]['etching_rate']

for display in df['power'].unique():
    display_b = df[df['power'] == display]['etching_rate']
    print(display, stats.ttest_ind(display_a, display_b))

160 W Ttest_indResult(statistic=7.611403634613074, pvalue=6.237977344615716e-05)
180 W Ttest_indResult(statistic=5.827496614588661, pvalue=0.0003926796476049085)
200 W Ttest_indResult(statistic=0.0, pvalue=1.0)


> Again, with both groups' p-value being less than 0.05 as well as the f-statistic above 3.89, we have sufficient evidence to reject the null hypothesis.