# ANOVA using Python

ANOVA is one of the statistical tools that helps determine whether two or more data samples have significantly identical properties. Let’s assume a scenario- we have different samples collected independently from the same dataset for cross-validation. We wish to know whether the means of the collected samples are significantly the same. Another scenario- we have developed three different machine learning models. We have obtained a set of results, and we wish to know whether the models perform significantly in the same manner. Thus, there are many scenarios in practical applications where we may need to use ANOVA as part of data analytics.

To read about it more, please refer [this](https://analyticsindiamag.com/a-complete-python-guide-to-anova/) article.

## **Comparing Means using ANOVA**

Import the necessary libraries to create the environment.

In [None]:
!python -m pip install pip --upgrade --user -q
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels --user -q

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
# import libraries
import numpy as np
import pandas as pd
import scipy
import statsmodels.api as sm
from statsmodels.formula.api import ols
from matplotlib import pyplot as plt
import seaborn as sns

np.random.seed(1)

Generate some normally distributed synthetic data using NumPy’s random module. While generating synthetic data, we should ensure that the standard deviation is common for all different methods.

In [None]:
method_1 = np.random.normal(10,3,10)
method_2 = np.random.normal(11,3,10)
method_3 = np.random.normal(12,3,10)
method_4 = np.random.normal(13,3,10)

data = pd.DataFrame({'method_1':method_1, 
                     'method_2':method_2, 
                     'method_3':method_3,
                     'method_4':method_4})
data.head()

Before proceeding further into ANOVA, we should establish a null hypothesis. Whenever we are unable to make a solid mathematical decision, we go for hypothesis testing. ANOVA does follow hypothesis testing. Our null hypothesis (common for most ANOVA problems) can be expressed as:

    Means of all the four methods are the same.

We know very well that the means are mathematically not the same. We set 10, 11, 12 and 13 as the means for the corresponding four methods while generating data. But from a statistical point of view, we make decisions with some level of significance. We set the most common level of significance, 0.05 (i.e. 5% of risk in rejecting the null hypothesis when it is actually true). 

In other words, if we set a level of significance of zero, it is a mathematical decision – we do not permit errors. In our case, we can reject the null hypothesis without any analysis, because we know that the means are different from each other. However, with many factors affecting the data, we should give some space to accept some statistically significant deviations among data. 

ANOVA follows F-test (We will define F-statistic shortly). If the probability of F-statistic is less than or equal to the level of significance (0.05, here), we should reject the null hypothesis. Else, we should accept the null hypothesis.

Make the data frame to have a single column of values using Pandas’ melt method.

In [None]:
df = pd.melt(data,  
             value_vars=['method_1', 'method_2', 'method_3', 'method_4'])

df.columns = [ 'treatment', 'value']
df.sample(10)

Develop an Ordinary Least Squares model with the melted data.

In [None]:
model = ols('value~C(treatment)', data=df).fit()
model.summary()

We can jump into conclusions with this step itself. The probability score is 0.135, which is greater than 0.05. Hence, we should accept the null hypothesis. In other words, the means of all four methods are significantly the same. However, an ANOVA table can give crystal clear output for better understanding. Obtain the ANOVA table with the following code.

In [None]:
anova = sm.stats.anova_lm(model, typ=1)
anova

Users need to be aware that the terms groups and methods are invariably used in this example.

We have come to the conclusion based on the Probability score. However, we can also arrive at the conclusion based on the F-statistic also. We can calculate the critical value of F-statistic with the following code.

In [None]:
scipy.stats.f(3,36).ppf(0.95)

If the observed F-statistic is greater than or equal to its critical value, we should reject the null hypothesis. Else, if the observed F-statistic is less than its critical value, we should accept the null hypothesis. Here the observed value 1.975314 is less than the critical value 2.86626. Therefore, we accept the null hypothesis.

We can visualize the actual data to get some better understanding.

In [None]:
sns.set_style('darkgrid')
data.plot()
plt.xlabel('Data points')
plt.ylabel('Data value')
plt.show()

We can see a great overlap among different data groups. This is exactly where we cannot jump into conclusions in a mathematical way. Statistical tools help take successful business decisions in these tough scenarios.

How does Means vary among different groups? Let’s visualize it too.

In [None]:
data.mean(axis=0).plot(kind='bar')
plt.xlabel('Methods / Groups')
plt.ylabel('Mean value')
plt.show()

# Limitation of ANOVA

There is a big problem with the ANOVA method when we reject the null hypothesis. Let’s study that with some code examples. Increase the mean value of method_4 from 13 to 15.

In [None]:
# Alter the mean value of method_4
method_1 = np.random.normal(10,3,10)
method_2 = np.random.normal(11,3,10)
method_3 = np.random.normal(12,3,10)
method_4 = np.random.normal(15,3,10)

data = pd.DataFrame({'method_1':method_1, 
                     'method_2':method_2, 
                     'method_3':method_3,
                     'method_4':method_4})
data.head()

Melt the data to have single-columned values.

In [None]:
df = pd.melt(data,  
             value_vars=['method_1', 'method_2', 'method_3', 'method_4'])

df.columns = [ 'treatment', 'value']
df.sample(10)

Develop the Ordinary Least Squares model.

In [None]:
model = ols('value~C(treatment)', data=df).fit()
model.summary()

Obtain the ANOVA table.

In [None]:
anova = sm.stats.anova_lm(model, typ=1)
anova

Since the probability score is less than the level of significance, 0.05, we do reject the null hypothesis. It means that at least one mean value is different from the others. But we cannot identify the method or methods whose means are different from the others. This is where ANOVA needs some other methods to bring light upon its decisions. 

This issue can be tackled with the help of Post Hoc Analysis.

In [None]:
sns.set_style('darkgrid')
data.plot()
plt.xlabel('Data points')
plt.ylabel('Data value')
plt.show()

In [None]:
data.mean(axis=0).plot(kind='bar')
plt.xlabel('Methods / Groups')
plt.ylabel('Mean value')
plt.show()

# Post Hoc Analysis

Post Hoc Analysis is also known as the Tukey-Kramer method or the Tukey test or the Multi-Comparison test. Whenever we reject the null hypothesis in an ANOVA test, we explore individual comparisons among the mean values of different groups (methods) using the Post Hoc Analysis.

Import the necessary module from the statsmodels library.

In [None]:
from statsmodels.stats.multicomp import MultiComparison

comparison = MultiComparison(df['value'], df['treatment'])
tukey = comparison.tukeyhsd(0.05)
tukey.summary()

This method performs ANOVA individually between every possible pair of groups. It yields individual decisions with probability scores. 

Here, the null hypothesis is accepted (means are significantly the same) for the pairs:

method_1 and method_2  

method_1 and method_3

method_2 and method_3 

On the other hand, null hypothesis is rejected (means are significantly different) for the pairs:

method_1 and method_4

method_2 and method_4

method_3 and method_4

Hence, we can conclude that methods 1, 2 and 3 possess significantly the same means while method 4 differs from them all.

Note: We have generated data with NumPy’s random module without any seed value. Hence, the values and results in these examples are not reproducible. 