<a href="https://colab.research.google.com/github/simodepth/Stat-Models/blob/main/One_Way_ANOVA_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Import Packages and a Given Data frame

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import f_oneway
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.multicomp import MultiComparison
from statsmodels.graphics.gofplots import qqplot
import warnings
from IPython.display import display, Math, Latex, Markdown

warnings.filterwarnings("ignore")

In [3]:
Balenciaga = pd.read_excel('/content/Balenciaga top queries.xlsx')

#here we only keep integers to compute ANOVA
df = pd.DataFrame(Balenciaga, columns= ['Clicks','Impressions'])
df

Unnamed: 0,Clicks,Impressions
0,2683305,18335608
1,235934,3259875
2,215741,1033109
3,102150,1163610
4,99343,371258
...,...,...
995,408,3555
996,407,3732
997,407,3615
998,407,2344


#ANOVA Analysis Setup


---

###**Object of the Research** 
 we want to investigate whether Clicks generation depend on the amount of Impressions generated by batch of URLs

###**Define Variables**
Let's define the variables by assuming that Clicks could be impacted by the Impressions quota

Dependent = Clicks

Independent = Impressions

In [4]:
model = ols("Clicks ~ C(Impressions)", df).fit()
model.summary()

0,1,2,3
Dep. Variable:,Clicks,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,42460.0
Date:,"Wed, 28 Sep 2022",Prob (F-statistic):,1.1400000000000001e-26
Time:,16:45:30,Log-Likelihood:,-5243.3
No. Observations:,1000,AIC:,12460.0
Df Residuals:,12,BIC:,17310.0
Df Model:,987,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,447.0000,418.139,1.069,0.306,-464.047,1358.047
C(Impressions)[T.609],19.0000,591.338,0.032,0.975,-1269.415,1307.415
C(Impressions)[T.619],-8.0000,591.338,-0.014,0.989,-1296.415,1280.415
C(Impressions)[T.656],-32.0000,591.338,-0.054,0.958,-1320.415,1256.415
C(Impressions)[T.662],-21.0000,591.338,-0.036,0.972,-1309.415,1267.415
C(Impressions)[T.664],-11.0000,591.338,-0.019,0.985,-1299.415,1277.415
C(Impressions)[T.671],37.0000,591.338,0.063,0.951,-1251.415,1325.415
C(Impressions)[T.716],92.0000,591.338,0.156,0.879,-1196.415,1380.415
C(Impressions)[T.717],-18.0000,591.338,-0.030,0.976,-1306.415,1270.415

0,1,2,3
Omnibus:,778.53,Durbin-Watson:,1.973
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1068419.191
Skew:,2.112,Prob(JB):,0.0
Kurtosis:,163.076,Cond. No.,995.0


In [5]:
res = anova_lm(model, typ=1)

In [6]:
def model_evaluation(
    model,
    independent_name: str = "Impressions",
    dependent_name: str = "Clicks",
    alpha=0.5,
):
    p_value = model.f_pvalue
    display(
        Markdown(
            f"""
**Null hypothesis**: All means are equal.<br>
**Alternative hypothesis**: Not all mean are equal<br>
**Significance level**: α = {alpha}

The F-statistic of the model is {round(model.fvalue, 6)}. The p-value of the model is {round(p_value, 6)}."""
        )
    )
    if p_value > alpha:
        display(
            Markdown(
                f"""Since the p-value is greater than the significance level of {alpha}, the differences between the means are not statistically significant."""
            )
        )
    else:
        display(
            Markdown(
                f"""Since the p-value is less than the significance level of {alpha}, there is enough evidence to claim that the differences between some of the means are statistically significant."""
            )
        )

In [7]:
model_evaluation(model)


**Null hypothesis**: All means are equal.<br>
**Alternative hypothesis**: Not all mean are equal<br>
**Significance level**: α = 0.5

The F-statistic of the model is 42457.317014. The p-value of the model is 0.0.

Since the p-value is less than the significance level of 0.5, there is enough evidence to claim that the differences between some of the means are statistically significant.

At a statistical level of 0.5 we can claim a regressive relationship between Clicks and Impressions exist.

Hence, the **number of Clicks are impacted by the Impressions quota according to a regression statistical operation relying on a level of confidence of 0.5**