In [5]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
import scipy.stats
import statsmodels.api as sm

# <font face="gotham" color="purple"> ANOVA </font>

If you have studied statistics, you certainly know the famous **Analysis of Variance** (ANOVA), you can skip this section, but if you haven't, read on.

Simply speaking, the **Analysis of Variance** (ANOVA) is a technique of comparing means of multiple$(\geq 3)$ populations, the name derives from the way how calculations are performed. 

For example, a common hypotheses of ANOVA are 
$$
H_0:\quad \mu_1=\mu_2=\mu_3=\cdots=\mu_n\\
H_1:\quad \text{At least two means differ}
$$
In order to construct $F$-statistic, we need to introduce two more statistics, the first one is **Mean Square for Treatments** (MST), $\bar{\bar{x}}$ is the grand mean, $\bar{x}_i$ is the sample mean of sample $x_i$, $n_i$ is the number of observable in sample $i$
$$
MST=\frac{SST}{k-1},\qquad\text{where } SST=\sum_{i=1}^kn_i(\bar{x}_i-\bar{\bar{x}})^2
$$
And the second one is **Mean Square for Error** (MSE), $s_i$ is the sample variance of sample $i$
$$
MSE=\frac{SSE}{n-k},\qquad\text{where } SSE =(n_1-1)s_1^2+(n_2-1)s_2^2+\cdots+(n_k-1)s_k^2
$$
Join them together, an $F$-statistic is constructed
$$
F=\frac{MST}{MSE}
$$
If the $F$-statistic is larger than critical value with its corresponding degree of freedom, we reject null hypothesis.

# <font face="gotham" color="purple"> Dummy Variable </font>

Here's dataset with dummy variables, which are either $1$ or $0$. 

In [6]:
df = pd.read_excel('Basic_Econometrics_practice_data.xlsx', sheet_name = 'Hight_ANOVA')

In [7]:
df

Unnamed: 0,Height,NL_dummpy,DM_dummpy,FI_dummy
0,161.783130,0,0,0
1,145.329934,0,0,0
2,174.569597,0,0,0
3,160.003162,0,0,0
4,162.242898,0,0,0
...,...,...,...,...
83,180.962477,0,0,1
84,172.791579,0,0,1
85,174.951880,0,0,1
86,176.059861,0,0,1


The dataset has five columns, the first column $Height$ is a sample of $88$ male height, other columns are dummy variables indication its qualitative feature, here is the nationality.

There are four countries in the sample, Japan, Netherlands, Denmark and Finland, however there are only $3$ dummies in the data set, this is to avoid _perfect multicollinearity_, because the height data is the perfect linear combination of four dummy variables. 

If we use the model with only dummy variables as independent variable, we basically regressing a ANOVA model, i.e.
$$
Y_{i}=\beta_{1}+\beta_{2} D_{2 i}+\beta_{3 i} \mathrm{D}_{3 i}+\beta_{3 i} \mathrm{D}_{3 i}+u_{i}
$$

where $Y_i =$ the male height, $D_{2i}=1$ if the male is from Netherlands,  $D_{3i}=1$ if the male is from Denmark and $D_{4i}=1$ if the male is from Finland. Japan doesn't have a dummy variable, so we are using it as reference, which will be clearer later.

Now we run the regression and print the result. And how do we interpret the estimated coefficients?

In [13]:
X = df[['NL_dummpy', 'DM_dummpy', 'FI_dummy']]
Y = df['Height']

X = sm.add_constant(X) # adding a constant

model = sm.OLS(Y, X).fit() 
print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:                 Height   R-squared:                       0.453
Model:                            OLS   Adj. R-squared:                  0.434
Method:                 Least Squares   F-statistic:                     23.20
Date:                Tue, 24 Aug 2021   Prob (F-statistic):           4.93e-11
Time:                        10:56:32   Log-Likelihood:                -300.08
No. Observations:                  88   AIC:                             608.2
Df Residuals:                      84   BIC:                             618.1
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        163.0300      1.416    115.096      0.0

First, all the $p$-value are significantly small, so our estimation is valid. Then we examine the coefficients one by one. 

The estimated constant $b_1 = 163.03$ is the mean height of Japanese male. The mean of Dutch male height is $b_1+b_2 = 163.03+17.71=180.74$, the mean of Danish male height is $b_1+b_3=163.03+12.21=175.24$, the mean of Finnish male height is $b_1+b_4=163.03+12.85=175.88$. 