# MODS202 - Econometrics
## Final Project - Nov 2024
DE CARVALHO MACHADO PINHEIRO Rafaela  
SANGINETO JUCA Marina

#### ***Imports***

In [10]:
import pandas as pd
import numpy as np

In [11]:
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import t, f
from statsmodels.tsa.stattools import adfuller

### Part 1 - Cross-section Data
Using the HPRICE2.RAW dataset.

In [9]:
df = pd.read_csv('HPRICE2.raw', delim_whitespace=True, header=None, names = ["price","crime","nox","rooms","dist","radial","proptax","stratio","lowstat","lprice", "lnox", "lproptax"]) 
df.head()

#### **1. State the fundamental hypothesis under which the Ordinary Least Squares (OLS) estimators are unbiased.**

The fundamental hypothesis under which the Ordinary Least Squares (OLS) estimators are unbiased is the Gauss-Markov Theorem. The Gauss-Markov Theorem states that the OLS estimators are unbiased and have the minimum variance among all linear unbiased estimators if the following assumptions are satisfied:  


Linearity: The model is linear in the parameters.  
Exogeneity: The expected value of the error term, given the explanatory variables, is zero.  
Homoscedasticity: The variance of the error term is constant (does not depend on the explanatory variables).  
No multicollinearity: The explanatory variables are not perfectly collinear.  
No autocorrelation: The error terms are uncorrelated with each other.  


The fundamental hypothesis under which the Ordinary Least Squares (OLS) estimators are unbiased is that the unobserved variable has a zero mean. So, in other words, to obtain unbiased estimators in a population model such as:

$$
y = \beta_0 + \beta_1 x + u
$$

where $u$ is the error term, also known as the disturbance or unobserved variable, we need to ensure that the expected value of the error term conditioned on the independent variables(s) (X) is equal to the unconditional expected value, which is zero. Mathematically, this can be expressed as: 

$$
E(u|X) = E(u) = 0
$$


#### **2. Show that under this assumption the OLS estimators are indeed unbiased.**


To show that the OLS estimators are unbiased under the Gauss-Markov assumptions, we can use the properties of expected values:

$$E[\hat{\beta}​]=E[(X′X)−1X′y]=E[(X′X)−1X′(X\beta+ϵ)]=\beta$$

where β^​ is the OLS estimator, X is the matrix of explanatory variables, y is the dependent variable, and ϵ is the error term. The second equality follows from the linearity assumption, and the third equality follows from the exogeneity assumption (i.e., $E[ϵ∣X]=0$).

We can rewrite the following linear model:

$$
y_i = \beta_0 + \beta_1 x_{i1} + ... + \beta_{K} x_{iK} + u_i
$$

in the matrix form, such as:

$$
y = X \beta + u
$$

with $y = (y_1, ..., y_n)'$, $x_k = (x_{1k}, ..., x_{nk})'$, $u=(u_1,...,u_n)'$, $\beta = [\beta_1, ..., \beta_K]'$ and $X = [x_1, ..., x_K]$. So, to derive OLS we need to find $\beta$ that minimizes the following expression:

$$
u' u = (y - X \beta)' (y - X \beta)
$$

In order to do that, we use the fact that the orthogonality condition must be satisfied between X and u, leadind to:

$$
-2 X' (y - X \beta) = 0
$$

Re-arranging considering that $(X'X)$ is inverted, since there is no multi-collinearity:

$$
\hat{\beta} = (X' X)^{-1} X' y
$$

To be unbiased, the estimator above must satisfies:

$$
b(\beta, \hat{\beta}) = E(\hat{\beta}) - \beta = 0
$$

The estimator expectation is given by:

$$
E(\hat{\beta}) = E[(X'X)^{-1} X' (X \beta + u)] = \beta + E(X'u)
$$

So, in order to satisfy $b(\beta, \hat{\beta}) = 0$, the fundamental hypothesis stated in the last item ($E(X'u) = 0$) must be satisfied.




#### **3. Explain the sample selection bias with an example from the course.**

Sample selection bias occurs when the sample used for estimation is not representative of the population of interest. This can happen, for example, when the data is collected only from individuals who choose to participate in a study or program. An example from the course could be estimating the effect of a job training program on wages, but the data is only available for individuals who chose to participate in the program. In this case, the sample may not be representative of the entire population of eligible individuals, and the estimated effect may be biased.

As discussed in the course, sample selection bias occurs when the sample used in analysis is not representative of the population under study.

For instance, during World War II, a sample of RAF planes returning from war zones was used to determine areas that needed reinforcement. However, this sample excluded planes that were shot down, resulting in a biased sample. Consequently, the conclusion that reinforcing areas with bullet holes was necessary was erroneous. In reality, other areas should have been reinforced since planes with bullet holes were still able to return, while others couldn't.

#### **4. Explain the omitted variable bias with an example from the course**

Omitted variable bias occurs when an important explanatory variable is left out of the model, and this variable is correlated with the included explanatory variables. This can lead to biased estimates of the coefficients of the included variables. An example from the course could be estimating the effect of education on income, but failing to include ability as an explanatory variable. Since ability is likely correlated with both education and income, omitting it from the model would lead to biased estimates of the effect of education on income.

#### **5. Explain the problem of multicollinearity. Is it a problem in this dataset?**

The problem of multicollinearity occurs when two or more explanatory variables in a regression model are highly correlated with each other. This can lead to unstable and imprecise estimates of the coefficients, as it becomes difficult to disentangle the individual effects of the correlated variables. Multicollinearity can also make it difficult to determine the statistical significance of the individual coefficients. An example from the course could be estimating the effect of both years of education and years of work experience on income, as these two variables are likely to be highly correlated.


In [3]:
det_df = np.linalg.det(df.T @ df)

print(f"Determinant of X'X is {det_df}")

if np.isclose(det_df, 0):
    print("There is multicollinearity in the data.")
else:
    print("There is no multicollinearity in the data.")

From the results above, this dataset contains some highly correlated coefficients, such as radial the and the proptax with a correlation of 0.91. Therefore, we can state that the multicollinearity is a problem in this dataset

#### **6. Create three categories of nox levels (low, medium, high), corresponding to the following percentiles: 0-25%, 26%-74%, 75%-100%**

In [None]:
low, high = df['nox'].quantile([0.25, 0.75])
df['nox_level'] = pd.cut(df['nox'], bins=[df['nox'].min(), low, high, df['nox'].max()], labels=['low', 'medium', 'high'], include_lowest=True)
df.head()

#### **7. Compute for each category of nox level the average median price and comment on your results**

In [None]:
import pandas as pd
grouped = df.groupby('nox_level')
average_prices = grouped['price'].mean()

average_pricestable = pd.DataFrame({'NOx level': average_prices.index, 'Average Price': average_prices.values})
average_pricestablec

#### **8. Produce a scatter plot with the variable price on the y-axis and the variable nox on the x-axis. Is this a ceteris paribus effect?**

In [None]:
plt.scatter(df['nox'], df['price'])
plt.xlabel('NOX')
plt.ylabel('Price')
plt.title('Scatter plot of Price vs NOX')
plt.show()

When analyzing the graph, it is possible to observe that the average prices increase as NOx levels rise. However, it cannot be asserted that this is a ceteris paribus effect because we do not know if the other variables are held constant, and a scatter plot alone is not sufficient. Moreover, it is possible to see that there are different prices for the same NOx level, indicating the likely presence of other variables influencing the price, which are therefore not constant.

#### **9. Run a regression of price on a constant, crime, nox, rooms, proptax. Comment on the histogram of the residuals. Interpret all coefficients.**

In [None]:
X = df[['crime', 'nox', 'rooms', 'proptax']]
y_9 = df['price']

X = sm.add_constant(X)

model = sm.OLS(y_9, X)
results9 = model.fit()
print(results9.summary())

plt.hist(results9.resid, bins='auto')
plt.title('Histogram of residuals')
plt.show()

The OLS regression results show that all the variables (crime, nox, rooms, proptax) are statistically significant to the price variation as their p-values are less than 0.05.

The coefficients indicate that the price is negatively correlated with crime, nox, and proptax, and positively correlated with rooms.

The histogram of residuals follows a normal distribution. It suggests that the linear regression model might fit for this data.

#### **10. Run a regression of lprice on a *constant, crime, nox, rooms, proptax*. Interpret all coefficients.**


In [None]:
X = df[['crime', 'nox', 'rooms', 'proptax']]
y_10 = df['lprice']

X = sm.add_constant(X)

model = sm.OLS(y_10, X)
results10 = model.fit()
print(results10.summary())

plt.hist(results10.resid, bins='auto')
plt.title('Histogram of residuals')
plt.show()


The OLS regression results show that all the variables (crime, nox, rooms, proptax) are statistically significant to lprice variation as their p-values are less than 0.05.

The coefficients indicate that the price is negatively correlated with crime, nox, and proptax, and positively correlated with rooms.

The histogram of residuals follows a normal distribution. It suggests that the linear regression model might fit for this data.

Comparing with the results of the previous model, the coefficients are smaller in this model. And the R-squared is higher in this model, which means that this model fits better than the previous one.

#### **11. Run a regression of lprice on a *constant, crime, lnox, rooms, lproptax*. Interpret all coefficients.**

In [None]:
X = df[['crime', 'lnox', 'rooms', 'lproptax']]
y = df['lprice']

X = sm.add_constant(X)

model = sm.OLS(y, X)
results11 = model.fit()
print(results11.summary())

plt.hist(results11.resid, bins='auto')
plt.title('Histogram of residuals')
plt.show()

The OLS regression results show that all the variables (crime, lnox, rooms, lproptax) are statistically significant to lprice variation as their p-values are less than 0.05.

The coefficients indicate that the price is negatively correlated with crime, lnox, and lproptax, and positively correlated with rooms.

The histogram of residuals follows a normal distribution. It suggests that the linear regression model might fit for this data.

The R-squared is a little bit higher than the previous model, which means that this model fits better than the previous one.


#### **12. In the specification of question 9, test the hypothesis H0: βnox = 0 vs. H1: βnox ≠ 0 at the 1% level using the p-value of the test**

In [None]:
alpha = 0.01

t_statistic = results10.params['crime'] / results10.bse['crime']

k = len(results10.params) - 1
ndf = len(y_10) - k - 1

p_value = t.cdf(abs(t_statistic), df=ndf)


print('T-statistic:', t_statistic)
print('P-value:', p_value)

if p_value < alpha:
    print("We reject the null hypothesis at the 1% level.")
else:
    print("We do not reject the null hypothesis at the 1% level.")

#### **13. In the specification of question 9, test the hypothesis $H0: βcrime = βproptax at the 10%$ level**

In [None]:
p_value = results10.pvalues['nox']
alpha = 0.01
print('p_value = ', p_value)

if p_value < alpha:
    print("We reject the null hypothesis at the 1% level.")
else:
    print("We do not reject the null hypothesis at the 1% level.")

#### **14. In the specification of question 9, test the hypothesis $H_{0}: β_{nox} = 0, β_{proptax} = 0$ at the 10% level**

Model:

$$
\text{{lprice}} = \beta_0 + \beta_1 \cdot \text{{crime}} + \beta_2 \cdot \text{{nox}} + \beta_3 \cdot \text{{rooms}} + \beta_4 \cdot \text{{proptax}} + u
$$

$$
\theta = \beta_1 - \beta_4
$$

Hypotheses:

$$
H_0: \theta = 0
$$

$$
H_1: \theta \neq 0
$$

Expressing $\beta_1$ in terms of $\theta$ and $\beta_4$:

$$
\beta_1 = \theta + \beta_4
$$

Substituting $\beta_1$ back into the model:

$$
\text{{lprice}} = \beta_0 + (\theta + \beta_4) \cdot \text{{crime}} + \beta_2 \cdot \text{{nox}} + \beta_3 \cdot \text{{rooms}} + \beta_4 \cdot \text{{proptax}} + u
$$
$$
\text{{lprice}} = \beta_0 + \theta \cdot \text{{crime}} + \beta_2 \cdot \text{{nox}} + \beta_3 \cdot \text{{rooms}} + \beta_4 \cdot (\text{{crime}} + \text{{proptax}}) + u
$$

Creating a new variable:

$$
\text{{crime\_tax}} = \text{{crime}} + \text{{proptax}}
$$

Perform OLS regression:

$$
\text{{model}}: \quad \text{{lprice}} = \beta_0 + \theta \cdot \text{{crime}} + \beta_2 \cdot \text{{nox}} + \beta_3 \cdot \text{{rooms}} + \beta_4 \cdot (\text{{crime}} + \text{{proptax}}) + u
$$

Hypothesis test:

$$
H_0: b_j = a_j  \quad \Rightarrow \quad \theta = 0
$$

$$
t = \frac{{b_j - a_j}}{{\text{{se}}(b_j)}}
$$

$$
t = \frac{{\theta}}{{\text{{se}}(\theta)}}
$$


In [None]:
X = df[['crime', 'nox', 'rooms']].copy()
X['crime_tax'] = X['crime'] + df['proptax']
X = sm.add_constant(X)

y = df['lprice']

model = sm.OLS(y, X)
results14 = model.fit()

# Hypothesis test:
# t = theta / se(theta)
t_stat = results14.params['crime'] / results14.bse['crime']

alpha = 0.1
k = len(results14.params) - 1 
ndf = len(y) - k - 1 # number of degrees of freedom
p_value = 2 * t.sf(abs(t_stat), df=ndf)

print('p_value:', p_value)
print('t-stat:', t_stat)

if p_value < alpha:
  print("We reject the null hypothesis at the", alpha*100, "% level.")
else:
  print("We do not reject the null hypothesis at the", alpha*100, "% level.")

#### **15. In the specification of question 9, test the hypothesis $H0: βnox = -500, βproptax = -100$ at the 10% level using the p-value of the test**

In [None]:
x_unrestricted = sm.add_constant(df[['crime', 'nox', 'rooms', 'proptax']])
x_restricted = sm.add_constant(df[['crime', 'rooms']])

y = df['lprice']

model_unrestricted = sm.OLS(y, x_unrestricted)
model_restricted = sm.OLS(y, x_restricted)

results_unrestricted = model_unrestricted.fit()
results_restricted = model_restricted.fit()

# Hypothesis test:
SSR_unrestricted = results_unrestricted.ssr
SSR_restricted = results_restricted.ssr

k_unrestricted = x_unrestricted.shape[1] - 1
k_restricted = x_restricted.shape[1] - 1

q = k_unrestricted - k_restricted  # numerator degrees of freedom
n = len(y)
ddf = n - k_unrestricted - 1  # denominator degrees of freedom

F_statistic = ((SSR_restricted - SSR_unrestricted) / q) / \
    (SSR_unrestricted / ddf)


alpha = 0.10  # 10% significance level
p_value = 2 * f.sf(F_statistic, q, ddf)

print(f'F Statistic: {F_statistic}')
print(f'p-value: {p_value}')

if p_value < alpha:
    print('Reject the null hypothesis')
else:
    print('Fail to reject the null hypothesis')

#### **16. In the specification of question 9, test the hypothesis that all coefficients are the same for observations with low levels of nox vs. medium and high levels of *nox*.**

Unrestricted model:

$$
\text{{lprice}} = \beta_0 + \beta_1 \cdot \text{{crime}} + \beta_2 \cdot \text{{nox}} + \beta_3 \cdot \text{{rooms}} + \beta_4 \cdot \text{{proptax}} + u
$$

Restricted model:

$$
\text{{lprice}} - \beta_2 \cdot \text{{nox}} - \beta_4 \cdot \text{{proptax}} = \beta_0 + \beta_1 \cdot \text{{crime}} + \beta_3 \cdot \text{{rooms}}
$$

Given values:

$$
\beta_2 = -500, \quad \beta_4 = -100
$$

Substituting the values into the restricted model:

$$
\text{{lprice}} + 500 \cdot \text{{nox}} + 100 \cdot \text{{proptax}} = \beta_0 + \beta_1 \cdot \text{{crime}} + \beta_3 \cdot \text{{rooms}}
$$

This represents the restricted model with the specified values for $\beta_2$ and $\beta_4$.


In [None]:
x_unrestricted = sm.add_constant(df[['crime', 'nox', 'rooms', 'proptax']])
x_restricted = sm.add_constant(df[['crime', 'rooms']])

y_unrestricted = df['lprice']
y_restricted = df['lprice'] + 500 * df['nox'] + 100 * df['proptax']

model_unrestricted = sm.OLS(y_unrestricted, x_unrestricted)
model_restricted = sm.OLS(y_restricted, x_restricted)

results_unrestricted = model_unrestricted.fit()
results_restricted = model_restricted.fit()

# Hypothesis test:
SSR_unrestricted = results_unrestricted.ssr
SSR_restricted = results_restricted.ssr

k_unrestricted = x_unrestricted.shape[1] - 1
k_restricted = x_restricted.shape[1] - 1

q = k_unrestricted - k_restricted  # numerator degrees of freedom
n = len(y)
ddf = n - k_unrestricted - 1  # denominator degrees of freedom

F_statistic = ((SSR_restricted - SSR_unrestricted) / q) / \
    (SSR_unrestricted / ddf)


p_value = 2 * f.sf(F_statistic, q, ddf)

alpha = 0.10  # 10% significance level

print(f'P-Value: {p_value}')
print(f'F-Statistic: {F_statistic}')

if p_value < alpha:
    print('Reject the null hypothesis')
else:
    print('Fail to reject the null hypothesis')

#### **17. Repeat the test of question 16 but now assuming that only the coefficients of nox and proptax can change between the two groups of observations. State and test H_{0}.**

$$
\text{{model}}: \text{{lprice}} = \beta_0 + \beta_1 \cdot \text{{crime}} + \beta_2 \cdot \text{{nox}} + \beta_3 \cdot \text{{rooms}} + \beta_4 \cdot \text{{proptax}} + u
$$

$$
\theta = \beta_2 + \beta_4
$$

Hypotheses:
- $H_0: \theta = -1000$
- $H_1: \theta \neq -1000$

Expressing $\beta_2$ in terms of $\theta$ and $\beta_4$:

$$
\beta_2 = \theta - \beta_4
$$

Substituting $\beta_2$ back into the model:

$$
\text{{lprice}} = \beta_0 + \beta_1 \cdot \text{{crime}} + (\theta - \beta_4) \cdot \text{{nox}} + \beta_3 \cdot \text{{rooms}} + \beta_4 \cdot \text{{proptax}} + u
$$
$$
\text{{lprice}} = \beta_0 + \beta_1 \cdot \text{{crime}} + \theta \cdot \text{{nox}} + \beta_3 \cdot \text{{rooms}} + \beta_4 \cdot 
(\text{{proptax}} - \text{{nox}}) + u
$$
Creating a new variable:

$$
\text{{proptax\_nox}} = \text{{proptax}} - \text{{nox}}
$$

Perform OLS regression:

$$
\text{{model}}: \quad \text{{lprice}} = \beta_0 + \beta_1 \cdot \text{{crime}} + \theta \cdot \text{{nox}} + \beta_3 \cdot \text{{rooms}} + \beta_4 \cdot \text{{proptax\_nox}} + u
$$

Hypothesis test:

$$
H_0: b_j = a_j  \quad \Rightarrow \quad \theta = -1000
$$

$$
t = \frac{{b_j - a_j}}{{\text{{se}}(b_j)}}
$$

$$
t = \frac{{\theta + 1000}}{{\text{{se}}(\theta)}}
$$


In [None]:
X = df[['crime', 'nox', 'rooms']].copy()
X['proptax_nox'] = df['proptax'] - df['nox']
X = sm.add_constant(X)

y = df['lprice']

model = sm.OLS(y, X)
results14 = model.fit()

beta_1 = results14.params['nox']
t_stat = (beta_1 + 1000) / results14.bse['nox']

alpha = 0.1
k = len(results14.params) - 1
ndf = len(y) - k - 1 # number of degrees of freedom. 

p_value = 2 * t.sf(abs(t_stat), df=ndf)

print('P-value:', p_value)
print('t-stat:', t_stat)

if p_value < alpha:
  print("We reject the null hypothesis at the", alpha*100, "% level.")
else:
  print("We do not reject the null hypothesis at the", alpha*100, "% level.")

### Part 2 - Heteroskedasticity

#### **18. Explain the problem of heteroskedasticity with an example of the course.**

First step: estimate the coefficient for low levels of nox:

$$
\text{{model}}: \quad \text{{lprice}} = \beta_0 + \beta_1 \cdot \text{{crime}} + \beta_2 \cdot \text{{nox}} + \beta_3 \cdot \text{{rooms}} + \beta_4 \cdot \text{{proptax}} + u
$$

Now we know the coefficients, and we can make the hypothesis test for medium and high levels of nox:

- $H_0$: $b_{i_{\text{{low}}}} = b_{i_{\text{{high\_medium}}}} \quad \forall \; i$
- $H_1$: $b_{i_{\text{{low}}}} \neq b_{i_{\text{{high\_medium}}}} \quad \exists \; i$

$$
\text{{Restricted model}}: \quad \text{{lprice}} - \beta_1 \cdot \text{{crime}} - \beta_2 \cdot \text{{nox}} - \beta_3 \cdot \text{{rooms}} - \beta_4 \cdot \text{{proptax}} = \beta_0 + u
$$

where $\beta_n \forall\; n \in \{1,2,3,4\}$ are the coefficients estimated in the previous model.

In [None]:
# First step: estimate the coefficient for low levels of nox
x_unrestricted = df[df['nox_level'] == 'low'][[
    'crime', 'nox', 'rooms', 'proptax']]
x_unrestricted = sm.add_constant(x_unrestricted)

y_unrestricted = df[df['nox_level'] == 'low']['lprice']

model = sm.OLS(y_unrestricted, x_unrestricted)
results_unrestricted = model.fit()

# Restricted model:

x_medium_high = df[df['nox_level'] != 'low'][[
    'crime', 'nox', 'rooms', 'proptax']]
y_medium_high = df[df['nox_level'] != 'low']['lprice']

x_restricted = sm.add_constant(x_medium_high)[
    ['const']]  # Only the constant goes here
y_restricted = y_medium_high - results_unrestricted.params['crime'] * x_medium_high['crime'] \
    - results_unrestricted.params['nox'] * x_medium_high['nox']\
    - results_unrestricted.params['rooms'] * x_medium_high['rooms'] \
    - results_unrestricted.params['proptax'] * x_medium_high['proptax']

model_restricted = sm.OLS(y_restricted, x_restricted)
results_restricted = model_restricted.fit()

# Hypothesis test:
SSR_unrestricted = results_unrestricted.ssr
SSR_restricted = results_restricted.ssr

k_unrestricted = x_unrestricted.shape[1] - 1
k_restricted = x_restricted.shape[1] - 1

q = k_unrestricted - k_restricted  # numerator degrees of freedom

n = len(y)
ddf = n - k_unrestricted - 1  # denominator degrees of freedom

F_statistic = ((SSR_restricted - SSR_unrestricted) / q) / \
    (SSR_unrestricted / ddf)


alpha = 0.10  # 10% significance level
p_value = f.sf(F_statistic, q, ddf)

print(f'P-Value: {p_value}')
print(f'F-Statistic: {F_statistic}')

if p_value < alpha:
    print('Reject the null hypothesis')
else:
    print('Fail to reject the null hypothesis')

#### **19. In the specification of question 9, test the hypothesis of no heteroskedasticity of linear form, i.e. in the regression of u2 on constant, crime, nox, rooms, proptax, test $H_{0}: \delta_{crime}, \delta_{nox}, \delta_{room}, \delta_{proptax} = 0$, where the coefficients $\delta_{k}$ (k = crime, nox, rooms, proptax) are associated with the corresponding explanatory variables.**

Now that we have the coefficients, we can conduct a hypothesis test for medium and high levels of NOx. The hypotheses are as follows:

- Null Hypothesis ($H_0$): $b_{i_{\text{low}}} = b_{i_{\text{high\_medium}}} \quad \forall \; i \in \{2,4\}$
- Alternative Hypothesis ($H_1$): $b_{i_{\text{low}}} \neq b_{i_{\text{high\_medium}}} \quad \text{for some } i \in \{2,4\}$

We have two models:

1. **Unrestricted model:**
   - $ \text{lprice} = \beta_0 + \beta_1 \cdot \text{crime} + \beta_2 \cdot \text{nox} + \beta_3 \cdot \text{rooms} + \beta_4 \cdot \text{proptax} + u $

2. **Restricted model:**
   - $ \text{lprice} - \beta_2 \cdot \text{nox} - \beta_4 \cdot \text{proptax} = \beta_0 + \beta_1 \cdot \text{crime} + \beta_3 \cdot \text{rooms} + u $
   - where $ \beta_2 $ and $ \beta_4 $ are the results of the unrestricted model

considering $\alpha = 0.1$:

In [None]:
# First step: estimate the coefficient for low levels of nox
x_unrestricted = df[df['nox_level'] == 'low'][[
    'crime', 'nox', 'rooms', 'proptax']]
x_unrestricted = sm.add_constant(x_unrestricted)

y_unrestricted = df[df['nox_level'] == 'low']['lprice']

model = sm.OLS(y_unrestricted, x_unrestricted)
results_unrestricted = model.fit()

# Second step: estimate the restricted model for medium and high levels of nox
x_medium_high = df[df['nox_level'] != 'low'][[
    'crime', 'nox', 'rooms', 'proptax']]
x_restricted = x_medium_high[['crime', 'rooms']]
x_restricted = sm.add_constant(x_restricted)

y_medium_high = df[df['nox_level'] != 'low']['lprice']
y_restricted = y_medium_high - results_unrestricted.params['nox'] * x_medium_high['nox']\
    - results_unrestricted.params['proptax'] * x_medium_high['proptax']

model_restricted = sm.OLS(y_restricted, x_restricted)
results_restricted = model_restricted.fit()

# Hypothesis test:
SSR_unrestricted = results_unrestricted.ssr
SSR_restricted = results_restricted.ssr

k_unrestricted = x_unrestricted.shape[1] - 1
k_restricted = x_restricted.shape[1] - 1

q = k_unrestricted - k_restricted  # numerator degrees of freedom

n = len(y)
ddf = n - k_unrestricted - 1  # denominator degrees of freedom

F_statistic = ((SSR_restricted - SSR_unrestricted) / q) / \
    (SSR_unrestricted / ddf)


alpha = 0.10  # 10% significance level
p_value = f.sf(F_statistic, q, ddf)

print(f'P-Value: {p_value}')
print(f'F-Statistic: {F_statistic}')

if p_value < alpha:
    print('Reject the null hypothesis')
else:
    print('Fail to reject the null hypothesis')

#### **20. In the specification of question 10, test the hypothesis of no heteroskedasticity of linear form**

In [None]:
X = df[['crime', 'nox', 'rooms', 'proptax']]

X = sm.add_constant(X)

u = results10.resid
u2 = u**2
y = u2
model = sm.OLS(y, X)
results = model.fit()

f_statistic = results.fvalue

alpha = 0.1
k = len(results.params) - 1
n = len(y)
ddf = n - k - 1

p_value = f.sf(f_statistic, dfn=k, dfd=ddf)

print('P-value = ', p_value)
print('F-statistic = ', f_statistic)

if p_value < alpha:
    print("We reject the null hypothesis at the 1% level.")
else:
    print("We do not reject the null hypothesis at the 1% level.")

#### **21. In the specification of question 11, test the hypothesis of no heteroskedasticity of linear form**

In [None]:
X = df[['crime', 'lnox', 'rooms', 'lproptax']]

X = sm.add_constant(X)

u = results11.resid
u2 = u**2
y = u2
model = sm.OLS(y, X)
results = model.fit()

f_statistic = results.fvalue

alpha = 0.1
k = len(results.params) - 1
n = len(y)
ddf = n - k - 1

p_value = f.sf(f_statistic, dfn=k, dfd=ddf)

print('P-value:', p_value)
print('F-statistic = ', f_statistic)

if p_value < alpha:
    print("We reject the null hypothesis at the 1% level.")
else:
    print("We do not reject the null hypothesis at the 1% level.")

#### **22. Comment on the differences between your results of questions 20,21, 22.**

In all three tests, the null hypothesis is rejected, indicating the presence of heteroskedasticity. However, an increase in the F-statistic is noticeable when comparing the tests. In the first test, the F-value is 6.799; in the second, F is 19.98; and in the third, F is 18.27. The increment in the F-value is noteworthy because as it increases, it moves further away from the critical value, which remains constant at 1.95 for all tests. Therefore, it can be concluded that as the F-value rises, we gain more certainty about the presence of heteroskedasticity.

In [None]:
x_22 = df[['crime', 'nox', 'rooms', 'proptax']]
x_22 = sm.add_constant(x_22)

u = results9.resid
u2 = u**2
y_22 = u2
model = sm.OLS(y_22, x_22)
results22 = model.fit()

f_statistic = results22.fvalue

alpha = 0.1
k = len(results22.params) - 1
n = len(y_22)
ddf = n - k - 1

p_value = f.sf(f_statistic, dfn=k, dfd=ddf)

print('P-value = ', p_value)
print('F-statistic = ', f_statistic)

if p_value < alpha:
    print("We reject the null hypothesis at the 1% level.")
else:
    print("We do not reject the null hypothesis at the 1% level.")

#### **23. Using the specification of question 9, identify the most significant variable causing heteroskedasticity using the student statistics and run a WLS regression with the identified variable as weight. Compare the standards errors with those of question 9. Comment on your results.**

### Part 3 - Time Series Data
Using the threecenturies_v2.3 datasets.

#### **24. Define strict and weak stationarity.**

#### **25. Explain ergodicity and state the ergodic theorem. Illustrate with an example.**

#### **26. Why do we need both stationarity and ergodicity?**

#### **27. Explain “spurious regression”.**

#### **28. Make all time series stationary by computing the difference between the original variable and a moving average of order 2x10. Give the formula for the exact weights.**

#### **29. Using the original dataset, test the unit root hypothesis for all variables**

#### **30. Transform all variables so that they are stationary using either your answers to questions 28 or to question 29.**

#### **31. Explain the difference between ACF and PACF.**

#### **32. Plot and comment on the ACF and PACF of all variables.**

#### **33. Explain the principle of parsimony and its relationship with Ockham’s razor using the theory of information criterion.**

#### **34. Explain the problem of auto-correction of the errors.**

#### **35. Using only stationary variables, run a regression of GDP on constant, unemployment and inflation and test the hypothesis of no-autocorrelation of errors.**

#### **36. Regardless of your answer to question 35, correct auto-correlation with GLS. Test again for the presence of auto-correlation. Comment on your results.**

#### **37. For all variables, construct their lag 1 and lag 2 variables..**

#### **38. Run a regression of GDP on constant, lag 1 unemployment, lag 2 unemployment, lag 1 inflation, lag 2 inflation. What is the number of observations and why?**

#### **39. State and test the no-Granger causality hypothesis of unemployment on GDP at the 1% level.**

#### **40. Divide the sample in two groups: 1900-1960 and 1961-2000. Test the stability of coefficients between the two periods.**

#### **41. Test the structural breakpoint using a trim ratio of 30% at the 1% level.**

#### **42. Divide the sample into 3 periods of equal length. Test that the coefficients of the second and the third periods are equal. Formulate the null hypothesis and interpret your results.**