# Problem Set 2


## An econometric analysis of Engel curves for U.S. households
In problem set 1, we undertook the initial analysis of the data and discussed regression models of Engel curves for food, clothes and alcohol. The objective of this week’s problem set is to estimate a simple regression model of the Engel curve using the OLS estimator.

The starting point is a regression model with one explanatory variable. Specifically, let us consider the case where the dependent variable represents food expenditures, while the explanatory variable is total expenditure:

\begin{align}
\text{xfath}_i = \beta_0 + \beta_1 \text{xtot}_i + u_i \tag{1}
\end{align}

In the consumption literature, it is common to use expenditure share, $\text{xfath}/\text{xtot}$, as the dependent variable instead of using total food expenditures. Furthermore, the logarithm of total expenditure deflated by an individual ”consumer price index” is often used as the explanatory variable. In this case, the regression model is:

\begin{align}
\frac{\text{xfath}_i}{\text{xtot}_i} = \delta_0 + \delta_1 \log \left(\frac{\text{xtot}_i}{\text{price}_i}
\right) + v_i \tag{2}
\end{align}


## Group work: Discuss model (2)

### Question 1
**Task:** What is the interpretation of $\delta_1$ when $%
\delta_1>0$ and $\delta_1<0$? 
(Hint: luxury versus necessity goods)

**Your answer:**

Når delta_1 > 0, vokser andelen af gode x, når totale udgifter vokser, dermed har vi i dette tilfælde en luksus vare.

Når delta_1 < 0 , minker andelen af gode x, når de totale udgifter vokser, dette kan fortolkes som at vi har en necessity vare.

### Question 2
**Task:** What is likely to be included in the error term in model (2)?
(Hint: What other variables may influence the dependent variable, $\text{xfath}/\text{xtot}$, besides the explanatory variable?)

**Your answer:**


## Python exercises

### Exercise 1: Estimate model (1)
In this exercise, we are going to estimate model (1) **for men only** using OLS:
\begin{align}
\text{xfath}_i = \beta_0 + \beta_1 \text{xtot}_i + u_i \tag{1}
\end{align}

To do this, revisit the lecture notebook and use the `statsmodels` approach described there.



---

**Task 1.** Load the data from PS2.dta

**Your code:**

In [5]:
import pandas as pd
df = pd.read_stata('PS2.dta')
df

Unnamed: 0,year,province,hgy,hage,nety,xfath,xrest,xhhop,xwcloth,xmcloth,...,pcaruse,pcare,stonep,price,rxtot,xtot,wfath,wwcloth,wmcloth,walc
0,92.0,5.0,18795.0,35.0,16901.0,1510.0,526.0,1200.0,1066.0,0.0,...,279.678772,214.639633,5.491577,242.639587,37.038475,8987.0,0.168020,0.118616,0.000000,0.007789
1,92.0,2.0,36000.0,34.0,26350.0,2680.0,260.0,1734.0,842.0,0.0,...,272.327911,221.057159,5.456213,234.208679,48.947803,11464.0,0.233775,0.073447,0.000000,0.004361
2,92.0,4.0,29288.0,39.0,20346.0,1820.0,50.0,647.0,2270.0,0.0,...,228.363876,203.841293,5.604066,271.528290,39.590717,10750.0,0.169302,0.211163,0.000000,0.025581
3,92.0,4.0,29225.0,28.0,24005.0,1440.0,1986.0,4186.0,2125.0,0.0,...,228.363876,203.841293,5.370063,214.876373,73.349152,15761.0,0.091365,0.134826,0.000000,0.000508
4,92.0,4.0,29300.0,27.0,22118.0,3100.0,515.0,1041.0,769.0,0.0,...,228.363876,203.841293,5.377327,216.443039,45.526989,9854.0,0.314593,0.078039,0.000000,0.020296
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
445,92.0,3.0,60600.0,42.0,41900.0,2490.0,1500.0,565.0,0.0,310.0,...,284.803314,223.677643,5.693949,297.064484,54.829845,16288.0,0.152873,0.000000,0.019032,0.083497
446,92.0,1.0,45148.0,41.0,35073.0,900.0,4500.0,933.0,0.0,880.0,...,236.936798,197.392685,5.489823,242.214417,55.417839,13423.0,0.067049,0.000000,0.065559,0.207852
447,92.0,4.0,41200.0,29.0,30900.0,1500.0,1580.0,580.0,0.0,2195.0,...,228.363876,203.841293,5.392261,219.699570,69.258217,15216.0,0.098580,0.000000,0.144256,0.018402
448,92.0,2.0,40400.0,38.0,28440.0,2790.0,2410.0,874.0,0.0,1000.0,...,272.327911,221.057159,5.520148,249.672058,62.822407,15685.0,0.177877,0.000000,0.063755,0.155690


**Task 2:** Estimate model (1) **for the male participants in the survey** only using the `statsmodels` module as described in lecture.

**Your code:**

In [7]:
import statsmodels.api as sm
df_male = df[df.dmale == 1]
x_male=df_male[["xtot"]].copy()
x_male["konstant"]=1
y_male = df_male["xfath"]

x_male

model = sm.OLS(y_male, x_male).fit

# OLS estimation vha OLS fra statsmodels
model_sm = sm.OLS(y_male, x_male).fit()
print(model_sm.summary())

                            OLS Regression Results                            
Dep. Variable:                  xfath   R-squared:                       0.049
Model:                            OLS   Adj. R-squared:                  0.045
Method:                 Least Squares   F-statistic:                     13.11
Date:                Mon, 09 Sep 2024   Prob (F-statistic):           0.000354
Time:                        11:18:11   Log-Likelihood:                -2157.0
No. Observations:                 258   AIC:                             4318.
Df Residuals:                     256   BIC:                             4325.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
xtot           0.0488      0.013      3.621      0.0


**Task 3:** What is the interpretation of the slope $\beta_1$ and intercept $\beta_0$ in model (1)? 


**Your answer:**

Interpretation of the Intercept beta_0 represents the predicted value of food expenditure when the total expenditure (xtot_i) is zero.

In practical terms, this would tell you the expected food expenditure for a household with zero total expenditure, which may not have a meaningful interpretation in reality (because total expenditure can rarely be zero), but it serves as the starting point of the regression line.

Interpretation of the slope beta_1 represents the change in food expenditure (xfath_i) for each one-unit increase in total expenditure (xtot_i). 
In other words, beta_1 tells us how much food expenditure is expected to change when total expenditure increases by one dollar (or the respective currency unit). A positive beta_1 indicates that food expenditure increases with higher total expenditure, while a negative beta_1 would suggest the opposite.

**Task 4:** What is the estimate of the slope? And the intercept?



**Your answer:**

Vi kan aflæse estimaterne udfra modellen i opgave 2. Vi har estimaterne
beta_0 = 1443.0723 
beta_1 = 0.0488.

**Task 5:** What is the total variation in the dependent variable, SST? The explained variation in the dependent variable, SSE? The variation in the residuals, SSR?

_Hint:_ You can calculate these statistics manually using the code from the lecture. Alternatively, you can access the statistics directly as properties of your OLS results object `sm.OLS.fit()`. If you named this object `results`, you can access the SST using `results.centered_tss`, the SSR using `results.ssr` and the SSE using `results.ess`.

**Your code:**

In [10]:
import numpy as np

bhat0 = 1443.0723
bhat1 = 0.0488

# Forudsagte værdier (y-hat)
yhat = bhat0 + bhat1 * x_male

# Godness of fit
SST = np.sum((y_male - np.mean(y_male)) ** 2)     # Total sum of squres
SSE = np.sum((yhat - np.mean(y_male)) ** 2)  # Forklaret sum of squares
SSR = np.sum((y_male - yhat) ** 2)           # Residual sum of squares
R2 = SSE / SST                          # Forklaringsgrad

# Udskriv resultater
print(f"SST (Total sum of squares):        {SST}")
print(f"SSE (Explained sum of squares):    {SSE}")
print(f"SSR (Residual sum of squares):     {SSR}")
print(f"R^2 (Goodness of fit):             {R2}")

SST (Total sum of squares):        290257414.4844961
SSE (Explained sum of squares):    xtot        1.411659e+07
konstant    1.145704e+08
dtype: float64
SSR (Residual sum of squares):     192         0.0
193         0.0
194         0.0
195         0.0
196         0.0
           ... 
447         0.0
448         0.0
449         0.0
konstant    0.0
xtot        0.0
Length: 260, dtype: float64
R^2 (Goodness of fit):             xtot        0.048635
konstant    0.394720
dtype: float64


  return reduction(axis=axis, out=out, **passkwargs)
  return reduction(axis=axis, out=out, **passkwargs)


In [11]:
SST_sm = model_sm.ess + model_sm.ssr  # Total Sum of Squares (SST)
SSR_sm = model_sm.ssr                 # Residual Sum of Squares (SSR)
SSE_sm = model_sm.ess                 # Explained Sum of Squares (SSE)
R2_sm = model_sm.rsquared             # R^2
print(f"SST (Total sum of squares):     {SST_sm:15.2f}")
print(f"SSE (Explained sum of squares): {SSE_sm:15.2f}")
print(f"SSR (Residual ssum of squares): {SSR_sm:15.2f}")
print(f"R^2 (Goodness of fit):          {R2_sm:15.4f}")

SST (Total sum of squares):        290257414.48
SSE (Explained sum of squares):     14140148.17
SSR (Residual ssum of squares):    276117266.31
R^2 (Goodness of fit):                   0.0487


**Your answer:**

**Task 6:** Find the coefficient of determination, $R^2$. How can it be calculated from the three measures from the previous question? How would you interpret the calculated $R^2$?


**Your answer:**

R^2 = SSE / SST 

Jo tættere på 1 R^2 er jo bedre.

**Task 7:** What is the estimate of the variance of the error term, $\hat \sigma^2$?

_Hint:_ use `results.mse_resid` to retrieve the estimated variance of the error term directly from the model object

**Your code:**

In [None]:
results.mse.resid

**Your answer:**

### Exercise 2: Illustrate the estimated Engel curve

**Task 1:** Illustrate the estimated Engel curve in a graph together with a scatterplot of the actual data observations. 

_Hints:_ To plot the estimated Engel curve, you need the $\hat{y}$ values (your estimated xfath values) for each of the observed $x$-values (the xtot values) in the dataset. When plotted together, these $\hat{y}$ values form the line estimated by OLS.

If your OLS results object is called `results` and your observations of xtot are called `X`, you can extract the $\hat{y}$ values like this:

```py
y_hat = results.predict(X)
```

To do a lineplot in Seaborn, use the `sns.lineplot(x, y, data)` method.

To layer two plots on top of each other, simply execute two Seaborn commands after one another in a single cell.

**Your code:**

**Task 2:** Perform scatterplots of the residuals from the regression against xtot and the predicted value of food consumption, separately.What shall you expect given the assumptions presented in the lectures? Should you expect the total expediture to correlate with the residuals? 

_Hint:_
You can access the residuals using the `.resid` property of your results object.

_Pro tip:_ You can manually change the labels of the y- and x-axis of Seaborn plots by adding `.set(ylabel="Residuals", xlabel="xtot")` to the end of your plot command

**Your code:**

In [1]:
#Task 2a 


In [2]:
# Task 2b


Remember the mechanical properties of the OLS estimator: The residuals sum to zero, $E(u)=0$, and are always uncorrelated with the explanatory variables, $E(x|u)=0$

### Exercise 2: Estimate model (2)

In this exercise, we are going to estimate model (2) for men and women individually:
\begin{align}
\frac{\text{xfath}_i}{\text{xtot}_i} = \delta_0 + \delta_1 \log \left(\frac{\text{xtot}_i}{\text{price}_i}
\right) + v_i \tag{2}
\end{align}


**Task 1:** Construct the variables needed to estimate model (2). 

_Hint:_ The $\text{price}_i$ variable is included in the dataset under the name `price`.


**Your code:**

**Task 2:** Estimate model (2) by OLS for the budget shares of food, clothing and alcohol for men and women, separately. For each gender and each of the three expenditure categories, print the estimated slope parameters. 

_Hint_: Write a nested for-loop where you estimate the model and print the parameter estimate for each of the genders and and each of the three dependant variables. 

To access the individual model parameters, you can use the `.params` property of the results object. For example, if you have an explanatory variable called `log_xtot_adj`, you can access the parameter estimate like this:


```py
model = sm.OLS(y, X)
results = models.fit()
delta1 = results.params['log_xtot_adj']
```

**Your code:**

**Task 3:** Interpret the estimation results in light of the discussion on luxury versus necessity goods (see group work). Which parameter is central to the analysis? What conclusions can be drawn on the basis of the analysis?

**Your answer:**

## Theoretical exercise
Solve the following theoretical exercises (using pen and paper). Estimated time for the exam is 30
minutes.

**Task 1:**
Write up the simple linear regression model (SLR) with a constant term and 1
explanatory variable in matrix form for $n$ observations.

**Task 2:** Write the OLS estimator in matrix form. Show that when one
calculates the OLS estimator, then: 
\begin{equation*}
\widehat{\beta }_{0}=\bar{y}-\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y%
})}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\bar{x},\qquad \widehat{\beta }_{1}=%
\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sum_{i=1}^{n}(x_{i}-%
\bar{x})^{2}}
\end{equation*}%
where $\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}$ and $\bar{y}=\frac{1}{n}%
\sum_{i=1}^{n}y_{i}$.

_Hints_: Use the following rule for inverting a matrix: 
\begin{equation*}
\left[ 
\begin{array}{cc}
a & b \\ 
c & d%
\end{array}%
\right] ^{-1}=\frac{1}{ad-bc}\left[ 
\begin{array}{cc}
d & -b \\ 
-c & a%
\end{array}%
\right]
\end{equation*}
as well rules (A.7) and (A.8) in Math Refresher A in the textbook.
