In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("proj04.ipynb")

<table style="width: 100%;" id="nb-header">
    <tr style="background-color: transparent;"><td>
        <img src="https://data-88e.github.io/assets/images/blue_text.png" width="250px" style="margin-left: 0;" />
    </td><td>
        <p style="text-align: right; font-size: 10pt;"><strong>Economic Models</strong>, Fall 2024<br>
            Dr. Eric Van Dusen <br>
        Vaidehi Bulusu <br>
        Akhil Venkatesh <br>
</table>

# Project 4: Econometrics

In the textbook and this week's lecture, we gave you a theoretical introduction to econometrics. In this project, you'll get the chance to apply what you've learned and see how econometrics is actually used by economists!

This project is based on an influential study on the relationship between a person's height and labor market outcomes, and is divided into 3 sections:

1. Simple Linear Regression

2. Multiple Linear Regression

3. Reading Econometrics Tables

You can refer to the [*Econometrics*](https://data-88e.github.io/textbook/content/11-econometrics/index.html) chapter in the textbook and week 11 resources for help on this project.

In [2]:
from datascience import *
import numpy as np
import pandas as pd
import statsmodels.api as sm

 ## Part 1: Simple Linear Regression

Several studies have identified a positive correlation between a person's height and labor market outcomes: on average, taller people have jobs that are of a higher status and pay more. In their paper, *[Stature and Status:
Height, Ability, and Labor Market Outcomes](https://www.nber.org/system/files/working_papers/w12466/w12466.pdf)* (2008), economists Anne Case and Christina Paxson analyze the data from the US National Health Interview Survey in 1994 to explain this association. This is the data we will be using for our analysis.

In the first part of the project, we will use simple (bivariate) linear regression to look at the association between a person's height and earnings, and consider the problems with limiting our regression model to just 1 regressor.

We start by importing the `earnings.csv` dataset which contains information about the characteristics and labor market outcomes for 17,870 workers.

In [3]:
earnings = Table().read_table("earnings.csv")
earnings.show(5)

sex,age,mrd,educ,cworker,region,race,earnings,height,weight,occupation
0,48,1,13,1,3,1,84054.8,65,133,1
0,41,6,12,1,2,1,14021.4,65,155,1
0,26,1,16,1,1,1,84054.8,60,108,1
0,37,1,16,1,2,1,84054.8,67,150,1
0,35,6,16,1,1,1,28560.4,68,180,1


<div class="alert alert-warning">
    
Before proceeding, please read the <a href="https://www.princeton.edu/~mwatson/Stock-Watson_3u/Students/EE_Datasets/Earnings_and_Height_Description.pdf" target="_blank">data description</a> for this study which gives you information about each variable. **This is very important.**
    
</div>

<!-- BEGIN QUESTION -->

**Question 1.1:** What do you expect the sign of the relationship between height and earnings to be? Explain your answer.


_The expected sign of the relationship between height and income would be positive, as several studies support such a correlation. Additionally, height can be linked to physical health, self-confidence, and factors such as an advantage in negotiation._

<!-- END QUESTION -->

Note: We generally take the log of earnings in regression models but we’re not doing that in this project for the sake of simplicity. We’ll be using log earnings as the dependent variable in section 3.

**Question 1.2:** Here's a simple linear regression equation to model the association between height and earnings:

$$\text{Earnings} = \beta_1 \times \text{Height} + \alpha$$

Perform a regression of `earnings` on `height`. Don't forget to add a constant term!


In [4]:
y_1_2 = earnings.column('earnings')
x_1_2 = earnings.column('height')
x_1_2 = sm.add_constant(x_1_2)
model_1_2 = sm.OLS(y_1_2, x_1_2)
results_1_2 = model_1_2.fit()
results_1_2.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.011
Model:,OLS,Adj. R-squared:,0.011
Method:,Least Squares,F-statistic:,196.5
Date:,"Tue, 26 Nov 2024",Prob (F-statistic):,2.13e-44
Time:,21:59:31,Log-Likelihood:,-207550.0
No. Observations:,17870,AIC:,415100.0
Df Residuals:,17868,BIC:,415100.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-512.7336,3386.856,-0.151,0.880,-7151.299,6125.832
x1,707.6716,50.489,14.016,0.000,608.708,806.635

0,1,2,3
Omnibus:,346151.826,Durbin-Watson:,1.683
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1827.913
Skew:,0.397,Prob(JB):,0.0
Kurtosis:,1.649,Cond. No.,1130.0


In [5]:
grader.check("q1_2")

**Question 1.3:**
Why should we include a constant term in this regression model?

<ol type="A" style="list-style-type: lower-alpha;">
    <li>Because we expect the slope of the line of best fit may be zero.  </li>
    <li>Because we expect the slope of the line of best fit may be non-zero.  </li>
    <li>Because we expect the x-intercept of the line of best fit may be non-zero.  </li>
    <li>Because we expect the y-intercept of the line of best fit may be non-zero.  </li>
</ol>

Assign a letter corresponding to your answer to `q1_3` below. For example, `q1_3 = 'a'`.


In [6]:
q1_3 = 'd'

In [7]:
grader.check("q1_3")

**Question 1.4:** What is the estimated association between `earnings` and `height`?


In [8]:
result_1_4 = results_1_2.params[1]
result_1_4

707.67155842741215

In [9]:
grader.check("q1_4")

**Question 1.5:** Is the association statistically significant? Answer `True` or `False`.


In [10]:
result_1_5 = True
result_1_5

True

In [11]:
grader.check("q1_5")

**Question 1.6:** Interpret the slope on the `height` variable (including the units). 

<ol type="A" style="list-style-type: lower-alpha;">
    <li>A 1 inch increase in height corresponds to around a \$707.7 increase in earnings  </li>
    <li>A 1 inch increase in height corresponds to around a \$707.7 decrease in earnings  </li>
    <li>A 1% increase in height corresponds to around a \$707.7 increase in earnings  </li>
    <li>A 1% increase in height corresponds to around a \$707.7 decrease in earnings  </li>
</ol>

Assign a letter corresponding to your answer to `q1_6` below. For example, `q1_6 = 'a'`.


In [12]:
q1_6 = 'a'

In [13]:
grader.check("q1_6")

**Question 1.7:** Interpret the intercept of the regression (including the units). Does this make practical sense? 

<ol type="A" style="list-style-type: lower-alpha;">
    <li>When height is 0, earnings are estimated to be around -\$512.7.  </li>
    <li>When height is 0, earnings are estimated to be around \$512.7.  </li>
    <li>When earnings are 0, height is estimated to be around -512.7 inches.  </li>
    <li>When earnings are 0, height is estimated to be around 512.7 inches.  </li>
</ol>

Assign a letter corresponding to your answer to `q1_7` below. For example, `q1_7 = 'a'`.


In [14]:
q1_7 = 'a'

In [15]:
grader.check("q1_7")

<!-- BEGIN QUESTION -->

**Question 1.8:** If the slope on the independent variable is statistically significant, can we infer a causal relationship between the two variables? Why or why not? When can you infer a causal relationship between 2 variables based on the results of the study?

*Hint: Think about the type of study we're looking at.*


_Just because the slope of the independent variable is statistically significant does not mean that we can infer a causal relationship between the two variables. This is because regression analysis is only a tool to show correlation, not a tool to prove causality.
To infer causality, we need a study that can control external influences, such as an experimental study or a randomized controlled trial. In observational studies, we cannot completely rule out the influence of latent variables or other external factors, so we cannot be sure of a causal relationship based on the results of regression analysis alone._

<!-- END QUESTION -->

**Question 1.9:** Use the slope and intercept from the regression in question 1.2 to generate predictions for `earnings`.


In [29]:
predictions_1_9 = results_1_2.params[0] + results_1_2.params[1] * earnings.column('height')
predictions_1_9

array([ 45485.91770787,  45485.91770787,  41947.55991573, ...,
        45485.91770787,  47608.93238315,  47608.93238315])

In [30]:
grader.check("q1_9")

**Question 1.10:** Calculate the RMSE for your regression predictions from question 1.9.


In [32]:
rmse_1_10 = np.sqrt(np.mean((predictions_1_9 - y_1_2) **2))
rmse_1_10

26775.741156313969

In [33]:
grader.check("q1_10")

**Question 1.11:** Which one of the following is true about the RMSE? 

<ol type="A" style="list-style-type: lower-alpha;">
    <li>RMSE is the total sum of squared error in the regression.   </li>
    <li>RMSE means that on average the predicted earnings are off by $26,775.7.  </li>
    <li>It is possible for the RMSE to increase if we add unrelated variables to the regression.  </li>
    <li>Higher RMSE means the model more accurately predicts the dependent variable.  </li>
</ol>

Assign a letter corresponding to your answer to `q1_11` below. For example, `q1_11 = 'a'`.


In [21]:
q1_11 = 'b'

In [22]:
grader.check("q1_11")

## Part 2: Multiple Linear Regression

Now, let's perform multiple linear regression to account for potential confounding variables in our model. For simplicity, we will be using only the following additional regressors: `age`, `educ`, `sex` and `weight`.

<!-- BEGIN QUESTION -->

**Question 2.1:** Pick 2 variables from the 4 additional variables we will be incorporating in the model (`age`, `educ`, `sex` and `weight`) that you think might be potential confounders in the study. Provide a brief explanation for why you think each of the variables may be a confounder. 

*Hint: Recall that a confounder is related to both the independent and dependent variables, so in your answer, make sure to talk about how each confounder is related to `height` and `weight`.*


_TEducation level (educ) is a confounding variable that may be related to both height and income. Higher education levels are generally more likely to earn higher incomes. Also, people with higher education levels are more likely to have better health and nutrition, which may lead to height on average. Therefore, education level is considered a confounding variable that may affect both height and income. Age is also a confounding variable. Older people are generally more likely to earn higher incomes due to accumulated career experience. Also, age is related to the time when a person grows up, which may also be related to height._

<!-- END QUESTION -->

**Question 2.2:** Perform another regression with the following new regressors: `age`, `educ`, `sex` and `weight` (also include `height`).


In [36]:
y_2_2 = earnings.column('earnings')
x_2_2 = pd.DataFrame({
    'age': earnings.column('age'),
    'educ': earnings.column('educ'),
    'sex': earnings.column('sex'),
    'weight': earnings.column('weight'),
    'height': earnings.column('height')
})
x_2_2 = sm.add_constant(x_2_2)
model_2_2 = sm.OLS(y_2_2, x_2_2)
results_2_2 = model_2_2.fit()
results_2_2.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.17
Model:,OLS,Adj. R-squared:,0.169
Method:,Least Squares,F-statistic:,729.7
Date:,"Tue, 26 Nov 2024",Prob (F-statistic):,0.0
Time:,22:28:46,Log-Likelihood:,-205980.0
No. Observations:,17870,AIC:,412000.0
Df Residuals:,17864,BIC:,412000.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-4.745e+04,4348.852,-10.911,0.000,-5.6e+04,-3.89e+04
age,335.0852,18.420,18.191,0.000,298.980,371.190
educ,3948.2298,70.529,55.980,0.000,3809.986,4086.473
sex,586.9203,520.681,1.127,0.260,-433.665,1607.505
weight,-5.0913,3.878,-1.313,0.189,-12.692,2.509
height,414.7890,68.097,6.091,0.000,281.312,548.265

0,1,2,3
Omnibus:,3334.172,Durbin-Watson:,1.908
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1045.069
Skew:,0.362,Prob(JB):,1.16e-227
Kurtosis:,2.062,Cond. No.,4610.0


In [37]:
grader.check("q2_2")

**Question 2.3:** Compare the coefficient on `height` from the regression model in question 2.2 to the coefficient on `height` from the regression model in question 1.2. What does this tell you about the nature of the omitted variable bias in the previous model (is it positive or negative)?

Fill in the blanks: The coefficient in 2.2 is \_\_\_\_\_ which means that the omitted variable bias in 1.2 is overall \_\_\_\_\_. 

<ol type="A" style="list-style-type: lower-alpha;">
    <li>higher; positive</li>
    <li>lower; positive</li>
    <li>higher; negative</li>
    <li>lower; negative</li>
</ol>

Assign a letter corresponding to your answer to `q2_3` below. For example, `q2_3 = 'a'`.


In [38]:
q2_3 = 'b'

In [39]:
grader.check("q2_3")

<!-- BEGIN QUESTION -->

**Question 2.4:** Look at the regression coefficients for the additional variables we included in question 2.2. Do the coefficients align with your intuition from 2.1 (in terms how they are related to earnings)?


_Looking at the regression coefficients of the added variables, we can see that they are consistent with what we expected in Question 2.1.
The regression coefficient for "educ" is 3948.23, which means that a one-year increase in education level increases income by $3948.23 on average. This is consistent with the intuition that higher education level leads to higher income.
The coefficient for "age" is 335.09, which shows that income tends to increase with age. This is consistent with the general pattern of income increasing with career experience.
The coefficients for "sex" and "weight" are statistically insignificant, which suggests that these variables do not have a significant effect on income._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.5:** Let's consider the effect of education on earnings. According to the result in 2.2, is the level of education correlated with earnings? Does adding this variable help alleviate omitted variable bias? Why or why not? Explain your answers. 


_According to the results of Question 2.2, the level of education is clearly correlated with income. The regression coefficient of "educ" is 3948.23, and the p-value is close to 0, which is statistically very significant. This strongly suggests that higher education level leads to higher income. Adding the education variable helps to mitigate the error variable bias. Since the initial regression model did not include important variables such as education level, the effect of height on income may have been overestimated. Therefore, by adding the education level variable to the regression model, the relationship between height and income can be explained more accurately._

<!-- END QUESTION -->

**Question 2.6:** If we computed the RMSE for this new regression model, do you think it would be higher or lower than the RMSE we computed in 1.10? 

<ol type="A" style="list-style-type: lower-alpha;">
    <li>Lower</li>
    <li>Higher</li>
    <li>It depends on the specific choice of variables </li>
</ol>

Assign a letter corresponding to your answer to `q2_6` below. For example, `q2_6 = 'a'`.


In [40]:
q2_6 = 'a'

In [41]:
grader.check("q2_6")

**Question 2.7:** Now that we've accounted for some additional confounding variables do you think it makes sense for us to infer a causal relationship between height and earnings?

<ol type="A" style="list-style-type: lower-alpha;">
    <li>Yes, because the coefficient for height is now highly significant. </li>
    <li>Yes, because we have eliminated omitted variable bias by adding control variables in 2.2. </li>
    <li>No, because there may be other omitted variables we haven't accounted for.</li>
    <li>No, because the set of control variables added in 2.2 is a poor choice. </li>
</ol>

Assign a letter corresponding to your answer to `q2_7` below. For example, `q2_7 = 'a'`.


In [42]:
q2_7 = 'c'

In [43]:
grader.check("q2_7")

**Question 2.8:** Using regression results in 2.2, now let’s try to predict a person’s earnings based on their characteristics.

What would the predicted earnings of a 35-year-old woman with a Bachelor's degree who is 64 inches tall and 124 pounds?

*Hint: A person with a Bachelor's degree has completed 15 years of education.*


In [44]:
prediction_2_8 = (
    results_2_2.params['const'] +
    results_2_2.params['age'] * 35 +
    results_2_2.params['educ'] * 15 +
    results_2_2.params['sex'] * 0 +
    results_2_2.params['weight'] * 124 +
    results_2_2.params['height'] * 64
)
prediction_2_8

49414.269967442233

In [45]:
grader.check("q2_8")

**Question 2.9:** Say you wanted to know the relationship between gender (`sex`: a binary variable) and income (`earnings`: a continuous variable). Based on our regression results, how is gender correlated with income?

*Hint*: Think about what it means when the coefficient of `sex` is 0 and 1. Also, you can try changing your input for `sex` in 2.8, and see what happens.

Assuming all else equal, ...

<ol type="A" style="list-style-type: lower-alpha;">
    <li>On average, male earns \$586.92 more than female. </li>
    <li>On average, female earns \$586.92 more than male. </li>
    <li>For every 1 inch increase in height, male's income will increase on average \$586.92 more than that of female. </li>
    <li>For every 1 inch increase in height, female's income will increase on average \$586.92 more than that of male. </li>
</ol>

Assign a letter corresponding to your answer to `q2_9` below. For example, `q2_9 = 'a'`.


In [46]:
q2_9 = 'a'

In [47]:
grader.check("q2_9")

## Part 3: Reading Econometrics Tables

Researchers tend to run multiple regression models which they summarize in econometrics tables.

Below is a table taken from the paper. It shows the regression results from 2 studies on the relationship between height and earnings: the British Cohort Study (BCS) and National Child Development Study (NCDS).

<img src = "https://i.imgur.com/a2o9OPA.png">

<div class="alert alert-warning">
    
Make sure to read the table (including the note at the bottom) before proceeding.
    
</div>

Note that for the questions below, **log earnings** is the dependent variable. Recall the [interpretation of the slope](https://data-88e.github.io/textbook/content/01-demand/03-log-log.html) in this case.

**Question 3.1:** According to the table, what did the British Cohort Study (1970) find about the relationship between height at age 30, test scores ages 5 and 10, and earnings for women? Use the results with extended controls added in. Which of the followings are correct? There can be 1-4 correct answers. 

<ol type="A" style="list-style-type: lower-alpha;">
    <li>The coefficient on height is statistically significant. </li>
    <li>The coefficient on height means for every 1 inch increase in height at age 30, the annual earnings will on average increase by 0.002 dollars. </li>
    <li>The coefficient on test scores is statistically significant. </li>
    <li>The coefficient on test scores means for every 1 point increase in test scores ages 5 and 10, the annual earnings will on average increase by 19.75 dollars. </li>
</ol>

Assign an array of letters corresponding to your answer to `q3_1` below. For example, `q3_1 = make_array('a', 'b', 'c', 'd')`.


In [48]:
q3_1 = make_array('a', 'b', 'c', 'd')

In [49]:
grader.check("q3_1")

<!-- BEGIN QUESTION -->

**Question 3.2:** Now look at the results of the British Cohort Study for men. There are 3 coefficients reported for height at age 30. Give a brief interpretation of each of these coefficients (including the units). In your explanation, talk about what is causing the differences among these coefficients (hint: look at the *Test scores* and *Extended controls* rows).

Note: Assume height is measured in inches for this study.


_First regression coefficient (Height at age 30 without test scores and controls):
Coefficient: 0.010 (standard error: 0.003)
On average, a 1-inch increase in height increases log earnings by 0.010. Converted to earnings, this represents a 1% increase. This regression model may overestimate the effect of height on earnings because it does not include test scores or additional control variables.

Second regression coefficient (Height at age 30 with test scores):
Coefficient: 0.004 (standard error: 0.003)
After controlling for test scores (ages 5 and 10), a 1-inch increase in height increases log earnings by 0.004. Adding test scores weakens the relationship between height and earnings.

Third regression coefficient (Height at age 30 with test scores and extended controls):
Coefficient: -0.004 (standard error: 0.004)
After controlling for test scores and additional control variables (parental education, birth weight, etc.), a 1-inch increase in height reduces log earnings by -0.004 on average. In this case, the effect of height is negative, which suggests that the positive correlation between height and earnings can actually be explained by several other variables.

In the first model, the effect of height was overestimated because we did not control for test scores or the additional variables. In the second model, the effect of height was reduced when we added test scores, which is because height and test scores are positively correlated. In the third model, the effect of height became negative when we included the additional control variables, which suggests that the effect of height on earnings can actually be explained by other variables. Therefore, the difference in each coefficient is reflected in the way that the effect of height on earnings is increasingly reduced as we include test scores and the additional control variables._

<!-- END QUESTION -->

## Conclusion

This brings us to the end of project 4! You've learned some key econometrics skills such as running regressions and reading econometrics tables. You've also developed an intuition for ordinary linear regression, omitted variable bias and regression with dummy variables.

Also, we didn't cover a large part of Case and Paxson's [fascinating study](https://www.nber.org/system/files/working_papers/w12466/w12466.pdf) so if you're interested in how they explain the positive association between height and earnings, we recommend giving the paper a read!

Congratulations on finishing your final DATA 88E project! We're so proud of you!

If you have any feedback for us, we'd love to hear it! Feel free to reach out to Akhil at akhil.v@berkeley.edu or make a post on Ed!

---

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [50]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)

Running your submission against local test cases...



Your submission received the following results when run against available test cases:

    q1_2 results: All test cases passed!

    q1_3 results: All test cases passed!

    q1_4 results: All test cases passed!

    q1_5 results: All test cases passed!

    q1_6 results: All test cases passed!

    q1_7 results: All test cases passed!

    q1_9 results: All test cases passed!

    q1_10 results: All test cases passed!

    q1_11 results: All test cases passed!

    q2_2 results: All test cases passed!

    q2_3 results: All test cases passed!

    q2_6 results: All test cases passed!

    q2_7 results: All test cases passed!

    q2_8 results: All test cases passed!

    q2_9 results: All test cases passed!

    q3_1 results: All test cases passed!
