# # ECON 490: Regression Analysis (10) -- or 12?

***

## Prerequisites 
---
- Opening a Dataset
- Understand how to get documentation of a command and syntax.


## Learning objectives:
---
- Identify the difference between a linear model and OLS.
- Propose an econometric (linear) model from a given dataset.
- Understand the interpretation of the coefficients in the linear regression output.
- Distinguish which control variables are good in the proposed model.




Understanding how to run a well structured OLS regression and how to interpret the results of that regression are the most important skills for undertaking empirical economic analysis. You have acquired a solid understanding of the theory behind the OLS regression in ECON 326. Here we will cover the practical side of running regression and, perhaps more importantly, how to interpret the results. 

One word of caution before we begin. Before conducting a regression analysis, a great deal of work must go into understanding the data and investigating the theoretical relationships between variables.  The biggest mistake that students make at this stage is not how they run the regression analysis, it is failing to spend enough time preparing data for analysis. 



<div class="alert alert-warning">

**Warning:** A variable that is qualitative and not ranked cannot be used in an OLS regression without first creating a dummy variable. Examples of variables that must always be included as dummy variables are sex, race, religiosity, immigration status, and marital status. Examples of variables that are sometimes included as dummy variables are education, income and age. 
</div>

<div class="alert alert-warning">

**Warning:** You will want to take a good look to see how your variables are coded before you begin run regressions and interpreting the results. Make sure that missing values are coded a "." and not some value (such as "99"). Also, check that qualitative ranked variables are coded in the way you expect (e.g. higher education is coded with a larger number). If you do not do this you could be misinterpreting your results.
    
</div>



Before we proceed further, we will re-open the fake data dataset. Recall that this dataset is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings. 

In [3]:
clear*
use fake_data,clear

## 10.1 Linear Model 

An econometric model describes an equation (or set of equations) that impose some structure on how the data was generated. The most natural way to describe statistical information is the *mean*. Therefore, we typically model the mean of a (dependent) variable and how it can depend on different factors (covariates). 

To consider an example, suppose you were an omnipotent being that could generate any variable (e.g. earnings) in the world. How would you do it? Let's think of ingredients that we would need to generate the earnings of every person in this world: 

- Age 
- Year (e.g. macroeconomic shocks in that particular year)
- Region (local determinants on earnings)
- Labor Market Experience
- Tenure at that particular firm
- Firm where that individual is working
- How productive a person is
- Passionate about their particular job
- etc., etc., there are so many!



For simplicity, let's assume there are only two regions in this world: A and B. In this world, we'll make it such that workers in region B earn $\beta_1$ percentage points more than workers in region A on average. Furthermore, an extra year of age increase earnings by $\beta_2$ on average. And we keep on doing the same for every variable in the list above. The econometrician (us!) only observe a subset of all these variables, which we call the observables or covariates $X_{it}$. Let's suppose that the econometrician only observes the region and age of the workers.

We could generate log-earnings of worker $i$ at time $t$ as follows. 

\begin{align}
logearn_{it} &=  \beta_1 \mathbf{1}\{region_{it}=B\} + \beta_2 age_{it} + \underbrace{ \beta_3 exp + \beta_4 tenure + \dots }_{\text{Unobservable, so we'll call this }u_{it}^*} \\
&= \mathbf{E}[logearn_{it} \mid region_{it}=A, age_{it}=0] + \beta_1 \mathbf{1}\{region_{it}=B\} + \beta_2 age_{it} + u_{it}^* - \mathbf{E}[logearn_{it} \mid region_{it}=A, age_{it}=0] \\ 
&= \beta_0 + \beta_1 \mathbf{1}\{region_{it}=B\} + \beta_2 age_{it}  + u_{it}
\end{align}


    
In the second line we did one of the most powerful tricks in all mathematics: add and substract the same term! The term we chose was the mean earnings for those who are in region A and age equal to zero, i.e., we "turn-off"  the effect of the covariates. This term is the interpretation of the constant in our linear model. The re-defined unobservable term is a deviation from such mean, which we expect to be zero on average. 
    
<div class="alert alert-info">

**Note:** Notice how adding a constant term in our linear model provides the intuition why we should expect the unobservable (error) term to be zero in expectation.
    
</div>


    
<div class="alert alert-info">


**Note:** Why did we model log-earnings? Notice that the right hand side can be, in theory, negative. Negative earnings do not make a lot of sense. However, negative log earnings can occur whenever a person earns below 1 dollar. 
</div>



If we knew those $\beta$s we could know a lot of information about the means of different set of workers. For instance, we can compute the mean log earnings of 18 year old workers who work at region B: 

$$ \mathbf{E}[logearn_{it} \mid region_{it}=B, age_{it}=18] = \beta_0 + \beta_1 + \beta_2 \times 18  $$

or the mean log earnings of 20 year old workers who work at region A: 

$$ \mathbf{E}[logearn_{it} \mid region_{it}=A, age_{it}=20] = \beta_0  + \beta_2 \times 20  $$


This is the intuition that we should follow to interpret the coefficients! 

## 10.2 Ordinary Least Squares

If we are given some dataset and we have to find the unknown $\beta$s, the most common and powerful tool is known as OLS. Let all the observations be indexed by $j=1,2,\dots, n$. Let $$b_0, b_1, b_2, \hat{u}$$ be some estimators of $$β_0, β_1, β_2, u.$$ The estimation of the linear model above can be expressed as,

$$ \hat{logearn_{j}} = b_0 + b_1 \mathbf{1}\{\hat{region_{j}}=B\} + b_2 \hat{age_{j}}  + \hat{u_{j}} $$

OLS wants to find estimators that minimize the sum of the squared error term. This is given by the following minimization problem:
$$ \min_{b} \frac{1}{n} \sum_{j}^n (\hat{u})^2$$ This expression can also be written as,

$$ \min_{b} \frac{1}{n} \sum_{j}^n (\hat{logearn_{j}} - b_0 - b_1 \mathbf{1}\{\hat{region_{j}}=B\} - b_2\hat{age_{j}} )^2 $$

It is minimizing the squared residuals (the sample version of the error term) given our data. This minimization problem can be solved using calculus and the first order condition is given by : 

\begin{align}
\frac{1}{n} \sum_{j}^n 1 \times \hat{u}_{j} &= 0 = E[u]  \\
\frac{1}{n} \sum_{j}^n age_j \times \hat{u}_{j} &= 0  \\
\frac{1}{n} \sum_{j}^n \mathbf{1}\{region_j = B\} \times \hat{u}_{j} &= 0 
\end{align}

In other words, by construction, the sample version of our error term will be uncorrelated with all the covariates. The constant term works the same way as including a variable equal to 1 in the regression (try it yourself!).


<div class="alert alert-info">


**Note:** Because this is an optimization problem, all of our variables must be numeric. If a variable is categorical we must be able to re-code it into a numerical variable. The next module will discuss more about this. 
    
</div>



In [4]:
%browse 10

Unnamed: 0,workerid,year,sex,birth_year,age,start_year,region,treated,earnings
1,1,1999,M,1944,55,1997,1,0,39975.008
2,1,2001,M,1944,57,1997,1,0,278378.06
3,2,2001,M,1947,54,2001,4,0,18682.6
4,2,2002,M,1947,55,2001,4,0,293336.41
5,2,2003,M,1947,56,2001,4,0,111797.26
6,3,2005,M,1951,54,2005,5,0,88351.672
7,3,2010,M,1951,59,2005,5,0,46229.574
8,4,1997,M,1952,45,1997,5,1,24911.029
9,4,2001,M,1952,49,1997,5,1,9908.3623
10,5,2009,M,1954,55,1998,2,1,137207.34


The command to run OLS is called `regress`. You can look at the help file to look at the different options that this command provides. 

### 10.2.1 OLS with Stata 
#### 1. Univariate Regressions

To run a linear regression using OLS we can write the command along with the dependent variable and a list of covariates.

In [6]:
cap drop logearn 
gen logearn = log(earnings)

regress logearn age 





      Source |       SS           df       MS      Number of obs   = 2,861,772
-------------+----------------------------------   F(1, 2861770)   =  45115.57
       Model |  61445.5785         1  61445.5785   Prob > F        =    0.0000
    Residual |  3897615.23 2,861,770  1.36195964   R-squared       =    0.0155
-------------+----------------------------------   Adj R-squared   =    0.0155
       Total |   3959060.8 2,861,771  1.38343033   Root MSE        =     1.167

------------------------------------------------------------------------------
     logearn |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0140839   .0000663   212.40   0.000     .0139539    .0142139
       _cons |    10.0016   .0024028  4162.52   0.000     9.996893    10.00631
------------------------------------------------------------------------------


By default Stata includes a constant (which is usually what we want, since this will make that residuals are 0 on average). The estimated coefficients are $\hat{\beta}_0 = 10$ and $\hat{\beta}_1 = 0.014$.  Notice that we only included one covariate here, which is known as univariate (linear) regression. 

The interpretation in univariate regression is fairly simple, $\hat{\beta}_1$ says that having one extra year of age increases earnings by $0.014$ in log earning. In other words, one extra year gives 1.4 percentage points higher earnings. 


#### 2. Multivariate Regression

The command reg allows us to list multiple covariates.

In [None]:
reg logearn age treated

How would we interpt the coefficient corresponding to being treated? Consider the following two comparisons: 

- Mean log earnings of treated workers of 18 years old minus the mean log earnings of untreated workers of 18 years old = $\beta_2$. 
- Mean log earnings of treated workers of 20 years old minus the mean log earnings of untreated workers of 20 years old = $\beta_2$. 
- and so on. 


Therefore, the coefficient gives the increase in log earnings between treated and untreated among workers *with the same other characteristics*. We economists usually refer to this as $\textit{ceteris paribus}$.

The second column shows the standard errors. Using those we can compute the third column which is testing that the coefficient is equal to zero: 

$$ t = \frac{ \hat{\beta} - 0 }{StdErr} $$

If the t-statistic is roughly greater than 2 in absolute value, we reject the null hypothesis that there is no effect. This would mean that the data supports the hypothesis that the variable in question has some effect on earnings at a confidence level of 95%. An alternative test can be performed using the p-value statistic: if the p-value is less than 0.05 we reject the null hypothesis at 95% confidence level.




## 10.3 What can we do with OLS? 

Notice that OLS gives us a linear approximation to the conditional mean of some dependent variable given some observables. We can use this information for prediction: if we had different observables how does the expected mean would differ? Another thing we could do with OLS is discuss causality: how does manipulating one variable impacts the dependent variable on average?

To give a causal interpretation to our OLS estimates we require that in the population it holds that 
$\mathbf{E}[X_i u_i] = 0$, the unobservables are uncorrelated to the independent variables of the equation. If these unobservables are correlated to a independent variable it means the variable can be causing a change in the dependent variable because of a change in an unobservable rather than a change in the independent variable itself, making us unable to prove causality. Notice that in the sample this is always true, because OLS by construction generates residuals that are uncorrelated to the observables. 

For instance, if we want to interpret in the previous regression that the causal effect of being treated is equal to -0.81 it must be the case that treatment is not correlated (in the population sense) to the error term. However, it could be the case that treated workers are the ones that usually perform worse at their job, and that would invalidate a causal interpretation of our OLS estimates.


    
<div class="alert alert-info">

**Good Controls:** To think about good controls we need to consider which *unobserved* determinants of the outcome are possibly correlated to our variable of interestest.
    
</div>



<div class="alert alert-warning">

**Bad Controls:** It is bad practice to include variables that are themselves outcomes. For instance, consider studying the causal effect of college on earnings. If we include a covariate of working at a high paying job, then we're blocking part of the causal channel between college and earnings, i.e. you are more likely to have a nice job if you study more years!
    
</div>

