# ECON 490: Regression Analysis (12)

## Prerequisites 
---
1. Econometric approaches to linear regression taught in ECON 326.
2. Importing data into Stata.
3. Creating new varables using `generate`.

## Learning objectives:
---

1. Impliment the econometric theory for linear regressions learned in ECON 326.
2. Run simple univariate and multivariant regressions using the command `regress`.
3. Understand the interpretation of the coefficients in the linear regression output.
4. Consider the quality of control variables in a proposed model.


## 12.1 A Word of Caution Before We Begin

Before conducting a regression analysis, a great deal of work must go into understanding the data and investigating the theoretical relationships between variables. The biggest mistake that students make at this stage is not how they run the regression analysis, it is failing to spend enough time preparing data for analysis. 
- A variable that is qualitative and not ranked cannot be used in an OLS regression without first creating a dummy variable. Examples of variables that must always be included as dummy variables are sex, race, religiosity, immigration status, and marital status. Examples of variables that are sometimes included as dummy variables are education, income and age. 
- You will want to take a good look to see how your variables are coded before you begin run regressions and interpreting the results. Make sure that missing values are coded a "." and not some value (such as "99"). Also, check that qualitative ranked variables are coded in the way you expect (e.g. higher education is coded with a larger number). If you do not do this you could be misinterpreting your results.


## 12.2 Linear Regression Models 

Understanding how to run a well structured OLS regression and how to interpret the results of that regression are the most important skills for undertaking empirical economic analysis. You have acquired a solid understanding of the theory behind the OLS regression in ECON 326. Here, we will cover the practical side of running regressions and, perhaps more importantly, how to interpret the results. 

An econometric model describes an equation (or set of equations) that impose some structure on how the data was generated. The most natural way to describe statistical information is the mean. Therefore, we typically model the mean of a (dependent) variable and how it can depend on different factors (independent variables or covariates). The easiest way to describe a relationship between a dependent variable, y, and one or more independent variables, x is linearly. 

Suppose we want to know what variables are needed to understand why and how earnings vary between each person in the world. What would be the ingredients needed to predict everyone's earnings?  

Some explanatory variables can be:
- Age 
- Year (e.g. macroeconomic shocks in that particular year)
- Region (local determinants on earnings)
- Labor Market Experience
- Tenure at that particular firm
- Firm where that individual is working
- Level of productivity
- Passion about their job
- etc., etc., there are so many!


For simplicity, lets assume we want to predict earnings but we only have access to datasets relating to people's age and earnings. If we want to generate a model that predicted the relationship between these two variables we could create a linear model where $$ y = b +mx.$$ In this case, the independent variable (x) would be age, the slope (m) would be how much an extra year of age affects earnings, the y-intercept (b) would be when the variable age is equal to 0, and the dependent variable (y) would be earnings.

Because we only have access to two variables we are unable to observe the rest of the variables (or covariates $X_{i}$) that change earnings. Even if we do not observe these variables they are still affecting earnings therefore, our model above would have error, the values would diverge from the linear model. Therefore, to linearly model earnings truthfully we need to include this error term, which is also called the residual.  

$$ logearn_{i} =\beta_0 + \beta_1 age_{i}  + u_{i}$$
Here, $\beta_0$ is the y-intercept and $\beta_1$ is the slope 

Its important to understand what $\beta_0$ and $\beta_1$ stand for in the linear model. We said above that we typically model the mean of a (dependent) variable and how it can depend on different factors (covariates). Therefore we are in fact modeling the expected value of y conditional on x. 
$$ E[y_{i}|x_{i}]=\beta_0 + \beta_1 E[x_{i}|x_{i}]  + E[u_{i}|x_{i}]$$
We would expect that the value of a random variable given we already know it's value, should be that value. Thus,$$E[x_{i}|x_{i}]=x_{i}$$ and, as we will explain below we assume $$E[u_{i}|x_{i}]=0$$
Hence, $$ E[y_{i}|x_{i}]=\beta_0 + \beta_1x_{i}$$
If $x=0$ then, $\beta_1x=0$ and $$ E[y_{i}|x_{i}=0]=\beta_0 $$
If $x=1$ then, $\beta_1x=\beta_1$ and $$ E[y_{i}|x_{i}=1]=E[y_{i}|x_{i}=0]+ \beta_1$$
$$ E[y_{i}|x_{i}=1]- E[y_{i}|x_{i}=0]= \beta_1$$ 
$β_1$ is the difference in the expected value of y when there is a change in x.

If we know those $\beta$s we can know a lot of information about the means of different set of workers. For instance, we can compute the mean log earnings of 18 year old workers: 

$$ \mathbf{E}[logearn_{it} \mid  age_{it}=18] = \beta_0 + \beta_1 \times 18  $$


This is the intuition that we should follow to interpret the coefficients! 


<div class="alert alert-warning">
    
**Paul and I had a meeting today discussing this section. He wanted to include the fact that the betas are expected values and the adding and subtracting trick. I included the expected value part above but we couldn't decide if the add subtract trick he shows below is necessary for the module, specially when we haven't introduced why $E[U]=0$. Would appreciate feedback on this section!

For simplicity, let's assume there are only two regions in this world: A and B. In this world, we'll make it such that workers in region B earn $\beta_1$ percentage points more than workers in region A on average. Furthermore, an extra year of age increase earnings by $\beta_2$ on average. And we keep on doing the same for every variable in the list above. The econometrician (us!) only observe a subset of all these variables, which we call the observables or covariates $X_{it}$. Let's suppose that the econometrician only observes the region and age of the workers.

We could generate log-earnings of worker $i$ at time $t$ as follows. 

\begin{align}
logearn_{it} &=  \beta_1 \mathbf{1}\{region_{it}=B\} + \beta_2 age_{it} + \underbrace{ \beta_3 exp + \beta_4 tenure + \dots }_{\text{Unobservable, so we'll call this }u_{it}^*} \\
&= \mathbf{E}[logearn_{it} \mid region_{it}=A, age_{it}=0] + \beta_1 \mathbf{1}\{region_{it}=B\} + \beta_2 age_{it} + u_{it}^* - \mathbf{E}[logearn_{it} \mid region_{it}=A, age_{it}=0] \\ 
&= \beta_0 + \beta_1 \mathbf{1}\{region_{it}=B\} + \beta_2 age_{it}  + u_{it}
\end{align}


    
In the second line we did one of the most powerful tricks in all mathematics: add and substract the same term! The term we chose was the mean earnings for those who are in region A and age equal to zero, i.e., we "turn-off"  the effect of the covariates. This term is the interpretation of the constant in our linear model. The re-defined unobservable term is a deviation from such mean, which we expect to be zero on average. 


>**Note:** Notice how adding a constant term in our linear model provides the intuition why we should expect the unobservable (error) term to be zero in expectation.
    
>**Note:** Why did we model log-earnings? Notice that the right hand side can be, in theory, negative. Negative earnings do not make a lot of sense. However, negative log earnings can occur whenever a person earns below 1 dollar. 
</div>


## 12.3 Ordinary Least Squares

If we are given some dataset and we have to find the unknown $\beta$s, the most common and powerful tool is known as OLS. For simplicity, let's assume there are only two regions in this world: A and B. In this world, we'll make it such that workers in region B earn $\beta_1$ percentage points more than workers in region A on average. Furthermore, an extra year of age increase earnings by $\beta_2$ on average. Every other variable that affect earnings is inaccessible to us. Let all the observations be indexed by $j=1,2,\dots, n$. Let $$b_0, b_1, b_2, \hat{u}$$ be the estimators of $$β_0, β_1, β_2, u.$$ The estimation of the linear model above can be expressed as,

$$ \hat{logearn_{j}} = b_0 + b_1 \mathbf{1}\{region_{j}=B\} + b_2 age_{j}  + \hat{u_{j}} $$

OLS finds estimators that minimize the sum of squared residuals. This is given by the following minimization problem:
$$ \min_{b} \frac{1}{n} \sum_{j}^n (\hat{u})^2$$ This expression can also be written as,

$$ \min_{b} \frac{1}{n} \sum_{j}^n (\hat{logearn_{j}} - b_0 - b_1 \mathbf{1}\{region_{j}=B\} - b_2age_{j} )^2 $$

OLS is minimizing the squared residuals (the sample version of the error term) given our data. This minimization problem can be solved using calculus, specifically the derivative chain rule. The first order conditions are given by : 

\begin{align}
\frac{1}{n} \sum_{j}^n 1 \times \hat{u}_{j} &= 0  \\
\frac{1}{n} \sum_{j}^n age_j \times \hat{u}_{j} &= 0  \\
\frac{1}{n} \sum_{j}^n \mathbf{1}\{region_j = B\} \times \hat{u}_{j} &= 0 
\end{align}

From these first order conditions we construct the most important restrictions for OLS: $$E[u]=E[u\times  age]=E[u\times\{region = B\}]=0$$
In other words, by construction, the sample version of our error term will be uncorrelated with all the covariates. The constant term works the same way as including a variable equal to 1 in the regression (try it yourself!).

We can also construct the fomula for $β_0, β_1, β_2$ using these conditions.


<div class="alert alert-info">


**Note:** Because this is an optimization problem, all of our variables must be numeric. If a variable is categorical we must be able to re-code it into a numerical variable. The next module will discuss more about this. 
    
</div>



## 12.4 Ordinary Least Squares Regressions with Stata 

For this module we will be using the fake data dataset. Recall that this data is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings. 

In [None]:
* Below you will need to include the path on your own computer to where the data is stored between the quotation marks.

clear *
cd " "
import delimited using "fake_data.csv", clear

### 12.4.1 Univariate Regressions

To run a linear regression using OLS we use the command `regress`. The basic syntax of the command is

```stata
regress dep_varname indep_varname
```
You can look at the help file to look at the different options that this command provides. 

Let's start by creating a new variable that is the natural log of earnings and then run our regression. 

In [None]:
gen logearn = log(earnings)
regress logearn age 

By default Stata includes a constant (which is usually what we want, since this will make that residuals are 0 on average). The estimated coefficients are $\hat{\beta}_0 = 10$ and $\hat{\beta}_1 = 0.014$.  Notice that we only included one covariate here, which is known as univariate (linear) regression. 

The interpretation in univariate regression is fairly simple, $\hat{\beta}_1$ says that having one extra year of age increases earnings by $0.014$ in log earning. In other words, one extra year gives 1.4 percentage points higher earnings. 


### 12.4.2 Multivariate Regression

The command `reg` also allows us to list multiple covariates. When we want to carry out a multivariate regression we write, 
```stata
regress dep_varname indep_varname1 indep_varname2
```
and so on.

In [None]:
reg logearn age treated

How would we interpt the coefficient corresponding to being treated? Consider the following two comparisons: 

- Mean log earnings of treated workers of 18 years old minus the mean log earnings of untreated workers of 18 years old = $\beta_2$. 
- Mean log earnings of treated workers of 20 years old minus the mean log earnings of untreated workers of 20 years old = $\beta_2$. 
- and so on. 


Therefore, the coefficient gives the increase in log earnings between treated and untreated among workers *with the same other characteristics*. We economists usually refer to this as $\textit{ceteris paribus}$.

The second column shows the standard errors. Using those we can compute the third column which is testing that the coefficient is equal to zero: 

$$ t = \frac{ \hat{\beta} - 0 }{StdErr} $$

If the t-statistic is roughly greater than 2 in absolute value, we reject the null hypothesis that there is no effect. This would mean that the data supports the hypothesis that the variable in question has some effect on earnings at a confidence level of 95%. 

An alternative test can be performed using the p-value statistic: if the p-value is less than 0.05 we reject the null hypothesis at 95% confidence level.

<div class="alert alert-info">

**Note:** Without statisical significance we cannot reject the null hypothesis and have no choice but to conclude that the coefficient is zero. 

</div>


## 12.5 What can we do with OLS? 

Notice that OLS gives us a linear approximation to the conditional mean of some dependent variable given some observables. We can use this information for prediction: if we had different observables how does the expected mean would differ? Another thing we could do with OLS is discuss causality: how does manipulating one variable impacts the dependent variable on average?

To give a causal interpretation to our OLS estimates we require that in the population it holds that 
$\mathbf{E}[X_i u_i] = 0$, the unobservables are uncorrelated to the independent variables of the equation. If these unobservables are correlated to a independent variable it means the variable can be causing a change in the dependent variable because of a change in an unobservable rather than a change in the independent variable itself, making us unable to prove causality. This is also called an endogeneity problem. Notice that in the sample this is always true, because OLS by construction generates residuals that are uncorrelated to the observables. 

For instance, if we want to interpret in the previous regression that the causal effect of being treated is equal to -0.81 it must be the case that treatment is not correlated (in the population sense) to the error term. However, it could be the case that treated workers are the ones that usually perform worse at their job, and that would invalidate a causal interpretation of our OLS estimates.

- Good Controls: To think about good controls we need to consider which *unobserved* determinants of the outcome are possibly correlated to our variable of interest.
    
- Bad Controls: It is bad practice to include variables that are themselves outcomes. For instance, consider studying the causal effect of college on earnings. If we include a covariate of working at a high paying job, then we're blocking part of the causal channel between college and earnings, i.e. you are more likely to have a nice job if you study more years!
    



## 12.6 Wrap up 
In this module we distinguished a basic linear model and an OLS model. OLS is a linear model that minimizes the sum of all error terms squared.$$ \min_{b} \frac{1}{n} \sum_{j}^n (\hat{u})^2$$ Another example of a linear model is one that just minimizes  the the sum of all error terms $$ \min_{b} \frac{1}{n} \sum_{j}^n (\hat{u})$$ This model is linear but not OLS.


We also learned how to interpret coefficients in any linear regression. $\beta_0$ is the y-intercept of the line therefore its equal to $$ E[y_{i}|x_{i}=0]=\beta_0 $$. Its the expected value of y when x=0.
In the case of any other beta, $\beta_1$ or 2 or 3, 
$$ E[y_{i}|x_{i}=1]- E[y_{i}|x_{i}=0]= \beta$$ its going to be the difference between the expected value of y due to a change in x. Therefore, betas tell us the effect that a particular covariate has on y, ceteris parabus, making them values of great importance when we are developing our research project!
