# 10. Regression Analysis

***

Pre-requisites: 
- Opening a Dataset
- Understand how to get documentation of a command and syntax.


Learning Objectives: 
- Identify the difference between a linear model and OLS.
- Propose an econometric (linear) model from a given dataset.
- Understand the interpretation of the coefficients in the linear regression output.
- Distinguish which control variables are good in the proposed model.

***

Understanding how to run a well structured OLS regression and how to interpret the results of that regression are the most important skills for undertaking empirical economic analysis. You have acquired a solid understanding of the theory behind the OLS regression in ECON 326. Here we will cover the practical side of running regression and, perhaps more importantly, how to interpret the results. 

One word of caution before we begin. Before conducting a regression analysis, a great deal of work must go into understanding the data and investigating the theoretical relationships between variables.  The biggest mistake that students make at this stage is not how they run the regression analysis, it is failing to spend enough time preparing data for analysis. 



<div class="alert alert-warning">

**Warning:** A variable that is qualitative and not ranked cannot be used in an OLS regression without first creating a dummy variable. Examples of variables that must always be included as dummy variables are sex, race, religiosity, immigration status, and marital status. Examples of variables that are sometimes included as dummy variables are education, income and age. 
</div>

<div class="alert alert-warning">

**Warning:** You will want to take a good look to see how your variables are coded before you begin run regressions and interpreting the results. Make sure that missing values are coded a "." and not some value (such as "99"). Also, check that qualitative ranked variables are coded in the way you expect (e.g. higher education is coded with a larger number). If you do not do this you could be misinterpreting your results.
    
</div>



Before we proceed further, we will re-open the earnings dataset of workers across different years.

In [1]:
clear*
use fake_data,clear

## 10.1 Linear Model 

It is very important to understand the distinction between model and ordinary least squares. To understand what a model is, suppose you were an omnipotent being that could generate any variable (e.g. earnings) in the world. How would you do it? Let's think of ingredients that we would need to generate the earnings of every person in this world: 

- Age 
- Year (e.g. macroeconomic shocks in that particular year)
- Region (local determinants on earnings)
- Labor Market Experience
- Tenure at that particular firm
- Firm where that individual is working
- How productive a person is
- Passionate about their particular job
- etc., etc., there are so many!

We could generate log-earnings of worker $i$ at time $t$ as follows. 

\begin{align}
earnings_{it} &= \beta_0 + \beta_1 age_{it} + \gamma_A \mathbf{1}\{region_{it}=A\} + \dots + \gamma_E\mathbf{1}\{region_{it}=E\} +  \\
 & \quad \beta_2 tenure_{it} + \beta_3 exper_{it} + \dots 
\end{align}


<div class="alert alert-info">

**Note:** Why do we model log-earnings? Notice that the right hand side can be, in theory, negative. Negative earnings do not make a lot of sense. However, negative log earnings can occur whenever a person earns below 1 dollar. 
</div>




In [2]:
%browse 10

Unnamed: 0,workerid,year,sex,birth_year,age,start_year,region,treated,earnings
1,1,1999,M,1944,55,1997,1,0,39975.008
2,1,2001,M,1944,57,1997,1,0,278378.06
3,2,2001,M,1947,54,2001,4,0,18682.6
4,2,2002,M,1947,55,2001,4,0,293336.41
5,2,2003,M,1947,56,2001,4,0,111797.26
6,3,2005,M,1951,54,2005,5,0,88351.672
7,3,2010,M,1951,59,2005,5,0,46229.574
8,4,1997,M,1952,45,1997,5,1,24911.029
9,4,2001,M,1952,49,1997,5,1,9908.3623
10,5,2009,M,1954,55,1998,2,1,137207.34


## 10.2 Univariate Regression

We'll begin a simple analysis between the relationship between two variables. In this particular case, we might be interested in how the earnings profile look across workers of different ages. 



<div class="alert alert-info">

**Note:** The $R^2$ measures the amount of variation 
    
</div>
