# ECON 490: Regression Analysis (12)

## Prerequisites 
---
1. Econometric approaches to linear regression taught in ECON 326.
2. Importing data into R.
3. Creating new variables in R.

## Learning objectives:
---

1. Implement the econometric theory for linear regressions learned in ECON 326.
2. Run simple univariate and multivariate regressions using the command `lm()`.
3. Understand the interpretation of the coefficients in linear regression output.
4. Consider the quality of control variables in a proposed model.


## 12.1 A Word of Caution Before We Begin

Before conducting a regression analysis, a great deal of work must go into understanding the data and investigating the theoretical relationships between variables. The biggest mistake that students make at this stage is not how they run the regression analysis, it is failing to spend enough time preparing data for analysis. 
- A variable that is qualitative and not ranked cannot be used in an OLS regression without first creating a dummy variable(or a series of dummy variables). Examples of variables that must always be included as dummy variables are sex, race, religiosity, immigration status, and marital status. Examples of variables that are sometimes included as dummy variables are education, income and age. 
- You will want to take a good look to see how your variables are coded before you begin run regressions and interpreting the results. Make sure that missing values are coded a "." and not some value (such as "99"). Also, check that qualitative ranked variables are coded in the way you expect (e.g. higher education is coded with a larger number). If you do not do this you could be misinterpreting your results.
- Some samples are not proper representations of the population and must be weighted accordingly (we will deal with this in depth later).
- You should always think about the theoretical relationship between your variables before you start your regression analysis: Does economic theory predict a linear relationship? Independence between explanatory terms, or is there possibly an interaction?


## 12.2 Linear Regression Models 

Understanding how to run a well structured OLS regression and how to interpret the results of that regression are the most important skills for undertaking empirical economic analysis. You have acquired a solid understanding of the theory behind the OLS regression in ECON 326; keep this in mind throughout your analysis. Here, we will cover the practical side of running regressions and, perhaps more importantly, how to interpret the results. 

An econometric model describes an equation (or set of equations) that impose some structure on how the data was generated. The most natural way to describe statistical information is the mean. Therefore, we typically model the mean of a (dependent) variable and how it can depend on different factors (independent variables or covariates). The easiest way to describe a relationship between a dependent variable, y, and one or more independent variables, x is linearly. 

Suppose we want to know what variables are needed to understand why and how earnings vary between each person in the world. What would be the measures needed to predict everyone's earnings?  

Some explanatory variables might be:
- Age 
- Year (e.g. macroeconomic shocks in that particular year)
- Region (local determinants on earnings)
- Hours worked
- Education 
- Labor Market Experience
- Industry / Occupation 
- Number of children
- Level of productivity
- Passion about their job
- etc., etc., there are so many!

For simplicity, lets assume we want to predict earnings but we only have access to datasets relating to people's age and earnings. If we want to generate a model that predicted the relationship between these two variables we could create a linear model where the dependent variable (y) would be annual earnings, the independent variable (x) would be age, the slope (m) would be how much an extra year of age affects earnings, and the y-intercept (b) would be earning when age is equal to 0. We would write this relationship as,

$$ y = b +mx.$$

We only have access to two variables, so we are unable to observe the rest of the variables (independent variables or covariates $X_{i}$) that might determine earnings. Even if we do not observe these variables they are still affecting earnings and our model above would have error; the values would diverge from the linear model. 

Where $\beta_0$ is the y-intercept, $\beta_1$ is the slope and $i$ indicates the worker observation in the data we have, 

$$ logearnings_{i} =\beta_0 + \beta_1 age_{i}  + u_{i}. \tag{1}$$

It's important to understand what $\beta_0$ and $\beta_1$ stand for in the linear model. We said above that we typically model the mean of a (dependent) variable and how it can depend on different factors (independent variables or covariates). Therefore we are in fact modeling the expected value of *earnings* conditional on the value of *age*. This is called the conditional expectation function or CEF. We assume that it takes the form of: 

$$  E[logearnings_{i}|age_{i}] =\beta_0 + \beta_1 \beta_1 age_i \tag{2} $$


How do equations (1) and (2) relate? If you take an expectation given age on equation (1) you will notice that 
$$E[age_{i}|age_{i}]=age_{i}$$ 
and, this will leave us with
$$E[u_{i}|age_{i}]=0.$$

If $age=0$ then, $\beta_1 \times age=0$ and $$ E[logearnings_{i}|age_{i}=0]=\beta_0 $$

If $age=1$ then, $\beta_1 \times age=\beta_1$ and $$ E[logearnings_{i}|age_{i}=1]=E[logearnings_{i}|age_{i}=0]+ \beta_1$$

Differencing the two equations above gives us the solution,

$$ E[logearnings_{i}|age_{i}=1]- E[logearnings_{i}|age_{i}=0]= \beta_1,$$ 

where $β_1$ is the difference in the expected value of *logearnings* when there is a one unit increase in *age*. If you choose any two values that differ by 1 unit you will also get $\beta_1$ as the solution (try it yourself!).

If we know those $β_1$s we can know a lot of information about the means of different set of workers. For instance, we can compute the mean log-earnings of 18 year old workers: 

$$ E[logearnings_{i} \mid  age_{i}=18] = \beta_0 + \beta_1 \times 18  $$


This is the intuition that we should follow to interpret the coefficients! 

Consider a slightly more complicated example. 
    
Let's assume there are only two regions in this world: region **A** and region **B**. In this world, we'll make it such that workers in region **B** earn $\beta_1$ percentage points more than workers in region **A** on average. We are going to create a dummy variable called $region$ that takes the value of 1 if the worker's region is **B** and a value of 0 if the worker's region is **A**.

Furthermore, an extra year of age increase earnings by $\beta_2$ on average and we take the same approach with ever explanatory variable on the list above. The empirical economist (us!) only observes a subset of all these variables, which we call the observables or covariates $X_{it}$. Let's suppose that the empirical economist only observes the region and age of the workers.

We could generate log-earnings of worker $i$ as follows.

\begin{align}
logearnings_{i} &=  \beta_1 \{region_{i}=1\} + \beta_2 age_{i} + \underbrace{ \beta_3 education_{i} + \beta_4 hours_{i} + \dots }_{\text{Unobservable, so we'll call this }u_{i}^*} \\
&= E[logearnings_{i} \mid region_{i}=0, age_{i}=0] + \beta_1 \{region_{i}=1\} + \beta_2 age_{i} + u_{i}^* - E[logearnings_{i} \mid region_{i}=0, age_{i}=0] \\\\\
&= \beta_0 + \beta_1 \{region_{i}=1\} + \beta_2 age_{i}  + u_{i}
\end{align}

    
In the second line we did one of the most powerful tricks in all mathematics: add and substract the same term! The term we chose was the mean earnings for those who are in region **A** and age equal to zero, i.e., we "turn-off"  the effect of the covariates. This term is the interpretation of the constant in our linear model. The re-defined unobservable term is a deviation from such mean, which we expect to be zero on average. 


So far we have made an assumption at the population level. Remember that to know the CEF we need to know the true betas, which in turn depend on the joint distribution of the outcome ($Y_i$) and covariates ($X_i$). However, in practice, we are given a random sample where we can compute average instead of expectations, and empirical distributions instead of the true distributions. We can use these in a formula (also known as an estimator!) to obtain a reasonable guess of the true $\beta$s. For a given sample, the numbers that are thrown by the estimator or formula are known as estimates. One of the most powerful estimators out there is the Ordinary Least Squares Estimator (OLS).


</div>


## 12.3 Ordinary Least Squares

If we are given some dataset and we have to find the unknown $\beta$s, the most common and powerful tool is known as OLS. Continuing with the example above, let all the observations be indexed by $j=1,2,\dots, n$. Let $$\hat{β_0}, \hat{β_1},\hat{β_2}$$ be the estimators of $$β_0, β_1, β_2.$$ The formula or estimator will return some values that wil give rise to a sample version of the population model: 

$$ logearnings_{j} = b_0 + b_1\{region_{j}=1\} + b_2 age_{j}  + \hat{u_{j}}, $$

where $u_j$ is the true error in the population, and $ \hat{u_{j}}$ is called a residual (the sample version of the error given the current estimates). OLS finds the values of $\hat{β}$s  that minimize the sum of squared residuals. This is given by the following minimization problem:
$$ \min_{b} \frac{1}{n} \sum_{j}^n \hat{u}_{j}^2$$ This expression can also be written as,

$$ \min_{b} \frac{1}{n} \sum_{j}^n (logearnings_{j} - b_0 - b_1 \{region_{j}=1\} - b_2age_{j} )^2 $$

OLS is minimizing the squared residuals (the sample version of the error term) given our data. This minimization problem can be solved using calculus, specifically the derivative chain rule. The first order conditions are given by : 

\begin{align}
\frac{1}{n} \sum_{j}^n 1 \times \hat{u}_{j} &= 0  \\
\frac{1}{n} \sum_{j}^n age_i \times \hat{u}_{j} &= 0  \\
\frac{1}{n} \sum_{j}^n \{region_i = B\} \times \hat{u}_{j} &= 0 
\end{align}

From these first order conditions we construct the most important restrictions for OLS: 

$$\frac{1}{n} \sum_{j}^n \hat{u}_j = \frac{1}{n} \sum_{j}^n \hat{u}_j \times  age_j=\frac{1}{n} \sum_{j}^n \hat{u}_j\times\{region_j = 1\}=0$$

In other words, by construction, the sample version of our error term will be uncorrelated with all the covariates. The constant term works the same way as including a variable equal to 1 in the regression (try it yourself!).

Notice that the formula for $β_0, β_1, β_2$ (the true values!) is using these conditions but we replace expectation instead of sample averages. This is obviously an infeasible approach since we argued before that we need to know the true joint distribution of the variables to compute such expectations. As a matter of fact, many useful estimators rely on this approach: replace an expectation by a sample average, which is called the sample analogue approach.


<div class="alert alert-info">


**Note:** Because this is an optimization problem, all of our variables must be numeric. If a variable is categorical we must be able to re-code it into a numerical variable. You will understand more about this after completing our next module. 
    
</div>



## 12.4 Ordinary Least Squares Regressions with R 

For this module we will be using the fake data dataset. Recall that this data is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings. 

In [1]:
#Clear the memory from any pre-existing objects
rm(list=ls())

# loading in our packages
library(tidyverse) #This includes ggplot2! 
library(haven)
library(IRdisplay)

#Open the dataset 
fake_data <- read_csv("../econ490-stata/fake_data.csv")  #change me!

# inspecting the data
glimpse(fake_data)

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

[1mRows: [22m[34m2[39m [1mColumns: [22m[34m1[39m

[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): version https://git-lfs.github.com/spec/v1


[36mℹ[39m Use `spec()` to retrieve the full column sp

Rows: 2
Columns: 1
$ `version https://git-lfs.github.com/spec/v1` [3m[90m<chr>[39m[23m "oid sha256:d50653466eb0d…


### 12.4.1 Univariate Regressions

To run a linear regression using OLS we use the command `lm()`. The basic syntax of the command is

```R
lm(data=dataset_name, dep_varname ~ indep_varnames)
```
You can look at the help file to look at the different options that this command provides. 

Let's start by creating a new variable that is the natural log of earnings and then run our regression. 

In [2]:
fake_data <- fake_data %>%
        mutate(log_earnings = log(earnings)) #the log function

In [4]:
lm(data=fake_data, log_earnings ~ age)


Call:
lm(formula = log_earnings ~ age + region, data = fake_data)

Coefficients:
(Intercept)          age       region  
  9.9988502    0.0140839    0.0009495  


By default R includes a constant (which is usually what we want, since this will make that residuals are 0 on average). The estimated coefficients are $\hat{\beta}_0 = 10.014$ and $\hat{\beta}_1 = 0.014$.  Notice that we only included one covariate here, which is known as univariate (linear) regression. 

The interpretation in univariate regression is fairly simple, $\hat{\beta}_1$ says that having one extra year of age increases earnings by $0.014$ in log earning. In other words, one extra year gives 1.4 percentage points higher earnings. 


### 12.4.2 Multivariate Regression

The command `lm()` also allows us to list multiple covariates. When we want to carry out a multivariate regression we write, 
```R
lm(data=dataset_name, dep_varname ~ indep_varname1 + indep_varname2 + ... )
```
and so on.

In [7]:
lm(data=fake_data, log_earnings ~ age + treated )


Call:
lm(formula = log_earnings ~ age + treated, data = fake_data)

Coefficients:
(Intercept)          age      treated  
  10.646445     0.006083    -0.817872  


How would we interpt the coefficient corresponding to being treated? Consider the following two comparisons: 

- Mean log earnings of treated workers of 18 years old minus the mean log earnings of untreated workers of 18 years old = $\beta_2$. 
- Mean log earnings of treated workers of 20 years old minus the mean log earnings of untreated workers of 20 years old = $\beta_2$. 
- and so on. 


Therefore, the coefficient gives the increase in log earnings between treated and untreated among workers *with the same other characteristics*. We economists usually refer to this as $\textit{ceteris paribus}$.

The second column shows the standard errors. Using those we can compute the third column which is testing that the coefficient is equal to zero: 

$$ t = \frac{ \hat{\beta} - 0 }{StdErr} $$

If the t-statistic is roughly greater than 2 in absolute value, we reject the null hypothesis that there is no effect. This would mean that the data supports the hypothesis that the variable in question has some effect on earnings at a confidence level of 95%. 

An alternative test can be performed using the p-value statistic: if the p-value is less than 0.05 we reject the null hypothesis at 95% confidence level.

<div class="alert alert-info">

**Note:** Without statisical significance we cannot reject the null hypothesis and have no choice but to conclude that the coefficient is zero. 

</div>


### 12.4.3 Sample Weights
The data that is provided to us is often not statistically representative of the  population as a whole. This is because the agencies that collect data (like Statistics Canada) often decide to over-sample some segments of the population. They do this to ensure that there is a large enough sample size of subgroups of the population to conduct meaningful statistical analysis of those sub-populations. For example, the population of Indigenous identity in Canada accounts for approximately 5% of the total population. If we took a representative sample of 10,000 Canadians, there would only be 500 people who identified as Indigenous in the sample. 

This creates two problems. The first is that this is is not a large enough sample to undertake any meaningful analysis of characteristics of the Indigenous population in Canada. The second is that when the sample is this this small, it might be possible for researchers to identify individuals in data. This would be extremely unethical, and Stats Canada works hard to make sure that data remains anonymized. 

To resolve this issue, Statistics Canada over-samples people of Indigenous identity when they collect data. For example, they might survey 1000 people of Indigenous identity so that those people now account for 10% of observations in the sample. This would allow researchers who want to specifically look at the experiences of Indigenous people to conduct reliable research, and maintain the anonymity of the individuals represented by the data. 

When we use this whole sample of 10,000, however, the data is no longer nationally representative since it overstates the share of the population of Indigenous identity - 10% instead of 5%. This sounds like a complex problem to resolve, but the solution is provided by the statistical agency that created the data in the form of "sample weights" that can be used to recreate data that is nationally representative.

<div class="alert alert-info">

**Note**: Before applying any weights in your regression, it is important that you read the user guide that comes with your data to see how weights should be applied. There are several options for weights and you should never apply weights without first understanding the intentions of the authors of the data.
    
</div>
    
Our sample weights will be commonly coded as an additional variable in our data set such as *weight_pct*. To include the weights in regression analysis, we can simply include the following option immediately after our independent variable(s) in the `lm` function:
```R
    lm(data = data, y ~ x, weights = weight_pct)  
```
We can do that with the variable _sample_weight_ which is provided to us in the "fake_data" data set, re-running the regression of log earnings on age and treatment status from above.

In [None]:
lm(data = fake_data, log_earnings ~ age + treated, weights = sample_weight)

Often, after weighting our sample, the coefficients from our regression will change in magnitude. In these cases, there was some sub-sample of the population that was over-represented in the data and skewed the results of the unweighted regression.

Finally, while this section described the use of weighted regressions, it is important to know that there are many times we might want to apply weights to our sample that have nothing to do with running regressions. For example, if we wanted to calculate the mean of a variable using data from a skewed sample, we would want to make sure to use the weighted mean. While `mean` is used in R to calculate means, R also has an incredibly useful command called `weighted.mean` which directly weights observations to calculate the weighted mean. Many packages exist which can calculate the weighted form of numerous other summary statistics.

## 12.5 What can we do with OLS? 

Notice that OLS gives us a linear approximation to the conditional mean of some dependent variable given some observables. We can use this information for prediction: if we had different observables how does the expected mean would differ? Another thing we could do with OLS is discuss causality: how does manipulating one variable impacts the dependent variable on average?

To give a causal interpretation to our OLS estimates we require that in the population it holds that 
$\mathbf{E}[X_i u_i] = 0$, the unobservables are uncorrelated to the independent variables of the equation (remember this is untestable because we cannot compute the expectations in practice!). If these unobservables are correlated to a independent variable it means the variable can be causing a change in the dependent variable because of a change in an unobservable rather than a change in the independent variable itself, making us unable to prove causality. This is also called an endogeneity problem. 

You might be tempted to think that we can test this using the sample version $\frac{1}{n} \sum_{j}^n  X_i u_i = 0$, but notice that from the first order conditions this is true by construction! It's by design a circular argument: we are assuming that it holds true when we compute the solution to OLS.

For instance, if we want to interpret in the previous regression that the causal effect of being treated is equal to -0.81 it must be the case that treatment is not correlated (in the population sense) to the error term. However, it could be the case that treated workers are the ones that usually perform worse at their job, and that would invalidate a causal interpretation of our OLS estimates.

- Good Controls: To think about good controls we need to consider which *unobserved* determinants of the outcome are possibly correlated to our variable of interest.
    
- Bad Controls: It is bad practice to include variables that are themselves outcomes. For instance, consider studying the causal effect of college on earnings. If we include a covariate of working at a high paying job, then we're blocking part of the causal channel between college and earnings, i.e. you are more likely to have a nice job if you study more years!
    



## 12.6 Exercises
1. Run a regression with log_earnings as the dependent variables and age, treatment and sex as independent variables. Interpret the four coefficients.

In [None]:
fake_data <- fake_data %>% 
    mutate(female = case_when(
        sex == , 
        sex == )) 

fake_data <- fake_data %>% 
    mutate(log_earnings = ))

lm()

In [2]:
display_html('<iframe src="https://h5p.open.ubc.ca/wp-admin/admin-ajax.php?action=h5p_embed&id=1210" width="841" height="483" frameborder="0" allowfullscreen="allowfullscreen" title="module 12 q1"></iframe><script src="https://h5p.open.ubc.ca/wp-content/plugins/h5p/h5p-php-library/js/h5p-resizer.js" charset="UTF-8"></script>')

In [3]:
display_html('<iframe src="https://h5p.open.ubc.ca/wp-admin/admin-ajax.php?action=h5p_embed&id=1211" width="841" height="310" frameborder="0" allowfullscreen="allowfullscreen" title="module 12 q2"></iframe><script src="https://h5p.open.ubc.ca/wp-content/plugins/h5p/h5p-php-library/js/h5p-resizer.js" charset="UTF-8"></script>')

In [5]:
display_html('<iframe src="https://h5p.open.ubc.ca/wp-admin/admin-ajax.php?action=h5p_embed&id=1212" width="841" height="339" frameborder="0" allowfullscreen="allowfullscreen" title="module 12 q3"></iframe><script src="https://h5p.open.ubc.ca/wp-content/plugins/h5p/h5p-php-library/js/h5p-resizer.js" charset="UTF-8"></script>')

4. Run a multivariate regression that includes the variable start-year or its adjusted version and explain what its coefficient means.

In [None]:
lm()

In [6]:
display_html('<iframe src="https://h5p.open.ubc.ca/wp-admin/admin-ajax.php?action=h5p_embed&id=1213" width="841" height="358" frameborder="0" allowfullscreen="allowfullscreen" title="module 12 q4"></iframe><script src="https://h5p.open.ubc.ca/wp-content/plugins/h5p/h5p-php-library/js/h5p-resizer.js" charset="UTF-8"></script>')

## 12.7 Wrap up 


In this module we distinguished the following concepts: 

- Linear Model : an equation that describes how the outcome is generated, and depends on some coefficients $\beta$. 
- Ordinary Least Squares: a method to obtain a good approximation of the true $\beta$ of a linear model from a given sample. 

Therefore, notice that there is no such thing as OLS model. Notice that we could apply a different method (estimator) to a linear model. For example, consider minimizing the sum of all error terms $$ \min_{b} \frac{1}{n} \sum_{i}^n | \hat{u}_j | $$ 

This model is linear but the solution to this problem are not OLS estimates.


We also learned how to interpret coefficients in any linear model. $\beta_0$ is the y-intercept of the line therefore its equal to $$ E[y_{i}|x_{i}=0]=\beta_0.$$ Its the expected value of y when x=0. Because we have a sample approximation to this true value, it would be the sample mean of y when x=0.


In the case of any other beta, $\beta_1$ or 2 or 3, 
$$ E[y_{i}|x_{i}=1]- E[y_{i}|x_{i}=0]= \beta$$ its going to be the difference between the expected value of y due to a change in x. Therefore, betas tell us the effect that a particular covariate has on y, ceteris paribus, making them values of great importance when we are developing our research project!
