# ECON 490: Dummy Variables and Interactions (13)


## Prerequisites: 
---
1. Importing data into R.
2. Examining data using `glimpse()`.
3. Creating new variables in R.
4. Linear regression analysis. 

## Learning Outcomes:  
---
1. Understanding when dummy variable are needed in analysis.
2. Code dummy variables from qualitative variables with two or more categories.
2. Interpret coefficients associated with a dummy variables from an OLS regression.
3. Interpret coefficients of an interaction between a numeric variable and a dummy variables from an OLS regression.

## 13.1 Introduction to Dummy Variables for Regression Analysis

You will remember dummy variables from when they were introduced in [Module 6](linked needed). There we discussed both how to interpret and to generate those types of variables. If you have any uncertainty about what these variables measure please make sure you review that module.

Here we will discuss including qualitative variables as explanatory variables in a linear regression model.

Imagine that we want to include a new explanatory variable in our multivariate regression from [Module 12](link needed) that indicates that an individual represented by the observation was identified as female. To do this we will to need to include a new dummy variable in our regression and then interpret coefficient on that variable from the regression results. 

For this module we again will be using the fake data data set. Recall that this data is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings. 

In [1]:
#Clear the memory from any pre-existing objects
rm(list=ls())

# loading in our packages
library(tidyverse) #This includes ggplot2! 
library(haven)

#Open the dataset 
fake_data <- read_csv("../econ490-stata/fake_data.csv")  #change me!

# inspecting the data
glimpse(fake_data)

-- [1mAttaching packages[22m ------------------------------------------------------------------------------- tidyverse 1.3.1 --

[32mv[39m [34mggplot2[39m 3.3.5     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.1.5     [32mv[39m [34mdplyr  [39m 1.0.7
[32mv[39m [34mtidyr  [39m 1.1.4     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 2.0.2     [32mv[39m [34mforcats[39m 0.5.1

-- [1mConflicts[22m ---------------------------------------------------------------------------------- tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

[1mRows: [22m[34m2861772[39m [1mColumns: [22m[34m9[39m

[36m--[39m [1mColumn specification[22m [36m------------------------------------------------------------------------------------------------[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): sex
[32mdbl[39m (

Rows: 2,861,772
Columns: 9
$ workerid   [3m[90m<dbl>[39m[23m 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 9,~
$ year       [3m[90m<dbl>[39m[23m 1999, 2001, 2001, 2002, 2003, 2005, 2010, 1997, 2001, 2009,~
$ sex        [3m[90m<chr>[39m[23m "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",~
$ birth_year [3m[90m<dbl>[39m[23m 1944, 1944, 1947, 1947, 1947, 1951, 1951, 1952, 1952, 1954,~
$ age        [3m[90m<dbl>[39m[23m 55, 57, 54, 55, 56, 54, 59, 45, 49, 55, 57, 41, 45, 46, 49,~
$ start_year [3m[90m<dbl>[39m[23m 1997, 1997, 2001, 2001, 2001, 2005, 2005, 1997, 1997, 1998,~
$ region     [3m[90m<dbl>[39m[23m 1, 1, 4, 4, 4, 5, 5, 5, 5, 2, 2, 5, 5, 5, 5, 2, 2, 4, 4, 2,~
$ treated    [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ earnings   [3m[90m<dbl>[39m[23m 39975.010, 278378.100, 18682.600, 293336.400, 111797.300, 8~


In [Module 6]() we showed how to create new variables. Here, we are generating a new variable based on the values of the already existing variable `earnings`. 

In [2]:
fake_data <- fake_data %>%
        mutate(log_earnings = log(earnings)) #the log function

Let's take a look at the data. 

In [3]:
glimpse(fake_data)

Rows: 2,861,772
Columns: 10
$ workerid     [3m[90m<dbl>[39m[23m 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, ~
$ year         [3m[90m<dbl>[39m[23m 1999, 2001, 2001, 2002, 2003, 2005, 2010, 1997, 2001, 200~
$ sex          [3m[90m<chr>[39m[23m "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M~
$ birth_year   [3m[90m<dbl>[39m[23m 1944, 1944, 1947, 1947, 1947, 1951, 1951, 1952, 1952, 195~
$ age          [3m[90m<dbl>[39m[23m 55, 57, 54, 55, 56, 54, 59, 45, 49, 55, 57, 41, 45, 46, 4~
$ start_year   [3m[90m<dbl>[39m[23m 1997, 1997, 2001, 2001, 2001, 2005, 2005, 1997, 1997, 199~
$ region       [3m[90m<dbl>[39m[23m 1, 1, 4, 4, 4, 5, 5, 5, 5, 2, 2, 5, 5, 5, 5, 2, 2, 4, 4, ~
$ treated      [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, ~
$ earnings     [3m[90m<dbl>[39m[23m 39975.010, 278378.100, 18682.600, 293336.400, 111797.300,~
$ log_earnings [3m[90m<dbl>[39m[23m 10.596010, 12.536736, 9.835348, 12.589075

As expected, `logearnings` is a quantitative variable showing the logarithm of each value of `earnings`. We observe a variable named `sex`, but it doesn't seem to be coded as a numeric variable. Notice that next to sex it says `<chr>`.

As expected, sex is a string variable, and is not numeric. We cannot use a string variable in a regression analysis; we have to create a new variable that indicated the sex of the individual represented by the observation. 

A dummy variable is numeric variable that takes either the value of 0 or 1 depending on a condition. A very simple way to create different categories for a variable in R is to use the `as.factor()` function.

In [5]:
as.factor(fake_data$sex)

## 13.2 Interpreting the Coefficient on a Dummy Variables

Whenever we include a dummy variable in a regression are making direct comparisons between the 1-category and the 0-category. In the case of this female dummy, we are directly comparing the mean earnings of female identified workers against the mean earnings of male identified workers. 

Let's consider the regression below, 

In [6]:
lm(data=fake_data, log_earnings ~ as.factor(sex))


Call:
lm(formula = log_earnings ~ as.factor(sex), data = fake_data)

Coefficients:
    (Intercept)  as.factor(sex)M  
        10.1384           0.5467  


Notice that the regression by default used females as the reference point and only estimated a male premia. Typically, we want the other way around. To change the reference group we write 

In [8]:
# Change reference level
fake_data = fake_data %>% mutate(female = relevel(as.factor(sex), "M"))

In [9]:
lm(data=fake_data, log_earnings ~ female)


Call:
lm(formula = log_earnings ~ female, data = fake_data)

Coefficients:
(Intercept)      femaleF  
    10.6851      -0.5467  


We remember from [Module 12]() that "_cons" is the constant $β_0$ and know that here $β_0 = E[logearnings_{i}|female_{i}=0]$. Therefore, the results of this regression suggests that on average, males earn 10.68. We also know from the [Module 12]() that,

$$\beta_1 = E[logearnings_{i}|female_{i}=1]- E[y_{i}|x_{i}=0].$$

The regression results here suggest that female identified persons earn on average 0.55 less than male identified person and, as a result, on average female identified persons earn 10.68 - 0.55 = 10.13.  

In other words, the coefficient of the female variable shows the mean difference in log-earnings relative to males (the ones who were coded as 0). $\hat{β}_1$ thus, also provides the measure of the raw gender gap.

<div class="alert alert-info">


**Note:** We are only able to state this result because the p-value for both  $\hat{β}_0$ and  $\hat{β}_1$ is less than 0.05 and we can reject the null hypothesis that $β_0 = 0$ and $β_1 = 0$ at 95% confidence level
    
</div>


The interpretation remains the same once we control for more variables, although it is ceteris paribus (holding constant) the other observables in the regression.

In [10]:
lm(data=fake_data, log_earnings ~ female + age)


Call:
lm(formula = log_earnings ~ female + age, data = fake_data)

Coefficients:
(Intercept)      femaleF          age  
   10.29798     -0.51423      0.01082  


In this case, among people that are the same age, the gender gap is (not surprisingly) slightly smaller than in our previous regression. That is expected since previously we compared all females to all males irrespective of the composition of age groups in those two categories of workers.

## 13.3 Dummy Variables with Multiple Categories

The previous section also holds when there is a variable with multiple categories, as in region. 

In [11]:
lm(data=fake_data, log_earnings ~ as.factor(region))


Call:
lm(formula = log_earnings ~ as.factor(region), data = fake_data)

Coefficients:
       (Intercept)  as.factor(region)2  as.factor(region)3  as.factor(region)4  
         10.502641           -0.028816           -0.011934           -0.027173  
as.factor(region)5  
          0.009242  


Notice that the sum of the five dummies in any row is equal to 1. This is because every worker is located in exactly one region. If we included all of the regional dummies in a regression we will create the problem of multi-collinearity; the full set of dummy variables are perfectly correlated. Think about it this way - if a person is in region 1 (regdummy1 = 1) then we know that the person is not in regions 2 (regdummy2 = 0). Therefore being in region 1 predicts not being in region 2. 

We must always exclude one of the dummies. Choose the dummy variable that you exclude carefully because this will be the category of reference. That is, we will be comparing means of any one region dummy variable to the excluded region. 

We have actually already seen this approach in action in the regression we ran above; there we didn't add a separate dummy variable for "male". Instead, we essentially excluded the male dummy variable and interpreted the coefficient on "female" as the difference between female and male log-earnings. 

You may have noticed that R drops the first region dummy (region = 1) and includes dummy variables for the regions 2 - 5. 

We can use the same trick as the previous section to change the reference group!

## 13.3 Interactions 

It is an established fact that a wage gap exists between male and female workers. However, it is possible that the wage gap changes depending on the age of the workers. For example, female and male high school students tend to work minimum wage jobs hence, we might believe that the wage gap between people of the ages of 15-18 is very small. However, once people have the experience to start looking for better paying jobs, we might believe the wage gap starts to increase. This means that the effect of increasing age on earnings will also differ across the sexes. The way to capture that differential effect of age across males and females is to create a new variable that is the product of the female dummy and age. 

Whenever we do this it is *very important* that we also include both the female dummy and age as control variables. 


In [12]:
lm(data=fake_data, log_earnings ~ female * age )


Call:
lm(formula = log_earnings ~ female * age, data = fake_data)

Coefficients:
(Intercept)      femaleF          age  femaleF:age  
  10.213877    -0.266154     0.013171    -0.007353  


Notice that Stata has automatically included the female dummy and age as control variables. We can see that on average people who are identified as female earn less than those identified as male. And we can see that each additional year of age increases log-earnings by 0.013 for the reference category (males). This affect of age on log-earnings is lower for females by 0.007 and it seems that our theory is correct; the wage gap between males and female of the same age increases as they get older. 

Try this yourself below with the set of region dummies we created above, and think about what these results mean!

## 13.4 Wrap up

There are very few empirical research projects using micro data that do not require researchers to use dummy variables. Important qualitative measures such as marital status, immigration status, occupation, industry, and race always require that we use dummy variables. Other important variables such as education, income, age and number of children often require us to use dummy variables even when they are sometimes measured using ranked categorical variables. For example, we could have a variable that measures years of education that could be included as a continuous variable. However, you might instead want to include a variable that indicated if the person has a university degree. If that is the case you can use `as.factor()` to create a dummy variable that indicated that level of education. 

Even empirical research projects that use macro data sometimes require that we use dummy variables. For example, you might have a data set that measures macro variables for African counties that includes information about historic colonization. You might want to create a dummy variables that indicate the origin of the colonizers, and then include that in your analysis to understand that effect. Another example, you might have a time series data set and want to indicate whether or not a specific policy was implemented in any one time period. You will need a dummy variable for that, and can include one in your analysis using the same process described above. 

Try this yourself with any data set that you have download in R. You will find that this approach is not complicated, but has the power to yield meaningful results!