# ECON 490: Dummy Variables and Interactions (13)


## Pre-requisites: 
---
1. Importing data into Stata.
2. Examining data using `browse` and `codebook`
3. Creating new varables using the command `generate`.

## Learning Outcomes:  
---
1. Code a dummy variable from variables with two or more categories.
2. Interpret coefficients associated with a dummy variables from an OLS regression.
3. Interpret coefficients of an interaction between a numeric variable and a dummy variables from an OLS regression.

## 3.1 Introduction to Dummy Variables for Regression Analysis

Dummy variables were introduced in [Module 6](linked needed) where we discussed both how to interpretation and generate those variables. Please make sure to review that module if you have any uncertainty about that these variables measure. 

Here we will discuss including qualitative variables as explanatory variables in a regression model.

Using our multivariable regression from [Module 12](link needed) we will include an new explanatory variable that indicates if the individual represented by the observation identified as female.  To do this we are going to include a dummy variable into our regression as a control variable and then interpret regression coefficients. 

For this module we again will be using the fake data dataset. Recall that this data is simulating information of workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings. 

In [None]:
* Below you will need to include the path on your own computer to where the data is stored between the quotation marks.

clear *
cd "/Users/marinaadshade/Documents/TELF Project/raw"
use fake_data,clear

In [Module 6] we introduced the command `gen` (or `generate`). It is used to create new variables. Here, we are generating a new variable based on the values of the already existing variable `earnings`. 

In [None]:
gen logearn = log(earnings)

Let's take a lool at the data. 

In [None]:
%browse 10

As expected `logearn` is be a quantitative variable showing the logarithm of each value of `earnings`. 

We observe a variable named `sex`, but it doesn't seem to be coded as a numeric variable. Let's take a closer look:

In [None]:
codebook sex

As expected, sex is a string variable, and is not numeric. We cannot use a string variable in a regression analysis; we have to create a new variable that indicated the sex of the indiviudal represented by the observation. 

A dummy variable is numeric variable that takes either the value of 0 or 1 depending on a condition. Thus, we want to create a variable that equals 1 whenever the worker identified as "female". We learned how to do this in [Module 6].

In [None]:
gen female = sex == "F"

## 13.1 Interpreting dummy variables

Whenever we include a dummy variable in a regression are making direct comparisons between the 1-category and the 0-category. In the case of this female dummy, we are direcly comparing the mean earnings of female identified workers against the mean earnings of male identified workers. 

Let's consider the regression below, 

In [None]:
reg logearn female

We remember from [Module 12]() that "_cons" is the constant $β_0$ and know that here $β_0 = E[logearnings_{i}|female_{i}=0]$. Therefore, the results of this regression suggests that on average, males earn 10.68. We also know from the [Module 12]() that,

$$\beta_1 = E[logearnings_{i}|female_{i}=1]- E[y_{i}|x_{i}=0].$$

The regression results here suggest that female identified persons earn on average 0.55 less than male identified person and, as a result, on average female identified persons earn 10.68 - 0.55 = 10.13.  

In other words, the coefficient of the female variable shows the mean difference in log-earnings relative to males (the ones who were coded as 0). $\hat{β}_1$ thus, also provides the measure of the raw gender gap.

<div class="alert alert-info">


**Note:** We are only able to state this result because the p-value for both  $\hat{β}_0$ and  $\hat{β}_1$ is less than 0.05 and we can reject the null hypothesis that $β_0 = 0$ and $β_1 = 0$ at 95% confidence level
    
</div>


The interpretation remains the same once we control for more variables, although it is ceteris paribus (holding constant) the other observables in the regression.

In [None]:
reg logearn female age 

In this case, among people that are the same age, the gender gap is (not surprisingly) slightly smaller than in our previous regression. That is expected since previously we compared all females to all males irrespective of the composition of age groups in those two categories of workers.

## 13.2 Dummy variables with multiple categories

In this dataset we also have a region variable that has 5 different regions. As in [Module 6]() can create dummies for each category using `tabulate`. 

First, we `tabulate` the categorical variable we want to make into a set of dummy variables. Then we use the option `gen` to create five new dummy variables for the 5 regions represented in the data.

In [None]:
tab region, gen(regdummy)

In [None]:
%browse 10

Notice that the sum of the five dummies in any row is equal to 1. This is because every worker is located in exactly one region. If we included all of the regional dummies in a regression we will create the problem of multi-collinearity; the full set of dummy variables are perfectly correllated. Think about it this way - if a person is in region 1 (regdummy1 = 1) then we know that the person is not in regions 2 (regdummy2 = 0). Therefore being in region 1 predicts not being in region 2. 

We must always exclude one of the dummies. Choose the dummy variable that you exclude carefully because this will be the category of reference. That is, we will be comparing means of any one region dummy variable to the excluded region. 

We have actually already seen this approach in action in the regression we ran above; there we didn't add a separate dummy variable for "male". Instead, we essentially excluded the male dummy variable and interpreted the coefficient on "female" as the difference between female and male log-earnings. 

The easiest way to include multiple categories in a regression is to write the list of variables using the notation *i.variable*. Below you will see that Stata drops the first region dummy (region = 1) and includes dummy variables for the regions 2 - 5. 

In [None]:
reg logearn i.region

Most of the time you will want to control which dummy variable is select for the reference. If that is the case, you will first have to control the reference dummy variable using the command `fvset base`. 

In [None]:
fvset base 3 region 

When you run the regression below the reference is region is 3.

In [None]:
reg logearn i.region

## 13.3 Interactions 

It is an established fact that a wage gap exists between male and female workers. However, it is possible that the wage gap changes depending on the age of the workers. For example, female and male high school students tend to work minimum wage jobs hence, we might believe that the wage gap between people of the ages of 15-18 is very small. However, once people have the experience to start looking for better paying jobs, we might believe the wage gap starts to increase. This means that the effect of increasing age on earnings will also differ across the sexes. The way to capture that differential effect of age across males and females is to create a new variable that is the product of the female dummy and age. 

Whenever we do this it is *very important* that we also include both the female dummy and age as control variables. 

To run this in Stata, categorical variables must be preceded by a `i.`, continuous variables must be preceded by `c.` and terms are interacted with the `#` symbol. For our example, we have the categorical variable `i.female` interacted with continuous variable `c.age` and the regression looks like this:

In [None]:
reg logearn i.female##c.age

Notice that Stata has automatically included the female dummy and age as control variables. We can see that on average people who are identified as female earn less than those identified as male. And we can see that each additional year of age inceases log-earnings by 0.013 for the reference category (males). This affect of age on log-earnings is lower for females by 0.007 and it seems that our theory is correct; the wage gap beween males and female of the same age increases as they get older. 

Try this yourself below with the set of region dummies we created above, and think about what these results mean!

## 13.4 Wrap up

There are very few empiral research projects using micro data that do not require researchers to use dummy variables. Important qualitative measures such as marital status, immigration status, occupation, industry, and race always require that we use dummy variables. Other important variables such as education, income, age and number of children often require us to use dummy variables even when they are sometimes measured using ranked categorical variables. For example, we could have a variable that measures years of education that could be included as a continuous variable. However, you might insteat want to include a variable that indicated if the person has a university degree. If that is the case you can use `generate` to create a dummy variable that indicated that level of education. 

Even empircal research projects that use macro data sometimes require that we use dummy variables. For example, you might have a data set that measures macro variables for African counties that includes information about historic colonization. You might want to create a dummy variables that indicate the orgin of the colonizers, and then include that in your analysis to understand that effect. Another example, you might have a timeseries data set and want to indicate whether or not a specific policy was implimented in any one time period. You will need a dummy variable for that, and can include one in your analysis using the same process described above. 

Try this yourself with any dataset that you have download into Stata. You will find that this is not complicated, but has the power to yeild meaningful results!