# ECON 490: Dummy Variables and Interactions (13)

## Prerequisites 

1. Importing data into Stata.
2. Examining data using `browse` and `codebook`.
3. Creating new variables using the commands `generate` and `tabulate`.
4. Using globals in your analysis.
5. Understanding linear regression analysis. 

## Learning Outcomes  

1. Understand when a dummy variable is needed in analysis.
2. Create dummy variables from qualitative variables with two or more categories.
2. Interpret coefficients on a dummy variable from an OLS regression.
3. Interpret coefficients on an interaction between a numeric variable and a dummy variable from an OLS regression.

## 13.1 Introduction to Dummy Variables for Regression Analysis

You will remember dummy variables from when they were introduced in [Module 6](econometrics/econ490-stata/6_Creating_Variables.ipynb). There we discussed both how to interpret and how to generate this type of variable. If you have any uncertainty about what dummy variables measure, please make sure you review that module.

Here we will discuss including qualitative variables as explanatory variables in a linear regression model.

Imagine that we want to include a new explanatory variable in our multivariate regression from [Module 12](econometrics/econ490-stata/12_Linear_Reg.ipynb) that indicates whether an individual is identified as female. To do this we need to include a new dummy variable in our regression.

For this module we again will be using the fake data data set. Recall that this data is simulating information for workers in the years 1982-2012 in a fake country where a training program was introduced in 2003 to boost their earnings. 

In [2]:
** Below you will need to include the path on your own computer to where the data is stored between the quotation marks.

clear *
** cd " "
use fake_data,clear

In [Module 6](econometrics/econ490-stata/6_Creating_Variables.ipynb) we introduced the command `gen` (or `generate`). It is used to create new variables. Here, we are generating a new variable based on the values of the already existing variable _earnings_. 

In [3]:
gen logearnings = log(earnings)

Let's take a look at the data. 

In [4]:
%browse 10

Unnamed: 0,workerid,year,sex,age,start_year,region,treated,earnings,sample_weight,logearnings
1,1,1999,M,55,1997,1,0,39975.008,0.26076493,10.59601
2,1,2001,M,57,1997,1,0,278378.06,0.014273916,12.536736
3,2,2001,M,54,2001,4,0,18682.6,0.032186829,9.8353481
4,2,2002,M,55,2001,4,0,293336.41,0.47120222,12.589075
5,2,2003,M,56,2001,4,0,111797.26,0.70438099,11.624442
6,3,2005,M,54,2005,5,0,88351.672,0.35590065,11.38908
7,3,2010,M,59,2005,5,0,46229.574,0.8969152,10.741375
8,4,1997,M,45,1997,5,1,24911.029,0.39900845,10.123066
9,4,2001,M,49,1997,5,1,9908.3623,0.55194622,9.2011347
10,5,2009,M,55,1998,2,1,137207.34,0.014438981,11.829248


As expected, _logearnings_ is a quantitative variable showing the logarithm of each value of _earnings_. We observe a variable named _sex_, but it doesn't seem to be coded as a numeric variable. Let's take a closer look:

In [5]:
codebook sex


--------------------------------------------------------------------------------
sex                                                                          Sex
--------------------------------------------------------------------------------

                  Type: String (str1)

         Unique values: 2                         Missing "": 0/138,138

            Tabulation: Freq.  Value
                       30,519  "F"
                      107,619  "M"


As expected, sex is a string variable and is not numeric. We cannot use a string variable in a regression analysis; we have to create a new variable which indicates the sex of the individual represented by the observation in numeric form. 

A dummy variable is a numeric variable that takes either the value of 0 or 1 depending on a condition. In this case, we want to create a variable that equals 1 whenever a worker is identified as "female". We have seen how to do this in previous notebooks.

In [6]:
gen female = sex == "F"

## 13.2 Interpreting the Coefficient on a Dummy Variable

Whenever we interpret the coefficient on a dummy variable in a regression, we are making a direct comparison between the 1-category and the 0-category for that dummy. In the case of this female dummy, we are directly comparing the mean earnings of female identified workers against the mean earnings of male identified workers. 

Let's consider the regression below.

In [7]:
reg logearnings female


      Source |       SS           df       MS      Number of obs   =   138,138
-------------+----------------------------------   F(1, 138136)    =   5952.64
       Model |   7425.4984         1   7425.4984   Prob > F        =    0.0000
    Residual |  172314.796   138,136   1.2474286   R-squared       =    0.0413
-------------+----------------------------------   Adj R-squared   =    0.0413
       Total |  179740.295   138,137  1.30117416   Root MSE        =    1.1169

------------------------------------------------------------------------------
 logearnings | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      female |  -.5588429   .0072433   -77.15   0.000    -.5730396   -.5446463
       _cons |   10.80163   .0034046  3172.68   0.000     10.79496     10.8083
------------------------------------------------------------------------------


We remember from [Module 12]() that "_cons" is the constant $β_0$, and we know that here $β_0 = E[logearnings_{i}|female_{i}=0]$. Therefore, the results of this regression suggest that on average, males have log earnings of 10.68. We also know from the [Module 12]() that

$$
\beta_1 = E[logearnings_{i}|female_{i}=1]- E[logearnings_{i}|female_{i}=0].
$$

The regression results here suggest that female identified persons earn on average 0.55 less than male identified persons. As a result, female identified persons earn on average 10.68 - 0.55 = 10.13.  

In other words, the coefficient on the female variable shows the mean difference in log-earnings relative to males. $\hat{β}_1$ thus provides the measure of the raw gender gap.

<div class="alert alert-info">


**Note:** We are only able to state this result because the p-value for both  $\hat{β}_0$ and  $\hat{β}_1$ is less than 0.05, allowing us to reject the null hypothesis that $β_0 = 0$ and $β_1 = 0$ at 95% confidence level.
    
</div>

The interpretation remains the same once we control for more variables, although it is ceteris paribus (holding constant) the other observables now also included in the regression. An example is below.

In [8]:
reg logearnings female age 


      Source |       SS           df       MS      Number of obs   =   138,138
-------------+----------------------------------   F(2, 138135)    =   3034.44
       Model |  7564.45279         2   3782.2264   Prob > F        =    0.0000
    Residual |  172175.842   138,135  1.24643169   R-squared       =    0.0421
-------------+----------------------------------   Adj R-squared   =    0.0421
       Total |  179740.295   138,137  1.30117416   Root MSE        =    1.1164

------------------------------------------------------------------------------
 logearnings | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      female |   -.542311   .0074077   -73.21   0.000      -.55683   -.5277919
         age |   .0043713    .000414    10.56   0.000     .0035598    .0051827
       _cons |   10.60057   .0193443   547.99   0.000     10.56266    10.63848
--------------------------------------------------

In this case, among people that are the same age, the gender gap is (not surprisingly) slightly smaller than in our previous regression. That is expected since previously we compared all females to all males irrespective of the composition of age groups in those two categories of workers. As we control for age, we can see that this differential decreases.

## 13.3 Dummy Variables with Multiple Categories

In this data set we also have a region variable that has 5 different regions. As in [Module 6](econometrics/econ490-stata/6_Creating_Variables.ipynb), we can create dummies for each category using `tabulate`. 

First, we `tabulate` the categorical variable we want to make into a set of dummy variables. Then we use the option `gen` to create five new dummy variables for the 5 regions represented in the data.

In [9]:
tab region, gen(regdummy)


group(prov) |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     54,364       39.35       39.35
          2 |     34,072       24.67       64.02
          3 |      6,216        4.50       68.52
          4 |     17,572       12.72       81.24
          5 |     25,914       18.76      100.00
------------+-----------------------------------
      Total |    138,138      100.00


In [10]:
%browse 10

Unnamed: 0,workerid,year,sex,age,start_year,region,treated,earnings,sample_weight,logearnings,female,regdummy1,regdummy2,regdummy3,regdummy4,regdummy5
1,1,1999,M,55,1997,1,0,39975.008,0.26076493,10.59601,0,1,0,0,0,0
2,1,2001,M,57,1997,1,0,278378.06,0.014273916,12.536736,0,1,0,0,0,0
3,2,2001,M,54,2001,4,0,18682.6,0.032186829,9.8353481,0,0,0,0,1,0
4,2,2002,M,55,2001,4,0,293336.41,0.47120222,12.589075,0,0,0,0,1,0
5,2,2003,M,56,2001,4,0,111797.26,0.70438099,11.624442,0,0,0,0,1,0
6,3,2005,M,54,2005,5,0,88351.672,0.35590065,11.38908,0,0,0,0,0,1
7,3,2010,M,59,2005,5,0,46229.574,0.8969152,10.741375,0,0,0,0,0,1
8,4,1997,M,45,1997,5,1,24911.029,0.39900845,10.123066,0,0,0,0,0,1
9,4,2001,M,49,1997,5,1,9908.3623,0.55194622,9.2011347,0,0,0,0,0,1
10,5,2009,M,55,1998,2,1,137207.34,0.014438981,11.829248,0,0,1,0,0,0


Notice that the sum of the five dummies in any row is equal to 1. This is because every worker is located in exactly one region. If we included all of the regional dummies in a regression, we would introduce the problem of perfect collinearity: the full set of dummy variables are perfectly correlated. Think about it this way - if a person is in region 1 (regdummy1 = 1) then we know that the person is not in region 2 (regdummy2 = 0). Therefore being in region 1 predicts not being in region 2. 

We must always exclude one of the dummies. Failing to do so means falling into the **dummy variable trap** of perfect collinearity described above. To avoid this, choose one region to serve as a base level for which you will not define a dummy. This dummy variable that you exclude will be the category of reference, or base level, when interpreting coefficients in the regression. That is, the coefficient on each region dummy variable will be comparing the mean earnings of people in that region to the mean earnings of people in the one region excluded.

We have actually already seen this approach in action in the regression we ran above; there we didn't add a separate dummy variable for "male". Instead, we excluded the male dummy variable and interpreted the coefficient on "female" as the difference between female and male log-earnings.

The easiest way to include multiple categories in a regression is to write the list of variables using the notation `i.variable`. Below you will see that Stata drops the first region dummy (region = 1) and includes dummy variables for the regions 2 - 5. In this way, Stata automatically helps us avoid the dummy variable trap.

In [11]:
reg logearnings i.region


      Source |       SS           df       MS      Number of obs   =   138,138
-------------+----------------------------------   F(4, 138133)    =      9.84
       Model |  51.1939778         4  12.7984945   Prob > F        =    0.0000
    Residual |  179689.101   138,133  1.30084122   R-squared       =    0.0003
-------------+----------------------------------   Adj R-squared   =    0.0003
       Total |  179740.295   138,137  1.30117416   Root MSE        =    1.1405

------------------------------------------------------------------------------
 logearnings | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      region |
          2  |   -.033541   .0078808    -4.26   0.000    -.0489872   -.0180947
          3  |   .0365522   .0152709     2.39   0.017     .0066214    .0664829
          4  |  -.0312389   .0098974    -3.16   0.002    -.0506375   -.0118402
          5  |   .0045515   .008609

Often we will want to control which dummy variable is selected as the reference or base level category. If that is the case, we first have to control the reference dummy variable using the command `fvset base`. We do this below by setting the base level category to be region 3.

In [12]:
fvset base 3 region 

When you run the regression below, the reference is now region 3 and not region 1.

In [13]:
reg logearnings i.region


      Source |       SS           df       MS      Number of obs   =   138,138
-------------+----------------------------------   F(4, 138133)    =      9.84
       Model |  51.1939778         4  12.7984945   Prob > F        =    0.0000
    Residual |  179689.101   138,133  1.30084122   R-squared       =    0.0003
-------------+----------------------------------   Adj R-squared   =    0.0003
       Total |  179740.295   138,137  1.30117416   Root MSE        =    1.1405

------------------------------------------------------------------------------
 logearnings | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      region |
          1  |  -.0365522   .0152709    -2.39   0.017    -.0664829   -.0066214
          2  |  -.0700931   .0157306    -4.46   0.000    -.1009248   -.0392614
          4  |  -.0677911   .0168316    -4.03   0.000    -.1007806   -.0348015
          5  |  -.0320007   .016108

Of course, we could also create a new `global` as was learned in [Module 4](econometrics/econ490-stata/4_Locals_and_Globals.ipynb) that includes all of the dummy variables and includes that in the regression. Here is an example of what that would look like:

In [14]:
global regiondummies "regdummy1 regdummy2 regdummy4 regdummy5"
reg logearnings ${regiondummies}




      Source |       SS           df       MS      Number of obs   =   138,138
-------------+----------------------------------   F(4, 138133)    =      9.84
       Model |  51.1939778         4  12.7984945   Prob > F        =    0.0000
    Residual |  179689.101   138,133  1.30084122   R-squared       =    0.0003
-------------+----------------------------------   Adj R-squared   =    0.0003
       Total |  179740.295   138,137  1.30117416   Root MSE        =    1.1405

------------------------------------------------------------------------------
 logearnings | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   regdummy1 |  -.0365522   .0152709    -2.39   0.017    -.0664829   -.0066214
   regdummy2 |  -.0700931   .0157306    -4.46   0.000    -.1009248   -.0392614
   regdummy4 |  -.0677911   .0168316    -4.03   0.000    -.1007806   -.0348015
   regdummy5 |  -.0320007   .0161081    -1.99   

When interpreting the coefficients in the regression above, our intercept is again the mean log earnings among those for which all dummies in the regression are 0; here, that is the mean earnings for all people in region 3. Each individual coefficient gives the difference in average log earnings among people in that region and in region 3. For instance, the mean log earnings in region 1 are about 0.012 higher than in region 3 and the mean log earnings in region 2 are about 0.017 lower than in region 3. Both of these differences are statistically significant at a high level (> 99%).

It follows from this logic of interpretation that we can compare mean earnings among non-reference groups. For example, the meaning log earnings in region 3 are given by the intercept coefficient: about 10.49. Since the mean log earnings in region 1 are about 0.012 higher than this, they must be about 10.49 + 0.012 = 10.502. In region 2, the mean log earnings are similarly about 10.49 - 0.017 = 10.473. We can thus conclude that the mean log earnings in region 1 are about 10.502 - 10.473 = 0.029 higher than in region 2. In this way, we compared the levels of the dependent variable for 2 dummy variables, neither of which are in the reference group excluded from the regression. We could have much more quickly compared the levels of these groups by comparing their deviations from the base group. Region 1 has mean log earnings about 0.012 above the reference level, while region 2 has mean log earnings about 0.017 below this same reference level; thus, region 1 has mean log earnings about 0.012 - (-0.017) = 0.029 above region 2.

## 13.4 Interactions 

It is an established fact that a wage gap exists between male and female workers. However, it is possible that the wage gap changes depending on the age of the workers. For example, female and male high school students tend to work minimum wage jobs; hence, we might believe that the wage gap between people within the 15-18 age bracket is very small. Conversely, once people have the experience to start looking for better paying jobs, we might believe the wage gap starts to increase, meaning that this gap might be much larger in higher age brackets. Similarly, the wage gap between males and females may also vary as age increases. The way to capture that differential effect of age across males and females is to create a new variable that is the product of the female dummy and age. 

Whenever we do this it is *very important* that we also include both the female dummy and age as control variables. 

To run this in Stata, categorical variables must be preceded by a `i.`, continuous variables must be preceded by `c.` and terms are interacted with the `##` symbol. For our example, we have the categorical variable `i.female` interacted with continuous variable `c.age` and the regression looks like this:

In [15]:
reg logearnings i.female##c.age


      Source |       SS           df       MS      Number of obs   =   138,138
-------------+----------------------------------   F(3, 138134)    =   2083.15
       Model |  7779.82557         3  2593.27519   Prob > F        =    0.0000
    Residual |  171960.469   138,134  1.24488156   R-squared       =    0.0433
-------------+----------------------------------   Adj R-squared   =    0.0433
       Total |  179740.295   138,137  1.30117416   Root MSE        =    1.1157

------------------------------------------------------------------------------
 logearnings | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
    1.female |  -1.127321   .0450886   -25.00   0.000    -1.215694   -1.038948
         age |   .0016534   .0004625     3.57   0.000     .0007469    .0025598
             |
female#c.age |
          1  |   .0136147   .0010351    13.15   0.000      .011586    .0156435
             |
     

Notice that Stata automatically includes the female and age variables as dummy variables for controls. From our results, we can see that, on average, people who are identified as female earn about 0.27 less than those identified as male, holding age constant. We can also see that each additional year of age increases log-earnings by about 0.013 for the reference category (males). This affect of age on log-earnings is lower for females by 0.007, meaning that an extra year of age increase log earnings for women by about 0.013 + (-0.007) = 0.006. It thus seems that our theory is correct: the wage gap between males and females of the same age increases as they get older. For men and women who are both 20, an extra year will be associated with the man earning a bit more than the woman on average. However, if the man and woman are both 50, an extra year will be associated with the man earning much more than the woman on average (or at least out-earning her by much more than before). We can also see from the statistical significance of the coefficient on our interaction term that it was worth including!

Try this yourself below with the set of region dummies we created above. Think about what these results mean.

In [16]:
reg logearnings i.female##i.region


      Source |       SS           df       MS      Number of obs   =   138,138
-------------+----------------------------------   F(9, 138128)    =    666.78
       Model |  7483.78659         9  831.531844   Prob > F        =    0.0000
    Residual |  172256.508   138,128  1.24707886   R-squared       =    0.0416
-------------+----------------------------------   Adj R-squared   =    0.0416
       Total |  179740.295   138,137  1.30117416   Root MSE        =    1.1167

-------------------------------------------------------------------------------
  logearnings | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
     1.female |  -.6356915     .03537   -17.97   0.000     -.705016    -.566367
              |
       region |
           1  |  -.0559027   .0167259    -3.34   0.001     -.088685   -.0231203
           2  |  -.0666064   .0172797    -3.85   0.000    -.1004743   -.0327386
           4

## 13.5 Wrap Up

There are very few empirical research projects using micro data that do not require researchers to use dummy variables. Important qualitative measures such as marital status, immigration status, occupation, industry, and race always require that we use dummy variables. Other important variables such as education, income, age and number of children often require us to use dummy variables even when they are sometimes measured using ranked categorical variables. For example, we could have a variable that measures years of education which is included as a continuous variable. However, you might instead want to include a variable that indicated if the person has a university degree. If that is the case, you can use `generate` to create a dummy variable indicating that specific level of education. 

Even empirical research projects that use macro data sometimes require that we use dummy variables. For example, you might have a data set that measures macro variables for African countries with additional information about historic colonization. You might want to create a dummy variable that indicates the origin of the colonizers, and then include that in your analysis to understand that effect. As another example, you might have a time series data set and want to indicate whether or not a specific policy was implemented in any one time period. You will need a dummy variable for that, and can include one in your analysis using the same process described above. Finally, you can use interaction terms to capture the effect of one variable on another if you believe that it varies between groups. If the coefficient on this interaction term is statistically significant, it can justify this term's inclusion in your regression. This impacts your interpretation of coefficients in the regression.

Try this yourself with any data set that you have downloaded in Stata. You will find that this approach is not complicated, but has the power to yield meaningful results!

## References

[Use factor variables in Stata to estimate interactions between two categorical variables](https://www.youtube.com/watch?v=f-tLLX8v11c)