# ECON 490: Dummy Variables and Interactions (13)


## Pre-requisites: 
---
- Know what a linear model is.
- Interpret coefficients from OLS output.

## Learning Outcomes:  
---
By the end of this module you will be able to:
- Understand how to code a dummy variable.
- Interpret the coefficient associated with a dummy variable from OLS output.
- Interpret the coefficient of an interaction between a variable and a dummy variable from OLS output.


Dummy variables are critical when you want to include qualitative variables as explanatory variables your regression model. 

Using our multivariable regression from the previous module we can include a new explanatory variable that controls for whether or not the individual represented by the observation identifies as female.  To do this we are going to include a dummy variable into our regression as a control variable and then interpret regression coefficients. 


In [1]:
clear*
use fake_data,clear

In [2]:
cap drop logearn 
gen logearn = log(earnings)

In [3]:
%browse 10

Unnamed: 0,workerid,year,sex,birth_year,age,start_year,region,treated,earnings,logearn
1,1,1999,M,1944,55,1997,1,0,39975.008,10.59601
2,1,2001,M,1944,57,1997,1,0,278378.06,12.536736
3,2,2001,M,1947,54,2001,4,0,18682.6,9.8353481
4,2,2002,M,1947,55,2001,4,0,293336.41,12.589075
5,2,2003,M,1947,56,2001,4,0,111797.26,11.624442
6,3,2005,M,1951,54,2005,5,0,88351.672,11.38908
7,3,2010,M,1951,59,2005,5,0,46229.574,10.741375
8,4,1997,M,1952,45,1997,5,1,24911.029,10.123066
9,4,2001,M,1952,49,1997,5,1,9908.3623,9.2011347
10,5,2009,M,1954,55,1998,2,1,137207.34,11.829248


We observe a variable named *sex*, but it doesn't seem to be coded as a numeric variable. 

In [4]:
codebook sex


--------------------------------------------------------------------------------
sex                                                                          Sex
--------------------------------------------------------------------------------

                  type:  string (str1)

         unique values:  2                        missing "":  0/2,861,772

            tabulation:  Freq.  Value
                     1,018,860  "F"
                     1,842,912  "M"


As expected, this is a string variable, not numeric. A dummy variable is a variable that takes either the value of 0 or 1 depending on a condition. In this case, we want to create a variable that equals 1 whenever the worker is female.

In [5]:
cap drop female
gen female = sex == "F"

## 13.1 Why is a dummy just 0 and 1? Could it be 1 and 2?

Whenever we code a dummy variable as a 0 and 1 variable we will make comparisons between the 1-category and the 0-category. In the case of the female dummy, we would be comparing means of female workers against the means of male workers. This is what we look for. For instance, consider a regression between log earnings and only the female dummy.


In [6]:
reg logearn female


      Source |       SS           df       MS      Number of obs   = 2,861,772
-------------+----------------------------------   F(1, 2861770)   >  99999.00
       Model |  196072.692         1  196072.692   Prob > F        =    0.0000
    Residual |  3762988.11 2,861,770  1.31491633   R-squared       =    0.0495
-------------+----------------------------------   Adj R-squared   =    0.0495
       Total |   3959060.8 2,861,771  1.38343033   Root MSE        =    1.1467

------------------------------------------------------------------------------
     logearn |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   -.546659   .0014157  -386.15   0.000    -.5494336   -.5438844
       _cons |    10.6851   .0008447  1.3e+04   0.000     10.68344    10.68675
------------------------------------------------------------------------------


This is showing that males earn on average 10.68, whereas females earn 0.54 less than that. In other words, the coefficient of the female variable shows the mean difference in log-earnings relative to males (the ones who were coded as 0). This also provides a measure of the raw gender gap.

The interpretation remains the same once we control for more variables, although it is ceteris paribus the other observables.

In [7]:
reg logearn female age 


      Source |       SS           df       MS      Number of obs   = 2,861,772
-------------+----------------------------------   F(2, 2861769)   =  88924.98
       Model |  231647.054         2  115823.527   Prob > F        =    0.0000
    Residual |  3727413.75 2,861,769  1.30248589   R-squared       =    0.0585
-------------+----------------------------------   Adj R-squared   =    0.0585
       Total |   3959060.8 2,861,771  1.38343033   Root MSE        =    1.1413

------------------------------------------------------------------------------
     logearn |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |  -.5142334   .0014225  -361.49   0.000    -.5170215   -.5114452
         age |   .0108197   .0000655   165.27   0.000     .0106914    .0109481
       _cons |   10.29799   .0024887  4137.96   0.000     10.29311    10.30286
--------------------------------------------------

In this case, among people that are the same age, the gender gap is (not surprisingly) slightly smaller than before. That is, our previous comparison was comparing all males vs all females irrespective of their age composition.

## 13.2 Dummy variables with multiple categories

In this dataset we also have a region variable that has 5 different regions. We can include a dummy corresponding to each category. 

In [9]:
tab region, gen(regdummy)


group(prov) |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |    660,053       23.06       23.06
          2 |    575,994       20.13       43.19
          3 |    516,046       18.03       61.22
          4 |    612,871       21.42       82.64
          5 |    496,808       17.36      100.00
------------+-----------------------------------
      Total |  2,861,772      100.00


In [10]:
%browse 10

Unnamed: 0,workerid,year,sex,birth_year,age,start_year,region,treated,earnings,logearn,female,regdummy1,regdummy2,regdummy3,regdummy4,regdummy5
1,1,1999,M,1944,55,1997,1,0,39975.008,10.59601,0,1,0,0,0,0
2,1,2001,M,1944,57,1997,1,0,278378.06,12.536736,0,1,0,0,0,0
3,2,2001,M,1947,54,2001,4,0,18682.6,9.8353481,0,0,0,0,1,0
4,2,2002,M,1947,55,2001,4,0,293336.41,12.589075,0,0,0,0,1,0
5,2,2003,M,1947,56,2001,4,0,111797.26,11.624442,0,0,0,0,1,0
6,3,2005,M,1951,54,2005,5,0,88351.672,11.38908,0,0,0,0,0,1
7,3,2010,M,1951,59,2005,5,0,46229.574,10.741375,0,0,0,0,0,1
8,4,1997,M,1952,45,1997,5,1,24911.029,10.123066,0,0,0,0,0,1
9,4,2001,M,1952,49,1997,5,1,9908.3623,9.2011347,0,0,0,0,0,1
10,5,2009,M,1954,55,1998,2,1,137207.34,11.829248,0,0,1,0,0,0


Notice that the sum of the 5 dummies in each row sum to 1. This is because every worker belongs to at least of those regions. The consequence of this is that it will be exactly collinear with the intercept in the linear model. Therefore, we typically exclude one of the dummies and that will be the category of reference. That is, we will be comparing means relative to the excluded region.

The best way to include multiple categories in a regression is to write the list of variables as *i.variable*.

In [11]:
reg logearn i.region


      Source |       SS           df       MS      Number of obs   = 2,861,772
-------------+----------------------------------   F(4, 2861767)   =    112.62
       Model |  623.109096         4  155.777274   Prob > F        =    0.0000
    Residual |  3958437.69 2,861,767  1.38321453   R-squared       =    0.0002
-------------+----------------------------------   Adj R-squared   =    0.0002
       Total |   3959060.8 2,861,771  1.38343033   Root MSE        =    1.1761

------------------------------------------------------------------------------
     logearn |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      region |
          2  |  -.0288158   .0021206   -13.59   0.000    -.0329721   -.0246594
          3  |   -.011934   .0021854    -5.46   0.000    -.0162173   -.0076507
          4  |  -.0271732   .0020863   -13.02   0.000    -.0312623   -.0230842
          5  |   .0092417    .00220

Since there is multicollinearity, Stata will automatically exclude one of the categories. In this case it was region 1. We can also modify which category is set as the reference. 

In [14]:
fvset base 3 region 

In [15]:
reg logearn i.region


      Source |       SS           df       MS      Number of obs   = 2,861,772
-------------+----------------------------------   F(4, 2861767)   =    112.62
       Model |  623.109096         4  155.777274   Prob > F        =    0.0000
    Residual |  3958437.69 2,861,767  1.38321453   R-squared       =    0.0002
-------------+----------------------------------   Adj R-squared   =    0.0002
       Total |   3959060.8 2,861,771  1.38343033   Root MSE        =    1.1761

------------------------------------------------------------------------------
     logearn |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      region |
          1  |    .011934   .0021854     5.46   0.000     .0076507    .0162173
          2  |  -.0168817   .0022543    -7.49   0.000    -.0213001   -.0124634
          4  |  -.0152392    .002222    -6.86   0.000    -.0195943   -.0108842
          5  |   .0211758   .002337

And now the reference is region number 3.

## 13.3 Interactions 

It is also natural to think that the effect of increasing one variable (e.g. age) on earnings will differ across gender. The way to capture that differential effect of age across male and females is to create a new variable that is the product of the female dummy and age. Whenever we do this it is *very important* that you also include both the female dummy and age as control variables. 

To run this in Stata we can interact variables with the # symbol. Categorical variables must be preceded by a `i.` and continuous variables must be preceded by `c.`.

In [16]:
reg logearn i.female##c.age


      Source |       SS           df       MS      Number of obs   = 2,861,772
-------------+----------------------------------   F(3, 2861768)   =  60255.52
       Model |  235220.283         3  78406.7611   Prob > F        =    0.0000
    Residual |  3723840.52 2,861,768  1.30123774   R-squared       =    0.0594
-------------+----------------------------------   Adj R-squared   =    0.0594
       Total |   3959060.8 2,861,771  1.38343033   Root MSE        =    1.1407

------------------------------------------------------------------------------
     logearn |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.female |  -.2661537    .004943   -53.84   0.000    -.2758418   -.2564655
         age |   .0131706   .0000793   166.01   0.000     .0130151    .0133261
             |
female#c.age |
          1  |  -.0073528   .0001403   -52.40   0.000    -.0076278   -.0070778
             |
     

The coefficient on age is now the effect of one extra year on log earnings for the reference category (males) and this effect is 0.007 lower on females relative to males. 