# Dummy variable 
Lecture notes by
_Sunil Paul_

- The materials in this notebook is prepared using G S Maddala's "Introduction to econometrics".
- Kindly refrain from circulating this note outside the class, as I haven't included citations for all the textbooks/paper I referenced in the preparation of these notes.
- Also see Ch 6, Diebold, F.X. (2019), Econometric Data Science: A Predictive Modeling Approach, Department of Economics, University of Pennsylvania, http://www.ssc.upenn.edu/~fdiebold/Textbooks.html.”-

## Introduction

In some instances, it becomes necessary to incorporate qualitative variables as explanatory factors in regression analysis. In such scenarios, proxy variables are constructed to represent these qualitative aspects. Dummy variables serve this purpose effectively. A dummy variable is an artificial construct that assumes a value of 1 when the qualitative attribute it represents is present, and 0 otherwise. For instance, 1 might denote female gender while 0 signifies male gender, or 1 could represent college graduation while 0 denotes otherwise. Dummy variables are integrated into regression models alongside other explanatory variables.

__Example:__ Suppose we have data on income of females and males. Assume that income is dependent on the gender of the individual, then we can model this relation as follows: $$Y=\alpha _{f}D_{f}+\alpha _{m}D_{m}+\varepsilon,$$ where $Y$ is the income, $D_{f}$ is a dummy variable taking the value one whenever the observation in question is a female and zero otherwise. $D_{m}$ is also defined likewise for the males.

The equation in the example given above does not contain an intercept. Note that $D_{f}+D_{m}=1.$ __dummy variable trap__. Hence if we have a qualitative variable with '$m$' categories we can use only '$m-1$' dummy variables to represent them in a regression with intercept. Dropping $D_{m}$ we may rewrite the example given above as $$Y=\alpha _{0}+\alpha _{f}D_{f}+\varepsilon.$$ The category for which no dummy variable is assigned is known as the base, benchmark, control, comparison, reference, or omitted category. And all comparisons are made in relation to the benchmark category.

## Uses and Interpretations 

We will begin with models that exclusively employ dummy variables as explanatory factors. Such models are known as Analysis of Variance (ANOVA) models
### Models with one dummy variable as explanatory variable}

Consider the following model

$Y_{i}=\alpha _{0}+\varepsilon_{i}$

then we can interpret the intercept $\hat{\alpha}_{0}$ as unconditional mean
of $Y$.

Suppose we have data on the income of males and females. We can run two separate regression models as follows:

$$Y_{f}=\alpha _{1}+\varepsilon_f,$$ and $$Y_{m}=\alpha _{2}+\varepsilon_m,$$where $Y_{f}$ is the income
of female professors and $Y_{m}$ is the income of male professors.$\alpha _{1},\alpha _{2}$ are the average income of females and males respectively. If we combine these equations using dummy variables as follows

$$Y=\alpha _{f}D_{f}+\alpha _{m}D_{m}+\varepsilon,$$. In this case the coefficients of
dummy variables can be considered as expected income of respective
categories.

in this case, the coefficients of the dummy variables can be interpreted as the expected income of the respective categories. However, if we combine the two equations including an intercept in the model, then the combined regression model can be expressed as follows:$$\begin{aligned}Y&=\beta _{0}+(\alpha _{1}-\alpha _{0})D_{f}+\varepsilon\\&=\beta _{0}+\beta _{f}D_{f}+\varepsilon,\end{aligned}$$
where the intercept $\beta_0$ reflects the average income of the omitted categoryand $\beta _{f}$ is the difference in average income between females and males in the sample. Generally we follow a specification with intercept
term. ( You can also see what would be the conditional expectation of Y
given $D=1$, or $D=0$)

| Average Income  | With intercept                         | Without intercept                   |
|-----------------|---------------------------------------|-------------------------------------|
| Males ($D_{m}=1)$ | $E(Y|D_{f}=0)=\beta _{0}$            | $E(Y|D_{f}=0,D_{m}=1)=\alpha _{m}$  |
| Females ($D_{f}=1)$| $E(Y|D_{f}=0)=\beta _{0}+\beta _{1}$ | $E(Y|D_{f}=1,D_{m}=0)=\alpha _{f}$  |


### Models with an additional dummy variable
Now, let's consider the scenario where we assume that individuals' income also depends on their profession. Assuming the profession has three categories, we can create the following dummy variables

$D_{d}=1$ if the individual is a doctor and 0 otherwise,

$D_{p}=1$ if the individual is a professor 0 otherwise, and

$D_{l}=1$ if the individual is a lawyer0 otherwise

We have already defined $D_{f}$ $\ $and $Dm$.

Then we may specify the model as follows

$Y=\alpha _{f}D_{f}+\alpha _{m}D_{m}+\alpha _{p}D_{p}+\alpha _{l}D_{l}+\varepsilon$
(without intercept)

or

$Y=\beta _{0}+\beta _{m}D_{m}+\beta _{p}D_{p}+\beta _{l}D_{l}+\varepsilon $ (with
intercept)

 We have omitted one dummy variable ($D_{p})$ from the model without
intercept . If such a dummy is added then there would be perfect
multicollinearity since $D_{f}+D_{m}=D_{d}+D_{p}+D_{l}.$ Similarly we had
omitted one category each from profession and gender ( $D_{f}$ and $D_{p})$
in the second model with intercept for the same reasons.

We can have

$\alpha _{f}=\beta _{0}$=average income of a female doctor

$\alpha _{m}=\beta _{0}+\beta _{m}$=average income of a male Doctor

$\alpha _{f}+\alpha _{p}=\beta _{0}+\beta _{p}$=average income of a female
doctor professor

$\alpha _{f}+\alpha _{l}=\beta _{0}+\beta _{l}$=average income of a female
lawyer

$\alpha _{m}+\alpha _{p}=\beta _{0}+\beta _{p}+\beta _{m}$=average income of
a male professor

$\alpha _{m}+\alpha _{l}=\beta _{0}+\beta _{l}+\beta _{m}$=average income of
a male \ \ lawyer

Similarly we can create a set of new dummy variables combing profession and gender as follows

$Y=\alpha _{fd}D_{fd}+\alpha _{md}D_{md}+\alpha _{fp}D_{fp}+\alpha
_{mp}D_{mp}+\alpha _{fl}D_{fl}+\alpha _{ml}D_{ml}+\varepsilon$

or

$Y=\beta _{0}^{\prime }+\beta _{fd}D_{fd}+\beta _{md}D_{md}+\beta
_{fp}D_{fp}+\beta _{mp}D_{mp}+\beta _{fl}D_{fl}+u$

where

$D_{fd}=1$ for a female doctor and 0 otherwise, ($D_{f}\times D_{d})$

$D_{fp}=1$ for female professor 0 otherwise, ($D_{f}\times D_{p})$

$Df_{l}=1$ for female lawyer 0 otherwise ($D_{f}\times D_{l})$

$D_{md}=1$ for a male doctor and 0 otherwise, ($D_{m}\times D_{d})$

$D_{mp}=1$ for male professor 0 otherwise, ($D_{m}\times D_{p})$

$Dm_{l}=1$ for male lawyer 0 otherwise ($D_{m}\times D_{l})$

The interpretations of these coefficients are straightforward. The key difference between this method and the former method is that the former assumes the difference between males and females is equal for all professions, while the latter model allows for interaction effects. From now on, we will use regression models with intercept only.

Regression models with only dummy variables as explanatory variables are unrealistic. More often, we will have a combination of qualitative and quantitative variables. In such models, dummy variables are used to:

- allow differences in intercept terms,
- allow for differences in slopes,
- test for the stability of regression coefficients, and
- conduct piecewise regressions.

#### Dummy variable for changes in the intercept term

If the slope of the regression model is the same but not the intercept, we can use dummy variables to capture the change in the intercept. Consider regression models where the slope of the two groups is the same but not the intercept, i.e., we have a regression equation$Y=\alpha _{1}+\beta X+\varepsilon$ for one group and $Y=\alpha _{2}+\beta X+\varepsilon,$ where $\alpha _{1}\neq \alpha _{2}.$

These equations can be combined into a single equation; $$\begin{aligned}Y&=\alpha
_{1}+(\alpha _{2}-\alpha _{1})D+\beta X+\varepsilon\\&=\alpha _{1}+\gamma D+\beta X+\varepsilon,\end{aligned}$$
where $D=1$ for group 2 and 0 otherwise.

The interpretation of coefficients is straightforward. The constant term ($\alpha _{1}$) gives the intercept term for the first group, and the sum of the intercept and the coefficient of the dummy variable ($\alpha _{1}+\gamma$) gives the intercept for the second group.

(We can also use dummy variables to pick out and control for seasonal variation in data. The idea is to include a set of dummy variables for each quarter (or month or day), which will then net out the average change in a variable resulting from any seasonal fluctuations.)


### Dummy variable for changes in slope coefficients
We can use dummy variables to allow for differences in slope coefficients as well. For example, if the regression equations are:
$$Y_{1}=\alpha _{1}+\beta _{1}X_{1}+\varepsilon_{1} $$ for the first group,

$$ Y_{2}=\alpha _{2}+\beta _{2}X_{2}+\varepsilon_{2} $$ for the second group,

we can combine these equations using a dummy variable as follows:

$$ Y=\alpha _{1}+(\alpha _{2}-\alpha _{1})D+\beta _{1}X+(\beta _{2}-\beta _{1})(X\ast D)+\varepsilon $$

$$ =\alpha _{1}+\gamma _{1}D_{1}+\beta _{1}X+\gamma _{2}D_{2}+\varepsilon $$

where $ Y=Y_{1}+Y_{2},  X=X_{1}+X_{2}, D_{1}$ (or $D$) equals 1 for group 2 and 0 otherwise, and  $D_{2}  (=X\ast D)$ is the product of the dummy and $ X $. here $ D_{2}=X_{2} $ for all observations in the second group and zero otherwise. The coefficient of $ D_{1} $ measures the difference in the intercept terms, and the coefficient of $ D_{2} $measures the difference in the slope. The distribution of the error term is assumed to be identical for both groups. Likewise, we can use dummy variables to allow for changes in intercept terms, changes in slope terms, changes in both slopes and intercepts, etc.


### Dummy variables for testing stability of regression coefficients

Dummy variables can be used to test for stability of regression coefficients.

Consider the equations below:

$ Y_{1}=\alpha _{1}+\beta _{1}X_{1}+\gamma _{1}Z_{1}+\varepsilon_{1} $ (for the first period)

$ Y_{2}=\alpha _{2}+\beta _{2}X_{2}+\gamma _{2}Z_{2}+\varepsilon_{2} $ (for the second period)


Where $ Y_{1}, X_{1}, Z_{1} $ are the observations of $ Y, X,$ and $ Z $ for the first period, and $ Y_{2}, X_{2}, Z_{2} $ are the observations for the second period.

A test for the stability of the parameters between the populations generated by the two data sets is a test of the hypotheses:

$$ H_{0}:\alpha _{1}=\alpha _{2},\beta _{1}=\beta _{2},\gamma _{1}=\gamma _{2}$$

To test this hypothesis, we can use F statistics. 

Let us define:

$ SSR_{1} $ =  sum of squares of residuals  for the first data set

$ SSR_{2} $ = sum of squares of residuals  for the second data set


The unrestricted SSR (USSR) can be obtained by $ SSR_{1}+SSR_{2} $. Restricted SSR (RSSR) can be obtained from regression with the pooled data.

Then, we can define the F statistic as follows:

$$ F=\frac{(RSSR-USSR)/(k)}{USSR/(n_{1}+n_{2}-2k)} $$

Where $ k $ is the number of parameters, $ n_{1} $ is the number of observations in the first period, and $ n_{2} $ is the number of observations in the second period. Then, the usual test procedure can be applied to test the hypothesis.

The same test can be done using dummy variables. Let us combine the two equations into one as follows:

$$ Y=\alpha _{1}+(\alpha _{2}-\alpha _{1})D+\beta _{1}X+(\beta _{2}-\beta _{1})(D\ast X)+\gamma _{1}Z+(\gamma _{2}-\gamma _{1})(D\ast Z)+\varepsilon$$

Where$ D=1 $ for the second period and 0 for the first period.

Or,

$$ Y=\alpha _{1}+\alpha _{2}^{\ast }D_{1}+\beta _{1}X+\beta _{2}^{\ast }D_{2}+\gamma _{1}Z+\gamma _{2}^{\ast }D_{3}+\varepsilon
$$

Where:
$ D_{1}=1 $ for the period 2, 0 for period 1

$ D_{2}=X $ for the period 2($X_{2}$), 0 for period 1

$D_{2}=Z$ for the period 2$(Z_{2})$, 0 for period 1

$\alpha _{2}^{\ast }=\alpha _{2}-\alpha _{1}$

$\beta _{2}^{\ast }=\beta _{2}-\beta _{1}$

$\gamma _{2}^{\ast }=\gamma _{2}-\gamma _{1}$


The Unrestricted SSR(USSR) can be obtained from the equation above, and the restricted RSS can be obtained by deleting the dummy variables based on the hypothesis we intend to test. A summary of the tests based on dummy variables is given below:

| Hypothesis                                                     | Variables Deleted |
|----------------------------------------------------------------|-------------------|
| 1) All coefficients same ($ \alpha _{1}=\alpha _{2},\beta _{1}=\beta _{2},\gamma _{1}=\gamma _{2} $) | $ D_{1},D_{2},D_{3} $ |
| 2) Only Intercepts Change ($ \beta _{1}=\beta _{2},\gamma _{1}=\gamma _{2} $)                  | $ D_{2},D_{3} $ |
| 3) Only intercepts and coefficients of Z change ($ \beta _{1}=\beta _{2} $)                        | $ D_{2} $ |

Once you obtain RSSR and USSR, you can calculate the F statistic and follow the usual testing procedure.

### Piecewise Linear Regression

In certain cases, the regression relationship may change after certain thresholds. We capture such relationships using dummy variables. These models are known as Piecewise regressions. For example, let $Y$ represent sales commission, $X$ denote the volume of sales by the salesperson, and $X^*$ is the
threshold value and suppose once he reached at $X^*$ his incentive structure changes and gets more commission per sales. Now let us define the dummy $D=1$
if $X>X^{\ast }$ and $0$ if $X<X^{\ast }$. 

Then we will get the model as follows

$$Y=\alpha +\beta _{1}X+\beta _{2}(X-X^{\ast })D+\varepsilon$$

It is easy to verify that

$E(Y/D=0,X,X^{\ast })=\alpha +\beta _{1}X$

$E(y/D=1,X,X^{\ast })=\alpha +\beta _{2}X^{\ast }+(\beta _{1}+\beta _{2})X$

