# Chapter 7. Multiple Regression Analysis with Qualitative Information: Binary (or Dummy) Variables

In previous chapters, the dependent and independent variables in our multiple regression models have had quantitative meaning. In empirical work, we must also incorporate qualitative factors into regression models. The gender or race of an individual, the industry of a firm or the region where a city is located are all considered to be qualitative factors.

## 7-1 Describing qualitative information

Qualitative factors often come in the form of binary information: a person is female or male, a person does or does not have a personal computer, a state administers capital punishment or not. In all these examples this information can be captured by defining a binary variable or a zero-one variable. In econometrics binary variables are most commonly called dummy variables.

In assigning a dummy variable we must decide which event is assigned the value one and which is assigned the value zero. For example, in a study of individual wage determination, we might define female to be a binary variable taling on the value one for females and the value zero for males.The name in this case indicates the event with the value one.

## 7-2 A Single Dummy Independent Variable

How do we incorporate binary information into regression models ? In the simplest case with only a single dummy explanatory variable, we just add it as an independent variable in the equation. For example, consider the following simple model of hourly wage determination:

\begin{equation}
wage=\beta_0+\delta_0female+\beta_1educ+u
\tag{7.1}
\end{equation}

In model (7.1), only two observed factors affect wage: gender and education. Because female=1 when the person is female, and female=0 when the person is male, the parameter $\deta_0$ has the following interpretation: $\delta_0$ is the difference in hourly wage between females and males, given the same amount of education (and the same error term u). Thus the coefficient $\delta_0$ determines whether there is discrimination againg women: if $\delta_0\leq0$ then, for the same level of other factors, women earn less on average than men.

In terms of expectations, if we assume the zero conditional mean assumption e(u|female,educ)=0, then

\begin{equation}
\delta_0=E(wage|female,educ)-E(wage|male,educ)
\tag{7.2}
\end{equation}

The key here is that the level of education is the same in both expectations; the difference $\delta_0$ is due to gender only.

In (7.1) we have chosen males to be the base group or benchmark group, that is the group against comparisons are made. This is why $\beta_0$ is the intercept for males and $\delta_0$ is the difference in intercepts between females and males.

### Wooldridge. Example 7.1 Hourly Wage Equation

Using the data in WAGE1, we estimate the following model (7.3) 

\begin{equation}
wage=\beta_0+\delta_0*female+\beta_1*female+\beta_2*exper+\beta_3*tenure+u
\tag{7.3}
\end{equation}

If educ, exper and tenure all relevant productivity characteristics, the null hypothesis of not difference between men and women is $H_0:\delta_0=0$. The alternative that there is discrimination against women is $H_1:\delta_0\leq0$. Following we compute the model by OLS

In [None]:
library(foreign)
wage1 <- read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/wage1.dta?raw=true")

wageres <- lm(wage ~ female+educ+exper+tenure, data=wage1)
summary(wageres)

The coefficient on female is interesting because it measures the average difference in hourly wage between a man and a woman who have the same levels of educ, exper and tenure. If we take a woman and a man with the same levels of education, experience and tenure the woman earns, on average, $1.81 less per hour than the man (in 1976 wages).

### Wooldridge. Example 7.2 Effects of Computer Ownership on College GPA

In order to determine the effects of computer ownership on college grade point average, we estimate the model

\begin{equation}
colGPA=\beta_0+\delta_0*PC+\beta_1*hsGPA+\beta_2*ACT+u
\end{equation}

Where the dummy variable PC equals one if a student owns a personal computer and zero otherwise. The variables hsGPA (high school GPA) and ACT (achievent test score) are used as controls: it could be that stronger students, as measured by high school GPA and ACT scores, are more likely to own computers. we control for these factors because we would like to know the average effect on colGPA is a student is picked at random and given a personal computer. Using the data is GPA1, we obtain

In [None]:
library(foreign)
gpa1 <- read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/gpa1.dta?raw=true")

# Store results under "GPAres" and display full table:
GPAres <- lm(colGPA ~ PC+hsGPA+ACT, data=gpa1)
summary(GPAres)


This equation implies that a student who owns a PC has a predicted GPA about .16 points higher than a comparable student without a PC. The effect is also very statistically significant, with $t_{PC}=.157/.057\approx 2.75$

Each of the previous examples can be viewed as having relevance for policy analysis. In the first example, we were interested in gender discrimination in the workforce. In the second example we were concerned with the effect of computer ownership on college performance. A special case of policy analysis is program evaluation, where we would like to know the effect of economic or social programs on individuals, firms, neighborhoods, cities and so on.

In the simplest case, there are two groups of subjects. The control group does not participate in the program. The experimental group or treatment group does take part in the program. Except in rare cases the choice of the control and the treatment groups is not random. However, in some cases, multiple regression analysis can be used to control for enough other factors in order to estimate the causal effect of the program.

### Wooldridge. Example 7.3 Effects of Training Grants on Hours of Training

Using the 1988 data for Michigan manufacturing firms in JTRAIN, we obtain the following estimated equation

In [None]:
library(foreign)
jtrain <- read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/jtrain.dta?raw=true")

# Store results under "GPAres" and display full table:
TRAINres <- lm(hrsemp ~ grant+log(sales)+log(employ), data=jtrain[jtrain$year == 1988,])
summary(TRAINres)


The dependent variable is hours of training per employee, at the firm level. The variable grant is a dummy variable equal to one if the firm received a job training grant for 1988, and zero otherwise. The variables sales and employ represent annual sales and number of employees respectively.

The variable grant is very statistically significant, with $t_grant=4.70$. Controlling for sales and employment, firms that received a grant trained each worker, on average, 26.25 hours more. The coefficient on log(sales) is small and very insignificant. The coefficient on log(employ) means that, if a firm is 10% larger, it trains its workers about .61 hours less. Its t statistic is -1.56, which is only marginally statistically significant.

As with any other independent variable, we should ask whether the measured effect of a qualitative variable is causal. In the previous model, is the difference in training between firms that receive grants and those who do not due to the grant, or is grant receipt simply an indicator of something else ? It might be that the firms receiving grants would have, on average, trained their workers more even in the absence of a grant. Nothing in this analysis tells us whether we have estimated a causal effect; we must know how the firms receiving grants were determined. We can only hope we have controlled for as many factors as possible that might be related to whether a firm received a grand and to its levels of training.

### 7-2a Interpreting Coefficients on Dummy Explanatory Variables When the Dependent Variable is log(y) 

A common specification in applied work has the dependent variable appearing in logarithmic form, with one or more dummy variables appearing as independent variables. In this case the coefficients have a percentage interpretation.

### Wooldridge. Example 7.4 Housing Price Regression

Using the data in HPRICE1, we obtain the equation

In [None]:


library(foreign)
hprice <- read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/hprice1.dta?raw=true")

# Store results under "GPAres" and display full table:
HPRICEres <- lm(log(price) ~ log(lotsize)+log(sqrft)+bdrms+colonial, hprice)
summary(HPRICEres)


All the variables are explanatory except colonial, which is a binary variable equal to one if the house is of the colonial style. For given levels of lotsize, sqrft and bdrms, the difference in log(price) between a house of colonial style and that of another style is .054. This means that a colonial-style house is predicted to sell for about 5.4% more, holding other factors fixed.

### Wooldridge. Example 7.5 Log Hourly Wage Equation

Let us reestimate the wage equation from Example 7.1, using log(wage) as the dependent variable and adding quadratics on exper and tenure

In [12]:
library(foreign)
wage1 <- read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/wage1.dta?raw=true")

wageres <- lm(log(wage) ~ female+educ+exper+I(exper^2)+tenure+I(tenure^2), data=wage1)
summary(wageres)


Call:
lm(formula = log(wage) ~ female + educ + exper + I(exper^2) + 
    tenure + I(tenure^2), data = wage1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.83160 -0.25658 -0.02126  0.25500  1.13370 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.4166910  0.0989279   4.212 2.98e-05 ***
female      -0.2965110  0.0358054  -8.281 1.04e-15 ***
educ         0.0801966  0.0067573  11.868  < 2e-16 ***
exper        0.0294324  0.0049752   5.916 6.00e-09 ***
I(exper^2)  -0.0005827  0.0001073  -5.431 8.65e-08 ***
tenure       0.0317139  0.0068452   4.633 4.56e-06 ***
I(tenure^2) -0.0005852  0.0002347  -2.493    0.013 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3998 on 519 degrees of freedom
Multiple R-squared:  0.4408,	Adjusted R-squared:  0.4343 
F-statistic: 68.18 on 6 and 519 DF,  p-value: < 2.2e-16


Therefore the proportionate difference in wages between females and males, holding other factors fixed is: $(\hat{wage_F}-\hat{wage_M})/\hat{wage_M}$. Based on the previous model we have: $\hat{log(wage_F)}-\hat{log(wage_M)}=-.297$

Exponentiating and substracting one gives: $\hat{wage_F}-\hat{wage_M}/\hat{wage_M}=exp(-.297)-1 \approx -.257$. Remeber that $log(wage_F)-log(wage_M)=log(wage_F/wage_M)$

This estimate implies that a woman's wage is, on average, 25.7% below a comparable man's wage.

## 7-3 Using Dummy Variables for Multiple Categories

We can use several dummy independent variables in the same equation.

### Wooldridge. Example 7.6 Log Hourly Wage Equation

Let us estimate a model that allows for wage differences among four groups: married men, married women, single men, and single women. To do this we must select a base group; we choose single men. Then, we must define dummy variables for each of the remaining groups. Call these marrmale, marrfem and singfem. We drop female as it is now redundant

In [20]:
library(foreign)
wage1 <- read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/wage1.dta?raw=true")

# Example 7.6
  # Generate the subgroup dummies

marrmale<-as.numeric(wage1$female==0 & wage1$married==1)
marrfem<-as.numeric(wage1$female==1 & wage1$married==1)
singfem<-as.numeric(wage1$female==1 & wage1$married==0)
 
wageres<-lm(lwage ~ marrmale + marrfem + singfem + educ + exper + 
             I(exper^2) + tenure + I(tenure^2), data=wage1)
summary(wageres)


Call:
lm(formula = lwage ~ marrmale + marrfem + singfem + educ + exper + 
    I(exper^2) + tenure + I(tenure^2), data = wage1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.89697 -0.24060 -0.02689  0.23144  1.09197 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.3213780  0.1000090   3.213 0.001393 ** 
marrmale     0.2126756  0.0553572   3.842 0.000137 ***
marrfem     -0.1982676  0.0578355  -3.428 0.000656 ***
singfem     -0.1103502  0.0557421  -1.980 0.048272 *  
educ         0.0789103  0.0066945  11.787  < 2e-16 ***
exper        0.0268006  0.0052428   5.112 4.50e-07 ***
I(exper^2)  -0.0005352  0.0001104  -4.847 1.66e-06 ***
tenure       0.0290875  0.0067620   4.302 2.03e-05 ***
I(tenure^2) -0.0005331  0.0002312  -2.306 0.021531 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3933 on 517 degrees of freedom
Multiple R-squared:  0.4609,	Adjusted R-squared:  0.4525 
F-statistic: 55.25 

All of the coefficients, with the exception of singfem (single female), have t statistics well above two in absolute value. The t statistic for singfem is about -1.98, which is just significant at the 5% level agains a two-sided alternative.

To interpret the coefficients on the dummy variables, we must remember that the base group is single males. Thus the estimates on the three dummy variables measure the proportionate difference in wage relative to single males. For example, married men (marrmale) are estimated to earn about 21.3% more than single men (the base group), holding all levels of education, experience and tenure fixed. The more precise estimate as shown previously is about 23.7%. A married woman (marrfem), on the other hand, earns a predicted 19.8% less than a single man (the base group) with the same levels of the other variables.

Because the base group is represented by the intercept, we have included dummy variables for only three of the four groups. I we were to add a dummy variable for single males to the model we would fall into the dummy variable trap by introducing perfect collinearity.

We can use the previous model to obtain the estimated difference between any two groups. The estimated proportionate difference between single and married women is -.110 - (-.198)=.088, which means that single women earn about 8.8% more than married women.

Unfortunately we cannot use the model for testing whether the estimated difference between single and married women is statistically significant. Knowing the standard errors on marrfem and singfem is not enough to carry out the test (refer to Section 4-4). The easiest thing to do is to choose one of these groups to be the base group and to reestimate the equation. Nothing substantive changes, but we get the neeeded estimate and its standard error directly. When we use married women as the base group, we obtain

In [1]:

library(foreign)
wage1 <- read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/wage1.dta?raw=true")

# Example 7.6
  # Generate the subgroup dummies

marrmale<-as.numeric(wage1$female==0 & wage1$married==1)
singmale<-as.numeric(wage1$female==0)*as.numeric(wage1$married==0)
singfem<-as.numeric(wage1$female==1 & wage1$married==0)
 
wageres<-lm(lwage ~ marrmale + singmale + singfem + educ + exper + 
             I(exper^2) + tenure + I(tenure^2), data=wage1)
summary(wageres)



Call:
lm(formula = lwage ~ marrmale + singmale + singfem + educ + exper + 
    I(exper^2) + tenure + I(tenure^2), data = wage1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.89697 -0.24060 -0.02689  0.23144  1.09197 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.1231104  0.1057937   1.164 0.245089    
marrmale     0.4109433  0.0457709   8.978  < 2e-16 ***
singmale     0.1982676  0.0578355   3.428 0.000656 ***
singfem      0.0879174  0.0523481   1.679 0.093664 .  
educ         0.0789103  0.0066945  11.787  < 2e-16 ***
exper        0.0268006  0.0052428   5.112 4.50e-07 ***
I(exper^2)  -0.0005352  0.0001104  -4.847 1.66e-06 ***
tenure       0.0290875  0.0067620   4.302 2.03e-05 ***
I(tenure^2) -0.0005331  0.0002312  -2.306 0.021531 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3933 on 517 degrees of freedom
Multiple R-squared:  0.4609,	Adjusted R-squared:  0.4525 
F-statistic: 55.25

The estimate on singfem (single female) is, as expected, .088. Now, we have a standard error to go along with this estimate. The t statistic for the null that there is no difference in the population between married women (base group) and single women (singfem) is 1.68. This is marginal evidence against the null hypothesis. We also see that the estimated difference between married men (marrmale) and married women (the base group) is very statistically significant 8.98

### 7-3a Incorporating Ordinal Information by Using Dummy Variables

Suppose that we would like to estimate the effect of city credit ratings on the municipal bond interest rate (MBR). Several financial companies, such as Moody&#8217;s Investors Service and Standard and Poor&#8217;s, rate the quality of debt for local governments, where the ratings depend on things like probability of default. (Local governments prefer lower interest rates in order to reduce their costs of borrowing). For simplicity, suppose that rankings take on the integer values {0, 1, 2, 3, 4}, with zero being the worst credit rating and four being the best. This is an example of an ordinal variable. Call this CR for concreteness. The question we need to address is: How do we incorporate the variable CR into a model to explain MBR ?. One possibility is to just include CR as we would include any other explanatory variable:

$MBR=\beta_0+\beta_1*CR+others$

A viable approach, given that CR takes on relatively few values, is to define dummy variables for each value of CR. Thus, let $CR_1=1$ if $CR=1$ and $CR_1=0$ otherwise. $CR_2=1$ if $CR=2$ and $CR_2=0$ otherwise; and so on. Effectively, we take the single credit rating and turn it into five categories. Then, we can estimate the mode;

\begin{equation}
MBR=\beta_0+\delta_1*CR_1+\delta_2*CR_2+\delta_3*CR_3+\delta_4*CR_4+others
\end{equation}

Following our rule for including dummy variables in a model, we include four dummy variables because we have five categories. The ommitted category here is the credit rating of zero, and so it is the base group. The coefficients are easy to interpret: $\delta_1$ is the difference in MBR (other factors fixed) between a municipality with a credit of one and a municipality with a credir rating of zero, $\delta_2$ is the difference in MBR between a municipality with a credit rating of two and a municipality with a credit rating of zero; and so on.

### Wooldridge. Example 7.8 Effects of Law School Rankings on Starting Salaries

the file LAWSCH85 contains data on median starting salaries for school graduates. One of the key explanatory variables is the rank of the law school. Because each law school has a different rank, we clearly cannot include a dummy variable for each rank. If we do not wish to put the rank directly in the equation, we can break it down into categories.


Define the dummy variables top10, r11_25, r26_40,r41_60,r61_100 to take on the value unity when the variable rank falls into the appropriate range. We let schools ranked below 100 be the base group. 

Given a numeric variable, we need to generate a categorical (factor) variable to represent the range into which the rank of a school falls. In R, the command cut is very convenient for this. It takes a numeric variable and a vector of cut points and returns a factor variable. By default, the upper cut and points are included in the corresponding range

In [4]:
library(foreign)
lawsch85<-
     read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/lawsch85.dta?raw=true")

# Define cut points for the rank
cutpts <- c(0,10,25,40,60,100,175)

# Create factor variable containing ranges for the rank
lawsch85$rankcat <- cut(lawsch85$rank, cutpts)

# Display frequencies
table(lawsch85$rankcat)

# Choose reference category
lawsch85$rankcat <- relevel(lawsch85$rankcat,"(100,175]")

# Run regression
reg<-lm(log(salary)~rankcat+LSAT+GPA+log(libvol)+log(cost), data=lawsch85)
summary(reg)


   (0,10]   (10,25]   (25,40]   (40,60]  (60,100] (100,175] 
       10        16        13        18        37        62 


Call:
lm(formula = log(salary) ~ rankcat + LSAT + GPA + log(libvol) + 
    log(cost), data = lawsch85)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.294888 -0.039691 -0.001682  0.043888  0.277497 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     9.1652952  0.4114243  22.277  < 2e-16 ***
rankcat(0,10]   0.6995659  0.0534919  13.078  < 2e-16 ***
rankcat(10,25]  0.5935434  0.0394400  15.049  < 2e-16 ***
rankcat(25,40]  0.3750763  0.0340812  11.005  < 2e-16 ***
rankcat(40,60]  0.2628191  0.0279621   9.399 3.18e-16 ***
rankcat(60,100] 0.1315950  0.0210419   6.254 5.71e-09 ***
LSAT            0.0056908  0.0030630   1.858   0.0655 .  
GPA             0.0137255  0.0741919   0.185   0.8535    
log(libvol)     0.0363619  0.0260165   1.398   0.1647    
log(cost)       0.0008412  0.0251360   0.033   0.9734    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08564 on 126 degrees of freedom
  

We observe that all of the dummy variables defining the different ranks are very statistically significant. The estimate rankcat(60_100] means that, holding LSAT, GPA, libvol and cost fixed, the median salary at a law school ranked between 61 and 100 is about 13.2% higher than that at a law school ranked below 100. The difference between a top 10 school and a below 100 school is quite large. Using the exact calculation given in equation (7.10) gives $exp(.700)-1 \approx 1.014$, and so the predicted median salary is more than 100% higher at a top 10 school than it is at a below 100 school.

### 7-4 Interactions Involving Dummy Variables

### 7-4b Allowing for different slopes

Just as variables with quantitative meaning can be interacted in regression models, so can dummy variables.

There are also occasions for interacting dummy variables with explanatory variables that are not dummy variables to allow for a difference in slopes. 

### Wooldridge. Example 7.10 Log Hourly Wage Equation

Continuing with the wage example, suppose that we wish to test whether the return to education is the same for men and women, allowing for a constant wage differential between men and women (a differential for which we have already found evidence). For simplicity, we include only education and gender in the model. To apply OLS, we must write the model with an interaction between female and educ:

\begin{equation}
log(wage)=\beta_0+\delta_0*female+\beta_1*educ+\delta_1*female*educ+u
\end{equation}

An important hypothesis is that the return to education is the same for women and men. 
In terms of the previous model, this is stated as $H_0: \delta_1=0$, which means that the slope of log( wage ) with respect to educ is the same for men and women. Note that this hypothesis puts no restrictions on the difference in intercepts, $\delta_0$. A wage differential between men and women is allowed under this null, but it must be the same at all levels of education.

In [9]:
library(foreign)
wage1 <- read.dta("http://fmwww.bc.edu/ec-p/data/wooldridge/wage1.dta")

wagereg<-(lm(log(wage)~female+educ+exper+female*educ+exper+I(exper^2)+tenure+I(tenure^2),
                                                           data=wage1))
summary(wagereg)


Call:
lm(formula = log(wage) ~ female + educ + exper + female * educ + 
    exper + I(exper^2) + tenure + I(tenure^2), data = wage1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.83265 -0.25261 -0.02374  0.25396  1.13584 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.3888060  0.1186871   3.276  0.00112 ** 
female      -0.2267886  0.1675394  -1.354  0.17644    
educ         0.0823692  0.0084699   9.725  < 2e-16 ***
exper        0.0293366  0.0049842   5.886 7.11e-09 ***
I(exper^2)  -0.0005804  0.0001075  -5.398 1.03e-07 ***
tenure       0.0318967  0.0068640   4.647 4.28e-06 ***
I(tenure^2) -0.0005900  0.0002352  -2.509  0.01242 *  
female:educ -0.0055645  0.0130618  -0.426  0.67028    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4001 on 518 degrees of freedom
Multiple R-squared:  0.441,	Adjusted R-squared:  0.4334 
F-statistic: 58.37 on 7 and 518 DF,  p-value: < 2.2e-16


The estimated return to education for men in this equation is .082, or 8.2%. For women, it is .082-.0055=0764, or about 7.6%. The difference, -.56%, or just over one-half a percent- age point less for women, is not economically large nor statistically significant: the t statistic is -0.426. Thus, we conclude that there is no evidence against the hypothesis that the return to education is the same for men and women.

The coefficient on female , while remaining economically large, is no longer significant at conventional levels (t=-1.35) . Its coefficient and t statistic in the equation without the interaction were -0.29 and -8.28, respectively [see example (7.5)]. Should we now conclude that there is no statistically significant evidence of lower pay for women at the same levels of educ , exper , and tenure ? This would be a serious error. Because we have added the interaction female*educ to the equation, the coef- ficient on female is now estimated much less precisely than it was in example (7.5): the standard error has increased by almost fivefold (.167 /.035=4.77) . This occurs because female and female*educ are highly correlated in the sample.

In this example, there is a useful way to think about the multicollinearity: in the previous model a, $\delta_0$ measures the wage differential between women and men when educ=0. Very few people in the sample have very low levels of education, so it is not surprising that we have a difficult time estimating the differential at educ=0 (nor is the differential at zero years of education very informative).

### 7-4c Testing for Differences in Regression Functions accross Groups

The previous examples illustrate that interacting dummy variables with other independent variables can be a powerful tool. Sometimes, we wish to test the null hypothesis that two populations or groups follow the same regression function, against the alternative that one or more of the slopes differ across the groups. We will also see examples of this in Chapter 13, when we discuss pooling different cross sections over time.

Suppose we want to test whether the same regression model describes college grade point aver- ages for male and female college athletes. The equation is

\begin{equation}
cumgpa=\beta_0+\beta_1*sat+\beta_2*hsperc+\beta_3*tothrs+u
\end{equation}

where sat is SAT score, hsperc is high school rank percentile, and tothrs is total hours of college courses. We know that, to allow for an intercept difference, we can include a dummy variable for either males or females. If we want any of the slopes to depend on gender, we simply interact the appropriate variable with, say, female , and include it in the equation.

If we are interested in testing whether there is any difference between men and women, then we must allow a model where the intercept and all slopes can be different across the two groups:

\begin{equation}
cumgpa=\beta_0+\delta_0*female+\beta_1*sat+\delta_1*female*sat+\beta_2*hsperc+\delta_2*female*hsperc+\beta_3*tothrs+\delta_3*female*tothrs+u
\end{equation}

The parameter $\delta_0$ is the difference in the intercept between women and men, $\delta_1$ is the slope difference with respect to sat between women and men, and so on. The null hypothesis that cumgpa follows the same model for males and females is stated as: $ \H_0:\delta_0=0, \delta_1=0,\delta_2=0,\delta_3=0 $

If one of the $\delta_j$ is different from zero, then the model is different from men and women. Using the spring semester data from the file GPA3, the full model is estimated as

In [17]:
library(foreign)
gpa3 <- read.dta("http://fmwww.bc.edu/ec-p/data/wooldridge/gpa3.dta")

# Model with full interactions with female dummy (only for spring data)
reg<-lm(cumgpa~female*(sat+hsperc+tothrs), data=gpa3, subset=(spring==1))
summary(reg)




Call:
lm(formula = cumgpa ~ female * (sat + hsperc + tothrs), data = gpa3, 
    subset = (spring == 1))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.51370 -0.28645 -0.02306  0.27555  1.24760 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1.4808117  0.2073336   7.142 5.17e-12 ***
female        -0.3534862  0.4105293  -0.861  0.38979    
sat            0.0010516  0.0001811   5.807 1.40e-08 ***
hsperc        -0.0084516  0.0013704  -6.167 1.88e-09 ***
tothrs         0.0023441  0.0008624   2.718  0.00688 ** 
female:sat     0.0007506  0.0003852   1.949  0.05211 .  
female:hsperc -0.0005498  0.0031617  -0.174  0.86206    
female:tothrs -0.0001158  0.0016277  -0.071  0.94331    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4678 on 358 degrees of freedom
Multiple R-squared:  0.4059,	Adjusted R-squared:  0.3943 
F-statistic: 34.95 on 7 and 358 DF,  p-value: < 2.2e-16


None of the four terms involving the female dummy variable is very statistically significant; only the female&#8729;sat interaction has a t statistic close to two.

The large standard errors on female and the interaction terms make it difficult to tell exactly how men and women differ. 
We must be very careful in interpreting the previous model because, 
in obtaining differences between women and men, the interaction terms must be taken into account. If we look only at the female variable, we would wrongly conclude that cumgpa is about .353 less for women than for men, holding other factors fixed. This is the estimated difference only when sat , hsperc , and tothrs are all set to zero, which is not close to being a possible scenario.

At sat=1100, hsperc=10, and tothrs=50, the predicted difference between a woman and a man is -.353+(.00075*1100)-(.00055*10)-(.00011*50)=.461. That is, the female athlete is predicted to have a GPA that is almost one-half a point higher than the comparable male athlete.

The so called Chow statistic allows us to formally test the null Hypothesis (that the interaction coefficients are zero)

In [18]:
# F-Test from package "car". H0: the interaction coefficients are zero
# matchCoefs(...) selects all coeffs with names containing "female"
library(car)
linearHypothesis(reg, matchCoefs(reg, "female"))

Res.Df,RSS,Df,Sum of Sq,F,Pr(>F)
362,85.51507,,,,
358,78.35451,4.0,7.160561,8.179112,2.544637e-06


The F-statistic is about 8.17, the p-value is zero to five decimal places which leads us to strongly reject H0. Thus men and women athletes do follow different GPA models.

### 7-5 A binary dependent variable: The linear probability model

By now, we have learned much about the properties and applicability of the multiple linear regression model. 
In the last several sections, we studied how, through the use of binary independent variables, 
we can incorporate qualitative information as explanatory variables in a multiple regression model. 
In all of the models up until now, the dependent variable y has had quantitative meaning 
(for example, y is a dollar amount, a test score, a percentage, or the logs of these). 
    What happens if we want to use multiple regression to explain a qualitative event?

In the simplest case, and one that often arises in practice, the event we would like to explain is a binary outcome. In other words, our dependent variable, y , takes on only two values: zero and one. For example, y can be defined to indicate whether an adult has a high school education; y can indicate whether a college student used illegal drugs during a given school year; or y can indicate whether a firm was taken over by another firm during a given year. In each of these examples, we can let y=1 denote one of the outcomes and y= 0 the other outcome.

What does it mean to write down a multiple regression model, such as

\begin{equation}
y=\beta_0+\beta_1*x_1+ \ldots +\beta_k*x_k
\end{equation}

In [None]:
when y is a binary variable? Because y can take on only two values, $\beta_j$ cannot be interpreted as the change in y given a one-unit increase in $x_j$ , holding all other factors fixed: y either changes from zero to one or from one to zero (or does not change). Nevertheless, the $\beta_j$ still have useful interpretations. If we assume that the zero conditional mean assumption MLR.4 holds, that is, $E(u|x_1,\ldots,x_k)=0$, then we have, as always,

\begin{equation}
E(y|x)=\beta_0+\beta_1*x_1+ \ldots +\beta_k*x_k
\end{equation}

The key point is that when y is a binary variable taking on the values zero and one, it is always true that $P(y=1|x)=\beta_0+\beta_1*x_1+\ldots+\beta_k*x_k$ : the probability of "success" that is, the probability that y=1 is the same as the expected value of y . Thus, we have the important equation

\begin{equation}
P(y=1|x)=\beta_0+\beta_1*x_1+ \ldots +\beta_k*x_k
\end{equation}

which says that the probability of success, say, $p(x)=P(y=1|x)$ , is a linear function of the $x_j$ . The previous equation is an example of a binary response model. Refer to chapter 17 for other examples.

The multiple linear regression model with a binary dependent variable is called the linear probability model (LPM) because the response probability is linear in the parameters $\beta_j$, In the LPM, $\beta_j$ measures the change in the probability of success when $x_j$ changes, holding other factors fixed:

### 7-7 Interpreting Regression Results with Discrete Dependent Variables

A binary response is the most extreme form of a discrete random variable: it takes on only two val- ues, zero and one. As we discussed in Section 7-5, the parameters in a linear probability model can be interpreted as measuring the change in the probability that y=1 due to a one-unit increase in an explanatory variable. We also discussed that, because y is a zero-one outcome, $P(y=1)=E(y)$ , and this equality continues to hold when we condition on explanatory variables.

To interpret regression results generally, even in cases where y is discrete and takes on a small number of values, it is useful to remember the interpretation of OLS as estimating the effects of the $x_j$ on the expected (or average ) value of y . Generally, under Assumptions MLR.1 and MLR.4,

\begin{equation}
E(y|x_1,x_2,\ldots,x_k)=\beta_0+\beta_1*x_1+ \ldots +\beta_k*x_k
\end{equation}

Therefore, $\beta_j$ is the effect of a ceteris paribus increase of $x_j$ on the expected value of y . As we discussed in Section 6-4, for a given set of $x_j$ values we interpret the predicted value, $\hat\beta_0+\hat\beta_1*x_1+\ldots+\hat\beta_k*x_k$ , as an estimate of $E(y|x_1,x_2,\ldots,x_k)$. Therefore, $\hat\beta_j$ is our estimate of how the average of y changes when $\Delta x_j=1$ (keeping other factors fixed).

Incidentally, when y is discrete the linear model does not always provide the best estimates of partial effects on $E(y|x_1,x_2,\ldots,x_k)$ Chapter 17 contains more advanced models and estimation methods that tend to fit the data better when the range of y is limited in some substantive way. Nevertheless, a linear model estimated by OLS often provides a good approximation to the true partial effects, at least on average.