# Chapter 17. Limited Dependent Variable Models and Sample Selection Corrections

In Chapter 7, we studied the linear probability model, which is simply an application of the multiple regression model to a binary dependent variable. A binary dependent variable is an example of a limited dependent variable (LDV) . An LDV is broadly defined as a dependent variable whose range of values is substantively restricted. A binary variable takes on only two values, zero and one. In Section 7-7, we discussed the interpretation of multiple regression estimates for generally discrete response variables, focusing on the case where y takes on a small number of integer values&#8212;for example, the number of times a young man is arrested during a year or the number of children born to a woman. Elsewhere, we have encountered several other limited dependent variables, including the percentage of people participating in a pension plan (which must be between zero and 100) and college grade point average (which is between zero and 4.0 at most colleges).

Most economic variables we would like to explain are limited in some way, often because they must be positive. For example, hourly wage, housing price, and nominal interest rates must be greater than zero. But not all such variables need special treatment. If a strictly positive variable takes on many different values, a special econometric model is rarely necessary. When y is discrete and takes on a small number of values, it makes no sense to treat it as an approximately continuous variable. Discreteness of y does not in itself mean that linear models are inappropriate. However, as we saw in Chapter 7 for binary response, the linear probability model has certain drawbacks. In Section 17-1, we discuss logit and probit models, which overcome the shortcomings of the LPM; the disadvantage is that they are more difficult to interpret.

Other kinds of limited dependent variables arise in econometric analysis, especially when the behavior of individuals, families, or firms is being modeled. Optimizing behavior often leads to a corner solution response for some nontrivial fraction of the population. That is, it is optimal to choose a zero quantity or dollar value, for example. During any given year, a significant number of families will make zero charitable contributions. Therefore, annual family charitable contributions has a population distribution that is spread out over a large range of positive values, but with a pileup at the value zero. Although a linear model could be appropriate for capturing the expected value of charitable contributions, a linear model will likely lead to negative predictions for some families. Taking the natural log is not possible because many observations are zero. The Tobit model, which we cover in Section 17-2, is explicitly designed to model corner solution dependent variables.

Another important kind of LDV is a count variable, which takes on nonnegative integer values. Section 17-3 illustrates how Poisson regression models are well suited for modeling count variables.

In some cases, we encounter limited dependent variables due to data censoring, a topic we introduce in Section 17-4. The general problem of sample selection, where we observe a nonrandom sample from the underlying population, is treated in Section 17-5.

Limited dependent variable models can be used for time series and panel data, but they are most often applied to cross-sectional data. Sample selection problems are usually confined to cross- sectional or panel data. We focus on cross-sectional applications in this chapter. Wooldridge (2010) analyzes these problems in the context of panel data models and provides many more details for cross-sectional and panel data applications.

## 17-1. Logit and Probit Models for Binary Response

The linear probability model is simple to estimate and use, but it has some drawbacks that we dis- cussed in Section 7-5. The two most important disadvantages are that the fitted probabilities can be less than zero or greater than one and the partial effect of any explanatory variable (appearing in level form) is constant. These limitations of the LPM can be overcome by using more sophisticated binary response models.

In a binary response model, interest lies primarily in the response probability

\begin{equation}
P(y=1|x)=P(y=1x_1,x_2,\ldots,x_k) \tag{17.1}
\end{equation}

where we use x to denote the full set of explanatory variables. For example, when y is an employment indicator, x might contain various individual characteristics such as education, age, marital status, and other factors that affect employment status, including a binary indicator variable for participation in a recent job training program.

### 17.1a Specifying Logit and Probit Models

In the LPM, we assume that the response probability is linear in a set of parameters, \Beta_j; see equation (7.27). To avoid the LPM limitations, consider a class of binary response models of the form

\begin{equation}
P(y=1|x)=G(\beta_0+\beta_1*x_1+\ldots+\beta_k*x_k)=G(\beta_0+x \beta) \tag{17.2}
\end{equation}

where G is a function taking on values strictly between zero and one: $0 < G(z) < 1$ for all real numbers z . This ensures that the estimated response probabilities are strictly between zero and one.

Various nonlinear functions have been suggested for the function G to make sure that the probabilities are between zero and one. The two we will cover here are used in the vast majority of applications (along with the LPM). In the logit model , G is the logistic function:

\begin{equation}
G(z)=exp(z)/[1+exp(z)]= \Lambda(z) \tag{17.3}
\end{equation}

which is between zero and one for all real numbers z . This is the cumulative distribution function (cdf) for a standard logistic random variable. In the probit model , G is the standard normal cdf, which is expressed as an integral:

\begin{equation}
G(z)=\Phi(z) \equiv \int_{-\infty}^{z} \phi(v) dv \tag{17.4}
\end{equation}

Where $\phi(z)$ is the normal density

In most applications of binary response models, the primary goal is to explain the effects of the $x_j$ on the response probability $P(y=1|x)$.

If, say, $x_1$ is a binary explanatory variable, then the partial effect from changing $x_1$ from zero to one, holding all other variables fixed, is simply

\begin{equation}
G(\beta_0+\beta_1+\beta_2 x_2+\ldots+\beta_k x_k)-(\beta_0+\beta_2 x_2+\ldots+\beta_k x_k) \tag{17.8}
\end{equation}

Note that knowing the sign of $\beta_1$ is sufficient for determining whether the program had a positive or negative effect. But to find the magnitude of the effect, we have to estimate the quantity in (17.8).

### 17.1b Maximum Likelihood Estimation of Logit and Probit Models

How should we estimate nonlinear binary response models? To estimate the LPM, we can use ordinary least squares (see Wooldridge Section 7-5) or, in some cases, weighted least squares (see Section 8-5). Because of the nonlinear nature of $E(y|x)$ , OLS and WLS are not applicable. We could use nonlinear versions of these methods, but it is no more difficult to use maximum likelihood estimation (MLE) (see Appendix 17A in Wooldridge for a brief discussion).

### 17.1d Interpreting the Logit and Probit Estimates

Given modern computers, from a practical perspective the most difficult aspect of logit or probit models is presenting and interpreting the results. The coefficient estimates, their standard errors, and the value of the log-likelihood function are reported by all software packages that do logit and probit, and these should be reported in any application. The coefficients give the signs of the partial effects of each $x_j$ on the response probability, and the statistical significance of $x_j$ is determined by whether we can reject $H_0: \beta_j=0$ at a sufficiently small significance level.

Often, we want to estimate the effects of the $x_j$ on the response probabilities, $P(y=1|x)$. If $x_j$ is (roughly) continuous, then

\begin{equation}
\Delta \hat P(y=1|x) \approx [g(\hat \beta_0+x\hat \beta)\hat \beta_j]\Delta x_j \tag{17.13}
\end{equation}

for "small" changes in $x_j$ . So, for $\Delta x_j=1$, the change in the estimated success probability is roughly $g(\hat \beta_0+x \hat \beta)\hat \beta_j$. Compared with the linear probability model, the cost of using probit and logit models is that the partial effects in equation (17.13) are harder to summarize because the scale factor, $g(\hat \beta_0+x \hat \beta)$ , depends on x (that is, on all of the explanatory variables).

As a quick summary for getting at the magnitudes of the partial effects, it is handy to have a single scale factor that can be used to multiply each $\beta_j$ (or at least those coefficients on roughly continuous variables). One method, commonly used in econometrics packages that routinely estimate probit and logit models, is to replace each explanatory variable with its sample average. In other words, the adjustment factor is

\begin{equation}
g(\hat \beta_0+\bar x \hat \beta)=g(\hat \beta_0+\hat \beta_1 \bar x_1+\hat \beta_2 \bar x_2+\ldots+\beta_k \bar x_k) \tag{17.14}
\end{equation}

where $g(\bullet)$ is the standard normal density in the probit case and $g(z=)exp(z)/[1+exp(z)]^2$ in the logit case. The idea behind (17.14) is that, when it is multiplied by $\hat \beta_j$ , we obtain the partial effect of $x_j$ for the "average" person in the sample. Thus, if we multiply a coefficient by (17.14), we generally obtain the partial effect at the average (PEA)

A different approach to computing a scale factor circumvents the issue of which values to plug in for the explanatory variables. Instead, the second scale factor results from averaging the individual partial effects across the sample, leading to what is called the average partial effect (APE) or, some- times, the average marginal effect (AME).

### Wooldridge Example 17.1 Married Women's Labor Force Participation

We now use the data on 753 married women in MROZ to estimate the labor force participation model from Example 8.8 -see also Section 7-5- by logit and probit. We also report the linear probability model estimates from Example 8.8, using the heteroskedasticity-robust standard errors. The results, with standard errors in parentheses, are given in Table 17.1.

In [10]:
install.packages('stargazer'); install.packages('mfx')
library(foreign);library(car); library(lmtest); library(stargazer)  # for robust SE
mroz <- read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/mroz.dta?raw=true")

# Estimate linear probability model
linprob <- lm(inlf~nwifeinc+educ+exper+I(exper^2)+age+kidslt6+kidsge6,data=mroz)

# Estimate logit model
logitres<-glm(inlf~nwifeinc+educ+exper+I(exper^2)+age+kidslt6+kidsge6,
                                family=binomial(link=logit),data=mroz)

# Estimate probit model
probitres<-glm(inlf~nwifeinc+educ+exper+I(exper^2)+age+kidslt6+kidsge6,
                                family=binomial(link=probit),data=mroz)


stargazer(linprob,logitres, probitres,type="text")

Installing package into '/home/nbuser/R'
(as 'lib' is unspecified)
Installing package into '/home/nbuser/R'
(as 'lib' is unspecified)
also installing the dependency 'betareg'




                                Dependent variable:            
                    -------------------------------------------
                                       inlf                    
                              OLS           logistic   probit  
                              (1)              (2)       (3)   
---------------------------------------------------------------
nwifeinc                   -0.003**         -0.021**  -0.012** 
                            (0.001)          (0.008)   (0.005) 
                                                               
educ                       0.038***         0.221***  0.131*** 
                            (0.007)          (0.043)   (0.025) 
                                                               
exper                      0.039***         0.206***  0.123*** 
                            (0.006)          (0.032)   (0.019) 
                                                               
I(exper2)                  -0.001***   

The estimates from the three models tell a consistent story. The signs of the coefficients are the same across models, and the same variables are statistically significant in each model. The pseudo R -squared for the LPM is just the usual R -squared reported for OLS; for logit and probit, the pseudo R -squared is the measure based on the log-likelihoods described earlier.

The following table reports the average partial effects for all explanatory variables and for each of the three estimated models.

In [73]:
 #Automatic APE calculations with package mfx
library(mfx)
logitpartialeffects<-logitmfx(inlf~nwifeinc+educ+exper+I(exper^2)+age+kidslt6+kidsge6, 
                                              data=mroz, atmean=FALSE)
probitpartialeffects<-probitmfx(inlf~nwifeinc+educ+exper+I(exper^2)+age+kidslt6+kidsge6, 
                                              data=mroz, atmean=FALSE)
linpartialeffects <- lm(inlf~nwifeinc+educ+exper+I(exper^2)+age+kidslt6+kidsge6,data=mroz)

In [77]:
# Let's print it in a single table for comparison
data.frame(linpartialeffects$coefficients[-1],logitpartialeffects$mfxest[,1],probitpartialeffects$mfxest[,1])

Unnamed: 0,linpartialeffects.coefficients..1.,logitpartialeffects.mfxest...1.,probitpartialeffects.mfxest...1.
nwifeinc,-0.0034051689,-0.0038118134,-0.003616175
educ,0.0379953029,0.0394965237,0.039370095
exper,0.0394923894,0.0367641055,0.037097345
I(exper^2),-0.0005963119,-0.0005632587,-0.000567546
age,-0.0160908062,-0.0157193607,-0.015895665
kidslt6,-0.261810467,-0.2577536552,-0.261153464
kidsge6,0.0130122345,0.0107348185,0.010828887


As is clear from the table, the APEs are very similar for all explanatory variables across all three models. The biggest difference between the LPM model and the logit and probit models is that the LPM assumes constant marginal effects for educ , kidslt6 , and so on, while the logit and probit models imply diminishing magnitudes of the partial effects.

## 17-2. The Tobit Model for Corner Solution Responses 

As mentioned in the chapter introduction, another important kind of limited dependent variable is a corner solution response. Such a variable is zero for a nontrivial fraction of the population but is roughly continuously distributed over positive values. An example is the amount an individual spends on alcohol in a given month. In the population of people over age 21 in the United States, this variable takes on a wide range of values. For some significant fraction, the amount spent on alcohol is zero.

Let y be a variable that is essentially continuous over strictly positive values but that takes on a value of zero with positive probability. Nothing prevents us from using a linear model for y . In fact, a linear model might be a good approximation to 
$E(y|x_1,x_2,\ldots,x_k)$, especially for $x_j$ near the mean values. But we would possibly obtain negative fitted values, which leads to negative predictions for y ; this is analogous to the problems with the LPM for binary outcomes. Also, the assumption that an explanatory variable appearing in level form has a constant partial effect on $E(y|x)$ can be misleading.

The Tobit model is quite convenient for these purposes. Typically, the Tobit model expresses the observed response, y , in terms of an underlying latent variable:

\begin{equation}
y^* =\beta_0+x \beta+u,u|x \sim Normal(0,\sigma^2) \tag{17.18}
\end{equation}
\begin{equation}
y =max(0,y^*) \tag{17.19}
\end{equation}

The latent variable y p satisfies the classical linear model assumptions; in particular, it has a normal, homoskedastic distribution with a linear conditional mean. Equation (17.19) implies that the observed variable, y , equals $y^*$ when $y^* \geq 0$, but $y=0$ when $y^* \leq 0$. Because $y^*$ is normally distributed, y has a continuous distribution over strictly positive values.

## 17.2a Interpreting the Tobit Estimates

Using modern computers, it is usually not much more difficult to obtain the maximum likelihood estimates for Tobit models than the OLS estimates of a linear model. Further, the outputs from Tobit and OLS are often similar. This makes it tempting to interpret the $\hat \beta_j$ from Tobit as if these were estimates from a linear regression. Unfortunately, things are not so easy.

From equation (17.18), we see that the $\beta_j$ measure the partial effects of the $x_j$ on $E(y^*|x)$ , where $y^*$ is the latent variable. Sometimes, $y^*$ has an interesting economic meaning, but more often it does not. The variable we want to explain is y , as this is the observed outcome (such as hours worked or amount of charitable contributions). For example, as a policy matter, we are interested in the sensitivity of hours worked to changes in marginal tax rates.

### Wooldridge Example 17.2 Married Women's Annual Labor Supply

The file MROZ includes data on hours worked for 753 married women, 428 of whom worked for a wage outside the home during the year; 325 of the women worked zero hours. For the women who worked positive hours, the range is fairly broad, extending from 12 to 4,950. Thus, annual hours worked is a good candidate for a Tobit model. We also estimate a linear model (using all 753 observations) by OLS. The following code performs the calculations:

In [80]:
install.packages('censReg')

Installing package into '/home/nbuser/R'
(as 'lib' is unspecified)
also installing the dependencies 'bdsmatrix', 'glmmML', 'plm'



In [97]:
library(foreign)
mroz <- read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/mroz.dta?raw=true")

# Estimate Tobit model using censReg:
library(censReg)
TobitRes <- censReg(hours~nwifeinc+educ+exper+I(exper^2)+ 
                                    age+kidslt6+kidsge6, data=mroz )
#summary(TobitRes)

In [98]:
linpartialeffects <- lm(hours~nwifeinc+educ+exper+I(exper^2)+age+kidslt6+kidsge6,data=mroz)
#linpartialeffects$coefficients

In [96]:
# Let's print it in a single table for comparison
data.frame(linpartialeffects$coefficients,head(TobitRes$estimate,-1))

Unnamed: 0,linpartialeffects.coefficients,head.TobitRes.estimate...1.
(Intercept),1330.4824036,965.305296
nwifeinc,-3.4466356,-8.814243
educ,28.7611246,80.645605
exper,65.6725131,131.564299
I(exper^2),-0.7004939,-1.864158
age,-30.5116345,-54.405012
kidslt6,-442.0899082,-894.02174
kidsge6,-32.7792266,-16.217996


This table has several noteworthy features. First, the Tobit coefficient estimates have the same sign as the corresponding OLS estimates. Second, though it is tempting to compare the magnitudes of the OLS and Tobit estimates, this is not very informative. We must be careful not to think that, because the Tobit coefficient on kidslt6 is roughly twice that of the OLS coefficient, the Tobit model implies a much greater response of hours worked to young children.

The following code computes APEs for all variables, where the APEs for the linear model are simply the OLS coefficients except for the variable exper , which appears as a quadratic. The Tobit APEs for nwifeinc , educ , and kidslt 6 are all substantially larger in magnitude than the corresponding OLS coefficients.

In [103]:
# Partial Effects at the average x:
data.frame(linpartialeffects$coefficients[-1],margEff(TobitRes))

Unnamed: 0,linpartialeffects.coefficients..1.,margEff.TobitRes.
nwifeinc,-3.4466356,-5.326442
educ,28.7611246,48.734094
exper,65.6725131,79.504231
I(exper^2),-0.7004939,-1.126509
age,-30.5116345,-32.876918
kidslt6,-442.0899082,-540.256832
kidsge6,-32.7792266,-9.800526


## 17-3. The Poisson Regression Model

Another kind of nonnegative dependent variable is a count variable , which can take on nonnegative integer values: ${0,1,2,\ldots}$. We are especially interested in cases where y takes on relatively few values, including zero. Examples include the number of children ever born to a woman, the number of times someone is arrested in a year, or the number of patents applied for by a firm in a year. For the same reasons discussed for binary and Tobit responses, a linear model for $E(y|x_1,\ldots,x_k)$ might not provide the best fit over all values of the explanatory variables. (Nevertheless, it is always informative to start with a linear model.

As with a Tobit outcome, we cannot take the logarithm of a count variable because it takes on the value zero. A profitable approach is to model the expected value as an exponential function:

\begin{equation}
E(y|x_1,x_2,\ldots,x_k)=exp(\beta_0+\beta_1 x_1+\ldots+\beta_k x_k) \tag{17.31}
\end{equation}

Although (17.31) is more complicated than a linear model, we basically already know how to interpret the coefficients. Taking the log of equation (17.31) shows that

\begin{equation}
lod[E(y|x_1,x_2,\ldots,x_k)]=\beta_0+\beta_1 x_1+\ldots+\beta_k x_k \tag{17.31}
\end{equation}

Although (17.31) is more complicated than a linear model, we basically already know how to interpret the coefficients. $100\beta_j$ is roughly the percentage change in $E(y|x)$ , given a one-unit increase in $x_j$

If, say, $x_j=log(z_j)$ for some variable $z_j > 0$, then its coefficient, $\beta_j$ , is interpreted as an elasticity with respect to $z_j$. The bottom line is that, for practical purposes, we can interpret the coefficients in equation (17.31) as if we have a linear model, with $log(y)$ as the dependent variable.

A count variable cannot have a normal distribution (because the nor- mal distribution is for continuous variables that can take on all values), and if it takes on very few values, the distribution can be very different from normal. Instead, the nominal distribution for count data is the Poisson distribution .

Because we are interested in the effect of explanatory variables on y , we must look at the Poisson distribution conditional on x . The Poisson distribution is entirely determined by its mean, so we only need to specify $E(y|x)$ . We assume this has the same form as (17.31), which we write in shorthand as $exp(x\beta)$ . Then, the probability that y equals the value h , conditional on x , is

\begin{equation}
P(y=h|x)=exp[-exp(x \beta)][exp(x \beta)]^h/h!, h=0,1,\ldots,
\end{equation}

where h ! denotes factorial (see Appendix B). This distribution, which is the basis for the Poisson regression model , allows us to find conditional probabilities for any values of the explanatory variables. For example, P(y=0|x)=exp[-exp(x \beta)] . Once we have estimates of the $\beta_j$ , we can plug them into the probabilities for various values of x

As with the probit, logit, and Tobit models, we cannot directly compare the magnitudes of the Poisson estimates of an exponential function with the OLS estimates of a linear function.

### Wooldridge Example 17.3 Poisson Regression for the number of Arrests

We now apply the Poisson regression model to the arrest data in CRIME1, used, among other places, in Example 9.1. The dependent variable, narr86 , is the number of times a man is arrested during 1986. This variable is zero for 1,970 of the 2,725 men in the sample, and only eight values of narr86 are greater than five. Thus, a Poisson regression model is more appropriate than a linear regression model. Table 17.5 also presents the results of OLS estimation of a linear regression model.

In [6]:
install.packages('stargazer')

Installing package into '/home/nbuser/R'
(as 'lib' is unspecified)


In [9]:
library(foreign) ; library(stargazer) # package for regression output
crime1 <- read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/crime1.dta?raw=true")

# Estimate linear model
lm.res      <-  lm(narr86~pcnv+avgsen+tottime+ptime86+qemp86+inc86+
                    black+hispan+born60, data=crime1)
# Estimate Poisson model
Poisson.res <- glm(narr86~pcnv+avgsen+tottime+ptime86+qemp86+inc86+
                    black+hispan+born60, data=crime1, family=poisson)

stargazer(lm.res,Poisson.res,type="text",keep.stat="n")


                 Dependent variable:     
             ----------------------------
                        narr86           
                  OLS          Poisson   
                  (1)            (2)     
-----------------------------------------
pcnv           -0.132***      -0.402***  
                (0.040)        (0.085)   
                                         
avgsen           -0.011        -0.024    
                (0.012)        (0.020)   
                                         
tottime          0.012         0.024*    
                (0.009)        (0.015)   
                                         
ptime86        -0.041***      -0.099***  
                (0.009)        (0.021)   
                                         
qemp86         -0.051***       -0.038    
                (0.014)        (0.029)   
                                         
inc86          -0.001***      -0.008***  
                (0.0003)       (0.001)   
                                 

The OLS and Poisson coefficients are not directly comparable, and they have very different meanings. For example, the coefficient on pcnv implies that, if $\Delta pcnv=.10$, the expected number of arrests falls by .013 ( pcnv is the proportion of prior arrests that led to conviction). The Poisson coefficient implies that $\Delta pcnv=.10$ reduces expected arrests by about 4% [.402(.10)=.0402, and we multiply this by 100 to get the percentage effect]. As a policy matter, this suggests we can reduce overall arrests by about 4% if we can increase the probability of conviction by .1.

The Poisson coefficient on black implies that, other factors being equal, the expected number of arrests for a black man is estimated to be about 100 *[exp(.661)-1]= 93.7% higher than for a white man with the same values for the other explanatory variables.

## 17-4. Censored and Truncated Regression Models

The models in Sections 17-1, 17-2, and 17-3 apply to various kinds of limited dependent variables that arise frequently in applied econometric work. In using these methods, it is important to remember that we use a probit or logit model for a binary response, a Tobit model for a corner solution out- come, or a Poisson regression model for a count response because we want models that account for important features of the distribution of y . There is no issue of data observability. For example, in the Tobit application to women's labor supply in Example 17.2, there is no problem with observing hours worked: it is simply the case that a nontrivial fraction of married women in the population choose not to work for a wage. In the Poisson regression application to annual arrests, we observe the dependent variable for every young man in a random sample from the population, but the dependent variable can be zero as well as other small integer values.

Unfortunately, the distinction between lumpiness in an outcome variable (such as taking on the value zero for a nontrivial fraction of the population) and problems of data censoring can be confus- ing. This is particularly true when applying the Tobit model. In this book, the standard Tobit model described in Section 17-2 is only for corner solution outcomes. But the literature on Tobit models usually treats another situation within the same framework: the response variable has been censored above or below some threshold. Typically, the censoring is due to survey design and, in some cases, institutional constraints. Rather than treat data censoring problems along with corner solution out- comes, we solve data censoring by applying a censored regression model . Essentially, the problem solved by a censored regression model is one of missing data on the response variable, y . Although we are able to randomly draw units from the population and obtain information on the explanatory vari- ables for all units, the outcome on y i is missing for some i . Still, we know whether the missing values are above or below a given threshold, and this knowledge provides useful information for estimating the parameters.

A truncated regression model arises when we exclude, on the basis of y , a subset of the population in our sampling scheme. In other words, we do not have a random sample from the underlying population, but we know the rule that was used to include units in the sample. This rule is determined by whether y is above or below a certain threshold. We explain more fully the difference between censored and truncated regression models later.

### 17.4a Censored Regression Models

While censored regression models can be defined without distributional assumptions, in this subsection we study the censored normal regression model . The variable we would like to explain, y , follows the classical linear model.But rather than observing $y_i$ , we observe it only if it is less than a censoring value, $c_i$.

One example of right data censoring is top coding . When a variable is top coded, we know its value only up to a certain threshold. For responses greater than the threshold, we only know that the variable is at least as large as the threshold. For example, in some surveys family wealth is top coded. Suppose that respondents are asked their wealth, but people are allowed to respond with "more than $500,000." Then, we observe actual wealth for those respondents whose wealth is less than $500,000 but not for those whose wealth is greater than $500,000. In this case, the censoring threshold, $c_i$ , is the same for all i . In many situations, the censoring threshold changes with individual or family characteristics.

If we observed a random sample for (x,y) , we would simply estimate b by OLS, and statistical inference would be standard. (We again absorb the intercept into x for simplicity.) The censoring causes problems. Using arguments similar to the Tobit model, an OLS regression using only the uncensored observations -that is, those with $y_i<c_i$ - produces inconsistent estimators of the $\beta_j$ . An OLS regression of $w_i$ on $x_i$ , using all observations, does not consistently estimate the $b_j$ , unless there is no censoring. This is similar to the Tobit case, but the problem is much different. In the Tobit model, we are modeling economic behavior, which often yields zero outcomes; the Tobit model is supposed to reflect this. With censored regression, we have a data collection problem because, for some reason, the data are censored.

It is important to know that we can interpret the $b_j$ just as in a linear regression model under random sampling. This is much different than Tobit applications to corner solution responses, where the expectations of interest are nonlinear functions of the $\beta_j$ .

An important application of censored regression models is duration analysis . A duration is a variable that measures the time before a certain event occurs. For example, we might wish to explain the number of days before a felon released from prison is arrested. For some felons, this may never happen, or it may happen after such a long time that we must censor the duration in order to analyze the data.

### Wooldridge Example 17.4 Duration of Recidivism

The file RECID contains data on the time in months until an inmate in a North Carolina prison is arrested after being released from prison; call this durat . Some inmates participated in a work program while in prison. We also control for a variety of demographic variables, as well as for measures of prison and criminal history.

Of 1,445 inmates, 893 had not been arrested during the period they were followed; therefore, these observations are censored. The censoring times differed among inmates, ranging from 70 to 81 months.

In this example, it is crucial to account for the censoring, especially because almost 62% of the durations are censored.

The following code computes a censored normal regression for log(durat). Each of the coefficients, when multiplied by 100, gives the estimated percentage change in expected duration, given a ceteris paribus increase of one unit in the corresponding explanatory variable.

In [11]:
library(foreign);library(survival)
recid <- read.dta("https://github.com/thousandoaks/Wooldridge/blob/master/recid.dta?raw=true")

# Define Dummy for UNcensored observations
recid$uncensored <- recid$cens==0
# Estimate censored regression model:
res<-survreg(Surv(log(durat),uncensored, type="right") ~ workprg+priors+
                     tserved+felon+alcohol+drugs+black+married+educ+age, 
                     data=recid, dist="gaussian")
# Output:
summary(res)


Call:
survreg(formula = Surv(log(durat), uncensored, type = "right") ~ 
    workprg + priors + tserved + felon + alcohol + drugs + black + 
        married + educ + age, data = recid, dist = "gaussian")
               Value Std. Error      z        p
(Intercept)  4.09939   0.347535 11.796 4.11e-32
workprg     -0.06257   0.120037 -0.521 6.02e-01
priors      -0.13725   0.021459 -6.396 1.59e-10
tserved     -0.01933   0.002978 -6.491 8.51e-11
felon        0.44399   0.145087  3.060 2.21e-03
alcohol     -0.63491   0.144217 -4.402 1.07e-05
drugs       -0.29816   0.132736 -2.246 2.47e-02
black       -0.54272   0.117443 -4.621 3.82e-06
married      0.34068   0.139843  2.436 1.48e-02
educ         0.02292   0.025397  0.902 3.67e-01
age          0.00391   0.000606  6.450 1.12e-10
Log(scale)   0.59359   0.034412 17.249 1.13e-66

Scale= 1.81 

Gaussian distribution
Loglik(model)= -1597.1   Loglik(intercept only)= -1680.4
	Chisq= 166.74 on 10 degrees of freedom, p= 0 
Number of Newton-Raphson Iterat

Several of the coefficients are interesting. The variables priors (number of prior convictions) and tserved (total months spent in prison) have negative effects on the time until the next arrest occurs. This suggests that these variables measure proclivity for criminal activity rather than representing a deterrent effect. For example, an inmate with one more prior conviction has a duration until next arrest that is almost 14% less. A year of time served reduces duration by about $100*12(.019)=22.8%$. A somewhat surprising finding is that a man serving time for a felony has an estimated expected duration that is almost 56% [exp(.444)-1=.56%] longer than a man serving time for a nonfelony.

Those with a history of drug or alcohol abuse have substantially shorter expected durations until the next arrest. (The variables alcohol and drugs are binary variables.) Older men, and men who were married at the time of incarceration, are expected to have significantly longer durations until their next arrest. Black men have substantially shorter durations, on the order of 42% [exp(.543)-1 =-.42].

The key policy variable, workprg , does not have the desired effect. The point estimate is that, other things being equal, men who participated in the work program have estimated recidivism durations that are about 6.3% shorter than men who did not participate. The coefficient has a small t statistic, so we would probably conclude that the work program has no effect. This could be due to a self-selection problem, or it could be a product of the way men were assigned to the program. Of course, it may simply be that the program was ineffective.

### 17.4b Truncated Regression Models

The truncated regression model differs in an important respect from the censored regression model. In the case of data censoring, we do randomly sample units from the population. The censoring problem is that, while we always observe the explanatory variables for each randomly drawn unit, we observe the outcome on y only when it is not censored above or below a given threshold. With data truncation, we restrict attention to a subset of the population prior to sampling; so there is a part of the popula- tion for which we observe no information. In particular, we have no information on explanatory vari- ables. The truncated sampling scenario typically arises when a survey targets a particular subset of the population and, perhaps due to cost considerations, entirely ignores the other part of the population. Subsequently, researchers might want to use the truncated sample to answer questions about the entire population, but one must recognize that the sampling scheme did not generate a random sample from the whole population.

As an example, Hausman and Wise (1977) used data from a negative income tax experiment to study various determinants of earnings. To be included in the study, a family had to have income less than 1.5 times the 1967 poverty line, where the poverty line depended on family size. Hausman and Wise wanted to use the data to estimate an earnings equation for the entire population.

The truncated normal regression model begins with an underlying population model that satis- fies the classical linear model assumptions:

\begin{equation}
y=\beta_0+x*\beta+u,u|x \sim Normal(0,\sigma^2) \tag{17.40}
\end{equation}

Recall that this is a strong set of assumptions, because u must not only be independent of x , but also normally distributed.

Under (17.40) we know that, given a random sample from the population, OLS is the most efficient estimation procedure. The problem arises because we do not observe a random sample from the population: Assumption MLR.2 is violated. 