### Classification

 The linear regression model discussed in Chapter 3 assumes that the re
sponse variable Y is quantitative. But in many situations, the response
 variable is instead qualitative. For example, eye color is qualitative. Of- qualitative
 ten qualitative variables are referred to as categorical; we will use these
 terms interchangeably. In this chapter, we study approaches for predicting
 qualitative responses, a process that is known as classification. Predicting classification
 a qualitative response for an observation can be referred to as classifying
 that observation, since it involves assigning the observation to a category,
 or class. On the other hand, often the methods used for classification first
 predict the probability that the observation belongs to each of the cate
gories of a qualitative variable, as the basis for making the classification.
 In this sense they also behave like regression methods.
 There are many possible classification techniques, or classifiers, that one classifier
 might use to predict a qualitative response. We touched on some of these
 in Sections 2.1.5 and 2.2.3. In this chapter we discuss some widely-used
 classifiers: logistic regression, linear discriminant analysis, quadratic dis- logistic
 criminant analysis, naive Bayes, and K-nearest neighbors. The discussion
 of logistic regression is used as a jumping-off point for a discussion of gen
eralized linear models, and in particular, Poisson regression. 

#### An Overview of Classification

 Classification problems occur often, perhaps even more so than regression problems. Some examples include:
  1. A person arrives at the emergency room with a set of symptoms
 that could possibly be attributed to one of three medical conditions.
 Which of the three conditions does the individual have?
 2. An online banking service must be able to determine whether or not
 a transaction being performed on the site is fraudulent, on the basis
 of the user’s IP address, past transaction history, and so forth.
 3. On the basis of DNA sequence data for a number of patients with
 and without a given disease, a biologist would like to figure out which
 DNA mutations are deleterious (disease-causing) and which are not.

 Just as in the regression setting, in the classification setting we have a
 set of training observations (x1,y1),...,(xn,yn) that we can use to build
 a classifier. We want our classifier to perform well not only on the training
 data, but also on test observations that were not used to train the classifier.
 
 In this chapter, we will illustrate the concept of classification using the
 simulated Default data set. We are interested in predicting whether an
 individual will default on his or her credit card payment, on the basis of
 annual income and monthly credit card balance. The data set is displayed
 in Figure 4.1. In the left-hand panel of Figure 4.1, we have plotted annual
 income and monthly credit card balance for a subset of 10,000 individuals.
 The individuals who defaulted in a given month are shown in orange, and
 those who did not in blue. (The overall default rate is about 3%, so we
 have plotted only a fraction of the individuals who did not default.) It
 appears that individuals who defaulted tended to have higher credit card
 balances than those who did not. In the center and right-hand panels of
 Figure 4.1, two pairs of boxplots are shown. The first shows the distribution
 of balance split by the binary default variable; the second is a similar plot
 for income. In this chapter, we learn how to build a model to predict default
 (Y ) for any given value of balance (X1) and income (X2). Since Y is not
 quantitative, the simple linear regression model of Chapter 3 is not a good
 choice: we will elaborate on this further in Section 4.2.
 
 It is worth noting that Figure 4.1 displays a very pronounced relation
ship between the predictor balance and the response default. In most real
 applications, the relationship between the predictor and the response will
 not be nearly so strong. However, for the sake of illustrating the classifica
tion procedures discussed in this chapter, we use an example in which the
 relationship between the predictor and the response is somewhat exaggerated.

#### Why Not Linear Regression?

 Suppose that we are trying to predict the medical condition of a patient
 in the emergency room on the basis of her symptoms. In this simplified
 example, there are three possible diagnoses: stroke, drug overdose, and epilepticseizure.Wecouldconsiderencodingthesevaluesasaquantita
tiveresponsevariable,Y,asfollows:
 Y=
 1 ifstroke;
 2 ifdrugoverdose;
 3 ifepilepticseizure.
 Usingthiscoding,leastsquarescouldbeusedtofitalinearregressionmodel
 topredictY onthebasisofasetofpredictorsX1,...,Xp.Unfortunately,
 thiscodingimpliesanorderingontheoutcomes,puttingdrug overdosein
 betweenstrokeandepilepticseizure, and insistingthat thedifference
 betweenstrokeanddrugoverdose is thesameas thedifferencebetween
 drugoverdoseandepilepticseizure. Inpractice there isnoparticular
 reasonthat thisneeds tobethecase.For instance, onecouldchoosean
 equallyreasonablecoding,
 Y=
 1 ifepilepticseizure;
 2 ifstroke;
 3 ifdrugoverdose,
 whichwouldimplyatotallydifferentrelationshipamongthethreecondi
tions.Eachof thesecodingswouldproducefundamentallydifferent linear
 modelsthatwouldultimatelyleadtodifferentsetsofpredictionsontest
 observations.
 If theresponsevariable’svaluesdidtakeonanaturalordering, suchas
 mild,moderate,andsevere,andwefeltthegapbetweenmildandmoderate
 wassimilartothegapbetweenmoderateandsevere,thena1,2,3coding
 wouldbereasonable.Unfortunately, ingeneral thereisnonaturalwayto
 convert a qualitative response variable with more than two levels into a
 quantitative response that is ready for linear regression.
 For a binary (two level) qualitative response, the situation is better. For binary
 instance, perhaps there are only two possibilities for the patient’s medical
 condition: stroke and drug overdose. We could then potentially use the
 dummyvariable approach from Section 3.3.1 to code the response as follows:
 Y = 0 ifstroke;
 1 if drug overdose.
 We could then fit a linear regression to this binary response, and predict
 drug overdose if ˆY>0.5 and stroke otherwise. In the binary case it is not
 hard to show that even if we flip the above coding, linear regression will
 produce the same final predictions.
 For a binary response with a 0/1 coding as above, regression by least
 squares is not completely unreasonable: it can be shown that the X ˆob
tained using linear regression is in fact an estimate of Pr(drug overdose|X)
 in this special case. However, if we use linear regression, some of our es
timates might be outside the [0,1] interval (see Figure 4.2), making them
 hard to interpret as probabilities! Nevertheless, the predictions provide an
 ordering and can be interpreted as crude probability estimates. Curiously,
 it turns out that the classifications that we get if we use linear regression
 to predict a binary response will be the same as for the linear discriminant
 analysis (LDA) procedure we discuss in Section 4.4.
 To summarize, there are at least two reasons not to perform classifica
tion using a regression method: (a) a regression method cannot accommo
date a qualitative response with more than two classes; (b) a regression
 method will not provide meaningful estimates of Pr(Y |X), even with just
 two classes. Thus, it is preferable to use a classification method that is
 truly suited for qualitative response values. In the next section, we present
 logistic regression, which is well-suited for the case of a binary qualita
tive response; in later sections we will cover classification methods that are
 appropriate when the qualitative response has two or more classes.


#### Logistic Regression

 Consider again the Default data set, where the response default falls into
 one of two categories, Yes or No. Rather than modeling this response Y
 directly, logistic regression models the probability that Y belongs to a par
ticular category.
 For the Default data, logistic regression models the probability of default.
 For example, the probability of default given balance can be written as
 Pr(default = Yes|balance).
 Thevalues of Pr(default = Yes|balance), which we abbreviate p(balance),
 will range between 0 and 1. Then for any given value of balance, a prediction
 can be made for default. For example, one might predict default = Yes
  foranyindividual forwhomp(balance)>0.5.Alternatively, ifacompany
 wishestobeconservativeinpredictingindividualswhoareatriskforde
fault,thentheymaychoosetousealowerthreshold,suchasp(balance)>
 0.1.
 


##### The Logistic Model

 Howshouldwemodel therelationshipbetweenp(X)=Pr(Y=1|X)and
 X?(Forconvenienceweareusingthegeneric0/1codingfortheresponse.)
 InSection4.2weconsideredusinga linearregressionmodel torepresent
 theseprobabilities:
 p(X)= 0+ 1X. (4.1)
 Ifweuse this approachtopredict default=Yesusingbalance, thenwe
 obtainthemodel showninthe left-handpanelofFigure4.2.Herewesee
 theproblemwiththisapproach: forbalances close tozerowepredict a
 negativeprobabilityofdefault;ifweweretopredictforverylargebalances,
 wewouldgetvaluesbiggerthan1.Thesepredictionsarenotsensible,since
 ofcoursethetrueprobabilityofdefault, regardlessofcreditcardbalance,
 mustfallbetween0and1.Thisproblemisnotuniquetothecreditdefault
 data.Anytimeastraight line isfittoabinaryresponsethat iscodedas
 0or1, inprinciplewecanalwayspredictp(X)<0forsomevaluesofX
 andp(X)>1forothers(unlesstherangeofXislimited).
 Toavoidthisproblem,wemustmodelp(X)usingafunctionthatgives
 outputsbetween0and1 for all values ofX.Many functionsmeet this
 description. Inlogisticregression,weusethelogisticfunction, logistic
 function
 p(X)= e 0+1X
 1+e 0+1X
 . (4.2)
 Tofitthemodel (4.2),weuseamethodcalledmaximumlikelihood,which maximum
 likelihood wediscussinthenextsection.Theright-handpanelofFigure4.2illustrates
 thefitofthelogisticregressionmodeltotheDefaultdata.Noticethatfor


 low balances we now predict the probability of default as close to, but never
 below, zero. Likewise, for high balances we predict a default probability
 close to, but never above, one. The logistic function will always produce
 an S-shaped curve of this form, and so regardless of the value of X,we
 will obtain a sensible prediction. We also see that the logistic model is
 better able to capture the range of probabilities than is the linear regression
 model in the left-hand plot. The average fitted probability in both cases is
 0.0333 (averaged over the training data), which is the same as the overall
 proportion of defaulters in the data set.
 After a bit of manipulation of (4.2), we find that
 p(X)
 1 p(X) =e 0+1X.
 (4.3)
 The quantity p(X)/[1 p(X)] is called the odds, and can take on any value odds
 between 0 and . Values of the odds close to 0 and indicate very low
 and very high probabilities of default, respectively. For example, on average
 1 in 5 people with an odds of 1/4 will default, since p(X)=0.2 implies an
 odds of 0.2
 1 0.2 =1/4. Likewise, on average nine out of every ten people with
 an odds of 9 will default, since p(X)=0.9 implies an odds of 0.9
 1 0.9 =9.
 Odds are traditionally used instead of probabilities in horse-racing, since
 they relate more naturally to the correct betting strategy.
 By taking the logarithm of both sides of (4.3), we arrive at
 log
 p(X)
 1 p(X) = 0+ 1X.
 (4.4)
 The left-hand side is called the log odds or logit. We see that the logistic log odds
 regression model (4.2) has a logit that is linear in X.
 Recall from Chapter 3 that in a linear regression model, 1 gives the
 average change in Y associated with a one-unit increase in X. By contrast,
 in a logistic regression model, increasing X by one unit changes the log
 odds by 1 (4.4). Equivalently, it multiplies the odds by e 1 (4.3). However,
 because the relationship between p(X) and X in (4.2) is not a straight line,
 1 does not correspond to the change in p(X) associated with a one-unit
 increase in X. The amount that p(X) changes due to a one-unit change in
 X depends on the current value of X. But regardless of the value of X, if
 1 is positive then increasing X will be associated with increasing p(X),
 and if 1 is negative then increasing X will be associated with decreasing
 p(X). The fact that there is not a straight-line relationship between p(X)
 and X, and the fact that the rate of change in p(X) per unit change in X
 depends on the current value of X, can also be seen by inspection of the
 right-hand panel of Figure 4.2

##### Estimating the Regression Coefficients

 The coefficients 0 and 1 in (4.2) are unknown, and must be estimated
 based on the available training data. In Chapter 3, we used the least squares
 approach to estimate the unknown linear regression coefficients. Although
 we could use (non-linear) least squares to fit the model (4.4), the more
 general method of maximum likelihood is preferred, since it has better sta
tistical properties. The basic intuition behind using maximum likelihood  to fit a logistic regression model is as follows: we seek estimates for 0 and
 1 such that the predicted probability ˆp(xi) of default for each individual,
 using (4.2), corresponds as closely as possible to the individual’s observed
 default status. In other words, we try to find ˆ0 and ˆ1 such that plugging
 these estimates into the model for p(X), given in (4.2), yields a number
 close to one for all individuals who defaulted, and a number close to zero
 for all individuals who did not. This intuition can be formalized using a
 mathematical equation called a likelihood function:
 ( 0, 1)=
 p(xi)
 i:yi=1
 (1 p(xi)).
 i :yi=0
 (4.5)
 The estimates ˆ0 and ˆ1 are chosen to maximize this likelihood function.
 Maximum likelihood is a very general approach that is used to fit many
 of the non-linear models that we examine throughout this book. In the
 linear regression setting, the least squares approach is in fact a special case
 of maximum likelihood. The mathematical details of maximum likelihood
 are beyond the scope of this book. However, in general, logistic regression
 and other models can be easily fit using statistical software such as R, and
 so we do not need to concern ourselves with the details of the maximum
 likelihood fitting procedure.
 Table 4.1 shows the coefficient estimates and related information that
 result from fitting a logistic regression model on the Default data in order
 to predict the probability of default=Yes using balance. We see that ˆ1 =
 0.0055; this indicates that an increase in balance is associated with an
 increase in the probability of default. To be precise, a one-unit increase in
 balance is associated with an increase in the log odds of default by 0.0055
 units.
 Many aspects of the logistic regression output shown in Table 4.1 are
 similar to the linear regression output of Chapter 3. For example, we can
 measure the accuracy of the coefficient estimates by computing their stan
dard errors. The z-statistic in Table 4.1 plays the same role as the t-statistic
 in the linear regression output, for example in Table 3.1 on page 77. For
 instance, the z-statistic associated with 1 is equal to ˆ1/SE(ˆ1), and so a
 large (absolute) value of the z-statistic indicates evidence against the null
 hypothesis H0 : 1 =0. This null hypothesis implies that p(X)= e 0
 1+e 0 
: in
 other words, that the probability of default does not depend on balance.
 Since the p-value associated with balance in Table 4.1 is tiny, we can reject
 H0. In other words, we conclude that there is indeed an association between
 balance and probability of default. The estimated intercept in Table 4.1
 is typically not of interest; its main purpose is to adjust the average fitted
 probabilities to the proportion of ones in the data (in this case, the overall
 default rate).

##### Making Predictions

 Once the coefficients have been estimated, we can compute the probability
 of default for any given credit card balance. For example, using the coeffi
cient estimates given in Table 4.1, we predict that the default probability  Classification
 Coefficient Std. error z-statistic
 p-value
 Intercept 10.6513
 balance 0.0055
 0.3612
 0.0002
 29.5 <0.0001
 24.9 <0.0001
 TABLE 4.1. For the Default data, estimated coefficients of the logistic regres
sion model that predicts the probability of default using balance. A one-unit
 increase in balance is associated with an increase in the log odds of default by
 0.0055 units.
 Coefficient Std. error z-statistic
 p-value
 Intercept
 student[Yes]
 3.5041
 0.4049
 0.0707
 0.1150
 49.55 <0.0001
 3.52
 0.0004
 TABLE4.2.FortheDefault data, estimated coefficients of the logistic regression
 model that predicts the probability of default using student status. Student status
 is encoded as a dummy variable, with a value of 1 for a student and a value of 0
 for a non-student, and represented by the variable student[Yes] in the table.
 for an individual with a balance of $1,000 is
 ˆ
 p(X)= eˆ0+ˆ1X
 1+eˆ0+ˆ1X 
= e 10.6513+0.0055 1,000
 1+e 10.6513+0.0055 1,000 
=0.00576,
 which is below 1%. In contrast, the predicted probability of default for an
 individual with a balance of $2,000 is much higher, and equals 0.586 or
 58.6%.
 One can use qualitative predictors with the logistic regression model us
ing the dummy variable approach from Section 3.3.1. As an example, the
 Default data set contains the qualitative variable student. To fit a model
 that uses student status as a predictor variable, we simply create a dummy
 variable that takes on a value of 1 for students and 0 for non-students. The
 logistic regression model that results from predicting probability of default
 from student status can be seen in Table 4.2. The coefficient associated
 with the dummy variable is positive, and the associated p-value is statisti
cally significant. This indicates that students tend to have higher default
 probabilities than non-students:
 Pr(default=Yes|student=Yes)= e 3.5041+0.4049 1
 1+e 3.5041+0.4049 1 
=0.0431,
 Pr(default=Yes|student=No)= e 3.5041+0.4049 0
 1+e 3.5041+0.4049 0 
=0.0292.


##### Multiple Logistic Regression

Wenowconsider the problem of predicting a binary response using multiple
 predictors. By analogy with the extension from simple to multiple linear
 regression in Chapter 3, we can generalize (4.4) as follows:
 log
 p(X)
 1 p(X) = 0+ 1X1+···+ pXp,
 (4.6)
 where X =(X1,...,Xp) are p predictors. Equation 4.6 can be rewritten as
 p(X)= e 0+1X1+···+pXp
 1+e 0+1X1+···+pXp
 .
 (4.7)
  Coefficient Std. error z-statistic
 p-value
 Intercept
 balance
 income
 student[Yes]
 10.8690
 0.0057
 0.0030
 0.6468
 0.4923
 0.0002
 0.0082
 0.2362
 22.08 <0.0001
 24.74 <0.0001
 0.37
 2.74
 0.7115
 0.0062
 TABLE4.3.FortheDefault data, estimated coefficients of the logistic regression
 model that predicts the probability of default using balance, income, and student
 status. Student status is encoded as a dummy variable student[Yes], with a value
 of 1 for a student and a value of 0 for a non-student. In fitting this model, income
 was measured in thousands of dollars.
 0
 Just as in Section 4.3.2, we use the maximum likelihood method to estimate
 , 1,..., p.
 Table 4.3 shows the coefficient estimates for a logistic regression model
 that uses balance, income (in thousands of dollars), and student status to
 predict probability of default. There is a surprising result here. The p
values associated with balance and the dummy variable for student status
 are very small, indicating that each of these variables is associated with
 the probability of default. However, the coefficient for the dummy variable
 is negative, indicating that students are less likely to default than non
students. In contrast, the coefficient for the dummy variable is positive in
 Table 4.2. How is it possible for student status to be associated with an
 increase in probability of default in Table 4.2 and a decrease in probability
 of default in Table 4.3? The left-hand panel of Figure 4.3 provides a graph
ical illustration of this apparent paradox. The orange and blue solid lines
 show the average default rates for students and non-students, respectively,
 as a function of credit card balance. The negative coefficient for student in
 the multiple logistic regression indicates that for a fixed value of balance
 and income, a student is less likely to default than a non-student. Indeed,
 we observe from the left-hand panel of Figure 4.3 that the student default
 rate is at or below that of the non-student default rate for every value of
 balance. But the horizontal broken lines near the base of the plot, which
 show the default rates for students and non-students averaged over all val
ues of balance and income, suggest the opposite effect: the overall student
 default rate is higher than the non-student default rate. Consequently, there
 is a positive coefficient for student in the single variable logistic regression
 output shown in Table 4.2.
 The right-hand panel of Figure 4.3 provides an explanation for this dis
crepancy. The variables student and balance are correlated. Students tend
 to hold higher levels of debt, which is in turn associated with higher prob
ability of default. In other words, students are more likely to have large
 credit card balances, which, as we know from the left-hand panel of Fig
ure 4.3, tend to be associated with high default rates. Thus, even though
 an individual student with a given credit card balance will tend to have a
 lower probability of default than a non-student with the same credit card
 balance, the fact that students on the whole tend to have higher credit card
 balances means that overall, students tend to default at a higher rate than
 non-students. This is an important distinction for a credit card company
 that is trying to determine to whom they should offer credit. A student is
 riskier than a non-student if no information about the student’s credit card