### Classification

 The linear regression model discussed in Chapter 3 assumes that the re
sponse variable Y is quantitative. But in many situations, the response
 variable is instead qualitative. For example, eye color is qualitative. Of- qualitative
 ten qualitative variables are referred to as categorical; we will use these
 terms interchangeably. In this chapter, we study approaches for predicting
 qualitative responses, a process that is known as classification. Predicting classification
 a qualitative response for an observation can be referred to as classifying
 that observation, since it involves assigning the observation to a category,
 or class. On the other hand, often the methods used for classification first
 predict the probability that the observation belongs to each of the cate
gories of a qualitative variable, as the basis for making the classification.
 In this sense they also behave like regression methods.
 There are many possible classification techniques, or classifiers, that one classifier
 might use to predict a qualitative response. We touched on some of these
 in Sections 2.1.5 and 2.2.3. In this chapter we discuss some widely-used
 classifiers: logistic regression, linear discriminant analysis, quadratic dis- logistic
 criminant analysis, naive Bayes, and K-nearest neighbors. The discussion
 of logistic regression is used as a jumping-off point for a discussion of gen
eralized linear models, and in particular, Poisson regression. 

#### An Overview of Classification

 Classification problems occur often, perhaps even more so than regression problems. Some examples include:
  1. A person arrives at the emergency room with a set of symptoms
 that could possibly be attributed to one of three medical conditions.
 Which of the three conditions does the individual have?
 2. An online banking service must be able to determine whether or not
 a transaction being performed on the site is fraudulent, on the basis
 of the user’s IP address, past transaction history, and so forth.
 3. On the basis of DNA sequence data for a number of patients with
 and without a given disease, a biologist would like to figure out which
 DNA mutations are deleterious (disease-causing) and which are not.

 Just as in the regression setting, in the classification setting we have a
 set of training observations (x1,y1),...,(xn,yn) that we can use to build
 a classifier. We want our classifier to perform well not only on the training
 data, but also on test observations that were not used to train the classifier.
 
 In this chapter, we will illustrate the concept of classification using the
 simulated Default data set. We are interested in predicting whether an
 individual will default on his or her credit card payment, on the basis of
 annual income and monthly credit card balance. The data set is displayed
 in Figure 4.1. In the left-hand panel of Figure 4.1, we have plotted annual
 income and monthly credit card balance for a subset of 10,000 individuals.
 The individuals who defaulted in a given month are shown in orange, and
 those who did not in blue. (The overall default rate is about 3%, so we
 have plotted only a fraction of the individuals who did not default.) It
 appears that individuals who defaulted tended to have higher credit card
 balances than those who did not. In the center and right-hand panels of
 Figure 4.1, two pairs of boxplots are shown. The first shows the distribution
 of balance split by the binary default variable; the second is a similar plot
 for income. In this chapter, we learn how to build a model to predict default
 (Y ) for any given value of balance (X1) and income (X2). Since Y is not
 quantitative, the simple linear regression model of Chapter 3 is not a good
 choice: we will elaborate on this further in Section 4.2.
 
 It is worth noting that Figure 4.1 displays a very pronounced relation
ship between the predictor balance and the response default. In most real
 applications, the relationship between the predictor and the response will
 not be nearly so strong. However, for the sake of illustrating the classifica
tion procedures discussed in this chapter, we use an example in which the
 relationship between the predictor and the response is somewhat exaggerated.

#### Why Not Linear Regression?

 Suppose that we are trying to predict the medical condition of a patient
 in the emergency room on the basis of her symptoms. In this simplified
 example, there are three possible diagnoses: stroke, drug overdose, and epilepticseizure.Wecouldconsiderencodingthesevaluesasaquantita
tiveresponsevariable,Y,asfollows:
 Y=
 1 ifstroke;
 2 ifdrugoverdose;
 3 ifepilepticseizure.
 Usingthiscoding,leastsquarescouldbeusedtofitalinearregressionmodel
 topredictY onthebasisofasetofpredictorsX1,...,Xp.Unfortunately,
 thiscodingimpliesanorderingontheoutcomes,puttingdrug overdosein
 betweenstrokeandepilepticseizure, and insistingthat thedifference
 betweenstrokeanddrugoverdose is thesameas thedifferencebetween
 drugoverdoseandepilepticseizure. Inpractice there isnoparticular
 reasonthat thisneeds tobethecase.For instance, onecouldchoosean
 equallyreasonablecoding,
 Y=
 1 ifepilepticseizure;
 2 ifstroke;
 3 ifdrugoverdose,
 whichwouldimplyatotallydifferentrelationshipamongthethreecondi
tions.Eachof thesecodingswouldproducefundamentallydifferent linear
 modelsthatwouldultimatelyleadtodifferentsetsofpredictionsontest
 observations.
 If theresponsevariable’svaluesdidtakeonanaturalordering, suchas
 mild,moderate,andsevere,andwefeltthegapbetweenmildandmoderate
 wassimilartothegapbetweenmoderateandsevere,thena1,2,3coding
 wouldbereasonable.Unfortunately, ingeneral thereisnonaturalwayto
 convert a qualitative response variable with more than two levels into a
 quantitative response that is ready for linear regression.
 For a binary (two level) qualitative response, the situation is better. For binary
 instance, perhaps there are only two possibilities for the patient’s medical
 condition: stroke and drug overdose. We could then potentially use the
 dummyvariable approach from Section 3.3.1 to code the response as follows:
 Y = 0 ifstroke;
 1 if drug overdose.
 We could then fit a linear regression to this binary response, and predict
 drug overdose if ˆY>0.5 and stroke otherwise. In the binary case it is not
 hard to show that even if we flip the above coding, linear regression will
 produce the same final predictions.
 For a binary response with a 0/1 coding as above, regression by least
 squares is not completely unreasonable: it can be shown that the X ˆob
tained using linear regression is in fact an estimate of Pr(drug overdose|X)
 in this special case. However, if we use linear regression, some of our es
timates might be outside the [0,1] interval (see Figure 4.2), making them
 hard to interpret as probabilities! Nevertheless, the predictions provide an
 ordering and can be interpreted as crude probability estimates. Curiously,
 it turns out that the classifications that we get if we use linear regression
 to predict a binary response will be the same as for the linear discriminant
 analysis (LDA) procedure we discuss in Section 4.4.
 To summarize, there are at least two reasons not to perform classifica
tion using a regression method: (a) a regression method cannot accommo
date a qualitative response with more than two classes; (b) a regression
 method will not provide meaningful estimates of Pr(Y |X), even with just
 two classes. Thus, it is preferable to use a classification method that is
 truly suited for qualitative response values. In the next section, we present
 logistic regression, which is well-suited for the case of a binary qualita
tive response; in later sections we will cover classification methods that are
 appropriate when the qualitative response has two or more classes.


#### Logistic Regression

 Consider again the Default data set, where the response default falls into
 one of two categories, Yes or No. Rather than modeling this response Y
 directly, logistic regression models the probability that Y belongs to a par
ticular category.
 For the Default data, logistic regression models the probability of default.
 For example, the probability of default given balance can be written as
 Pr(default = Yes|balance).
 Thevalues of Pr(default = Yes|balance), which we abbreviate p(balance),
 will range between 0 and 1. Then for any given value of balance, a prediction
 can be made for default. For example, one might predict default = Yes
  foranyindividual forwhomp(balance)>0.5.Alternatively, ifacompany
 wishestobeconservativeinpredictingindividualswhoareatriskforde
fault,thentheymaychoosetousealowerthreshold,suchasp(balance)>
 0.1.
 


##### The Logistic Model

 Howshouldwemodel therelationshipbetweenp(X)=Pr(Y=1|X)and
 X?(Forconvenienceweareusingthegeneric0/1codingfortheresponse.)
 InSection4.2weconsideredusinga linearregressionmodel torepresent
 theseprobabilities:
 p(X)= 0+ 1X. (4.1)
 Ifweuse this approachtopredict default=Yesusingbalance, thenwe
 obtainthemodel showninthe left-handpanelofFigure4.2.Herewesee
 theproblemwiththisapproach: forbalances close tozerowepredict a
 negativeprobabilityofdefault;ifweweretopredictforverylargebalances,
 wewouldgetvaluesbiggerthan1.Thesepredictionsarenotsensible,since
 ofcoursethetrueprobabilityofdefault, regardlessofcreditcardbalance,
 mustfallbetween0and1.Thisproblemisnotuniquetothecreditdefault
 data.Anytimeastraight line isfittoabinaryresponsethat iscodedas
 0or1, inprinciplewecanalwayspredictp(X)<0forsomevaluesofX
 andp(X)>1forothers(unlesstherangeofXislimited).
 Toavoidthisproblem,wemustmodelp(X)usingafunctionthatgives
 outputsbetween0and1 for all values ofX.Many functionsmeet this
 description. Inlogisticregression,weusethelogisticfunction, logistic
 function
 p(X)= e 0+1X
 1+e 0+1X
 . (4.2)
 Tofitthemodel (4.2),weuseamethodcalledmaximumlikelihood,which maximum
 likelihood wediscussinthenextsection.Theright-handpanelofFigure4.2illustrates
 thefitofthelogisticregressionmodeltotheDefaultdata.Noticethatfor


 low balances we now predict the probability of default as close to, but never
 below, zero. Likewise, for high balances we predict a default probability
 close to, but never above, one. The logistic function will always produce
 an S-shaped curve of this form, and so regardless of the value of X,we
 will obtain a sensible prediction. We also see that the logistic model is
 better able to capture the range of probabilities than is the linear regression
 model in the left-hand plot. The average fitted probability in both cases is
 0.0333 (averaged over the training data), which is the same as the overall
 proportion of defaulters in the data set.
 After a bit of manipulation of (4.2), we find that
 p(X)
 1 p(X) =e 0+1X.
 (4.3)
 The quantity p(X)/[1 p(X)] is called the odds, and can take on any value odds
 between 0 and . Values of the odds close to 0 and indicate very low
 and very high probabilities of default, respectively. For example, on average
 1 in 5 people with an odds of 1/4 will default, since p(X)=0.2 implies an
 odds of 0.2
 1 0.2 =1/4. Likewise, on average nine out of every ten people with
 an odds of 9 will default, since p(X)=0.9 implies an odds of 0.9
 1 0.9 =9.
 Odds are traditionally used instead of probabilities in horse-racing, since
 they relate more naturally to the correct betting strategy.
 By taking the logarithm of both sides of (4.3), we arrive at
 log
 p(X)
 1 p(X) = 0+ 1X.
 (4.4)
 The left-hand side is called the log odds or logit. We see that the logistic log odds
 regression model (4.2) has a logit that is linear in X.
 Recall from Chapter 3 that in a linear regression model, 1 gives the
 average change in Y associated with a one-unit increase in X. By contrast,
 in a logistic regression model, increasing X by one unit changes the log
 odds by 1 (4.4). Equivalently, it multiplies the odds by e 1 (4.3). However,
 because the relationship between p(X) and X in (4.2) is not a straight line,
 1 does not correspond to the change in p(X) associated with a one-unit
 increase in X. The amount that p(X) changes due to a one-unit change in
 X depends on the current value of X. But regardless of the value of X, if
 1 is positive then increasing X will be associated with increasing p(X),
 and if 1 is negative then increasing X will be associated with decreasing
 p(X). The fact that there is not a straight-line relationship between p(X)
 and X, and the fact that the rate of change in p(X) per unit change in X
 depends on the current value of X, can also be seen by inspection of the
 right-hand panel of Figure 4.2

##### Estimating the Regression Coefficients

 The coefficients 0 and 1 in (4.2) are unknown, and must be estimated
 based on the available training data. In Chapter 3, we used the least squares
 approach to estimate the unknown linear regression coefficients. Although
 we could use (non-linear) least squares to fit the model (4.4), the more
 general method of maximum likelihood is preferred, since it has better sta
tistical properties. The basic intuition behind using maximum likelihood  to fit a logistic regression model is as follows: we seek estimates for 0 and
 1 such that the predicted probability ˆp(xi) of default for each individual,
 using (4.2), corresponds as closely as possible to the individual’s observed
 default status. In other words, we try to find ˆ0 and ˆ1 such that plugging
 these estimates into the model for p(X), given in (4.2), yields a number
 close to one for all individuals who defaulted, and a number close to zero
 for all individuals who did not. This intuition can be formalized using a
 mathematical equation called a likelihood function:
 ( 0, 1)=
 p(xi)
 i:yi=1
 (1 p(xi)).
 i :yi=0
 (4.5)
 The estimates ˆ0 and ˆ1 are chosen to maximize this likelihood function.
 Maximum likelihood is a very general approach that is used to fit many
 of the non-linear models that we examine throughout this book. In the
 linear regression setting, the least squares approach is in fact a special case
 of maximum likelihood. The mathematical details of maximum likelihood
 are beyond the scope of this book. However, in general, logistic regression
 and other models can be easily fit using statistical software such as R, and
 so we do not need to concern ourselves with the details of the maximum
 likelihood fitting procedure.
 Table 4.1 shows the coefficient estimates and related information that
 result from fitting a logistic regression model on the Default data in order
 to predict the probability of default=Yes using balance. We see that ˆ1 =
 0.0055; this indicates that an increase in balance is associated with an
 increase in the probability of default. To be precise, a one-unit increase in
 balance is associated with an increase in the log odds of default by 0.0055
 units.
 Many aspects of the logistic regression output shown in Table 4.1 are
 similar to the linear regression output of Chapter 3. For example, we can
 measure the accuracy of the coefficient estimates by computing their stan
dard errors. The z-statistic in Table 4.1 plays the same role as the t-statistic
 in the linear regression output, for example in Table 3.1 on page 77. For
 instance, the z-statistic associated with 1 is equal to ˆ1/SE(ˆ1), and so a
 large (absolute) value of the z-statistic indicates evidence against the null
 hypothesis H0 : 1 =0. This null hypothesis implies that p(X)= e 0
 1+e 0 
: in
 other words, that the probability of default does not depend on balance.
 Since the p-value associated with balance in Table 4.1 is tiny, we can reject
 H0. In other words, we conclude that there is indeed an association between
 balance and probability of default. The estimated intercept in Table 4.1
 is typically not of interest; its main purpose is to adjust the average fitted
 probabilities to the proportion of ones in the data (in this case, the overall
 default rate).

##### Making Predictions

 Once the coefficients have been estimated, we can compute the probability
 of default for any given credit card balance. For example, using the coeffi
cient estimates given in Table 4.1, we predict that the default probability  Classification
 Coefficient Std. error z-statistic
 p-value
 Intercept 10.6513
 balance 0.0055
 0.3612
 0.0002
 29.5 <0.0001
 24.9 <0.0001
 TABLE 4.1. For the Default data, estimated coefficients of the logistic regres
sion model that predicts the probability of default using balance. A one-unit
 increase in balance is associated with an increase in the log odds of default by
 0.0055 units.
 Coefficient Std. error z-statistic
 p-value
 Intercept
 student[Yes]
 3.5041
 0.4049
 0.0707
 0.1150
 49.55 <0.0001
 3.52
 0.0004
 TABLE4.2.FortheDefault data, estimated coefficients of the logistic regression
 model that predicts the probability of default using student status. Student status
 is encoded as a dummy variable, with a value of 1 for a student and a value of 0
 for a non-student, and represented by the variable student[Yes] in the table.
 for an individual with a balance of $1,000 is
 ˆ
 p(X)= eˆ0+ˆ1X
 1+eˆ0+ˆ1X 
= e 10.6513+0.0055 1,000
 1+e 10.6513+0.0055 1,000 
=0.00576,
 which is below 1%. In contrast, the predicted probability of default for an
 individual with a balance of $2,000 is much higher, and equals 0.586 or
 58.6%.
 One can use qualitative predictors with the logistic regression model us
ing the dummy variable approach from Section 3.3.1. As an example, the
 Default data set contains the qualitative variable student. To fit a model
 that uses student status as a predictor variable, we simply create a dummy
 variable that takes on a value of 1 for students and 0 for non-students. The
 logistic regression model that results from predicting probability of default
 from student status can be seen in Table 4.2. The coefficient associated
 with the dummy variable is positive, and the associated p-value is statisti
cally significant. This indicates that students tend to have higher default
 probabilities than non-students:
 Pr(default=Yes|student=Yes)= e 3.5041+0.4049 1
 1+e 3.5041+0.4049 1 
=0.0431,
 Pr(default=Yes|student=No)= e 3.5041+0.4049 0
 1+e 3.5041+0.4049 0 
=0.0292.


##### Multiple Logistic Regression

Wenowconsider the problem of predicting a binary response using multiple
 predictors. By analogy with the extension from simple to multiple linear
 regression in Chapter 3, we can generalize (4.4) as follows:
 log
 p(X)
 1 p(X) = 0+ 1X1+···+ pXp,
 (4.6)
 where X =(X1,...,Xp) are p predictors. Equation 4.6 can be rewritten as
 p(X)= e 0+1X1+···+pXp
 1+e 0+1X1+···+pXp
 .
 (4.7)
  Coefficient Std. error z-statistic
 p-value
 Intercept
 balance
 income
 student[Yes]
 10.8690
 0.0057
 0.0030
 0.6468
 0.4923
 0.0002
 0.0082
 0.2362
 22.08 <0.0001
 24.74 <0.0001
 0.37
 2.74
 0.7115
 0.0062
 TABLE4.3.FortheDefault data, estimated coefficients of the logistic regression
 model that predicts the probability of default using balance, income, and student
 status. Student status is encoded as a dummy variable student[Yes], with a value
 of 1 for a student and a value of 0 for a non-student. In fitting this model, income
 was measured in thousands of dollars.
 0
 Just as in Section 4.3.2, we use the maximum likelihood method to estimate
 , 1,..., p.
 Table 4.3 shows the coefficient estimates for a logistic regression model
 that uses balance, income (in thousands of dollars), and student status to
 predict probability of default. There is a surprising result here. The p
values associated with balance and the dummy variable for student status
 are very small, indicating that each of these variables is associated with
 the probability of default. However, the coefficient for the dummy variable
 is negative, indicating that students are less likely to default than non
students. In contrast, the coefficient for the dummy variable is positive in
 Table 4.2. How is it possible for student status to be associated with an
 increase in probability of default in Table 4.2 and a decrease in probability
 of default in Table 4.3? The left-hand panel of Figure 4.3 provides a graph
ical illustration of this apparent paradox. The orange and blue solid lines
 show the average default rates for students and non-students, respectively,
 as a function of credit card balance. The negative coefficient for student in
 the multiple logistic regression indicates that for a fixed value of balance
 and income, a student is less likely to default than a non-student. Indeed,
 we observe from the left-hand panel of Figure 4.3 that the student default
 rate is at or below that of the non-student default rate for every value of
 balance. But the horizontal broken lines near the base of the plot, which
 show the default rates for students and non-students averaged over all val
ues of balance and income, suggest the opposite effect: the overall student
 default rate is higher than the non-student default rate. Consequently, there
 is a positive coefficient for student in the single variable logistic regression
 output shown in Table 4.2.
 The right-hand panel of Figure 4.3 provides an explanation for this dis
crepancy. The variables student and balance are correlated. Students tend
 to hold higher levels of debt, which is in turn associated with higher prob
ability of default. In other words, students are more likely to have large
 credit card balances, which, as we know from the left-hand panel of Fig
ure 4.3, tend to be associated with high default rates. Thus, even though
 an individual student with a given credit card balance will tend to have a
 lower probability of default than a non-student with the same credit card
 balance, the fact that students on the whole tend to have higher credit card
 balances means that overall, students tend to default at a higher rate than
 non-students. This is an important distinction for a credit card company
 that is trying to determine to whom they should offer credit. A student is
 riskier than a non-student if no information about the student’s credit card  balance is available. However, that student is less risky than a non-student
 with the same credit card balance!
 This simple example illustrates the dangers and subtleties associated
 with performing regressions involving only a single predictor when other
 predictors may also be relevant. As in the linear regression setting, the
 results obtained using one predictor may be quite different from those ob
tained using multiple predictors, especially when there is correlation among
 the predictors. In general, the phenomenon seen in Figure 4.3 is known as
 confounding.
 By substituting estimates for the regression coefficients from Table 4.3
 into (4.7), we can make predictions. For example, a student with a credit
 card balance of $1,500 and an income of $40,000 has an estimated proba
bility of default of
 ˆ
 p(X)= e 10.869+0.00574 1,500+0.003 40 0.6468 1
 1+e 10.869+0.00574 1,500+0.003 40 0.6468 1 
=0.058.
 (4.8)
 A non-student with the same balance and income has an estimated prob
ability of default of
 ˆ
 p(X)= e 10.869+0.00574 1,500+0.003 40 0.6468 0
 1+e 10.869+0.00574 1,500+0.003 40 0.6468 0 
=0.105.
 (4.9)
 (Here we multiply the income coefficient estimate from Table 4.3 by 40,
 rather than by 40,000, because in that table the model was fit with income
 measured in units of $1,000.)

##### Multinomial Logistic Regression

We sometimes wish to classify a response variable that has more than two
 classes. For example, in Section 4.2 we had three categories of medical con
dition in the emergency room: stroke, drug overdose, epileptic seizure.
 However, the logistic regression approach that we have seen in this section
 only allows for K =2classes for the response variable.
  It turns out that it is possible to extend the two-class logistic regression
 approach to the setting of K>2 classes. This extension is sometimes
 known as multinomial logistic regression. To do this, we first select a single multinomial
 class to serve as the baseline; without loss of generality, we select the Kth
 class for this role. Then we replace the model (4.7) with the model
 e k0+k1x1+···+kpxp
 Pr(Y = k|X =x)=
 for k =1,...,K 1, and
 Pr(Y = K|X =x)=
 1+ K 1
 l=1 e l0+ l1x1+···+ lpxp
 1
 1+ K1
 l=1 e l0+ l1x1+···+ lpxp 
.
 It is not hard to show that for k =1,...,K 1,
 log Pr(Y =k|X =x)
 Pr(Y = K|X =x) = k0+ k1x1+···+ kpxp.
 (4.10)
 (4.11)
 (4.12)
 Notice that (4.12) is quite similar to (4.6). Equation 4.12 indicates that once
 again, the log odds between any pair of classes is linear in the features.
 It turns out that in (4.10)–(4.12), the decision to treat the Kth class as
 the baseline is unimportant. For example, when classifying emergency room
 visits into stroke, drug overdose, and epileptic seizure, suppose that we
 f
 it two multinomial logistic regression models: one treating stroke as the
 baseline, another treating drug overdose as the baseline. The coefficient
 estimates will differ between the two fitted models due to the differing
 choice of baseline, but the fitted values (predictions), the log odds between
 any pair of classes, and the other key model outputs will remain the same.
 Nonetheless, interpretation of the coefficients in a multinomial logistic
 regression model must be done with care, since it is tied to the choice
 of baseline. For example, if we set epileptic seizure to be the baseline,
 then we can interpret stroke0 as the log odds of stroke versus epileptic
 seizure, given that x1 = ···= xp =0. Furthermore, a one-unit increase
 in Xj is associated with a strokej increase in the log odds of stroke over
 epileptic seizure. Stated another way, if Xj increases by one unit, then
 logistic
 regression
 Pr(Y = stroke|X = x)
 Pr(Y = epileptic seizure|X = x)
 increases by e strokej.
 We now briefly present an alternative coding for multinomial logistic
 regression, known as the softmax coding. The softmax coding is equivalent softmax
 to the coding just described in the sense that the fitted values, log odds
 between any pair of classes, and other key model outputs will remain the
 same, regardless of coding. But the softmax coding is used extensively in
 some areas of the machine learning literature (and will appear again in
 Chapter 10), so it is worth being aware of it. In the softmax coding, rather
 than selecting a baseline class, we treat all K classes symmetrically, and
 assume that for k =1,...,K,
 Pr(Y = k|X =x)= e k0+k1x1+···+kpxp
 l=1 e l0+ l1x1+···+ lpxp 
.
 K
 (4.13)
 Thus, rather than estimating coefficients for K 1 classes, we actually
 estimate coefficients for all K classes. It is not hard to see that as a result
 of (4.13), the log odds ratio between the kth and kth classes equals
 log Pr(Y =k|X =x)
 Pr(Y = k|X =x) =( k0 k0)+( k1 k1)x1+···+( kp kp)xp.
 (4.14)

#### Generative Models for Classification

 Logistic regression involves directly modeling Pr(Y = k|X = x) using the
 logistic function, given by (4.7) for the case of two response classes. In
 statistical jargon, we model the conditional distribution of the response Y ,
 given the predictor(s) X. We now consider an alternative and less direct
 approach to estimating these probabilities. In this new approach, we model
 the distribution of the predictors X separately in each of the response
 classes (i.e. for each value of Y ). We then use Bayes’ theorem to flip these
 around into estimates for Pr(Y = k|X = x). When the distribution of X
 within each class is assumed to be normal, it turns out that the model is
 very similar in form to logistic regression.
 Why do we need another method, when we have logistic regression?
 There are several reasons:
 • When there is substantial separation between the two classes, the
 parameter estimates for the logistic regression model are surprisingly
 unstable. The methods that we consider in this section do not suffer
 from this problem.
 • If the distribution of the predictors X is approximately normal in
 each of the classes and the sample size is small, then the approaches
 in this section may be more accurate than logistic regression.
 • The methods in this section can be naturally extended to the case
 of more than two response classes. (In the case of more than two
 response classes, we can also use multinomial logistic regression from
 Section 4.3.5.)
 Suppose that we wish to classify an observation into one of K classes,
 where K 2. In other words, the qualitative response variable Y can take
 on K possible distinct and unordered values. Let k represent the overall
 or prior probability that a randomly chosen observation comes from the prior
 kth class. Let fk(X) Pr(X|Y = k)1 denote the density function of X density
 for an observation that comes from the kth class. In other words, fk(x) is
 relatively large if there is a high probability that an observation in the kth
 class has X x, and fk(x) is small if it is very unlikely that an observation
 in the kth class has X x. Then Bayes’ theorem states that Pr(Y = k|X =x)= kfk(x)
l=1 lfl(x)
.
K
(4.15)
In accordance with our earlier notation, we will use the abbreviation pk(x)=
Pr(Y = k|X = x); this is the posterior probability that an observation posterior
X = x belongs to the kth class. That is, it is the probability that the
observation belongs to the kth class, given the predictor value for that
observation.
Equation 4.15 suggests that instead of directly computing the posterior
probability pk(x) as in Section 4.3.1, we can simply plug in estimates of k
and fk(x) into (4.15). In general, estimating k is easy if we have a random
sample from the population: we simply compute the fraction of the training
observations that belong to the kth class. However, estimating the density
function fk(x) is much more challenging. As we will see, to estimate fk(x),
we will typically have to make some simplifying assumptions.
We know from Chapter 2 that the Bayes classifier, which classifies an
observation x to the class for which pk(x) is largest, has the lowest possible
error rate out of all classifiers. (Of course, this is only true if all of the
terms in (4.15) are correctly specified.) Therefore, if we can find a way to
estimate fk(x), then we can plug it into (4.15) in order to approximate the
Bayes classifier.
In the following sections, we discuss three classifiers that use different
estimates of fk(x) in (4.15) to approximate the Bayes classifier: linear dis
criminant analysis, quadratic discriminant analysis, and naive Bayes.
4.4.1 Linear Discriminant Analysis for p =1
For now, assume that p =1—that is, we have only one predictor. We would
like to obtain an estimate for fk(x) that we can plug into (4.15) in order to
estimate pk(x). We will then classify an observation to the class for which
pk(x) is greatest. To estimate fk(x), we will first make some assumptions
about its form.
In particular, we assume that fk(x) is normal or Gaussian. In the one- normal
dimensional setting, the normal density takes the form
fk(x)= 1
2 k
exp 1
2 2
k
(x µk)2 ,
(4.16)
where µk and 2
k are the mean and variance parameters for the kth class.
For now, let us further assume that 2
1 = ···= 2
K: that is, there is a shared
variance term across all K classes, which for simplicity we can denote by
2
. Plugging (4.16) into (4.15), we find that
pk(x)= k 1
2
exp 1
2 2
(x µk)2
K
l=1 l 1
2
exp 1
2 2
(x µl)2
.
(4.17)
(Note that in (4.17), k denotes the prior probability that an observation
belongs to the kth class, not to be confused with 
3.14159, the math
ematical constant.) The Bayes classifier2 involves assigning an observation

X =xto the class for which (4.17) is largest. Taking the log of (4.17) and
rearranging the terms, it is not hard to show3 that this is equivalent to
assigning the observation to the class for which
k(x)=x· µk
2 
µ2
k
2 2 
+log( k)
(4.18)
is largest. For instance, if K =2and 1 = 2, then the Bayes classifier
assigns an observation to class 1 if 2x(µ1 µ2) >µ2
1 µ2
2, and to class
2 otherwise. The Bayes decision boundary is the point for which 1(x)=
2(x); one can show that this amounts to
x = µ2
1 µ2
2
2(µ1 µ2) = µ1+µ2
2 .
(4.19)
Anexample is shown in the left-hand panel of Figure 4.4. The two normal
density functions that are displayed, f1(x) and f2(x), represent two distinct
classes. The mean and variance parameters for the two density functions
are µ1 = 1.25, µ2 =1.25, and 2
1 = 2
2 =1. The two densities overlap,
and so given that X = x, there is some uncertainty about the class to which
the observation belongs. If we assume that an observation is equally likely
to come from either class—that is, 1 = 2 =0.5—then by inspection of
(4.19), we see that the Bayes classifier assigns the observation to class 1
if x<0 and class 2 otherwise. Note that in this case, we can compute
the Bayes classifier because we know that X is drawn from a Gaussian
distribution within each class, and we know all of the parameters involved.
In a real-life situation, we are not able to calculate the Bayes classifier.
In practice, even if we are quite certain of our assumption that X is
drawn from a Gaussian distribution within each class, to apply the Bayes
classifier we still have to estimate the parameters µ1,...,µK, 1,..., K,
and 2. The linear discriminant analysis (LDA) method approximates the Bayes classifier by plugging estimates for k, µk, and 2 into (4.18).

particular, the following estimates are used:
ˆ
µk = 1
nk i:yi=k 
xi
ˆ2 = 1
n K
K
k=1i:yi=k
(xi 
ˆ
µk)2
(4.20)
where n is the total number of training observations, and nk is the number
of training observations in the kth class. The estimate for µk is simply the
average of all the training observations from the kth class, while ˆ2 can
be seen as a weighted average of the sample variances for each of the K
classes. Sometimes we have knowledge of the class membership probabili
ties 1,..., K, which can be used directly. In the absence of any additional
information, LDA estimates k using the proportion of the training obser
vations that belong to the kth class. In other words,
ˆk = nk/n.
(4.21)
The LDAclassifier plugs the estimates given in (4.20) and (4.21) into (4.18),
and assigns an observation X = x to the class for which
ˆ
k(x)=x· ˆµk
ˆ2 
ˆ
µ2
k
2ˆ2 
+log(ˆk)
(4.22)
is largest. The word linear in the classifier’s name stems from the fact
that the discriminant functions ˆk(x) in (4.22) are linear functions of x (as discriminant
opposed to a more complex function of x).
The right-hand panel of Figure 4.4 displays a histogram of a random
sample of 20 observations from each class. To implement LDA, we began
by estimating k, µk, and 2 using (4.20) and (4.21). We then computed the
decision boundary, shown as a black solid line, that results from assigning
an observation to the class for which (4.22) is largest. All points to the left
of this line will be assigned to the green class, while points to the right of
this line are assigned to the purple class. In this case, since n1 = n2 = 20,
we have ˆ1 =ˆ2. As a result, the decision boundary corresponds to the
midpoint between the sample means for the two classes, (ˆµ1 +ˆ
µ2)/2. The
f
igure indicates that the LDA decision boundary is slightly to the left of
the optimal Bayes decision boundary, which instead equals (µ1 + µ2)/2=
0. How well does the LDA classifier perform on this data? Since this is
simulated data, we can generate a large number of test observations in order
to compute the Bayes error rate and the LDA test error rate. These are
10.6% and 11.1%, respectively. In other words, the LDA classifier’s error
rate is only 0.5% above the smallest possible error rate! This indicates that
LDA is performing pretty well on this data set.
To reiterate, the LDA classifier results from assuming that the obser
vations within each class come from a normal distribution with a class
specific mean and a common variance 2, and plugging estimates for these
parameters into the Bayes classifier. In Section 4.4.3, we will consider a less
stringent set of assumptions, by allowing the observations in the kth class
to have a class-specific variance, 2
k

#####  Linear Discriminant Analysis for p > 1

WenowextendtheLDAclassifier tothecaseofmultiplepredictors.To
dothis,wewillassumethatX=(X1,X2,...,Xp) isdrawnfromamulti
variateGaussian(ormultivariatenormal)distribution,withaclass-specific multivariate
Gaussian meanvectorandacommoncovariancematrix.Webeginwithabriefreview
ofthisdistribution.
ThemultivariateGaussiandistributionassumesthateachindividualpre
dictorfollowsaone-dimensionalnormaldistribution,asin(4.16),withsome
correlationbetweeneachpairofpredictors.Twoexamplesofmultivariate
Gaussiandistributionswithp=2areshowninFigure4.5.Theheightof
thesurfaceatanyparticularpointrepresentstheprobabilitythatbothX1
andX2 fall inasmallregionaroundthatpoint.Ineitherpanel, ifthesur
faceiscutalongtheX1axisoralongtheX2axis,theresultingcross-section
willhavetheshapeofaone-dimensionalnormaldistribution.Theleft-hand
panelofFigure4.5illustratesanexampleinwhichVar(X1)=Var(X2)and
Cor(X1,X2)=0;thissurfacehasacharacteristicbell shape.However,the
bellshapewillbedistortedifthepredictorsarecorrelatedorhaveunequal
variances, as is illustrated inthe right-handpanel ofFigure4.5. Inthis
situation, thebaseof thebellwillhaveanelliptical, ratherthancircular,
shape.To indicatethatap-dimensional randomvariableXhasamulti
variateGaussiandistribution,wewriteX N(µ, ).HereE(X)=µis
themeanofX(avectorwithpcomponents), andCov(X)= is the
p pcovariancematrixofX.Formally,themultivariateGaussiandensity
isdefinedas
f(x)= 1
(2 )p/2| |1/2
exp 1
2 (x µ)T 1(x µ) . (4.23)
Inthe caseof p>1predictors, theLDAclassifier assumes that the
observations inthekthclassaredrawnfromamultivariateGaussiandis
tributionN(µk, ),whereµk isaclass-specificmeanvector, and isa
covariancematrixthat iscommontoallKclasses.Pluggingthedensity
functionforthekthclass, fk(X=x), into(4.15)andperformingalittle
bitofalgebrarevealsthattheBayesclassifierassignsanobservationX=x to the class for which
k(x)=xT 1µk
1
2µT
k
1µk+log k
is largest. This is the vector/matrix version of (4.18).
(4.24)
An example is shown in the left-hand panel of Figure 4.6. Three equally
sized Gaussian classes are shown with class-specific mean vectors and a
common covariance matrix. The three ellipses represent regions that con
tain 95% of the probability for each of the three classes. The dashed lines
are the Bayes decision boundaries. In other words, they represent the set
of values x for which k(x)= (x); i.e.
xT
1µk 
1
2µT
k
1µk =xT 1µl
1
2µT
l
1µl
(4.25)
for k= l. (The log k term from (4.24) has disappeared because each of
the three classes has the same number of training observations; i.e. k is
the same for each class.) Note that there are three lines representing the
Bayes decision boundaries because there are three pairs of classes among
the three classes. That is, one Bayes decision boundary separates class 1
from class 2, one separates class 1 from class 3, and one separates class 2
from class 3. These three Bayes decision boundaries divide the predictor
space into three regions. The Bayes classifier will classify an observation
according to the region in which it is located.
Once again, we need to estimate the unknown parameters µ1,...,µK,
1,..., K, and ; the formulas are similar to those used in the one
dimensional case, given in (4.20). To assign a new observation X = x,
LDA plugs these estimates into (4.24) to obtain quantities ˆk(x), and clas
sifies to the class for which ˆk(x) is largest. Note that in (4.24) k(x) is
a linear function of x; that is, the LDA decision rule depends on x only True default status
No Yes Total
Predicted No 9644 252 9896
default status Yes 23 81 104
Total 9667 333 10000
TABLE 4.4. A confusion matrix compares the LDA predictions to the true
default statuses for the 10,000 training observations in the Default data set.
Elements on the diagonal of the matrix represent individuals whose default statuses
were correctly predicted, while off-diagonal elements represent individuals that
were misclassified. LDA made incorrect predictions for 23 individuals who did
not default and for 252 individuals who did default.
through a linear combination of its elements. As previously discussed, this
is the reason for the word linear in LDA.
In the right-hand panel of Figure 4.6, 20 observations drawn from each of
the three classes are displayed, and the resulting LDA decision boundaries
are shown as solid black lines. Overall, the LDA decision boundaries are
pretty close to the Bayes decision boundaries, shown again as dashed lines.
The test error rates for the Bayes and LDA classifiers are 0.0746 and 0.0770,
respectively. This indicates that LDA is performing well on this data.
We can perform LDA on the Default data in order to predict whether
or not an individual will default on the basis of credit card balance and
student status.4 The LDA model fit to the 10,000 training samples results
in a training error rate of 2.75%. This sounds like a low error rate, but two
caveats must be noted.
• First of all, training error rates will usually be lower than test error
rates, which are the real quantity of interest. In other words, we
might expect this classifier to perform worse if we use it to predict
whether or not a new set of individuals will default. The reason is
that we specifically adjust the parameters of our model to do well on
the training data. The higher the ratio of parameters p to number
of samples n, the more we expect this overfitting to play a role. For overfitting
these data we don’t expect this to be a problem, since p =2and
n =10,000.
• Second, since only 3.33% of the individuals in the training sample
defaulted, a simple but useless classifier that always predicts that
an individual will not default, regardless of his or her credit card
balance and student status, will result in an error rate of 3.33%. In
other words, the trivial null classifier will achieve an error rate that null
is only a bit higher than the LDA training set error rate.
In practice, a binary classifier such as this one can make two types of
errors: it can incorrectly assign an individual who defaults to the no default
category, or it can incorrectly assign an individual who does not default to the default category. It is often of interest to determine which of these two
types of errors are being made. A confusion matrix, shown for the Default confusion
data in Table 4.4, is a convenient way to display this information. The
table reveals that LDA predicted that a total of 104 people would default.
Of these people, 81 actually defaulted and 23 did not. Hence only 23 out
of 9,667 of the individuals who did not default were incorrectly labeled.
This looks like a pretty low error rate! However, of the 333 individuals who
defaulted, 252 (or 75.7%) were missed by LDA. So while the overall error
rate is low, the error rate among individuals who defaulted is very high.
From the perspective of a credit card company that is trying to identify
high-risk individuals, an error rate of 252/333 = 75.7% among individuals
who default may well be unacceptable.
matrix
Class-specific performance is also important in medicine and biology,
where the terms sensitivity and specificity characterize the performance of sensitivity
a classifier or screening test. In this case the sensitivity is the percent
age of true defaulters that are identified; it equals 24.3%. The specificity
is the percentage of non-defaulters that are correctly identified; it equals
(1 23/9667) = 99.8%.
Why does LDA do such a poor job of classifying the customers who de
fault? In other words, why does it have such low sensitivity? As we have
seen, LDA is trying to approximate the Bayes classifier, which has the low
est total error rate out of all classifiers. That is, the Bayes classifier will
yield the smallest possible total number of misclassified observations, re
gardless of the class from which the errors stem. Some misclassifications will
result from incorrectly assigning a customer who does not default to the
default class, and others will result from incorrectly assigning a customer
who defaults to the non-default class. In contrast, a credit card company
might particularly wish to avoid incorrectly classifying an individual who
will default, whereas incorrectly classifying an individual who will not de
fault, though still to be avoided, is less problematic. We will now see that it
is possible to modify LDA in order to develop a classifier that better meets
the credit card company’s needs.
The Bayes classifier works by assigning an observation to the class for
which the posterior probability pk(X) is greatest. In the two-class case, this
amounts to assigning an observation to the default class if
Pr(default = Yes|X = x) > 0.5.
(4.26)
Thus, the Bayes classifier, and by extension LDA, uses a threshold of 50%
for the posterior probability of default in order to assign an observation
to the default class. However, if we are concerned about incorrectly pre
dicting the default status for individuals who default, then we can consider
lowering this threshold. For instance, we might label any customer with a
posterior probability of default above 20% to the default class. In other
words, instead of assigning an observation to the default class if (4.26)
holds, we could instead assign an observation to this class if
Pr(default = Yes|X = x) > 0.2.
(4.27)
Theerror rates that result from taking this approach are shown in Table 4.5.
Now LDA predicts that 430 individuals will default. Of the 333 individuals
who default, LDA correctly predicts all but 138, or 41.4%. This is a vast
