### Classification

 The linear regression model discussed in Chapter 3 assumes that the re
sponse variable Y is quantitative. But in many situations, the response
 variable is instead qualitative. For example, eye color is qualitative. Of- qualitative
 ten qualitative variables are referred to as categorical; we will use these
 terms interchangeably. In this chapter, we study approaches for predicting
 qualitative responses, a process that is known as classification. Predicting classification
 a qualitative response for an observation can be referred to as classifying
 that observation, since it involves assigning the observation to a category,
 or class. On the other hand, often the methods used for classification first
 predict the probability that the observation belongs to each of the cate
gories of a qualitative variable, as the basis for making the classification.
 In this sense they also behave like regression methods.
 There are many possible classification techniques, or classifiers, that one classifier
 might use to predict a qualitative response. We touched on some of these
 in Sections 2.1.5 and 2.2.3. In this chapter we discuss some widely-used
 classifiers: logistic regression, linear discriminant analysis, quadratic dis- logistic
 criminant analysis, naive Bayes, and K-nearest neighbors. The discussion
 of logistic regression is used as a jumping-off point for a discussion of gen
eralized linear models, and in particular, Poisson regression. 

#### An Overview of Classification

 Classification problems occur often, perhaps even more so than regression problems. Some examples include:
  1. A person arrives at the emergency room with a set of symptoms
 that could possibly be attributed to one of three medical conditions.
 Which of the three conditions does the individual have?
 2. An online banking service must be able to determine whether or not
 a transaction being performed on the site is fraudulent, on the basis
 of the user’s IP address, past transaction history, and so forth.
 3. On the basis of DNA sequence data for a number of patients with
 and without a given disease, a biologist would like to figure out which
 DNA mutations are deleterious (disease-causing) and which are not.

 Just as in the regression setting, in the classification setting we have a
 set of training observations (x1,y1),...,(xn,yn) that we can use to build
 a classifier. We want our classifier to perform well not only on the training
 data, but also on test observations that were not used to train the classifier.
 
 In this chapter, we will illustrate the concept of classification using the
 simulated Default data set. We are interested in predicting whether an
 individual will default on his or her credit card payment, on the basis of
 annual income and monthly credit card balance. The data set is displayed
 in Figure 4.1. In the left-hand panel of Figure 4.1, we have plotted annual
 income and monthly credit card balance for a subset of 10,000 individuals.
 The individuals who defaulted in a given month are shown in orange, and
 those who did not in blue. (The overall default rate is about 3%, so we
 have plotted only a fraction of the individuals who did not default.) It
 appears that individuals who defaulted tended to have higher credit card
 balances than those who did not. In the center and right-hand panels of
 Figure 4.1, two pairs of boxplots are shown. The first shows the distribution
 of balance split by the binary default variable; the second is a similar plot
 for income. In this chapter, we learn how to build a model to predict default
 (Y ) for any given value of balance (X1) and income (X2). Since Y is not
 quantitative, the simple linear regression model of Chapter 3 is not a good
 choice: we will elaborate on this further in Section 4.2.
 
 It is worth noting that Figure 4.1 displays a very pronounced relation
ship between the predictor balance and the response default. In most real
 applications, the relationship between the predictor and the response will
 not be nearly so strong. However, for the sake of illustrating the classifica
tion procedures discussed in this chapter, we use an example in which the
 relationship between the predictor and the response is somewhat exaggerated.

#### Why Not Linear Regression?

 Suppose that we are trying to predict the medical condition of a patient
 in the emergency room on the basis of her symptoms. In this simplified
 example, there are three possible diagnoses: stroke, drug overdose, and epilepticseizure.Wecouldconsiderencodingthesevaluesasaquantita
tiveresponsevariable,Y,asfollows:
 Y=
 1 ifstroke;
 2 ifdrugoverdose;
 3 ifepilepticseizure.
 Usingthiscoding,leastsquarescouldbeusedtofitalinearregressionmodel
 topredictY onthebasisofasetofpredictorsX1,...,Xp.Unfortunately,
 thiscodingimpliesanorderingontheoutcomes,puttingdrug overdosein
 betweenstrokeandepilepticseizure, and insistingthat thedifference
 betweenstrokeanddrugoverdose is thesameas thedifferencebetween
 drugoverdoseandepilepticseizure. Inpractice there isnoparticular
 reasonthat thisneeds tobethecase.For instance, onecouldchoosean
 equallyreasonablecoding,
 Y=
 1 ifepilepticseizure;
 2 ifstroke;
 3 ifdrugoverdose,
 whichwouldimplyatotallydifferentrelationshipamongthethreecondi
tions.Eachof thesecodingswouldproducefundamentallydifferent linear
 modelsthatwouldultimatelyleadtodifferentsetsofpredictionsontest
 observations.
 If theresponsevariable’svaluesdidtakeonanaturalordering, suchas
 mild,moderate,andsevere,andwefeltthegapbetweenmildandmoderate
 wassimilartothegapbetweenmoderateandsevere,thena1,2,3coding
 wouldbereasonable.Unfortunately, ingeneral thereisnonaturalwayto
 convert a qualitative response variable with more than two levels into a
 quantitative response that is ready for linear regression.
 For a binary (two level) qualitative response, the situation is better. For binary
 instance, perhaps there are only two possibilities for the patient’s medical
 condition: stroke and drug overdose. We could then potentially use the
 dummyvariable approach from Section 3.3.1 to code the response as follows:
 Y = 0 ifstroke;
 1 if drug overdose.
 We could then fit a linear regression to this binary response, and predict
 drug overdose if ˆY>0.5 and stroke otherwise. In the binary case it is not
 hard to show that even if we flip the above coding, linear regression will
 produce the same final predictions.
 For a binary response with a 0/1 coding as above, regression by least
 squares is not completely unreasonable: it can be shown that the X ˆob
tained using linear regression is in fact an estimate of Pr(drug overdose|X)
 in this special case. However, if we use linear regression, some of our es
timates might be outside the [0,1] interval (see Figure 4.2), making them
 hard to interpret as probabilities! Nevertheless, the predictions provide an
 ordering and can be interpreted as crude probability estimates. Curiously,
 it turns out that the classifications that we get if we use linear regression
 to predict a binary response will be the same as for the linear discriminant
 analysis (LDA) procedure we discuss in Section 4.4.
 To summarize, there are at least two reasons not to perform classifica
tion using a regression method: (a) a regression method cannot accommo
date a qualitative response with more than two classes; (b) a regression
 method will not provide meaningful estimates of Pr(Y |X), even with just
 two classes. Thus, it is preferable to use a classification method that is
 truly suited for qualitative response values. In the next section, we present
 logistic regression, which is well-suited for the case of a binary qualita
tive response; in later sections we will cover classification methods that are
 appropriate when the qualitative response has two or more classes.


#### Logistic Regression

 Consider again the Default data set, where the response default falls into
 one of two categories, Yes or No. Rather than modeling this response Y
 directly, logistic regression models the probability that Y belongs to a par
ticular category.
 For the Default data, logistic regression models the probability of default.
 For example, the probability of default given balance can be written as
 Pr(default = Yes|balance).
 Thevalues of Pr(default = Yes|balance), which we abbreviate p(balance),
 will range between 0 and 1. Then for any given value of balance, a prediction
 can be made for default. For example, one might predict default = Yes
  foranyindividual forwhomp(balance)>0.5.Alternatively, ifacompany
 wishestobeconservativeinpredictingindividualswhoareatriskforde
fault,thentheymaychoosetousealowerthreshold,suchasp(balance)>
 0.1.
 


##### The Logistic Model

 Howshouldwemodel therelationshipbetweenp(X)=Pr(Y=1|X)and
 X?(Forconvenienceweareusingthegeneric0/1codingfortheresponse.)
 InSection4.2weconsideredusinga linearregressionmodel torepresent
 theseprobabilities:
 p(X)= 0+ 1X. (4.1)
 Ifweuse this approachtopredict default=Yesusingbalance, thenwe
 obtainthemodel showninthe left-handpanelofFigure4.2.Herewesee
 theproblemwiththisapproach: forbalances close tozerowepredict a
 negativeprobabilityofdefault;ifweweretopredictforverylargebalances,
 wewouldgetvaluesbiggerthan1.Thesepredictionsarenotsensible,since
 ofcoursethetrueprobabilityofdefault, regardlessofcreditcardbalance,
 mustfallbetween0and1.Thisproblemisnotuniquetothecreditdefault
 data.Anytimeastraight line isfittoabinaryresponsethat iscodedas
 0or1, inprinciplewecanalwayspredictp(X)<0forsomevaluesofX
 andp(X)>1forothers(unlesstherangeofXislimited).
 Toavoidthisproblem,wemustmodelp(X)usingafunctionthatgives
 outputsbetween0and1 for all values ofX.Many functionsmeet this
 description. Inlogisticregression,weusethelogisticfunction, logistic
 function
 p(X)= e 0+1X
 1+e 0+1X
 . (4.2)
 Tofitthemodel (4.2),weuseamethodcalledmaximumlikelihood,which maximum
 likelihood wediscussinthenextsection.Theright-handpanelofFigure4.2illustrates
 thefitofthelogisticregressionmodeltotheDefaultdata.Noticethatfor
