# Chapter 4 : Classification

&bullet; The __linear regression model__ discussed in <a href="http://localhost:8888/notebooks/islr-book/Chapter%203/Chapter%203.ipynb#3-Linear-Regression">Chapter 3</a> assumes that the `response variable` <font size=3>$Y$</font> is __quantitative__. 

&bullet; But in `many situations`, the `response variable` is ___instead `qualitative`___. 

>For __example__, `eye color` is `qualitative`, taking on values _blue, brown, or green_. 

&bullet; Often `qualitative variables` are referred to as __`categorical`__ ; we will `use` these terms `interchangeably`. 

&bullet; In this chapter, we study `approaches` for `predicting qualitative responses`, a process that is `known` as __`classification`__. 

&bullet; `Predicting` a `qualitative response` for an `observation` can be `referred` to as ___`classifying`___ that observation, since it `involves assigning` the `observation to a category`, or `class`. 

&bullet; On the `other hand`, often the `methods used` for `classification` first `predict the probability` of each of the `categories` of a `qualitative variable`, as the `basis` for `making` the `classification`. 

&bullet; In this sense they also behave like `regression methods`.

&bullet; There are many `possible classification techniques`, or __classifiers__, that one might use to `predict a qualitative response`. 

&bullet; We touched on some of these in <a href="http://localhost:8888/notebooks/islr-book/Chapter%202/Chapter%202.ipynb#2.1.5-Regression-Versus-Classification-Problems">Sections 2.1.5</a> and <a href="http://localhost:8888/notebooks/islr-book/Chapter%202/Chapter%202.ipynb#2.2.3-The-Classification-Setting">2.2.3</a>. 

&bullet; In this chapter we `discuss three` of the most widely-`used` `classifiers`: 
1. __`logistic regression`__, 
2. __`linear discriminant analysis`__, and
3. __`K-nearest neighbors`__. 

We will focus on below topics in other chapters.

1. generalized additive models, 
2. trees, 
3. random forests, and 
4. boosting, and 
5. support vector machines.


## 4.1 An Overview of Classification

    The classification problems happenes even more than regression problems.
    
__For Example__
1. A person arrives at the emergency room with a `set of symptoms` that could `possibly be attributed` to `one of three` `medical conditions`.<br>__Which of the three conditions does the individual have?__

2. An `online banking service` must be able to determine whether or not a `transaction being performed` on the `site is fraudulent`, on the `basis of the user’s IP address`, `past transaction history`, and `so forth`.

3. On the basis of `DNA sequence data` for a `number of patients with and without a given disease`, a `biologist` would like to `figure out` which `DNA mutations are deleterious` (disease-causing) and `which are not`.

&bullet; Just as in the `regression setting`, in the `classification setting` we have a `set of training observations` $(x_{1} , y_{1} ), \dots , (x_{n} , y_{n} )$ that `we can use` to `build a classifier`. 

&bullet; We want `our classifier` to `perform well not only on` the `training data`, but also on `test observations` that were `not used` to `train the classifier`.

<a id="Figure4.1"></a>
![image.png](Figures/Figure4.1.png)
>__FIGURE 4.1__. The `Default data set`. 

>__Left:__ The `annual incomes` and `monthly credit card balances` of a `number of individuals`. 
<br>The individuals who defaulted on their credit card payments are shown in orange, and those who did not are shown in blue. 

>__Center:__ Boxplots of `balance` as a `function of default status`. 

>__Right:__ Boxplots of `income` as a `function of default status`.



## 4.2 Why Not Linear Regression?

We stated that Linear Regression is not appropriate in the case of a `Qualititive response`.

___Why Not?___

Suppose that we are `trying to predict` the `medical condition` of a `patient`
in the `emergency room on the basis` of `her symptoms`. 

In this simplified example, there are `three possible diagnoses`: __stroke , drug overdose , and
epileptic seizure__. 

We `could consider encoding these values` as a `quantitative response variable`, $Y$ , as follows:

![image.png](Figures/FormulaE1.png)

Using this `coding`, `least squares` could be `used to fit a linear regression model` to `predict` $Y$ on the `basis of a set of predictors` $X_{1} , \dots , X_{p}$ . 

Unfortunately, this `coding implies` an `ordering on the outcomes`, `putting drug overdose` in `between stroke and epileptic seizure` , and `insisting` that the `difference between stroke and drug overdose` is the `same as the difference between drug overdose and epileptic seizure`. 

In practice there is `no particular reason` that `this needs to be the case`. 

For instance, one `could choose` an `equally reasonable coding`,which would `imply` a `totally different relationship` among the `three conditions`. 

Each of these `codings` would `produce` fundamentally `different linear
models` that would `ultimately lead` to `different sets` of `predictions on test observations`

![image.png](Figures/FormulaE2.png)

If the response variable’s values did take on a natural ordering, such as
mild, moderate, and severe, and we felt the gap between mild and moderate
was similar to the gap between moderate and severe, then a 1, 2, 3 coding
would be reasonable. Unfortunately, in general there is no natural way to
convert a qualitative response variable with more than two levels into a
quantitative response that is ready for linear regression.


For a ___`binary (two level) qualitative response`___, the `situation is better`. 
For instance, perhaps there are `only two possibilities` for the `patient’s medical condition`: __stroke and drug overdose__. 

We could then `potentially use`
the `dummy variable approach` from <a href="http://localhost:8888/notebooks/islr-book/Chapter%203/Chapter%203.ipynb#3.3.1-Qualitative-Predictors">Section 3.3.1</a> to code the response as follows:

![image.png](Figures/FormulaE3.png)

We could then `fit a linear regression` to this `binary response`, and `predict drug overdose` if $\hat{y} > 0.5$ and `stroke` otherwise. 

In the `binary case` it is `not hard` to `show that` even `if we flip` the above `coding`, `linear regression` will produce the same `final predictions`.

For a `binary response` with a $ 0/1 $ `coding` as above, `regression` by `least squares does make sense`; it can be shown that the $X \hat{\beta}$ obtained using `linear
regression` is `in fact` an `estimate of` $Pr( drug\ overdose |X)$ in this `special
case`. 
However, if we use `linear regression`, some of `our estimates` might be `outside` the $[0, 1]$ `interval` (see <a href="#Figure4.2">Figure 4.2</a>), `making them hard` to `interpret as probabilities`! 

Nevertheless, the `predictions provide` an `ordering` and `can be interpreted` as `crude probability estimates`. 

Curiously, it `turns out` that the `classifications` that we `get` if `we use linear regression` to `predict a binary response` will be the same as for the ___`linear discriminant analysis (LDA)`___

However, the `dummy variable` approach `cannot` be `easily extended` to `accommodate qualitative responses` with `more than two levels`. 
For these `reasons`, it is `preferable` to `use` a `classification method` that is `truly suited`
for `qualitative response values`, such as the `ones presented next`.

## 4.3 Logistic Regression

Consider again the `Default data set`, where the `response default` falls into `one of two categories`, __`Yes or No`__ . 

Rather than `modeling` this `response` $Y$ `directly`, `logistic regression models` the `probability` that $Y$ belongs to a `particular category`.

<a id="Figure4.2"></a>
![image.png](Figures/Figure4.2.png)
>__FIGURE 4.2:__ `Classification` using the `Default data`.

>__Left:__ `Estimated probability` of `default` using `linear regeression`. 
<br>Some `estimated probabilities` are `negative`! 
<br>The `orange ticks indicate` the __0/1 values__ `coded for default`(No or Yes).

>__Right:__ `Predicted probailities` of `default` using `logistic regression`. 
<br>All `probabilities lie` between __0 and 1__.

For the Default data, logistic regression models the probability of defaulor example, the probability of default given balance can be written as

<font size=5><center>$Pr(default = Yes|balance ).$</center></font>

The values of $Pr(default = Yes|balance )$, which we `abbreviate` $p$(balance ), will `range between` __0 and 1__. 
 
Then for `any given value of balance`, a `prediction` can be made for `default`. 

For __example__, `one might predict` __default = Yes__ for `any individual` for `whom` $p(balance) > 0.5$.

Alternatively, if a `company wishes` to be `conservative in predicting individuals` who
are at `risk for default`, then `they may choose` to `use a lower threshold`, such
as $p(balance) > 0.1$.



### 4.3.1 The Logistic Model

__How should we model the `relationship between` $p(X) = Pr(Y = 1|X)$ and
$X$?__ 
<br>(For convenience we are using the generic 0/1 coding for the response).

In <a href="#4.2-Why-Not-Linear-Regression?">Section 4.2</a> we `talked` of `using` a `linear regression model` to represent `these probabilities`:

<a id="Formula4.1"></a>
<font size=5><center>$p(X) = β0 + β1 X$</center></font>.

If we use `this approach` to `predict` __default=Yes__ using `balance`, then we `obtain the model` shown in the `left-hand` panel of <a href="#Figure4.2">Figure 4.2</a>. 

Here `we see` the `problem with this approach`: for `balances close to zero` we `predict` a __`negative probability of default`__; <br>if we were to `predict for very large balances`, we would `get values` __`bigger than 1`__. 

These `predictions` are `not sensible`, since `of course` the `true probability of default`, `regardless` of `credit card balance`, must `fall` between __0 and 1__. 

This `problem` is `not unique` to the `credit default data`. 

Any time a `straight line` is `fit to a binary response` that is `coded as 0 or 1`, in `principle` we can `always predict` $p(X) < 0$ for `some values` of $X$ and $p(X) > 1$ for `others` (unless the range of $X$ is `limited`).

__To avoid this problem__, we `must model` $p(X)$ `using a function` that gives
`outputs between 0 and 1` for `all values` of $X$. 

Many `functions meet` this `description`. 

In __`logistic regression`__, we use the __`logistic function`__,

<a id="Formula4.2"></a>
<font size=5><center> $ p(X) = \frac{e^{\beta_{0}+\beta_{1}X}}{1+e^{\beta_{0}+\beta_{1}X}} $ </center></font>


To `fit` the `model` <a href="#Formula4.2">(4.2)</a>, we `use a method` called __`maximum likelihood`__, which
we discuss in the next section. 

The _right-hand panel_ of <a href="#Figure4.2">Figure 4.2</a> illustrates the `fit` of the `logistic regression` model to the `Default data`. 

Notice that for `low balances` we now `predict` the `probability` of `default` as `close to`, but never `below`, __zero__. 

Likewise, for `high balances` we `predict` a `default probability` close to, but `never above`, __one__. 

The ___`logistic function`___ will always produce an `S-shaped curve` of this `form`, and so `regardless of the value` of $X$, we will obtain a `sensible prediction`. 

We also see that the `logistic model` is `better` able `to capture the range` of `probabilities` than is the `linear regression model` in the _left-hand plot_. 

The `average fitted probability` in `both cases` is $0.0333$ (averaged over the training data), which is the `same as the overall proportion` of `defaulters` in the `data set`.

After a `bit of manipulation` of <a href="#Formula4.2">(4.2)</a>, we find that

<a id="Formula4.3"></a>
<font size=5><center> $ \frac{p(X)}{1 - p(X)} = e^{\beta_{0}+\beta_{1}X}$ </center></font>

The `quantity` <font size=5><center> $ \frac{p(X)}{1 - p(X)}$ </center></font> is called the __`odds`__, and can `take on any value` between $0$ and $\infty$. 

`Values` of the `odds close` to $0$ and $\infty$ `indicate very low` and `very high probabilities` of `default`, respectively. 

__For example__, on `average`
1 in 5 people with `an odds` of $\frac{1}{4}$ will `default`, since $p(X) = 0.2$ `implies` an odds of $\frac{0.2}{1-0.2} = frac{1}{4}$. 

Likewise on `average nine out of every ten people` with `odds of`  9 will default, since $p(X) = 0.9$ implies an `odds` of $\frac{0.9}{1−0.9} = 9$.

`Odds` are `traditionally used instead of probabilities` in `horse-racing`, since they `relate more naturally` to the `correct betting strategy`.

By taking the `logarithm of both sides` of <a href="#Formula4.3">(4.3)</a>, we arrive at

<a id="Formula4.4"></a>
<font size=5><center> $ log \Big(\frac{p(X)}{1 - p(X)}\Big) = \beta_{0}+\beta_{1}X$ </center></font>

The _left-hand side_ is called the __`log-odds or logit`__. 

We see that the __`logistic regression model`__ <a href="#Formula4.2">(4.2)</a> has a `logit` that is `linear` in $X$.

__[Read More On Page No 132&133]__

### 4.3.2 Estimating the Regression Coefficients

The `coefficients` $\beta_{0}$ and $\beta_{1}$ in <a href="#Formula4.2">(4.2)</a> are __`unknown`__, and `must be estimated` based on the `available training data`. 

In <a href="http://localhost:8888/notebooks/islr-book/Chapter%203/Chapter%203.ipynb">Chapter 3</a>, we used the __`least squares approach`__ to `estimate the unknown linear regression coefficients`. 

Although we could use (non-linear) `least squares` to `fit the model` <a href="#Formula4.4">(4.4)</a>, the `more
general method` of ___`maximum likelihood`___ is `preferred`, since it has `better statistical properties`. 

The `basic intuition` behind using `maximum likelihood` to `fit a logistic regression model` is as follows: 
we seek estimates for $\beta_{0}$ and $\beta_{1}$ such that the `predicted probability` $\hat{p}(x_{i})$ of `default` for `each individual`, using <a href="#Formula4.2">(4.2)</a>, corresponds as `closely as possible` to the `individual’s observed default status`. 

In __other words__, we `try to find` $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ such that `plugging these estimates into the model` for $p(X)$, given in <a href="#Formula4.2">(4.2)</a>, yields a `number close to one` for `all individuals who defaulted`, and a `number close to zero`
for `all individuals who did not`. 

This `intuition` can be `formalized` using a
`mathematical equation` called a ___`likelihood function`___:

<a id="Formula4.5"></a>
![image.png](Figures/Formula4.5.png)

The estimates $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ are `chosen` to `maximize this likelihood function`.


Maximum likelihood is a `very general approach` that is `used to fit many of the non-linear models` that we `examine throughout` this book. 

In the `linear regression setting`, the `least squares approach` is in `fact a special case` of __`maximum likelihood`__. 

The `mathematical details` of `maximum likelihood` are beyond the `scope of this book`. 

However, in general, `logistic regression` and `other models` can be `easily fit using a statistical software package` such as Python/R , and so we `do not need` to `concern ourselves` with the `details` of the `maximum likelihood fitting procedure`.

Page No 134-148
