# Chapter 4 : Classification

&bullet; The __linear regression model__ discussed in <a href="http://localhost:8888/notebooks/islr-book/Chapter%203/Chapter%203.ipynb#3-Linear-Regression">Chapter 3</a> assumes that the `response variable` <font size=3>$Y$</font> is __quantitative__. 

&bullet; But in `many situations`, the `response variable` is ___instead `qualitative`___. 

>For __example__, `eye color` is `qualitative`, taking on values _blue, brown, or green_. 

&bullet; Often `qualitative variables` are referred to as __`categorical`__ ; we will `use` these terms `interchangeably`. 

&bullet; In this chapter, we study `approaches` for `predicting qualitative responses`, a process that is `known` as __`classification`__. 

&bullet; `Predicting` a `qualitative response` for an `observation` can be `referred` to as ___`classifying`___ that observation, since it `involves assigning` the `observation to a category`, or `class`. 

&bullet; On the `other hand`, often the `methods used` for `classification` first `predict the probability` of each of the `categories` of a `qualitative variable`, as the `basis` for `making` the `classification`. 

&bullet; In this sense they also behave like `regression methods`.

&bullet; There are many `possible classification techniques`, or __classifiers__, that one might use to `predict a qualitative response`. 

&bullet; We touched on some of these in <a href="http://localhost:8888/notebooks/islr-book/Chapter%202/Chapter%202.ipynb#2.1.5-Regression-Versus-Classification-Problems">Sections 2.1.5</a> and <a href="http://localhost:8888/notebooks/islr-book/Chapter%202/Chapter%202.ipynb#2.2.3-The-Classification-Setting">2.2.3</a>. 

&bullet; In this chapter we `discuss three` of the most widely-`used` `classifiers`: 
1. __`logistic regression`__, 
2. __`linear discriminant analysis`__, and
3. __`K-nearest neighbors`__. 

We will focus on below topics in other chapters.

1. generalized additive models, 
2. trees, 
3. random forests, and 
4. boosting, and 
5. support vector machines.


## 4.1 An Overview of Classification

    The classification problems happenes even more than regression problems.
    
__For Example__
1. A person arrives at the emergency room with a `set of symptoms` that could `possibly be attributed` to `one of three` `medical conditions`.<br>__Which of the three conditions does the individual have?__

2. An `online banking service` must be able to determine whether or not a `transaction being performed` on the `site is fraudulent`, on the `basis of the user’s IP address`, `past transaction history`, and `so forth`.

3. On the basis of `DNA sequence data` for a `number of patients with and without a given disease`, a `biologist` would like to `figure out` which `DNA mutations are deleterious` (disease-causing) and `which are not`.

&bullet; Just as in the `regression setting`, in the `classification setting` we have a `set of training observations` $(x_{1} , y_{1} ), \dots , (x_{n} , y_{n} )$ that `we can use` to `build a classifier`. 

&bullet; We want `our classifier` to `perform well not only on` the `training data`, but also on `test observations` that were `not used` to `train the classifier`.

<a id="Figure4.1"></a>
![image.png](Figures/Figure4.1.png)
>__FIGURE 4.1__. The `Default data set`. 

>__Left:__ The `annual incomes` and `monthly credit card balances` of a `number of individuals`. 
<br>The individuals who defaulted on their credit card payments are shown in orange, and those who did not are shown in blue. 

>__Center:__ Boxplots of `balance` as a `function of default status`. 

>__Right:__ Boxplots of `income` as a `function of default status`.



## 4.2 Why Not Linear Regression?

We stated that Linear Regression is not appropriate in the case of a `Qualititive response`.

___Why Not?___

Suppose that we are `trying to predict` the `medical condition` of a `patient`
in the `emergency room on the basis` of `her symptoms`. 

In this simplified example, there are `three possible diagnoses`: __stroke , drug overdose , and
epileptic seizure__. 

We `could consider encoding these values` as a `quantitative response variable`, $Y$ , as follows:

![image.png](Figures/FormulaE1.png)

Using this `coding`, `least squares` could be `used to fit a linear regression model` to `predict` $Y$ on the `basis of a set of predictors` $X_{1} , \dots , X_{p}$ . 

Unfortunately, this `coding implies` an `ordering on the outcomes`, `putting drug overdose` in `between stroke and epileptic seizure` , and `insisting` that the `difference between stroke and drug overdose` is the `same as the difference between drug overdose and epileptic seizure`. 

In practice there is `no particular reason` that `this needs to be the case`. 

For instance, one `could choose` an `equally reasonable coding`,which would `imply` a `totally different relationship` among the `three conditions`. 

Each of these `codings` would `produce` fundamentally `different linear
models` that would `ultimately lead` to `different sets` of `predictions on test observations`

![image.png](Figures/FormulaE2.png)

If the response variable’s values did take on a natural ordering, such as
mild, moderate, and severe, and we felt the gap between mild and moderate
was similar to the gap between moderate and severe, then a 1, 2, 3 coding
would be reasonable. Unfortunately, in general there is no natural way to
convert a qualitative response variable with more than two levels into a
quantitative response that is ready for linear regression.


For a ___`binary (two level) qualitative response`___, the `situation is better`. 
For instance, perhaps there are `only two possibilities` for the `patient’s medical condition`: __stroke and drug overdose__. 

We could then `potentially use`
the `dummy variable approach` from <a href="http://localhost:8888/notebooks/islr-book/Chapter%203/Chapter%203.ipynb#3.3.1-Qualitative-Predictors">Section 3.3.1</a> to code the response as follows:

![image.png](Figures/FormulaE3.png)

We could then `fit a linear regression` to this `binary response`, and `predict drug overdose` if $\hat{y} > 0.5$ and `stroke` otherwise. 

In the `binary case` it is `not hard` to `show that` even `if we flip` the above `coding`, `linear regression` will produce the same `final predictions`.

For a `binary response` with a $ 0/1 $ `coding` as above, `regression` by `least squares does make sense`; it can be shown that the $X \hat{\beta}$ obtained using `linear
regression` is `in fact` an `estimate of` $Pr( drug\ overdose |X)$ in this `special
case`. 
However, if we use `linear regression`, some of `our estimates` might be `outside` the $[0, 1]$ `interval` (see <a href="#Figure4.2">Figure 4.2</a>), `making them hard` to `interpret as probabilities`! 

Nevertheless, the `predictions provide` an `ordering` and `can be interpreted` as `crude probability estimates`. 

Curiously, it `turns out` that the `classifications` that we `get` if `we use linear regression` to `predict a binary response` will be the same as for the ___`linear discriminant analysis (LDA)`___

However, the `dummy variable` approach `cannot` be `easily extended` to `accommodate qualitative responses` with `more than two levels`. 
For these `reasons`, it is `preferable` to `use` a `classification method` that is `truly suited`
for `qualitative response values`, such as the `ones presented next`.

## 4.3 Logistic Regression

Consider again the `Default data set`, where the `response default` falls into `one of two categories`, __`Yes or No`__ . 

Rather than `modeling` this `response` $Y$ `directly`, `logistic regression models` the `probability` that $Y$ belongs to a `particular category`.

<a id="Figure4.2"></a>
![image.png](Figures/Figure4.2.png)
>__FIGURE 4.2:__ `Classification` using the `Default data`.

>__Left:__ `Estimated probability` of `default` using `linear regeression`. 
<br>Some `estimated probabilities` are `negative`! 
<br>The `orange ticks indicate` the __0/1 values__ `coded for default`(No or Yes).

>__Right:__ `Predicted probailities` of `default` using `logistic regression`. 
<br>All `probabilities lie` between __0 and 1__.

For the Default data, logistic regression models the probability of defaulor example, the probability of default given balance can be written as

<font size=5><center>$Pr(default = Yes|balance ).$</center></font>

The values of $Pr(default = Yes|balance )$, which we `abbreviate` $p$(balance ), will `range between` __0 and 1__. 
 
Then for `any given value of balance`, a `prediction` can be made for `default`. 

For __example__, `one might predict` __default = Yes__ for `any individual` for `whom` $p(balance) > 0.5$.

Alternatively, if a `company wishes` to be `conservative in predicting individuals` who
are at `risk for default`, then `they may choose` to `use a lower threshold`, such
as $p(balance) > 0.1$.



### 4.3.1 The Logistic Model

__How should we model the `relationship between` $p(X) = Pr(Y = 1|X)$ and
$X$?__ 
<br>(For convenience we are using the generic 0/1 coding for the response).

In <a href="#4.2-Why-Not-Linear-Regression?">Section 4.2</a> we `talked` of `using` a `linear regression model` to represent `these probabilities`:

<a id="Formula4.1"></a>
<font size=5><center>$p(X) = β0 + β1 X$</center></font>.

If we use `this approach` to `predict` __default=Yes__ using `balance`, then we `obtain the model` shown in the `left-hand` panel of <a href="#Figure4.2">Figure 4.2</a>. 

Here `we see` the `problem with this approach`: for `balances close to zero` we `predict` a __`negative probability of default`__; <br>if we were to `predict for very large balances`, we would `get values` __`bigger than 1`__. 

These `predictions` are `not sensible`, since `of course` the `true probability of default`, `regardless` of `credit card balance`, must `fall` between __0 and 1__. 

This `problem` is `not unique` to the `credit default data`. 

Any time a `straight line` is `fit to a binary response` that is `coded as 0 or 1`, in `principle` we can `always predict` $p(X) < 0$ for `some values` of $X$ and $p(X) > 1$ for `others` (unless the range of $X$ is `limited`).

__To avoid this problem__, we `must model` $p(X)$ `using a function` that gives
`outputs between 0 and 1` for `all values` of $X$. 

Many `functions meet` this `description`. 

In __`logistic regression`__, we use the __`logistic function`__,

<a id="Formula4.2"></a>
<font size=5><center> $ p(X) = \frac{e^{\beta_{0}+\beta_{1}X}}{1+e^{\beta_{0}+\beta_{1}X}} $ </center></font>


To `fit` the `model` <a href="#Formula4.2">(4.2)</a>, we `use a method` called __`maximum likelihood`__, which
we discuss in the next section. 

The _right-hand panel_ of <a href="#Figure4.2">Figure 4.2</a> illustrates the `fit` of the `logistic regression` model to the `Default data`. 

Notice that for `low balances` we now `predict` the `probability` of `default` as `close to`, but never `below`, __zero__. 

Likewise, for `high balances` we `predict` a `default probability` close to, but `never above`, __one__. 

The ___`logistic function`___ will always produce an `S-shaped curve` of this `form`, and so `regardless of the value` of $X$, we will obtain a `sensible prediction`. 

We also see that the `logistic model` is `better` able `to capture the range` of `probabilities` than is the `linear regression model` in the _left-hand plot_. 

The `average fitted probability` in `both cases` is $0.0333$ (averaged over the training data), which is the `same as the overall proportion` of `defaulters` in the `data set`.

After a `bit of manipulation` of <a href="#Formula4.2">(4.2)</a>, we find that

<a id="Formula4.3"></a>
<font size=5><center> $ \frac{p(X)}{1 - p(X)} = e^{\beta_{0}+\beta_{1}X}$ </center></font>

The `quantity` <font size=5><center> $ \frac{p(X)}{1 - p(X)}$ </center></font> is called the __`odds`__, and can `take on any value` between $0$ and $\infty$. 

`Values` of the `odds close` to $0$ and $\infty$ `indicate very low` and `very high probabilities` of `default`, respectively. 

__For example__, on `average`
1 in 5 people with `an odds` of $\frac{1}{4}$ will `default`, since $p(X) = 0.2$ `implies` an odds of $\frac{0.2}{1-0.2} = frac{1}{4}$. 

Likewise on `average nine out of every ten people` with `odds of`  9 will default, since $p(X) = 0.9$ implies an `odds` of $\frac{0.9}{1−0.9} = 9$.

`Odds` are `traditionally used instead of probabilities` in `horse-racing`, since they `relate more naturally` to the `correct betting strategy`.

By taking the `logarithm of both sides` of <a href="#Formula4.3">(4.3)</a>, we arrive at

<a id="Formula4.4"></a>
<font size=5><center> $ log \Big(\frac{p(X)}{1 - p(X)}\Big) = \beta_{0}+\beta_{1}X$ </center></font>

The _left-hand side_ is called the __`log-odds or logit`__. 

We see that the __`logistic regression model`__ <a href="#Formula4.2">(4.2)</a> has a `logit` that is `linear` in $X$.

__[Read More On Page No 132&133]__

### 4.3.2 Estimating the Regression Coefficients

The `coefficients` $\beta_{0}$ and $\beta_{1}$ in <a href="#Formula4.2">(4.2)</a> are __`unknown`__, and `must be estimated` based on the `available training data`. 

In <a href="http://localhost:8888/notebooks/islr-book/Chapter%203/Chapter%203.ipynb">Chapter 3</a>, we used the __`least squares approach`__ to `estimate the unknown linear regression coefficients`. 

Although we could use (non-linear) `least squares` to `fit the model` <a href="#Formula4.4">(4.4)</a>, the `more
general method` of ___`maximum likelihood`___ is `preferred`, since it has `better statistical properties`. 

The `basic intuition` behind using `maximum likelihood` to `fit a logistic regression model` is as follows: 
we seek estimates for $\beta_{0}$ and $\beta_{1}$ such that the `predicted probability` $\hat{p}(x_{i})$ of `default` for `each individual`, using <a href="#Formula4.2">(4.2)</a>, corresponds as `closely as possible` to the `individual’s observed default status`. 

In __other words__, we `try to find` $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ such that `plugging these estimates into the model` for $p(X)$, given in <a href="#Formula4.2">(4.2)</a>, yields a `number close to one` for `all individuals who defaulted`, and a `number close to zero`
for `all individuals who did not`. 

This `intuition` can be `formalized` using a
`mathematical equation` called a ___`likelihood function`___:

<a id="Formula4.5"></a>
![image.png](Figures/Formula4.5.png)

The estimates $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ are `chosen` to `maximize this likelihood function`.


Maximum likelihood is a `very general approach` that is `used to fit many of the non-linear models` that we `examine throughout` this book. 

In the `linear regression setting`, the `least squares approach` is in `fact a special case` of __`maximum likelihood`__. 

The `mathematical details` of `maximum likelihood` are beyond the `scope of this book`. 

However, in general, `logistic regression` and `other models` can be `easily fit using a statistical software package` such as Python/R , and so we `do not need` to `concern ourselves` with the `details` of the `maximum likelihood fitting procedure`.

<a href="#Table4.1">Table 4.1</a> shows the `coefficient estimates` and `related information` that `result from fitting` a `logistic regression model` on the `Default data` in `order to predict` the `probability` of __`default = Yes` using `balance`.__ 

We see that $\hat{\beta_{1}} = 0.0055$; this `indicates` that `an increase in balance is associated with` an `increase in the probability of default` . 

To be `precise`, a `one-unit increase` in `balance` is `associated with` an `increase in the log odds` of `default` by 0.0055 units.

<a id="Table4.1"></a>

|     &nbsp;| Coefficient | Std. error | Z-Statistic | P-Value  |
| --------- | ----------- | ---------- | ----------- | -------- |
| Intercept | -10.6513    | 0.3612     | -29.5       | < 0.0001 |
| Balance   | 0.0055      | 0.0002     | 24.9        | < 0.0001 |

>__TABLE 4.1.__ For the `Default data`, estimated `coefficients` of the `logistic regression model` that `predicts the probability of default using balance`. 
<br>A `one-unit increase` in `balance` is `associated with` an `increase in the log odds` of `default` by
0.0055 `units`.

Important Things inn table which are almost same like Linear Regression.
1. __Std.Error__ - We can measure accuracy of coefficient estimate by computing their Standard-Error.
2. __Z-Statistics(t-statistics)__ - Checking Null Hypothesis. in above example.
    __Z-statistics__ means - Probability of default does not depend on balance.
    
3. __P-Value__ is tiny so we can reject $H_0$

### 4.3.3 Making Predictions

Once the `coefficients have been estimated`, it is a `simple matter to compute` the `probability of default` for `any given credit card balance`. 

__For example__, using the `coefficient estimates` given in <a href="#Table4.1">Table 4.1</a>, we `predict` that the `default probability` for an `individual with a balance of` &dollar;1, 000 is

<font size=5>
    <center>$ \hat{p}(X) = \frac{e^{\beta_{0}+\beta_{1}X}}{1 + e^{\beta_{0}+\beta_{1}X}} = \frac{e^{− 10.6513 + 0.0055 * 1,000}}{1 + e^{− 10.6513 + 0.0055 * 1,000}} = 0.00576,$
    </center>
</font>

which is `below` $1\%$. 

In contrast, the `predicted probability of default` for `an individual with a balance of` &dollar;2, 000 is `much higher`, and `equals` 0.586 or 58.6 $\%$


One can use `qualitative predictors` with the `logistic regression model` using the `dummy variable approach` from <a href="http://localhost:8888/notebooks/islr-book/Chapter%203/Chapter%203.ipynb#3.3.1-Qualitative-Predictors">Section 3.3.1</a>. 

As an __example__, the `Default data set contains` the `qualitative variable student`. 

To `fit the model` we simply `create a dummy variable` that `takes on a value` of __1 for `students` and 0 for `non-students`__.

The `logistic regression model` that results
from `predicting probability` of default from `student status` can be `seen` in <a href="#Table4.2">Table 4.2.</a>

The `coefficient associated` with the `dummy variable` is `positive`,

<a id="Table4.2"></a>

| &nbsp;        | Coefficient | Std. error | Z-Statistic | P-Value  |
| ------------- | ----------- | ---------- | ----------- | -------- |
| Intercept     | -3.5041     | 0.0707     | -49.55      | < 0.0001 |
| Student [Yes] | 0.4049      | 0.1150     | 3.52        | 0.0004   |

>__TABLE 4.2.__ For the `Default data`, estimated `coefficients of the logistic regression model` that `predicts the probability` of `default` using `student status`. 
<br>Student status is `encoded as a dummy variable`, with a `value of 1` for a `student` and a `value of 0` for a `non-student`, and `represented` by the `variable student[Yes]` in the `table`.

and the associated __`p-value`__ is `statistically significant`. 

This `indicates` that `students tend to have higher default probabilities` than `non-students`:

<font size=5><center> $ \hat{P_{r}} ( default = Yes | student = Yes) = \frac{e^{− 3.5041 + 0.4049 * 1}}{1 + e^{− 3.5041 + 0.4049 * 1}} =0.0431 $</center></font>

<font size=5><center> $ \hat{P_{r}} ( default = Yes | student = No) = \frac{e^{− 3.5041 + 0.4049 * 0}}{1 + e^{− 3.5041 + 0.4049 * 0}} =0.0292 $</center></font>

### 4.3.4 Multiple Logistic Regression

We now consider the `problem of predicting` a `binary response` using `multiple predictors`. 

By `analogy` with the `extension from` __simple to multiple linear regression__ in <a href="http://localhost:8888/notebooks/islr-book/Chapter%203/Chapter%203.ipynb#3-Linear-Regression">Chapter 3</a>, we can `generalize` <a href="#Formula4.4">(4.4)</a> as follows:

<a id="Formula4.6"></a>
<font size=5><center> $ log \Big(\frac{p(X)}{1 - p(X)}\Big) = \beta_{0}+\beta_{1}X_{1}+ \dots + \beta_{p}X_{p}$ </center></font>
>where $X = (X_{1} , \dots , X_{p})$ are $p$ `predictors`. 

<a href="#Formula4.6">Equation 4.6</a> can be `rewritten as`
<a id="Formula4.7"></a>
<font size=5><center> $ p(X) = \frac{e^{\beta_{0}+\beta_{1}X_{1}+\dots+\beta_{p}X_{p}}}{1 +e^{\beta_{0}+\beta_{1}X_{1}+\dots+\beta_{p}X_{p}}}$ </center></font>

Just as in <a href="#4.3.2-Estimating-the-Regression-Coefficients">Section 4.3.2</a>, we `use` the `maximum likelihood method` to estimate $ \beta_{0},\beta_{1},\dots,\beta_{p}$.

| &nbsp;        | Coefficient | Std. error | Z-Statistic | P-Value  |
| ------------- | ----------- | ---------- | ----------- | -------- |
| Intercept     | -10.8690    | 0.4923     | −22.08      | < 0.0001 |
| balance       | 0.0057      | 0.0002     | 24.74       | < 0.0001 |
| income        | 0.0030      | 0.0082     | 0.37        | 0.7115   |
| Student [Yes] | 0.6468      | 0.0.2362   | −2.74       | 0.0062   |

This `simple example illustrates` the `dangers` and `subtleties` associated with `performing regressions involving only a single predictor` when other `predictors may also be relevant`. 

As in the `linear regression setting`, the `results obtained using` `one predictor may be quite different` from those `obtained using multiple predictors`, especially `when there is correlation among the predictors`. 


In general, the `phenomenon` seen in <a href="#Figure4.3">Figure 4.3</a> is `known as` ___`confounding`___.

<a id="Figure4.3"></a>
![image.png](Figures/Figure4.3.png)


By `substituting estimates` for the `regression coefficients` from <a href="#Table4.3">Table 4.3</a>
into <a href="#Formula4.7">(4.7)</a>, we can `make predictions`. 

__For example__, a `student with a credit
card balance` of &dollar;1, 500 and `an income` of &dollar;40, 000 has an `estimated probability` of default of

<a id="Formula4.8"></a>
<font size=5><center> $ \hat{p}(X) = \frac{e^{-10.869+0.00574*1500+ 0.003*40 - 0.6468*1}}{1 +e^{-10.869+0.00574*1500+ 0.003*40 - 0.6468*1}} = 0.058$ </center></font>

A `non-student with the same balance` and `income` has an `estimated probability` of default of

<a id="Formula4.8"></a>
<font size=5><center> $ \hat{p}(X) = \frac{e^{-10.869+0.00574*1500+ 0.003*40 - 0.6468*0}}{1 +e^{-10.869+0.00574*1500+ 0.003*40 - 0.6468*0}} = 0.105$ </center></font>

(Here we multiply the income coefficient estimate from Table 4.3 by 40,
rather than by 40,000, because in that table the model was fit with income
measured in units of $1, 000.)

### 4.3.5 Logistic Regression for >2 Response Classes

&bullet;We sometimes wish to classify a response variable that has more than two classes. 

__For example__, in <a href="#4.2-Why-Not-Linear-Regression?">Section 4.2</a> we had `three categories` of `medical condition in the emergency room`: __stroke , drug overdose , epileptic seizure__.

In this `setting`, we `wish to model` both
<br>$Pr(Y = stroke |X)$ and 
<br>$Pr(Y = drug\ overdose |X)$, with `the remaining` 
<br>$Pr(Y = epileptic\ seizure |X) = 1 − Pr(Y = stroke |X) − Pr(Y = drug\ overdose |X)$. 

The `two-class` `logistic regression models` discussed in the `previous sections` have `multiple-class
extensions`, but in `practice` they `tend not to be used` all that `often`. 

One of the `reasons` is that `the method we discuss` in the `next section`, `discriminant analysis`, is popular for `multiple-class classification`. 

So we do not go into the details of `multiple-class logistic regression` here, but `simply note that` such `an approach is possible`, and that software for it is `available` in Python Or R .

## 4.4 Linear Discriminant Analysis

__`Logistic regression`__ involves `directly modeling` $Pr(Y = k|X = x)$ using the `logistic function`, given by <a href="#Formula4.7">(4.7)</a> for the `case of two response classes`. 

In `statistical jargon`, we `model` the `conditional distribution` of `the response` $Y$ , `given the predictor(s)` $X$. 

We `now consider` an `alternative` and `less direct approach` to `estimating these probabilities`. 

In this `alternative approach`, we `model` the `distribution of the predictors` $X$ `separately` in `each of the
response classes` (i.e. given $Y$ ), and then `use` __Bayes’ theorem__ to `flip these around` into `estimates for` $Pr(Y = k|X = x)$. 

When these `distributions` are `assumed to be normal`, it `turns out that` the `model is very similar` in `form
to logistic regression`.

__Why do we need another method, when we have logistic regression?__

There are several reasons:
- When the `classes` are `well-separated`, the `parameter estimates` for the `logistic regression model` are `surprisingly unstable`.  <br>___`Linear discriminant analysis`___ does `not suffer from this problem`.


- If $n$ is `small` and the `distribution of the predictors` $X$ is `approximately normal` in `each of the classes`, the __`linear discriminant model`__ is `again more stable` than the `logistic regression model`.


- As mentioned in <a href="#4.3.5-Logistic-Regression-for-%3E2-Response-Classes">Section 4.3.5</a>, `linear discriminant analysis` is `popular when` we `have more than two response classes`.

### 4.4.1 Using Bayes’ Theorem for Classification

`Suppose` that we `wish to classify` an `observation` into `one of` $K$ `classes`, 
>where $K \ge 2$. 

In `other words`, the `qualitative response variable` $Y$ can `take on` $K$ `possible distinct` and `unordered values`. 

Let $\pi_{k}$ represent the `overall` or `prior probability` that a `randomly chosen observation comes from the` $kth$ `class`; this is the `probability` that a given `observation` is `associated with` the $kth$ `category of the response variable` $Y$ . 

Let $f_{k}( x ) \equiv Pr(X = x|Y = k)$ `denote` the ___`density function`___ of $X$ for an `observation` that `comes from` the $kth$ class.

In `other words`, $f_{k}(x)$ is `relatively large` __if there is a high probability__ that an `observation in the` $kth$ `class` has $X \approx x$, and $f_{k}(x)$ is `small` if it is `very unlikely` that an `observation` in the $kth$ `class` has $X \approx x$. 

Then ___`Bayes’ theorem`___ states that

<a id="Formula4.10"></a>
<font size=5><center> $ Pr(Y = k|X = x) = \frac{\pi_{k}f_{k}(x)}{\sum_{l=1}^{K}\pi_{l}f_{l}(x)} $ </center></font>

In accordance with our `earlier notation`, we will `use` the `abbreviation` $p_{k}(X) = Pr(Y = k|X)$. 

This `suggests` that `instead of directly computing` p_{k}(X) as in <a href="#4.3.1-The-Logistic-Model">Section 4.3.1</a>, we can `simply plug in estimates` of $π_{k}$ and $f_{k}(X)$ into <a href="#Formula4.10">(4.10)</a>. 

In general, `estimating` $π_{k}$ is `easy if we have a random sample of` $Y_{s}$ `from the population`: __we simply compute the fraction of the training observations that belong to the kth class.__

However, `estimating` $f_{k}(X)$ tends to be `more challenging`, `unless` we `assume some simple forms` for `these
densities`. 

We `refer to` $p_{k}(x)$ as the ___`posterior`___ __probability__ that an observation $X = x$ `belongs` to the $kth$ `class`. 

That is, it is the `probability that` the `observation belongs` to the $kth$ class, given the `predictor value` for `that observation`.


We know from <a href="http://localhost:8888/notebooks/islr-book/Chapter%202/Chapter%202.ipynb">Chapter 2</a> that the `Bayes classifier`, which `classifies` an `observation` to `the class` for which $p_{k}(X)$ is `largest`, has the `lowest possible error rate` out of `all classifiers`. 

(This is of course only true if the terms in (4.10) are all correctly specified.) 
Therefore, if we `can find a way` to `estimate` $f_{k}(X)$, then `we can develop a classifier` that `approximates the Bayes classifier`. 

### 4.4.2 Linear Discriminant Analysis for p = 1

For now, __assume that p = 1__—that is, we have only `one predictor`. 

We would `like to obtain` an `estimate for` $f_{k}(x)$ that `we can plug` into <a href="#Formula4.10">(4.10)</a> in `order to estimate` $p_{k}(x)$. 

We will then `classify an observation` to `the class` for which $p_{k}(x)$ is `greatest`. 

In order to `estimate` $f_{k}(x)$, we will first `make some assumptions` about `its form`.


Suppose we `assume that` $f_{k}(x)$ is ___`normal` or `Gaussian`___. 

In the `one-dimensional setting`, the `normal density takes` the `form`

<a id="Formula4.11"></a>
<font size=5><center> $ f_{k}(x) =  \frac{1}{\sqrt{2\pi\sigma_{k}}} exp \Big( - \frac{1}{2\sigma_{k}^2} (x - \mu_{k})^2 \Big) $ </center></font>
>where $\mu_{k}$ and $\sigma_{k}^2$ are the `mean and variance parameters` for the $kth$ class.

For now, let us `further assume that` $\sigma_{1}^2 = \dots = \sigma_{K}$: that is, __there is a shared variance term across all K classes, which for simplicity we can denote by $\sigma^2$.__ 

Plugging <a href="#Formula4.11">(4.11)</a> into <a href="#Formula4.10">(4.10)</a>, we find that

<a id="Formula4.2"></a>
<font size=5><center> $ P_{k}(x) = \frac{\pi_{k}  \frac{1}{\sqrt{2\pi\sigma}} exp \Big( - \frac{1}{2\sigma^2} (x - \mu_{k})^2 \Big)}{ \sum_{l=1}^{K} \pi_{l}  \frac{1}{\sqrt{2\pi\sigma}} exp \Big( - \frac{1}{2\sigma^2} (x - \mu_{l})^2 \Big) }$</center></font>

__(Note that in (4.12), $\pi_{k}$ denotes the `prior probability` that an `observation belongs to` the $kth$ class, not to be confused with $\pi \approx 3.14159$, the `mathematical constant`.)__

<a id="Figure4.4"></a>
![image.png](Figures/Figure4.4.png)
>__FIGURE 4.4.__ 
<br>__Left:__ Two `one-dimensional` normal `density functions` are shown.
<br>The `dashed vertical line` represents the `Bayes decision boundary`. 

>__Right:__ 20 `observations were drawn` from `each of the two classes`, and are shown as `histograms`.
<br>The `Bayes decision boundary` is again shown as a `dashed vertical line`. 
<br>The `solid vertical line` represents the `LDA decision boundary estimated` from the `training data`.

The `Bayes classifier` involves assigning an observation $X = x$ to the `class` for which <a href="#Formula4.12">(4.12)</a> is `largest`. 

`Taking the log` of <a href="#Formula4.12">(4.12)</a> and `rearranging` the terms, it is `not hard to show` that this is `equivalent to
assigning the observation` to `the class` for which 

<a id="Formula4.13"></a>
<font size=5><center> $ \delta_{k}(x) = x * \frac{\mu_{k}}{\sigma^2} - \frac{\mu_{k}^2}{2\sigma^2} + log(\pi_{k}) $</center></font>

is `largest`. 

For `instance`, if $K = 2$ and $\pi_{1} = \pi_{2}$  , then the `Bayes classifier assigns an observation to class` __1__ ; if $2x (\mu_{1} - \mu_{2}) \gt \mu_{1}^2 − \mu_{2}^2$ , and to `class` __2__ otherwise. 

In this case, the __`Bayes decision boundary`__ corresponds to the `point` where

<a id="Formula4.14"></a>
<font size=5><center> $  x = \frac{ \mu_{1}^2 - \mu_{2}^2 }{ 2(\mu_{1} - \mu_{2})} = \frac{\mu_{1} + \mu_{2}}{2} $</center></font>

An `example` is shown in the `left-hand panel` of <a href="#Figure4.4">Figure 4.4</a>. 

The two `normal density functions` that are displayed, $f_{1}(x)$ and $f_{2}(x)$, represent `two distinct classes`. 

The __`mean and variance`__ parameters for the `two density functions` are $\mu_{1} = −1.25, \mu_{2} = 1.25,$ and $\sigma_{1}^2 = \sigma_{2}^2 = 1$. 

The `two densities overlap`, and so given that $X = x$, there is `some uncertainty` about the `class` to which the `observation belongs`. 

If we `assume` that an `observation is equally likely` to come `from either class`—that is, $\pi_{1}= \pi_{1} = 0.5$—then by `inspection` of <a href="#Formula4.14">(4.14)</a>, we see that the `Bayes classifier` assigns the `observation` to `class 1` __if $x \lt 0$ and `class 2` otherwise__. 

Note that in this `case`, we can `compute` the `Bayes classifier` because `we know that` $X$ is `drawn` from a `Gaussian distribution` within `each class`, and `we know all of the parameters involved`. 

In a `real-life situation`, we `are not able` to `calculate the Bayes classifier`. 

In `practice`, even if we are `quite certain` of our `assumption that` $X$ is `drawn from a Gaussian distribution` within `each class`, we `still have to estimate` the `parameters`$\mu_{1},\mu_{2}, \dots , \mu_{K}, \pi_{1}, \dots, \pi_{K} $, and $\sigma_{2}$. 

The ___`linear discriminant analysis (LDA)`___ `method` approximates the `Bayes classifier` by `plugging estimates` for $\pi_{k}, \mu_{k}$, and $\sigma_{2}$ into <a href="#Formula4.13">(4.13)</a>. 

In particular, the following estimates are used:

<a id="Formula4.15"></a>
<font size=5><center> $ \hat{\mu}_{k} =  \frac{1}{n_{k}} \sum_{i:y_{i} = k}x_{i} $ </center></font>

<font size=5><center> $ \hat{\sigma}^2 =  \frac{1}{n - K} \sum_{k=1}^{K} \sum_{i:y_{i} = k}(x_{i} - \hat{\mu}_{k})^2 $ </center></font>
>where 
<br>$n$ is the `total number of training observations`, and 
<br>$n_{k}$ is the `number of training observations in the kth class`.
<br>The estimate for $μ_{k}$ is `simply the average of all the training observations from the kth class`, while $\hat{\sigma}^2$ can be seen as a `weighted average of the sample variances for each of the K
classes`. 

Sometimes we `have knowledge of the class membership probabilities` $\pi_{1},\dots,\pi_{K}$, which can be `used directly`. 

In the `absence of any additional information`, __`LDA estimates`__ $\pi_{k}$ using the `proportion of the training observations` that `belong to the` $kth$ `class`. 

In other words,

<a id="Formula4.16"></a>
<font size=5><center> $ \hat{\pi}_{k} = n_{k}/n.$</center></font>

The `LDA classifier plugs` the `estimates` given in <a href="#Formula4.15">(4.15)</a> and <a href="#Formula4.16">(4.16)</a> into <a href="#Formula4.13">(4.13)</a>, and `assigns an observation` $X = x$ to the `class` for which

<a id="Formula4.17"></a>
<font size=5><center> $ \hat{\delta}_{k}(x) = x* \frac{\hat{\mu}_{k}}{\hat{\sigma}^2} - \frac{\hat{\mu}_{k}^2}{2\hat{\sigma}^2} + log(\hat{\pi}_{k})$</center></font>

is `largest`. 

The `word linear` in the `classifier’s name stems` from the `fact`
that the `discriminant functions` $\hat{\delta}_{k}(x)$ in <a href="Formula4.17">(4.17)</a> are `linear functions` of $x$ (as opposed to a more `complex function` of $x$).

The `right-hand panel` of <a href="#Figure4.4">Figure 4.4</a> displays a `histogram` of a `random sample of 20 observations` from `each class`. 

To `implement LDA`, we began by `estimating` $\pi_{k}, \mu_{k}$, and $\sigma^2$ using <a href="#Formula4.15">(4.15)</a> and <a href="#Formula4.16">(4.16)</a>. 

We then `computed the decision boundary`, shown as a `black solid line`, that `results` from assigning an `observation to the class` for which <a href="#Formula4.17">(4.17)</a> is `largest`. 

__All points to the `left of this line will be assigned to the green class`__, while __points to the `right of this line are assigned to the purple class`.__

In this `case`, since $n_{1} = n_{2} = 20$, we have $\hat{\pi}_{1} = \hat{\pi}_{2}$. 

As a `result`, the `decision boundary corresponds to the midpoint` between `the sample means` for the `two classes`, $(\hat{\mu}_{1} + \hat{\mu}_{2} )/2$. 

The `figure indicates` that the `LDA decision boundary` is `slightly to the left` of the `optimal Bayes decision boundary`, which `instead equals` $(\hat{\mu}_{1} + \hat{\mu}_{2} )/2 = 0$. 

__How well does the LDA classifier perform on this data?__

Since this is `simulated data`, we can `generate a large number of test observations` in `order to compute` the `Bayes error rate` and the `LDA test error rate`. 

These are 10.6 % and 11.1 %, respectively. In other words, the `LDA classifier’s error rate` is only 0.5 % above the `smallest possible error rate`! This indicates that `LDA is performing pretty well` on this `data set`.

<a id="Figure4.5"></a>
![image.png](Figures/Figure4.5.png)
`
 __FIGURE 4.5.__ `Tw`o$ \sigma_{k}^2$ multivariate Gaussian density functions` are shown, with
$p = 2$. 

>__Left:__ The `two predictors are uncorrelated`. 

>__Right:__ The `two variables have a correlation of 0.7`.

To `reiterate`, `the LDA classifier results` from assuming that the `observations` within `each class come from` a `normal distribution with` a `class-specific` __mean vector and a common variance $\sigma^2$ , and plugging estimates__ for these `parameters` into the `Bayes classifier`. 

In <a href="">Section 4.4.4</a>, we will `consider` a `less stringent set of assumptions`, by allowing the `observations` in the $kth$ `class to have a class-specific variance`,$ \sigma_{k}^2$.


### 4.4.3 Linear Discriminant Analysis for p >1

### 4.4.4 Quadratic Discriminant Analysis


## 4.5 A Comparison of Classification Methods

Page 142.