For many situations, the response variable is *qualitative*. Often, qualitative variables are referred to as *categorical*. The process for predicting a qualitative response is called **classification**. 

### $\underline{\text{Overview of Classification}}:$

Like regression, in the classification setting we have a set of training observations $(x_1, y_1),\dots, (x_n, y_n)$ that we can use to build a classifier. In this chapter, we learn how to build a model to predict *default* $Y$ for any given value of *balance* $X_1$ and *income* $X_2$.


**Why not linear regression?**

1. A regression method cannot accomodate a qualitative response with more than two classes. 

*Example* If we encode three diagnoses: $\text{stroke}, \text{drug overdose}, \text{epileptic seizure}$ response variable as $Y$

$$
Y =     
\begin{cases}
1 & \text{if stroke;} \\\\
2 & \text{if drug overdose;} \\\\
3 & \text{if epileptic seizure;} \\\\
\end{cases}
\tag{1}
$$

we can use least squares to fit a linear model. However, this implies a false ordering and equal spacing between categories, which may not reflect reality. Linear regression isn’t appropriate when the response is qualitative with no natural numeric structure.

2. A regression method will not provide meaningful estimates of $Pr(Y|X)$, even with just two classes. 

*Example* For a binary response, e.g.,

$$
Y =
\begin{cases}
0 & \text{if stroke} \\
1 & \text{if drug overdose}
\end{cases}
$$

we can fit a linear regression and predict drug overdose if $\hat{Y} > 0.5$, stroke otherwise. This gives crude probability estimates, though some may fall outside $[0, 1]$.

### $\underline{\text{Logistic Regression:}}$

Logistic regression models the **probability** that $Y$ belongs to a paticular category. For the **Default** data, logistic regression models the probability of default given a **balance**

$$Pr(\text{default=Yes|balance}).$$

which can be abbreviated as $p(\text{balance})$, which will range between 0 and 1. For example, any individual for whom $p(\text{balance}) > 0.5$ might predict a $default=Yes$. A company could have a more conservative approach and use a lower threshold like $p(\text{balance}) > 0.1$.

**The Logistic Model**

Using linear regression to model $p(X) = \Pr(Y = 1 \mid X)$ with

$$
p(X) = \beta_0 + \beta_1 X
$$

can lead to invalid probability estimates—values below 0 or above 1—especially when $X$ has a wide range. To fix this, we use a function that maps any input to $[0, 1]$, such as the logistic function:

$$
p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}.
$$

This is the basis of logistic regression. To fit the model, we use *maximum likelihood*, ensuring predicted probabilities stay within $[0, 1]$. The logistic function

$$
p(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}
$$

yields an S-shaped curve—low $X$ leads to probabilities near 0, high $X$ near 1—unlike linear regression, which can produce invalid probabilities.

We can rewrite the model as

$$
\frac{p(X)}{1 - p(X)} = e^{\beta_0 + \beta_1 X}
$$

where the left-hand side is the **odds**, and its log gives the **logit**:

$$
\log\left( \frac{p(X)}{1 - p(X)} \right) = \beta_0 + \beta_1 X
$$

Here, $\beta_1$ reflects how the **log-odds** change with $X$, and multiplying the odds by $e^{\beta_1}$ gives the effect of a one-unit increase in $X$. Unlike linear regression, the effect of $X$ on $p(X)$ depends on it's current value,  making the probability response nonlinear. If $\beta_1 > 0$, increasing $X$ increases $p(X)$; if $\beta_1 < 0$, increasing $X$ decreases $p(X)$.

**Estimating the Regression Coefficients**

In logistic regression, the coefficients $\beta_0$ and $\beta_1$ are unknown and estimated from training data. Instead of least squares (used in linear regression), we use **maximum likelihood**, which has better statistical properties.

The idea is to choose $\hat{\beta}_0$, $\hat{\beta}_1$ so that the predicted probability $\hat{p}(x_i)$ is close to 1 for individuals who defaulted and close to 0 for those who didn’t. This is done by maximizing the **likelihood function**:

$$
\ell(\beta_0, \beta_1) = \prod_{i: y_i = 1} p(x_i) \prod_{i': y_{i'} = 0} (1 - p(x_{i'}))
$$

In practice, software estimates these automatically.

For example, if $\hat{\beta}_1 = 0.0055$, then a one-unit increase in balance raises the **log-odds** of default by 0.0055. The **z-statistic** tests if this effect is statistically significant, similar to the t-statistic in linear regression. A small **p-value** indicates we reject the null $H_0: \beta_1 = 0$, meaning balance is associated with default risk.

The intercept $\hat{\beta}_0$ is usually not of direct interest—it adjusts the overall probability to match the proportion of defaults in the data.

**Making Predictions**

Once the coefficients are estimated, we can compute predicted probabilities using the logistic model:

$$
\hat{p}(X) = \frac{e^{\hat{\beta}_0 + \hat{\beta}_1 X}}{1 + e^{\hat{\beta}_0 + \hat{\beta}_1 X}}
$$

From **Table 4.1**, for `balance` as the predictor:

* $\hat{\beta}_0 = -10.6513$
* $\hat{\beta}_1 = 0.0055$

Then:

* For a balance of \$1,000:

$$
\hat{p}(X = 1000) = \frac{e^{-10.6513 + 0.0055 \cdot 1000}}{1 + e^{-10.6513 + 0.0055 \cdot 1000}} = \frac{e^{-5.1513}}{1 + e^{-5.1513}} \approx 0.00576
$$

* For a balance of \$2,000:

$$
\hat{p}(X = 2000) = \frac{e^{-10.6513 + 0.0055 \cdot 2000}}{1 + e^{-10.6513 + 0.0055 \cdot 2000}} = \frac{e^{0.3487}}{1 + e^{0.3487}} \approx 0.586
$$

**Table 4.1: Logistic Regression Predicting Default from Balance**

| Coefficient | Std. Error | z-Statistic | p-Value |         |
| ----------- | ---------- | ----------- | ------- | ------- |
| Intercept   | -10.6513   | 0.3612      | -29.5   | <0.0001 |
| Balance     | 0.0055     | 0.0002      | 24.9    | <0.0001 |

Logistic regression also handles qualitative predictors using **dummy variables**. For example, for the qualitative variable `student` (1 = student, 0 = non-student), the estimated coefficients from **Table 4.2** are:

* $\hat{\beta}_0 = -3.5041$
* $\hat{\beta}_1 = 0.4049$

Then:

* For a student:

$$
\hat{p}(\text{default} = \text{Yes} \mid \text{student} = 1) = \frac{e^{-3.5041 + 0.4049 \cdot 1}}{1 + e^{-3.5041 + 0.4049 \cdot 1}} = \frac{e^{-3.0992}}{1 + e^{-3.0992}} \approx 0.0431
$$

* For a non-student:

$$
\hat{p}(\text{default} = \text{Yes} \mid \text{student} = 0) = \frac{e^{-3.5041 + 0.4049 \cdot 0}}{1 + e^{-3.5041 + 0.4049 \cdot 0}} = \frac{e^{-3.5041}}{1 + e^{-3.5041}} \approx 0.0292
$$

The positive coefficient indicates that students are predicted to have a higher probability of default than non-students.

**Table 4.2: Logistic Regression Predicting Default from Student Status**

| Coefficient    | Std. Error | z-Statistic | p-Value |         |
| -------------- | ---------- | ----------- | ------- | ------- |
| Intercept      | -3.5041    | 0.0707      | -49.55  | <0.0001 |
| Student \[Yes] | 0.4049     | 0.1150      | 3.52    | 0.0004  |

**Multiple Regression**

To predict a binary response using multiple predictors, we generalize the logistic regression model as:

$$
\log \left( \frac{p(X)}{1 - p(X)} \right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p
$$

or equivalently,

$$
\hat{p}(X) = \frac{e^{\hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p}}{1 + e^{\hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p}}
$$

**Table 4.3: Logistic Regression Predicting Default from Multiple Predictors**

| Coefficient    | Std. Error | z-Statistic | p-Value |         |
| -------------- | ---------- | ----------- | ------- | ------- |
| Intercept      | -10.8690   | 0.4923      | -22.08  | <0.0001 |
| Balance        | 0.0057     | 0.0002      | 24.74   | <0.0001 |
| Income         | 0.0030     | 0.0082      | 0.37    | 0.7115  |
| Student \[Yes] | -0.6468    | 0.2362      | -2.74   | 0.0062  |

As before, we use **maximum likelihood** to estimate the coefficients. From the table:

* Balance is positively and significantly associated with default.
* Income is not statistically significant.
* Student status has a **negative coefficient**, in contrast to the positive coefficient in the earlier single-variable model (Table 4.2).

This reversal occurs due to **confounding variable**: student status and balance are correlated—students tend to carry higher balances, which are strongly associated with higher default rates. So although students have higher **overall** default rates, for the **same balance and income**, students are actually **less likely** to default than non-students.

**Predictions from the Multiple Logistic Regression Model**

Using the estimated coefficients from Table 4.3, we can compute predicted probabilities:

* For a **student** with balance = \$1,500 and income = \$40,000:

$$
\hat{p}(\text{default} = \text{Yes} \mid \text{student} = 1, \text{balance} = 1500, \text{income} = 40) =
\frac{e^{-10.869 + 0.0057 \cdot 1500 + 0.0030 \cdot 40 - 0.6468 \cdot 1}}{1 + e^{-10.869 + 0.0057 \cdot 1500 + 0.0030 \cdot 40 - 0.6468 \cdot 1}} \approx 0.058
$$

* For a **non-student** with the same balance and income:

$$
\hat{p}(\text{default} = \text{Yes} \mid \text{student} = 0, \text{balance} = 1500, \text{income} = 40) =
\frac{e^{-10.869 + 0.0057 \cdot 1500 + 0.0030 \cdot 40 - 0.6468 \cdot 0}}{1 + e^{-10.869 + 0.0057 \cdot 1500 + 0.0030 \cdot 40 - 0.6468 \cdot 0}} \approx 0.105
$$

Note: Income is measured in **thousands of dollars**, so we use 40 instead of 40,000 in the calculation.

**Discussion of confounding**

The **overall default rate** refers to the empirical proportion of individuals in the dataset who defaulted, regardless of predictor values.

Mathematically, if you have $n$ observations and $y_i \in \{0, 1\}$ indicates whether person $i$ defaulted:

$$
\text{Overall default rate} = \frac{1}{n} \sum_{i=1}^n y_i
$$

This is **not** a model-based prediction—it's simply the average value of the binary response across the data. For example, if 3.3% of people in the data defaulted, the overall default rate is 0.033.

In plots like **Figure 4.3**, the horizontal dashed lines show this empirical rate **within subgroups**, e.g., students vs. non-students, averaged across all values of balance and income.

How can students be less likely to default (in the model), yet have a higher overall default rate?

This is the classic signature of **confounding**.

* In the **single-variable logistic regression** (Table 4.2), students appear **more likely** to default. That’s because:

  * Students tend to have **higher balances** (right panel of Figure 4.3).
  * Higher balances are strongly associated with higher default risk.
  * So, without adjusting for balance, it *looks* like students are riskier.

* In the **multiple logistic regression** (Table 4.3), after controlling for **balance** and **income**, the **coefficient for student becomes negative**:

  * This means that **for two people with the same balance and income**, the student is actually **less likely** to default.
  * So, once we hold balance constant, the true effect of student status is revealed.

* The multiple logistic regression says:

  $$
  \log \left( \frac{\Pr(\text{default} = 1 \mid X)}{1 - \Pr(\text{default} = 1 \mid X)} \right) = \beta_0 + \beta_1 \cdot \text{balance} + \beta_2 \cdot \text{income} + \beta_3 \cdot \text{student}
  $$

  where $\hat{\beta}_3 = -0.6468$ implies students are less likely to default **when balance and income are the same**.

The higher **overall** student default rate is due to **students having higher average balances**, not because being a student inherently increases default risk.

**Student status is a confounding variable**. It’s associated with higher balance, which increases default risk. Once you control for balance, being a student is actually associated with **lower** default probability.

Here’s a concise and clear version for **interview review**, retaining all key ideas and equations from the section on **multinomial logistic regression**:

**Multinomial Logistic Regression (for $K > 2$ classes)**

Logistic regression can be extended to handle **multi-class classification**—this is called **multinomial logistic regression**.

Suppose $Y \in \{1, 2, ..., K\}$ is a response with **K classes**, and we choose one class (say class $K$) as the **baseline**.

For $k = 1, \dots, K-1$:

$$
\Pr(Y = k \mid X = x) = \frac{e^{\beta_{k0} + \beta_{k1}x_1 + \cdots + \beta_{kp}x_p}}{1 + \sum_{l=1}^{K-1} e^{\beta_{l0} + \beta_{l1}x_1 + \cdots + \beta_{lp}x_p}}
$$

$$
\Pr(Y = K \mid X = x) = \frac{1}{1 + \sum_{l=1}^{K-1} e^{\beta_{l0} + \beta_{l1}x_1 + \cdots + \beta_{lp}x_p}}
$$

And the **log-odds** for class $k$ vs. the baseline class $K$ is:

$$
\log \left( \frac{\Pr(Y = k \mid X = x)}{\Pr(Y = K \mid X = x)} \right) = \beta_{k0} + \beta_{k1}x_1 + \cdots + \beta_{kp}x_p
$$

This generalizes binary logistic regression: the log-odds between any class and the baseline is linear in the predictors.

* The choice of **baseline class** doesn’t affect predicted probabilities or log-odds between any pair of classes, but it **does affect coefficient interpretation**.
* For example, if **epileptic seizure** is the baseline, then $\beta_{\text{stroke}, j}$ is the change in log-odds of stroke vs. epileptic seizure for a one-unit increase in $x_j$.
* The **odds ratio** increases by $e^{\beta_{\text{stroke}, j}}$ per unit increase in $x_j$.

**Softmax Coding**

An alternative formulation used widely in **machine learning** is **softmax coding**, which treats all classes symmetrically:

$$
\Pr(Y = k \mid X = x) = \frac{e^{\beta_{k0} + \beta_{k1}x_1 + \cdots + \beta_{kp}x_p}}{\sum_{l=1}^K e^{\beta_{l0} + \beta_{l1}x_1 + \cdots + \beta_{lp}x_p}} \quad \text{for } k = 1, \dots, K
$$

In this version

* Coefficients are estimated for **all K classes**.
* The **log-odds between any two classes** $k$ and $k'$ is:

$$
\log \left( \frac{\Pr(Y = k \mid X = x)}{\Pr(Y = k' \mid X = x)} \right)
= (\beta_{k0} - \beta_{k'0}) + (\beta_{k1} - \beta_{k'1})x_1 + \cdots + (\beta_{kp} - \beta_{k'p})x_p
$$

Softmax coding produces the **same predicted probabilities and log-odds** as the baseline coding—it’s just a different parameterization.

**Summary Table**

| Concept                  | Description                                                                                        |
| ------------------------ | -------------------------------------------------------------------------------------------------- |
| Modeling approach        | **Discriminative**: directly models $\Pr(Y = k \mid X = x)$                                        |
| Functional form (binary) | $\log \frac{p(x)}{1 - p(x)} = \beta_0 + \beta_1 x$                                                 |
| Output                   | Probability estimates for each class                                                               |
| Probability function     | $p(x) = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}}$                               |
| Fitting method           | Maximum likelihood estimation                                                                      |
| Link function            | **Logit** (log-odds)                                                                               |
| Assumptions              | Linear relationship between predictors and log-odds                                                |
| Decision rule            | Predict class 1 if $p(x) > t$, with threshold $t$ (e.g., 0.5)                                      |
| Interpretability         | Coefficients represent **change in log-odds** per unit change in predictor                         |
| Extensions               | - Multinomial logistic regression (for $K > 2$)                                                    |
|                          | - Softmax regression (symmetric multiclass generalization)                                         |
| Strengths                | - Probabilistic output                                                                             |
|                          | - Handles binary and multiclass tasks                                                              |
|                          | - Can include interactions and non-linearities via features                                        |
| Limitations              | - Assumes linear log-odds                                                                          |
|                          | - May underperform if classes are non-separable or heavily overlapping                             |
| Relation to LDA          | Logistic regression is **discriminative**; LDA is **generative**                                   |
| Confounding handling     | Can adjust for multiple predictors simultaneously                                                  |
| Prediction formula       | $\hat{p}(x) = \frac{e^{\hat{\beta}_0 + \hat{\beta}_1 x}}{1 + e^{\hat{\beta}_0 + \hat{\beta}_1 x}}$ |



### $\underline{\text{Generative Models for Classification}}:$

Logistic regression directly models the **posterior probability**:

$$
\Pr(Y = k \mid X = x)
$$

This is called a **discriminative approach**, since it models the response $Y$ given predictors $X$.

### **Generative Approach**

Instead of modeling $\Pr(Y \mid X)$ directly, we:

1. Model the **class priors** $\pi_k = \Pr(Y = k)$
2. Model the **class-conditional density** $f_k(x) = \Pr(X = x \mid Y = k)$

Then apply **Bayes' theorem**:

$$
\Pr(Y = k \mid X = x) = \frac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)}
$$

This gives us the **posterior probability** $p_k(x)$, which we can use to classify observations by assigning them to the class with the highest posterior (i.e., **Bayes classifier**).

We use this approach because 

* **More stable** than logistic regression when classes are well separated.
* **Better performance** with small sample sizes, assuming normality of $X$ within each class.
* Naturally generalizes to multi-class settings.
* Enables approximation of the **Bayes classifier**, which has the lowest theoretical error (if $\pi_k$ and $f_k(x)$ are correctly specified).

**Estimating $\pi_k$ and $f_k(x)$**

* $\pi_k$ can be estimated easily as the proportion of training examples in class $k$.
* Estimating $f_k(x)$ is harder and typically requires assumptions (e.g., normal distribution).

This framework leads to three common generative classifiers:

* **Linear Discriminant Analysis (LDA)**
* **Quadratic Discriminant Analysis (QDA)**
* **Naive Bayes**

Each uses a different assumption about the form of $f_k(x)$, allowing practical approximation of the Bayes classifier.


### **Linear Discriminant Analysis for $p=1$:**

**LDA** is a generative classification method that models the conditional distribution $X \mid Y = k$ as **Gaussian** with **class-specific means** $\mu_k$ and a **shared variance** $\sigma^2$. Using **Bayes’ theorem**, it computes the posterior $\Pr(Y = k \mid X = x)$, and classifies observations by selecting the class with the highest posterior (i.e., the **Bayes classifier**).

**Assumptions (Univariate case):**

$$
f_k(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x - \mu_k)^2}{2\sigma^2} \right)
$$

* $X \mid Y = k \sim \mathcal{N}(\mu_k, \sigma^2)$
* Common variance across all classes
* Class priors $\pi_k$ are known or estimated from the training data

**Discriminant Function:**

LDA simplifies the Bayes rule into a **linear discriminant function**:

$$
\delta_k(x) = \frac{\mu_k}{\sigma^2}x - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k)
$$

Predict the class with the largest $\delta_k(x)$.

**Parameter Estimation:**

Given training data:

* **Class means**:

  $$
  \hat{\mu}_k = \frac{1}{n_k} \sum_{i: y_i = k} x_i
  $$

* **Shared variance**:

  $$
  \hat{\sigma}^2 = \frac{1}{n - K} \sum_{k=1}^K \sum_{i: y_i = k} (x_i - \hat{\mu}_k)^2
  $$

* **Class priors**:

  $$
  \hat{\pi}_k = \frac{n_k}{n}
  $$

These are substituted into the estimated discriminant:

$$
\hat{\delta}_k(x) = \frac{\hat{\mu}_k}{\hat{\sigma}^2}x - \frac{\hat{\mu}_k^2}{2\hat{\sigma}^2} + \log(\hat{\pi}_k)
$$


**Decision Boundary (Two Classes, Equal Priors):**

If $\pi_1 = \pi_2$, then the LDA decision boundary is simply:

$$
x = \frac{\mu_1 + \mu_2}{2}
$$

**Performance Insight:**

In simulated examples, LDA achieves near-optimal performance (e.g., 11.1% test error vs. 10.6% Bayes error), assuming the Gaussian and equal variance assumptions hold.

**Model Assumptions:**

LDA assumes Gaussian-distributed predictors with **class-specific means** and a **common variance**. When this assumption holds, it closely approximates the Bayes classifier. If not, more flexible models like **QDA** (which allows class-specific variances) may perform better.


**Summary Table**

| Concept               | Description                                                             |
| --------------------- | ----------------------------------------------------------------------- |
| Generative assumption | $X \mid Y = k \sim \mathcal{N}(\mu_k, \sigma^2)$                        |
| Decision rule         | Assign $x$ to class with largest $\delta_k(x)$                          |
| Discriminant function | Linear in $x$                                                           |
| Parameters estimated  | $\mu_k$, shared $\sigma^2$, class prior $\pi_k$                         |
| Strengths             | Simple, fast, interpretable, near-optimal if assumptions hold           |
| Limitations           | Assumes equal variance; less flexible than QDA or nonparametric methods |
| Output                | Posterior probabilities, linear decision boundaries                     |



### **Linear Discriminant Analysis for $p > 1$:**

**Assumptions**

LDA assumes that the predictor $\mathbf{X} \in \mathbb{R}^p$ follows a **multivariate Gaussian distribution** within each class $k$:

$$
\mathbf{X} \mid Y = k \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma})
$$

* $\boldsymbol{\mu}_k$: class-specific mean vector ($p \times 1$)
* $\boldsymbol{\Sigma}$: shared covariance matrix ($p \times p$), same for all $K$ classes
* $\pi_k$: class prior probability, often estimated from class frequencies

**Multivariate Gaussian Density**

The density for class $k$ is:

$$
f_k(\mathbf{x}) = \frac{1}{(2\pi)^{p/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left( -\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) \right)
$$

**Discriminant Function**

Instead of directly comparing the densities, LDA defines a **discriminant score** $\delta_k(\mathbf{x})$, derived from log-posterior:

$$
\delta_k(\mathbf{x}) = \mathbf{x}^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k - \frac{1}{2} \boldsymbol{\mu}_k^T \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k + \log(\pi_k)
$$

An observation is assigned to the class with the **largest** $\delta_k(\mathbf{x})$.

Each $\delta_k(\mathbf{x})$ is a **linear function** of $\mathbf{x}$, which gives LDA its name.

**Decision Boundaries**

* Boundaries between class $k$ and class $\ell$ are defined by the set of $\mathbf{x}$ where $\delta_k(\mathbf{x}) = \delta_\ell(\mathbf{x})$
* These are **linear hyperplanes** in $\mathbb{R}^p$

**Parameter Estimation from Training Data**

Given $n$ training examples and $K$ classes:

* Sample mean vector for class $k$:

  $$
  \hat{\boldsymbol{\mu}}_k = \frac{1}{n_k} \sum_{i: y_i = k} \mathbf{x}_i
  $$

* Shared covariance matrix:

  $$
  \hat{\boldsymbol{\Sigma}} = \frac{1}{n - K} \sum_{k=1}^K \sum_{i: y_i = k} (\mathbf{x}_i - \hat{\boldsymbol{\mu}}_k)(\mathbf{x}_i - \hat{\boldsymbol{\mu}}_k)^T
  $$

* Priors:

  $$
  \hat{\pi}_k = \frac{n_k}{n}
  $$

**Classification Rule**

Plug in the estimates into the discriminant function:

$$
\hat{\delta}_k(\mathbf{x}) = \mathbf{x}^T \hat{\boldsymbol{\Sigma}}^{-1} \hat{\boldsymbol{\mu}}_k - \frac{1}{2} \hat{\boldsymbol{\mu}}_k^T \hat{\boldsymbol{\Sigma}}^{-1} \hat{\boldsymbol{\mu}}_k + \log(\hat{\pi}_k)
$$

Assign $\mathbf{x}$ to the class with the largest $\hat{\delta}_k(\mathbf{x})$.

**Sensitivity to Thresholding**

LDA implicitly uses a **0.5 threshold** for the posterior probability. This minimizes overall error but can underperform on specific classes (e.g., defaulters). Lowering the threshold (e.g., to 0.2) increases **sensitivity** (true positive rate) at the cost of **specificity** (true negative rate).

This tradeoff is visualized with a **ROC curve**; the **area under the curve (AUC)** summarizes classifier performance across all thresholds. For example, AUC = 0.95 indicates excellent classification.

**Confusion Matrix Terminology (Table 4.6)**

| True Class   | Predicted Negative  | Predicted Positive  | Total |
| ------------ | ------------------- | ------------------- | ----- |
| Negative (−) | TN (True Negative)  | FP (False Positive) | N\*   |
| Positive (+) | FN (False Negative) | TP (True Positive)  | P\*   |
| Total        | N                   | P                   |       |

**Key Metrics (Table 4.7)**

| Name                          | Formula          | Synonyms                               |
| ----------------------------- | ---------------- | -------------------------------------- |
| **False Positive Rate**       | $\frac{FP}{N}$   | Type I error, $1 - \text{Specificity}$ |
| **True Positive Rate**        | $\frac{TP}{P}$   | Power, Sensitivity, Recall             |
| **Positive Predictive Value** | $\frac{TP}{P^*}$ | Precision                              |
| **Negative Predictive Value** | $\frac{TN}{N^*}$ | —                                      |

---

**LDA $p>1$ Summary Table**

| Concept                   | Description                                                                          |
| ------------------------- | ------------------------------------------------------------------------------------ |
| **Assumption**            | Multivariate normal distribution with class-specific means and shared covariance     |
| **Discriminant function** | Linear in $\mathbf{x}$, derived from Bayes theorem                                   |
| **Classification rule**   | Assign to class with highest $\hat{\delta}_k(\mathbf{x})$                            |
| **Decision boundary**     | Linear hyperplanes between class regions                                             |
| **Parameter estimation**  | Estimate $\boldsymbol{\mu}_k$, $\boldsymbol{\Sigma}$, and $\pi_k$ from training data |
| **Effect of threshold**   | Lower thresholds increase sensitivity at the cost of specificity                     |
| **ROC/AUC**               | Summarize classifier performance across all thresholds                               |
| **When LDA works well**   | When classes are approximately Gaussian and have equal covariances                   |
| **Limitations**           | Poor performance if assumptions are violated or classes are highly non-linear        |


### Quadratic Discriminant Analysis (QDA)

QDA is a generative classification model that, like LDA, assumes the feature vector $\mathbf{X} \in \mathbb{R}^p$ follows a multivariate Gaussian distribution **within each class**, but **allows a separate covariance matrix** for each class:

$$
\mathbf{X} \mid Y = k \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)
$$

* $\boldsymbol{\mu}_k$ is the class-specific mean vector.
* $\boldsymbol{\Sigma}_k$ is the **class-specific** covariance matrix.
* $\pi_k = \mathbb{P}(Y = k)$ is the prior probability for class \$k\$.

**QDA Discriminant Function**

Using Bayes’ theorem, the log-posterior (up to proportionality) becomes:

$$
\delta_k(\mathbf{x}) = 
- \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)
- \frac{1}{2} \log |\boldsymbol{\Sigma}_k| 
+ \log \pi_k
$$

This expands to:

$$
\delta_k(\mathbf{x}) 
= -\frac{1}{2} \mathbf{x}^T \boldsymbol{\Sigma}_k^{-1} \mathbf{x}
+ \mathbf{x}^T \boldsymbol{\Sigma}_k^{-1} \boldsymbol{\mu}_k 
- \frac{1}{2} \boldsymbol{\mu}_k^T \boldsymbol{\Sigma}_k^{-1} \boldsymbol{\mu}_k 
- \frac{1}{2} \log |\boldsymbol{\Sigma}_k| + \log \pi_k
$$

Since this includes a **quadratic form** in $\mathbf{x}$, QDA’s decision boundaries are **quadratic surfaces**, unlike the linear boundaries in LDA.

**QDA Classification Rule**

Assign $\mathbf{x}$ to the class with the largest discriminant score:

$$
\hat{y} = \arg\max_k \delta_k(\mathbf{x})
$$

**Parameter Estimation**

From the training data:

- $\hat{\mu}_k = \dfrac{1}{n_k} \sum_{i: y_i = k} \mathbf{x}_i$
- $\hat{\boldsymbol{\Sigma}}_k = \dfrac{1}{n_k - 1} \sum_{i: y_i = k} (\mathbf{x}_i - \hat{\mu}_k)(\mathbf{x}_i - \hat{\mu}_k)^T$
- $\hat{\pi}_k = \dfrac{n_k}{n}$

**LDA vs. QDA: Bias–Variance Trade-Off**

| Model   | Covariance Assumption                | Decision Boundary | Covariance Params               | Variance | Bias                          |
|--------|----------------------------------------|-------------------|----------------------------------|----------|-------------------------------|
| **LDA** | Shared $\boldsymbol{\Sigma}$ across classes | Linear            | $\frac{p(p+1)}{2}$ total         | Low      | High (if assumption is wrong) |
| **QDA** | Class-specific $\boldsymbol{\Sigma}_k$     | Quadratic         | $K \cdot \frac{p(p+1)}{2}$ total | High     | Low (if assumption holds)     |

- **Use LDA** when $n$ is small or classes have similar covariance structure — lower variance improves generalization.
- **Use QDA** when $n$ is large or class covariances are clearly different — more flexibility reduces bias.

**Geometric Interpretation (ISLR Fig. 4.9)**

* **Left Panel:** $\Sigma_1 = \Sigma_2$. Bayes boundary is linear. LDA matches Bayes well. QDA slightly overfits.
* **Right Panel:** $\Sigma_1 \ne \Sigma_2$. Bayes boundary is curved. QDA matches Bayes better. LDA underfits.


**Summary Table — LDA vs. QDA**
| Feature               | LDA                                | QDA                                     |
|-----------------------|-------------------------------------|------------------------------------------|
| Covariance Structure  | Common to all classes ($\Sigma$)   | Separate for each class ($\Sigma_k$)     |
| Decision Boundary     | Linear                             | Quadratic                                |
| Covariance Parameters | $\frac{p(p+1)}{2}$                 | $K \cdot \frac{p(p+1)}{2}$               |
| Variance              | Low                                | Higher                                   |
| Bias                  | High (if assumption violated)      | Low (if assumption holds)                |
| Best Use Case         | Few samples, similar class spreads | Many samples, class-specific spreads     |
| Computational Cost    | Low                                | Higher (due to matrix per class)         |

### Naive Bayes