# Section 7 - Classification Models: Naive Bayes, LDA, QDA

Goals:

- Review lecture content on classification methods;
- Better understand Naive Bayes, Linear and Quadratic Discriminant Analysis (LDA and QDA) models;
- Get a practical sense on model assessment (hypothesis and performance).

You should have downloaded:
- heart.csv
- gnb-lda-qda.png

# 1: Preprocessing

For this section we will use the [Heart Failure Clinical Records Dataset](https://archive.ics.uci.edu/ml/datasets/Heart%2Bfailure%2Bclinical%2Brecords). This dataset contains the medical records of patients who had heart failure, collected during their follow-up period. Each patient profile has 13 clinical features, followed by a label describing if the patient survived or not.

**Task:**
- Load the [Heart Failure Clinical Records Dataset](https://archive.ics.uci.edu/ml/datasets/Heart%2Bfailure%2Bclinical%2Brecords) from `heart.csv`
- Store the following quantitative variables as predictors:
    1. `age`
    2. `creatinine_phosphokinase`
    3. `ejection_fraction`
    4. `platelets`
    5. `serum_creatinine`
    6. `serum_sodium`
    - Use a log transformation on the predictors to make them look more like a Gaussian R.V. 
- Define `y` as the column `DEATH_EVENT` of the dataset. This is the target we want to eventually predict.

In [None]:
import numpy as np
import pandas as pd

dataset = pd.read_csv("heart.csv")

predictors = ["age", "creatinine_phosphokinase", "ejection_fraction", "platelets", "serum_creatinine", "serum_sodium"]
X = dataset[predictors]
X = np.log(X)

print(f"X shape: {X.shape}")
X.head()

In [None]:
y = dataset["DEATH_EVENT"]
y

Split the data into training and testing
- test_size = 1/3
- random_state = 1234

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

X_train = X_train.reset_index(drop=True)
X_test  = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test  = y_test.reset_index(drop=True)

# 2: Model Fitting
GNB, LDA, QDA's models all look something like finding the best posterior amongst the classes $C_k$ (in our case $k=0,1$ for death events)
$$
\text{posterior for class k} \ = \ P(C_k \lvert\,\boldsymbol{x}) \  = \ \frac{\pi(C_k)\,{\color{red}{{\cal{}L}_{\!\boldsymbol{x}}(C_k)}}}{Z}.
$$
The key is in the likelihood ${\color{red}{{\cal{}L}_{\!\boldsymbol{x}}(C_k)}}$.
$$
\text{GNB: }  {\cal{}N} \left( \mu_k, {\color{orange}{D}} \right), \quad \text{LDA: }  {\cal{}N} \left( \mu_k, {\color{orange}{\Sigma}} \right), \quad \text{QDA: }  {\cal{}N} \left( \mu_k, {\color{orange}{\Sigma_k}} \right).
$$
The models are best suited to the following types of data, in increasing complexity.

![](gnb-lda-qda.png)

$$ 
GNB: \ 
D_0 = \begin{pmatrix} 1&0\\0&1\end{pmatrix},
\
D_1 = \begin{pmatrix} 3&0\\0&0.5\end{pmatrix}
\qquad
LDA: \ 
\Sigma = \begin{pmatrix} 2&0.7\\0.7&1\end{pmatrix}
\qquad
QDA: \
\Sigma_0 = \begin{pmatrix} 2&0.7\\0.7&1\end{pmatrix},
\
\Sigma_1 = \begin{pmatrix} 1&-0.5\\-0.5&1\end{pmatrix}.
$$

**Discuss:**

What are the essential differences between the implementation GNB, LDA, and QDA? Write them out below.

1. prior calculation? 

    **Ans:**
    
2. likelihood calculation?

    **Ans:**

3. posterior calculation? 

    **Ans:**

4. number of parameters needed to describe the model (complexity)? (Let $K$ be the number of classes, $d$ be dimension of data.)

    **Ans:**
    
**Task:**

Create a prediction function `predict()` that implements all models. 
- Fit the prior and likelihood to the training data
    - Note: Infer prior probabilities from class proportions
- evaluate likelihood at test points
- Use posterior to predict `DEATH_EVENT` at test points. 

In [None]:
from scipy.stats import multivariate_normal

def predict(X_train, y_train, X_test, model):
    # prior
    prior0 = None
    prior1 = None

    # likelihood for each class
    X0 = X_train.iloc[y_train[y_train==0].index, :]
    X1 = X_train.iloc[y_train[y_train==1].index, :]
    
    mu0 = None
    mu1 = None
    
    if model == 'gnb':
        Sigma0 = None
        Sigma1 = None
        # print(Sigma0)  # note the dimensions of Sigma0
    elif model == 'lda':
        Sigma0 = None
        Sigma1 = None
    elif model == 'qda':
        Sigma0 = None
        Sigma1 = None

    likelihood0 = None
    likelihood1 = None

    # posterior
    # Since we want to predict the class label, we can ignore
    # the normalization factor. Just select the one with greatest
    # unnormalized posterior.
    posterior0 = None
    posterior1 = None

    return None

# 3: Model Asessment
## 3.1 Accuracy
For each model, print the accuracy on the dataset.

In [None]:
def accuracy(pred_y, true_y):
    n = pred_y.shape[0]
    return 100*np.sum(pred_y == true_y) / n

gnb_acc = None
lda_acc = None
qda_acc = None

print(f"Acc. GNB Model: {np.round(gnb_acc, 2)}%")
print(f"Acc. LDA Model: {np.round(lda_acc, 2)}%")
print(f"Acc. QDA Model: {np.round(qda_acc, 2)}%")

## 3.2 Is data approapriate for models? Check assumptions.
### 3.2.1 Check independence/covariance

In [None]:
# data corresponding to each class
X0 = X.iloc[y[y==0].index, :]
X1 = X.iloc[y[y==1].index, :]

#### GNB
- Predictors are **independent**.

**Discuss:** 
1. Independence implies no correlation. Does no correlation imply independence? 
    
    **Ans:** 

2. How can we check/gauge independence using correlation?

    **Ans:** 

3. By calculating the correlation matrix, what does it suggest about (in)dependence of predictors?

    **Ans:** 

In [None]:
# GNB: Check the correlation matrix of the predictors within each class.
# You may use the pandas.dataframe.corr() function.


#### LDA
- Predictors are **not necessarily independent**, but it is assumed that the **covariance matrix is the same** for each class.

#### QDA
- Predictors are **not necessarily independent**, and the **covariance matrix is not neccessarily the same** for each class.

**Discuss:**
- What are some ways to compare close-ness of covariance matrices? Is simply computing them and checking if each entry is exactly the same a fair comparison? What kind of tolerance seems appropriate?


In [None]:
# LDA: Check the covariance matrix within each class and see if they are the same.
# You may use the pandas.dataframe.cov() function.

**Discuss:**
Based on all the checks we did, how do you make sense of the accuracy results in 3.1 between GNB, LDA, QDA?

**Ans:** 

### 3.2.2 Check if data normally distributed
Run the code cells below. 

**Discuss:**
- What does each plot represent? How is it computed?
- Does the data look normally distributed? 
- How does that affect the appropriateness of using GNB/LDA/QDA? What are some reasons for or against using them?


In [None]:
# GNB, LDA, and QDA: Check if the predictors follow a Gaussian distribution within each class.
import matplotlib.pyplot as plt

_, ax = plt.subplots(2,3, figsize=(12,8))
for i in range(X0.shape[1]):
    row, col = i%2, i%3
    ax[row, col].hist(X0.iloc[:, i])
    ax[row, col].set_title(f"Class 0: {X0.columns[i]}")
plt.show()

In [None]:
_, ax = plt.subplots(2,3, figsize=(12,8))
for i in range(X1.shape[1]):
    row, col = i%2, i%3
    ax[row, col].hist(X1.iloc[:, i])
    ax[row, col].set_title(f"Class 1: {X1.columns[i]}")
plt.show()