# Homework 5 - Naive Bayes

Make sure you have downloaded:
- heart_processed_log.csv

This homework will ask you to implement naive bayes using a custom likelihood and then comparing it against a custom build LDA and QDA implimentation. 

The execution of GNB is slightly different from lecture and section. 
- It is more streamlined to take adavantage of vector multiplications and numpy functions, which has its own benefits if we want to scale up our naive bayes prediction to higher dimensions. 
- However, you may need to familiarize yourself with the "dictionary" data structure.

Before attempting this homework, make sure you understand the broad strokes of naive Bayes. This will make your coding and debugging much smoother.

## 0 Data
Load `heart_processed.csv` from the [Heart Failure Clinical Records Dataset](https://archive.ics.uci.edu/ml/datasets/Heart%2Bfailure%2Bclinical%2Brecords)  It contains various predictors (which are in log-scale) for predicting the event of death `DEATH_EVENT`.

In [71]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Before submitting your homework, remember to set:
- random_state = 0

In [None]:
dataset = pd.read_csv("heart_processed_log.csv", index_col=0)
display(dataset)

X = dataset.drop("DEATH_EVENT", axis=1).values
y = dataset["DEATH_EVENT"].values

# split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# print the shapes of the training and testing sets
print('train shapes:')
print('\t X_train ->', X_train.shape)
print('\t y_train ->', y_train.shape)

print('test shapes:')
print('\t X_test ->', X_test.shape)
print('\t y_test ->', y_test.shape)

Recall: naive Bayes is choosing the class $k$, $C_k$, that maximizes the posterior
$$
P(C_k \lvert\,{x}) = \frac{\pi(C_k)\,{\cal{}L}_{\!{x}}(C_k)}{Z}.
$$
Hence, we maximize the numerator + assume that all $d$ features $x_i$ are independent ("naive-ness"). So we want to find the $k$ that satisfies
$$
\max_k \, \pi(C_k)\,{\cal{}L}_{\!{x}}(C_k) \quad = \quad \max_k \, \left( \pi(C_k)\,\prod_{i=1}^d p(x_i \lvert C_k) \right).
$$

## 1 Custom Naive Bayes Classifier with KDE
You will create a naive Bayes classifier:
- using the training data
- with KDE to approximate the likelihood
- with bernoulli as the prior

**Use only the training data ```X_train, y_train``` to fit the naive Bayes classifier.**

### 1.1 Prior
1. [2 pt] Compute ```prior```, a two element array. 
    - prior[0] is the probability of death event 0, $\pi(C_0)$
    - prior[1] is the probability of death event 1, $\pi(C_1)$ 
    - You should construct the prior probabilities based on frequency of death events from the training data. 
    - Tip: Use np.unique() with return_counts.
2. [1 pt] Print ```prior```.

In [None]:
prior = None      # TODO

print('The prior probabilities are:', prior)

### 1.2 Likelihood (KDE)
1. [2 pt] Define dictionaries `kde0` and `kde1` which fulfill the following:
    - kde0[i] corresponds to the kde object (created by calling `scipy.stats.gaussian_kde`) for feature i when death event is 0. kde1[i] defined likewise.
    - Make sure you index the correct rows of `X_train` when defining kdes.
    - Use bandwidth method 'scott'. (For fun, you can try 'silverman' and see what difference in result you get.)
    - As with all arrays you throw into sklearn or scipy, you may need to take transposes.

In [74]:
from scipy.stats import gaussian_kde
kde0 = {} 
kde1 = {} 

for i in range(X_train.shape[1]):
    kde0[i] = None # TODO
    kde1[i] = None # TODO

# display(kde1) # Use this to check what you made. swap kde0 for kde1 if you want

2. [2 pt] Complete the code for ```compute_likelihood``` function.
    - The objects kde0[i] and kde1[i] have a method .pdf(), which you will use when computing the likelihood.
        - Read the documentation to understand how it works.
    - `likelihood0[j]` is the likelihood of seeing $j$ th data ${x_j} = \left({x_j}_1, \dots, {x_j}_d\right)$ for death event 0, i.e., ${L}_{{x_j}}(C_0) = \prod_{i=1}^d p({x_j}_i | C_0)$
    - `likelihood1[j]` defined likewise.
    - You can loop over the kde objects kde[i] to populate the likelihood arrays.
    - Be careful with the shape of arrays. Print shapes as necessary when debugging.

(Your solution shouldn't be very complicated. A working solutions needs only about 5-10 lines of code.)

In [75]:
def compute_likelihood(x, kde0, kde1):
    # input:    x, a (# data) by (# features) array of test data
    #           kde0 and kde1, dictionaries that will be used to compute the likelihood
    # output:   likelihood, a (# data) by (# classes) array. 
    #           likelihood[j,k] is the likelihood of data j given class k
    
    # likelihood0[j] is the likelihood of data j given class 0. Analogously for likelihood1
    likelihood0 = np.ones(x.shape[0])    
    likelihood1 = np.ones(x.shape[0])    

    for i in range(x.shape[1]):
        likelihood0 *= None
        likelihood1 *= None

    likelihood = np.vstack((likelihood0, likelihood1)).T
    
    return likelihood

### 1.3 Posterior
1. [2 pt] Complete the code for ```compute_posterior``` function. 
    - It should include calling the function ```compute_likelihood```.

In [76]:
def compute_posterior(x, prior, kde0, kde1):
    # input:    x, a (# data) by (# features) array of test data
    #           prior, a 1 by 2 array
    #           kde0 and kde1, kde dictionaries that will be used to compute the likelihood
    # output:   posterior, a (# data) by (# classes) array

    likelihood = None # TODO
    posterior = None # TODO
    
    return posterior

### 1.4 Combine prior, likelihood, posterior
Now, we are ready to piece all the code we prepared above.
1. [2 pt] Complete the code for ```naive_bayes_predict```.
    - Your code should include calling the ```compute_posterior``` function.
    - Computing y_pred should be a simple one line of code. You may consider using numpy functions that find the index of the largest entry on every row.
2. [1 pt] Complete the code for ```print_success_rates```.

In [77]:
def naive_bayes_predict(x, prior, kde0, kde1):
    # input:    x, a (# data) by (# features) array
    #           prior, a 1 by 2 array
    #           kde0 and kde1, kde dictionaries that will be used to compute the likelihood
    # output:   y_pred, an array of length (# data)

    posterior = None # TODO
    y_pred = None # TODO
    
    return y_pred

def print_success_rates(y_true,y_pred):
    n_success = None   # TODO
    n_total   = None    # TODO
    print("Number of correctly labeled points: %d of %d.  Accuracy: %.2f" 
        % (n_success, n_total, n_success/n_total))

### 1.5 Predict
1. [1 pt] Use your custom naive Bayes to:
    - predict *TRAINING* 
    - print the results with ```print_success_rates```

In [None]:
# TODO predict training data and print


2. [1 pt] Use your custom naive Bayes to:
    - predict *TEST* data
    - print the results with ```print_success_rates```

In [None]:
# TODO predict test data and print


## Discussion
### 1.6 random_state = 0
Using random_state=0 and respond to the following questions.

[2 pt] For **custom NB**, what is the difference between the training and test accuracy? Give an explanation for why it might be so.
    
**Ans:**  

### 1.7 change random_state
Now, experiment with a range of random_state and respond to the following question.

[2 pt] Does your responses to 3.1 change? If so, describe how your responses change and why you changed them.
- (You do not need to artificially adjust your response to 3.1 to fit the any new findings you made after changing random_state)

**Ans:** 

# 2 LDA and QDA

In this section you will demonstrate your understanding of LDA and QDA by completing the following functions. But first a quiz!

### Discussion

#### 2.1 

[2 pt] What is the main difference between the assumptions of GNB, LDA, and QDA? Explain.

**Ans**  

#### 2.2 

[2 pt] Which method is the most computational expensive? Explain.

**Ans** 

In [92]:
# Run this cell
from scipy.stats import multivariate_normal

### 2.3 LDA Implementation

Complete the following code block to implement the `lda_predict` function. Most of the variables have already been named you just need to assign them.

**Task**

- [1 pt] Calculate prior of each class assuming a uniform prior
- [$\frac{1}{2}$ pt] Split the training data into two seperate class specific datasets
- [1 pt] Calculate the mean of each class using `np.mean`
- [1 pt] Calculate the covariance matrix for each class using `np.cov`
- [1 pt] Calculate the likelihoods of each class using `multivariate_normal.pdf()`
- [1 pt] Calculate the posterior for each class
- [$\frac{1}{2}$ pt] Return the predicted classifications

HINTS:
- Be careful with transposes and axis declarations 
    - Make sure to check the shapes of your calculated matrices to make sure they make sense in the context of the problem
- `np.where` might be helpful... https://numpy.org/doc/2.0/reference/generated/numpy.where.html

In [81]:
def lda_predict(X_train, y_train, X_test):
    
    prior_class_0 = None # TODO: prior likelihood of class 0
    prior_class_1 = None # TODO: prior likelihood of class 1

    X_class_0 = None # TODO: Seperate X_train by class
    X_class_1 = None # TODO:
    
    mu_class_0 = None # TODO: Mean
    mu_class_1 = None # TODO: Mean

    sigma_class_0 = None # TODO: Proper covariance matrix for LDA
    sigma_class_1 = None # TODO: Proper covariance matrix for LDA

    likelihood_class_0 = None # TODO: Calculate likelihood
    likelihood_class_1 = None # TODO: Calculate likelihood

    posterior_class_0 = None # TODO: Posterior 
    posterior_class_1 = None # TODO: Posterior

    return None # TODO: return predicted class

### 2.4 QDA Implementation

Complete the following code block to implement the `qda_predict` function. Most of the variables have already been named you just need to assign them. You can repeat much of the same code from LDA. Just don't get too overzealous and forget something ;)

**Task**

- [1 pt] Calculate prior of each class assuming a uniform prior
- [$\frac{1}{2}$ pt] Split the training data into two seperate class specific datasets
- [1 pt] Calculate the mean of each class using `np.mean`
- [1 pt] Calculate the covariance matrix for each class using `np.cov`
- [1 pt] Calculate the likelihoods of each class `multivariate_normal.pdf()`
- [1 pt] Calculate the posterior for each class
- [$\frac{1}{2}$ pt] Return the predicted classifications

HINTS:
- Be careful with transposes and axis declarations 
    - Make sure to check the shapes of your calculated matrices to make sure they make sense in the context of the problem
- `np.where` might be helpful... https://numpy.org/doc/2.0/reference/generated/numpy.where.html

In [87]:
def qda_predict(X_train, y_train, X_test):
    prior_class_0 = None # TODO: prior likelihood of class 0
    prior_class_1 = None # TODO: prior likelihood of class 1

    X_class_0 = None # TODO: Seperate X_train by class
    X_class_1 = None # TODO:
    
    mu_class_0 = None # TODO: Mean
    mu_class_1 = None # TODO: Mean

    sigma_class_0 = None # TODO: Proper covariance matrix for QDA
    sigma_class_1 = None # TODO: Proper covariance matrix for QDA

    likelihood_class_0 = None # TODO: Calculate likelihood
    likelihood_class_1 = None # TODO: Calculate likelihood

    posterior_class_0 = None # TODO: Posterior 
    posterior_class_1 = None # TODO: Posterior

    return None # TODO: return predicted class

### 2.5 Compare

**Task**

- [1 pt] Assign your predicted y values to `y_pred_lda` and `y_pred_qda`
- [1 pt] Print their success rates with the `print_success_rates` from above

In [None]:
y_pred_lda = None # TODO
y_pred_qda = None # TODO

# TODO: Print sucess rates

### 2.6 Discuss

[2 pt] Note the success rates of the two methods. Are they the same or different? If they are the same does that mean that they had the exact same predictions for each sample? Use the code block below to support your answer.

**Ans** 

In [None]:
# TODO: Code for 2.5

## <span style="color:red"> Before submitting your hw, set train test split to random_state=0. Restart kernel and rerun all cells. </span>