# Section 6: LDA, QDA and Naive Bayes


## 1
In this section, we plan to use LDA, QDA as well as Naive Bayes methods to predict heart failure death caused by CVDs (cardiovascular diseases). Our models will use five predictive features: age, creatinine phosphokinase, ejection fraction, platelets, serum creatinine and serum sodium. The target will have labels 1 and 0 for death and no death.
- More details on the dataset:
[Heart Failure](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data?select=heart_failure_clinical_records_dataset.csv)

In [1]:
# Load dataset.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

heart_data=pd.read_csv('sec06.csv')
heart_data.head(5)
index=['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium']

X_data=heart_data[index]
print(X_data)
y_data=heart_data.iloc[:,-1].values.ravel()

      age  creatinine_phosphokinase  ejection_fraction  platelets  \
0    75.0                       582                 20  265000.00   
1    55.0                      7861                 38  263358.03   
2    65.0                       146                 20  162000.00   
3    50.0                       111                 20  210000.00   
4    65.0                       160                 20  327000.00   
..    ...                       ...                ...        ...   
294  62.0                        61                 38  155000.00   
295  55.0                      1820                 38  270000.00   
296  45.0                      2060                 60  742000.00   
297  45.0                      2413                 38  140000.00   
298  50.0                       196                 45  395000.00   

     serum_creatinine  serum_sodium  
0                 1.9           130  
1                 1.1           136  
2                 1.3           129  
3                 1

## 1.1

- Scale your data and split the data into a training and test dataset with 30% test size and random_state=0. 
- Next, fit LDA and QDA models to the training datasetand compare the models' performance on the test dataset using confusion matrices as well as prediction accuracy scores. 
- Finally, explain why your prediction accuracy scores are high or low. 

See more on confusion matrices here: [Confusion Matrix](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix). See more on LDA/QDA here: [LDA/QDA](https://scikit-learn.org/stable/modules/lda_qda.html#lda-qda).

In [2]:
y_data

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## 1.2

- Re-train the LDA and QDA models based on the two features 'age' and 'ejection_fraction'
- Plot the decision boundaries of both models with MAP thresholds 0.5.
- Make a scatterplot of the test dataset and highlight those points whose prediction results are different between the LDA and QDA models.

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
x_train,x_test,y_train,y_test = train_test_split(X_data,y_data,train_size=0.7)

## 2.1
- Normalize the data and split it into training and test sets with test_size=0.4 and random_state=636.
- Manually fit a Naive Bayes model using two features: serum_creatinine and serum_sodium. Use a flat prior.
- Print the model accuracy.

In [5]:
x_train

Unnamed: 0,age,creatinine_phosphokinase,ejection_fraction,platelets,serum_creatinine,serum_sodium
32,50.0,249,35,319000.00,1.0,128
48,80.0,553,20,140000.00,4.4,133
121,66.0,68,38,162000.00,1.0,136
213,48.0,131,30,244000.00,1.6,130
8,65.0,157,65,263358.03,1.5,138
...,...,...,...,...,...,...
206,40.0,101,40,226000.00,0.8,141
291,60.0,320,35,133000.00,1.4,139
28,58.0,60,38,153000.00,5.8,134
124,60.0,582,40,217000.00,3.7,134


## 2.2
Plot the covariance matrix of all features and explain whether a Naive Bayes model built on all features would perform well.

In [6]:
x_test

Unnamed: 0,age,creatinine_phosphokinase,ejection_fraction,platelets,serum_creatinine,serum_sodium
45,50.0,582,38,310000.00,1.90,135
176,69.0,1419,40,105000.00,1.00,135
288,65.0,892,35,263358.03,1.10,142
184,58.0,145,25,219000.00,1.20,137
202,70.0,97,60,220000.00,0.90,138
...,...,...,...,...,...,...
153,50.0,1846,35,263358.03,1.18,137
279,55.0,84,38,451000.00,1.30,136
23,53.0,63,60,368000.00,0.80,135
251,55.0,572,35,231000.00,0.80,143
