# Heart disease prediction - ML model development

This experiment consists in the following steps to produce a ML model to predict the risk of heart disease:

* Load, analyse data and split in train and test
* Setup the experiment
* Choose the best model
* Build best model
* Analyse best model
* Predict on the validation set
* Analyze prediction on the validation set
* Conclusion

## Load, analyse data and split in train and test

Load data

In [None]:
import pandas as pd
import numpy as np

global_seed = 1236

raw = pd.read_csv('/kaggle/input/heart-disease-uci/heart.csv')
raw.dropna()
raw.head()

Analyse data

In [None]:
for column in raw.columns:
    print(f"{column}: {raw[column].unique()}")

Split train and test

In [None]:
from sklearn.model_selection import train_test_split
train, validation = train_test_split(raw, test_size=0.05, random_state=global_seed)

print('raw data shape: ', raw.shape)
print('train data shape: ', train.shape)
print('test data shape: ', validation.shape)

## Setup the experiment

### First install pycaret

In [None]:
!pip install pycaret

### Define the parameters for experiment

In [None]:
from pycaret.classification import *

exp = setup(
    train, 
    target = 'target', 
    categorical_features=None,
    numeric_features=None,
    date_features=None, 
    ignore_features=None, 
    normalize=False, 
    normalize_method='zscore', 
    transformation=False, 
    transformation_method='yeo-johnson',
    n_jobs=2, 
    use_gpu=True, 
    session_id=global_seed, 
    log_experiment=False, 
    experiment_name=None, 
    log_plots=False, 
    log_profile=False, 
    log_data=False, 
    silent=True, 
    verbose=True, 
    profile=False
)

## Choose the best model

In [None]:
compare_models()

## Build best model

In [None]:
best = create_model('ridge')

## Analyse best model

Check confusion matrix of the model

In [None]:
plot_model(best, plot = 'confusion_matrix')

Check the importance of each feature

In [None]:
plot_model(best, plot = 'feature')

## Predict on the validation (test) set

Check results of predictions on cases never 'saw' before by the model.

In [None]:
test_prediction = predict_model(best, validation)
test_prediction.to_csv('validation_prediction.csv', index=False)
test_prediction

Check predictions

In [None]:
test_prediction = test_prediction.apply(pd.to_numeric)
test_prediction['comp'] = np.where(test_prediction['target'] == test_prediction['Label'], 'Correct', 'Incorrect')
test_prediction.groupby('comp').count()['Label']

## Analyze prediction on the validation (test) set

Accuracy on validation (test) set

In [None]:
validation_accuracy = test_prediction.groupby('comp').count()['Label'][0] / (test_prediction.groupby('comp').count()['Label'][0] + test_prediction.groupby('comp').count()['Label'][1])
print('validation_accuracy: ', validation_accuracy)

**The accuracy for validation (test) and training are almost the same, which is very good. It demonstrates that the model have ability to generalize, and it is not overfited.**

Confusion matrix for validation (test) set

In [None]:
from sklearn.metrics import confusion_matrix

y_actu = test_prediction['target']
y_pred = test_prediction['Label']

cm = confusion_matrix(y_actu, y_pred)

import seaborn as sn
sn.heatmap(cm, cmap="Blues", annot=True,annot_kws={"size": 16})

## Conclusion

**The validation accuracy is:**

In [None]:
print('validation_accuracy: ', validation_accuracy)

Check the file *validation_prediction.csv* in the output folder. Enjoy!