### <h3>Attribute Information</h3>

- Age: age of the patient: years
- Sex: sex of the patient 
  - M: Male
  - F: Female
- ChestPainType: chest pain type 
  - TA: Typical Angina, 
  - ATA: Atypical Angina 
  - NAP: Non-Anginal Pain
  - ASY: Asymptomatic
- RestingBP: resting blood pressure: mm Hg
- Cholesterol: serum cholesterol: mm/dl
- FastingBS: fasting blood sugar 
  - 1: if FastingBS > 120 mg/dl
  - 0: otherwise
- RestingECG: resting electrocardiogram results 
  - Normal: Normal
  - ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or - depression of > 0.05 mV)
  - LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria
- MaxHR: maximum heart rate achieved: Numeric value between 60 and 202
- ExerciseAngina: exercise-induced angina 
  - Y: Yes
  - N: No
- Oldpeak: oldpeak ST: Numeric value measured in depression
- ST_Slope: the slope of the peak exercise ST segment 
  - Up: upsloping 
  - Flat: flat
  - Down: downsloping
- HeartDisease: output class 
  - 1: heart disease
  - 0: Normal

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.metrics import f1_score, recall_score, precision_score, confusion_matrix
from sklearn.metrics import r2_score, roc_auc_score, roc_curve, classification_report
from sklearn.svm import SVC
from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import KFold

DATASET_FILE_PATH = '/kaggle/input/heart-failure-prediction/heart.csv'


In [None]:
df = pd.read_csv(DATASET_FILE_PATH)
print(df.head(10))

In [None]:
# Transform non-numerical labels to numerical labels
category_encoders = {}
string_categories = ["Sex", "ChestPainType",
                     "RestingECG", "ExerciseAngina", "ST_Slope"]

for category in string_categories:
    category_encoders[category] = LabelEncoder()
    df[f"{category}_encoded"] = category_encoders[category].fit_transform(
        df[category])

# rename output (HeartDisease) column to target
df = df.rename(columns={"HeartDisease": "target"})

# Drop columns which contain non-numerical labels
df_processed = df.drop(string_categories, axis=1)
print(df_processed.head(10))


In [None]:
corr = df_processed.corr()
ax, fig = plt.subplots(figsize=(15, 15))
sns.heatmap(corr, vmin=-1, cmap=plt.cm.Blues, annot=True)
plt.show()


In [None]:
corr[abs(corr['target']) < 0.3]['target']


In [None]:
# Split dataset into X and Y
df_x = df_processed.iloc[:, df_processed.columns != 'target']
df_y = df_processed['target']

# Standardize features
scaler = StandardScaler()
df_x = scaler.fit_transform(df_x)

# Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2)

print(f"X_train shape {X_train.shape}")
print(f"Y_train shape {y_train.shape}")
print(f"X_test shape {X_test.shape}")
print(f"Y_test shape {y_test.shape}")


In [None]:
model = SVC(tol=1e-4, verbose=1, max_iter=2500).fit(X_train, y_train)
y_pred = model.predict(X_test)

print('Accuracy Score: {:.4f}'.format(accuracy_score(y_test, y_pred)))
print('SVC f1-score  : {:.4f}'.format(f1_score(y_pred, y_test)))
print('SVC precision : {:.4f}'.format(precision_score(y_pred, y_test)))
print('SVC recall    : {:.4f}'.format(recall_score(y_pred, y_test)))
print("\n", classification_report(y_pred, y_test))
cnf_matrix = confusion_matrix(y_test, y_pred, labels=[1, 0])
np.set_printoptions(precision=2)
plt.figure()


In [None]:
# Search best hyperparameters: GridSearchCV
# GridSearchCV accepts dictionary where different hyper-parameters we want to try on the SVM model. 

kernels = list(['linear', 'rbf', 'poly', 'sigmoid'])
c = list([1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1, 10, 1e2, 1e3, 1e4,1e5])
gammas = list([0.1, 1, 10, 100])

clf = SVC()
clf.fit(X_train, y_train)
param_grid = dict(kernel=kernels, C=c, gamma=gammas)
grid = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1)
grid.fit(X_train, y_train)
grid.best_params_


Since in this dataset False Negatives and False postives are quite important:
 - False Negatives (predicting negative to a positive heart failure),  failed task over here on maybe saving someone.
 - False positives (predicting positive to a negative heart failure), it's a bad joke to tell a person and it's family that someone will die. Stop him/her from going to Las Vegas for to spend all his money in at the casino.

F1 Score it's an important scoring measure to take in consideration. In order to calculate the F1, we also need: recall and precision scores.  
 - Recall  
<img src= "https://lawtomated.com/wp-content/uploads/2019/10/Recall_1.png" alt ="Precision" style='width: 200px;'>  

 - Precision  
 <img src= "https://anchormen.nl/wp-content/uploads/2020/02/precision-formula.png" alt ="Precision" style='width: 200px;'>  
 
- F1 score  
<img src= "https://miro.medium.com/max/752/1*UJxVqLnbSj42eRhasKeLOA.png
" alt ="F1 score" style='width: 200px;'>  

A presence of a smaller recall than the precision, means that the proposed model is more likely to miss classify a heart failure pixel as a nnegative. Whereas, a higher precision value explains that the model is more accurate on classifying correctly a heart failure as positive

In [None]:
# Best hyper-parameters are  C:1.0 (default), gamma: 0.1, kernel: rbf

model_ = SVC(kernel='rbf',gamma=0.1, C=1.0, tol=1e-5, verbose=1,max_iter=2500).fit(X_train, y_train)
y_pred = model.predict(X_test)

print('Accuracy Score: {:.4f}'.format(accuracy_score(y_test, y_pred)))
print('SVC f1-score  : {:.4f}'.format(f1_score(y_pred, y_test)))
print('SVC precision : {:.4f}'.format(precision_score(y_pred, y_test)))
print('SVC recall    : {:.4f}'.format(recall_score(y_pred, y_test)))
print("\n", classification_report(y_pred, y_test))

cnf_matrix = confusion_matrix(y_test, y_pred, labels=[1, 0])
sns.heatmap((cnf_matrix / np.sum(cnf_matrix)*100),
            annot=True, fmt=".2f", cmap="Blues")


While doing tests, the different sets of train/test where giving different results
on the F1 score and accuraccy. Thus and to avoid overfitting cross-validation is used in
this experiment to avoid previouse mentioned problems. Cross-validation splits a dataset 
into k parts, where

See: https://machinelearningmastery.com/k-fold-cross-validation/

In [None]:
kf = KFold(n_splits=10, shuffle=True)

acc_arr = np.empty((10, 1))
f1_arr = np.empty((10, 1))
cnf_arr= []
x = 0
for train_index, test_index in kf.split(df_x, df_y):
    X_train, X_test = df_x[train_index], df_x[test_index]
    y_train, y_test = df_y[train_index], df_y[test_index]
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print('Accuracy Score: {:.4f}'.format(accuracy_score(y_test, y_pred)))
    print('SVC f1-score  : {:.4f}'.format(f1_score(y_pred, y_test)))
    print('SVC precision : {:.4f}'.format(precision_score(y_pred, y_test)))
    print('SVC recall    : {:.4f}'.format(recall_score(y_pred, y_test)))
    print("\n", classification_report(y_pred, y_test))
    
    cnf_matrix = confusion_matrix(y_test, y_pred)
    acc_arr[x] = accuracy_score(y_test, y_pred)
    f1_arr[x] = f1_score(y_test, y_pred)

    x = x+ 1

print("%0.2f f1 score with a standard deviation of %0.2f" %
      (f1_arr.mean(), f1_arr.std()))
print("%0.2f accuracy with a standard deviation of %0.2f" %
      (acc_arr.mean(), acc_arr.std()))
