#**Heart** **Disease** **Prediction**

Heart Disease is one of the major concerns to be dealt with. It is very important to identify it and do the proper treatment.
Machine learning proves to be effective in making decisions and predictions from the large quantity of data produced by the healthcare industry.


Here, various ML models have been applied for classifying whether a person is suffering from Heart Disease or Not. The dataset is taken from [Cleveland Heart Disease dataset from the UCI Repository](https://archive.ics.uci.edu/ml/datasets/heart+disease) and the same is also available at [Kaggle](https://www.kaggle.com/ronitf/heart-disease-uci).

The dataset consists of 303 individuals data. There are 14 columns in the dataset, which are described below - 
1. Age: displays the age of the individual.
2. Sex: displays the gender of the individual using the following format :
- 1 = male
- 0 = female
3. Chest-pain type: displays the type of chest-pain experienced by the individual using the following format :
- 1 = typical angina
- 2 = atypical angina
- 3 = non — anginal pain
- 4 = asymptotic
4. Resting Blood Pressure: displays the resting blood pressure value of an individual in mmHg (unit)
5. Serum Cholestrol: displays the serum cholesterol in mg/dl (unit)
6. Fasting Blood Sugar: compares the fasting blood sugar value of an individual with 120mg/dl.
If fasting blood sugar > 120mg/dl then : 1 (true)
else : 0 (false)
7. Resting ECG : displays resting electrocardiographic results
- 0 = normal
- 1 = having ST-T wave abnormality
- 2 = left ventricular hyperthrophy
8. Max heart rate achieved : displays the max heart rate achieved by an individual.
9. Exercise induced angina :
- 1 = yes
- 0 = no
10. ST depression induced by exercise relative to rest: displays the value which is an integer or float.
11. Peak exercise ST segment :
- 1 = upsloping
- 2 = flat
- 3 = downsloping
12. Number of major vessels (0–3) colored by flourosopy : displays the value as integer or float.
13. Thal : displays the thalassemia :
- 3 = normal
- 6 = fixed defect
- 7 = reversible defect
14. Diagnosis of heart disease : Displays whether the individual is suffering from heart disease or not :
- 0 = absence
- 1, 2, 3, 4 = present.



# 1.Import Libraries


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV, GridSearchCV, train_test_split
from sklearn.metrics import confusion_matrix, classification_report, precision_score, recall_score, f1_score, plot_roc_curve, accuracy_score, roc_curve

from matplotlib import rcParams
from matplotlib.cm import rainbow

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

#2.Import Dataset

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
df = pd.read_csv('/kaggle/input/heart-disease-uci/heart.csv')

In [None]:
df.head()

Data has been imported successfully

#3.Data Visualizaton

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
#to check for null values
df.isnull().sum()

In [None]:
df["target"].value_counts()

In [None]:
sns.countplot(x = 'target', data = df, palette = 'rocket', saturation = 1)
plt.show()

In [None]:
#sex wise distribuion of categorical value
sns.set_style('darkgrid')
sns.countplot(x = 'target', hue = 'sex', data = df, palette = 'rocket', saturation = 1)
plt.title('Heart Disease Frequency : Sex Wise')
plt.show()


In [None]:
#It's always a good practice to work with the dataset where the target classes are of approximately equal size. Thus check for the same

In [None]:
#heart disease frequency according to sex
pd.crosstab(df.sex, df.target).plot(kind = "bar",figsize = (20, 6), color = ['salmon', 'deepskyblue'])
plt.title('Heart Disease Frequency : Sex')
plt.xlabel('Sex (0 = Female, 1 = Male)')
plt.xticks(rotation=0)
plt.legend(["0", "1"])
plt.ylabel('Frequency')
plt.show()

In [None]:
#heart disease frequency according to age
pd.crosstab(df.age,df.target).plot(kind="bar", figsize=(20, 6), color = ['salmon', 'deepskyblue'])
plt.title('Heart Disease Frequency : Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.savefig('heartDiseaseAndAges.png')
plt.show()


In [None]:
#heart disease frequency according to Maximum Heart Rate and Age
plt.figure(figsize=(10, 6))
plt.scatter(x = df.age[df.target==1], y = df.thalach[(df.target==1)], c = "salmon")
plt.scatter(x=df.age[df.target==0], y=df.thalach[(df.target==0)])
plt.legend(["Disease", "No Disease"])
plt.xlabel("Age")
plt.ylabel("Maximum Heart Rate")
plt.show()

In [None]:
#heart disease frequency according to slope
pd.crosstab(df.slope, df.target).plot(kind="bar",figsize=(20, 6),color=['salmon', 'deepskyblue'])
plt.title('Heart Disease Frequency : Slope')
plt.xlabel('The Slope of The Peak Exercise ST Segment ')
plt.xticks(rotation = 0)
plt.ylabel('Frequency')
plt.show()


In [None]:
#heart disease frequency according to FBS
pd.crosstab(df.fbs,df.target).plot(kind="bar",figsize=(15,6),color=['salmon','deepskyblue' ])
plt.title('Heart Disease Frequency : FBS')
plt.xlabel('FBS - (Fasting Blood Sugar > 120 mg/dl) (1 = true 0 = false)')
plt.xticks(rotation = 0)
plt.legend(["No Disease", "Disease"])
plt.ylabel('Frequency of Disease or Not')
plt.show()

In [None]:
#heart disease frequency according to Chest Pain Type
pd.crosstab(df.cp,df.target).plot(kind="bar",figsize=(15,6),color=['salmon','deepskyblue' ])
plt.title('Heart Disease Frequency : Chest Pain Type')
plt.xlabel('Chest Pain Type')
plt.xticks(rotation = 0)
plt.ylabel('Frequency of Disease')
plt.show()

#4.Feature Selection

In [None]:
#correlations study
corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize = (15, 10))
g = sns.heatmap(df[top_corr_features].corr(), annot = True, cmap = "YlGnBu")

In [None]:
df.hist(figsize = (20, 20))

#5.Data Processing and Train Test Split

In [None]:
#Since 'sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca' and 'thal' are categorical variables we'll turn them into dummy variables.

In [None]:
df = pd.get_dummies(df, columns = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal'])

In [None]:
standardScaler = StandardScaler()
columns_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
df[columns_to_scale] = standardScaler.fit_transform(df[columns_to_scale])

In [None]:
df.head()


In [None]:
#Split Data
Y = df['target']
X = df.drop(['target'], axis = 1)

In [None]:
df.head()

In [None]:
#train test split
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2) 

Now the data is ready to be processed.

5 Different Models are used.

1.   Logistic Regression
2.   K Nearest Neighbors
3.   Decision Tree Classifier
4.   Random Forest Classifier
5.   Support Vector Machine


For all these 5 models, following data are evaluated : 

*   Accuracy Score
*   Classification Report
*   Confusion Matrix
  







#6.Models

Function Definition to evaluate Accuracy Score, Classification Report and Confusion Matrix of Classifier

In [None]:
#function to print accuracy score, classification report and confusion matrix
def print_score(clf, X_train, y_train, X_test, y_test, train = True):
    #training performance
    if train:
        pred = clf.predict(X_train)
        print("TRAIN RESULT \n")
        print("Accuracy Score: {0:.4f}\n".format(accuracy_score(y_train, pred)))
        print("Classification Report: \n {}\n".format(classification_report(y_train, pred)))
        #print("Confusion Matrix: \n {}\n".format(confusion_matrix(y_train, pred)))

        res = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
        print("Average Accuracy: \t {0:.4f}".format(np.mean(res)))
        print("Accuracy SD: \t\t {0:.4f}".format(np.std(res)))
        print("_______________________________________________________________________________________")

    #test performance    
    elif train==False:
        print("\nTEST RESULT \n")        
        print("Accuracy Score: {0:.4f}\n".format(accuracy_score(y_test, clf.predict(X_test))))
        print("Classification Report: \n {}\n".format(classification_report(y_test, clf.predict(X_test))))
        print("Confusion Matrix of Test Data Set: \n {}\n".format(confusion_matrix(y_test, clf.predict(X_test))))

In [None]:
#function to plot confusion matrix
def plot_cm(y_test, model):
  cnf_matrix = confusion_matrix(y_test, model.predict(X_test))
  class_names = [0,1]
  fig,ax = plt.subplots()
  tick_marks = np.arange(len(class_names))
  plt.xticks(tick_marks,class_names)
  plt.yticks(tick_marks,class_names)
  #create a heat map
  sns.heatmap(pd.DataFrame(cnf_matrix), annot = True, cmap = 'YlGnBu',
            fmt = 'g')
  ax.xaxis.set_label_position('top')
  plt.tight_layout()
  plt.title('Confusion Matrix ', y = 1.1)
  plt.ylabel('Actual label')
  plt.xlabel('Predicted label')
  plt.show()

##6.1 Logistic Regression

In [None]:
accuracy = {}

In [None]:
lr = LogisticRegression(solver='liblinear')
lr.fit(X_train, y_train)

print_score(lr, X_train, y_train, X_test, y_test, train=True)
print_score(lr, X_train, y_train, X_test, y_test, train=False)

test_score = accuracy_score(y_test, lr.predict(X_test)) * 100
train_score = accuracy_score(y_train, lr.predict(X_train)) * 100

accuracy['Logistic Regression'] = test_score

results_df = pd.DataFrame(data=[["Logistic Regression", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
plot_cm(y_test, lr)

##6.2 KNN

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

print_score(knn, X_train, y_train, X_test, y_test, train=True)
print_score(knn, X_train, y_train, X_test, y_test, train=False)

test_score = accuracy_score(y_test, knn.predict(X_test)) * 100
train_score = accuracy_score(y_train, knn.predict(X_train)) * 100

accuracy['KNN'] = test_score


results_df_2 = pd.DataFrame(data=[["K-nearest neighbors", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df = results_df.append(results_df_2, ignore_index=True)
plot_cm(y_test, knn)

##6.3 Decision Tree Classifier

In [None]:
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

print_score(dtc, X_train, y_train, X_test, y_test, train=True)
print_score(dtc, X_train, y_train, X_test, y_test, train=False)

test_score = accuracy_score(y_test, dtc.predict(X_test)) * 100
train_score = accuracy_score(y_train, dtc.predict(X_train)) * 100

accuracy['Decision Tree Classifier'] = test_score


results_df_2 = pd.DataFrame(data=[["Decision Tree Classifier", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df = results_df.append(results_df_2, ignore_index=True)
plot_cm(y_test, dtc)

##6.4 Random Forest Classifier

In [None]:
rf = RandomForestClassifier(n_estimators=1000, random_state=42)
rf.fit(X_train, y_train)

print_score(rf, X_train, y_train, X_test, y_test, train=True)
print_score(rf, X_train, y_train, X_test, y_test, train=False)

test_score = accuracy_score(y_test, rf.predict(X_test)) * 100
train_score = accuracy_score(y_train, rf.predict(X_train)) * 100

accuracy['Random Forest Classifier'] = test_score

results_df_2 = pd.DataFrame(data=[["Random Forest Classifier", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df = results_df.append(results_df_2, ignore_index=True)
plot_cm(y_test, rf)

##6.5 Support Vector Machine

In [None]:
svm = SVC(kernel='rbf', gamma=0.1, C=1.0)
svm.fit(X_train, y_train)

print_score(svm, X_train, y_train, X_test, y_test, train = True)
print_score(svm, X_train, y_train, X_test, y_test, train = False)

test_score = accuracy_score(y_test, svm.predict(X_test)) * 100
train_score = accuracy_score(y_train, svm.predict(X_train)) * 100

accuracy['SVM'] = test_score

results_df_2 = pd.DataFrame(data=[["Support Vector Machine", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df = results_df.append(results_df_2, ignore_index=True)
plot_cm(y_test, svm)

## Accuracy comparison between Different Models

In [None]:
results_df

In [None]:
sns.set_style("darkgrid")
plt.figure(figsize=(16,5))
plt.yticks(np.arange(0,100,10))
plt.ylabel("Accuracy")
sns.barplot(x = list(accuracy.keys()), y = list(accuracy.values()), palette = 'Paired')
plt.show()

As seen from the table and graph, the highest accuracy is produced by KNN and SVM. 

The accuracy can be improved by Hyperparameter Tuning. It involves choosing a range of optimal parameters for an algorithm. 

#7.Hyperparameter Tuning To Improve The Accuracy

In [None]:
accuracy_tuned = {}

##7.1 Logistic Regression

In [None]:
params = {"C": np.logspace(-4, 4, 20), "solver": ["liblinear"]}
lr = LogisticRegression()

lr_cv = GridSearchCV(lr, params, scoring="accuracy", n_jobs=-1, verbose=1, cv=5, iid=True)
lr_cv.fit(X_train, y_train)
best_params = lr_cv.best_params_
#print(f"Best parameters: {best_params}")
lr = LogisticRegression(**best_params)

lr.fit(X_train, y_train)

print_score(lr, X_train, y_train, X_test, y_test, train=True)
print_score(lr, X_train, y_train, X_test, y_test, train=False)


test_score = accuracy_score(y_test, lr.predict(X_test)) * 100
train_score = accuracy_score(y_train, lr.predict(X_train)) * 100

accuracy_tuned['Logistic Regression'] = test_score

tuning_results_df = pd.DataFrame(data=[["Tuned Logistic Regression", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
#tuning_results_df
plot_cm(y_test, lr)

##7.2 KNN

In [None]:
train_score = []
test_score = []
neighbors = range(1, 31)

for k in neighbors:
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    train_score.append(accuracy_score(y_train, model.predict(X_train)))
    test_score.append(accuracy_score(y_test, model.predict(X_test)))

In [None]:
plt.figure(figsize=(12, 8))
plt.plot(neighbors, train_score, label="Train score")
plt.plot(neighbors, test_score, label="Test score")
plt.xticks(np.arange(1, 31, 1))
plt.xlabel("Number of neighbors")
plt.ylabel("Model score")
plt.legend()

print(f"Maximum KNN score on the test data: {max(test_score)*100:.2f}%")

In [None]:
knn = KNeighborsClassifier(n_neighbors=27)
knn.fit(X_train, y_train)

print_score(knn, X_train, y_train, X_test, y_test, train=True)
print_score(knn, X_train, y_train, X_test, y_test, train=False)

test_score = accuracy_score(y_test, knn.predict(X_test)) * 100
train_score = accuracy_score(y_train, knn.predict(X_train)) * 100

accuracy_tuned['KNN'] = test_score

results_df_2 = pd.DataFrame(data=[["Tuned K-nearest neighbors", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
tuning_results_df = tuning_results_df.append(results_df_2, ignore_index=True)
#tuning_results_df
plot_cm(y_test, knn)

##7.3 Decision Tree Classifier

In [None]:
params = {"criterion":("gini", "entropy"), "splitter":("best", "random"), "max_depth":(list(range(1, 20))), 
          "min_samples_split":[2, 3, 4], "min_samples_leaf":list(range(1, 20))}

tree = DecisionTreeClassifier(random_state=42)
tree_cv = GridSearchCV(tree, params, scoring="accuracy", n_jobs=-1, verbose=1, cv=3, iid=True)
tree_cv.fit(X_train, y_train)
best_params = tree_cv.best_params_
print(f'Best_params: {best_params}')

tree = DecisionTreeClassifier(**best_params)
tree.fit(X_train, y_train)

print_score(tree, X_train, y_train, X_test, y_test, train=True)
print_score(tree, X_train, y_train, X_test, y_test, train=False)

test_score = accuracy_score(y_test, tree.predict(X_test)) * 100
train_score = accuracy_score(y_train, tree.predict(X_train)) * 100

accuracy_tuned['Desision Tree Classifier'] = test_score

results_df_2 = pd.DataFrame(data=[["Tuned Decision Tree Classifier", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
tuning_results_df = tuning_results_df.append(results_df_2, ignore_index=True)
#tuning_results_df
plot_cm(y_test, knn)


##7.4 Random Forest Classifier

In [None]:
rf_grid = {'n_estimators': np.arange(10, 1000, 50), 'max_depth': [None, 3, 5, 10], 
           'min_samples_split': np.arange(2, 20, 2), 'min_samples_leaf': np.arange(1, 20, 2)}
np.random.seed(42)

rf = RandomizedSearchCV(RandomForestClassifier(), param_distributions = rf_grid, cv=5, n_iter=20, verbose=True)

rf.fit(X_train, y_train)
rf.best_params_

test_score = accuracy_score(y_test, rf.predict(X_test)) * 100
train_score = accuracy_score(y_train, rf.predict(X_train)) * 100

accuracy_tuned['Random Forest Classifier'] = test_score

results_df_2 = pd.DataFrame(data=[["Tuned Random Forest Classifier", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
tuning_results_df = tuning_results_df.append(results_df_2, ignore_index=True)
#tuning_results_df
plot_cm(y_test, rf)

##7.5 Support Vector Machine

In [None]:
svm = SVC(kernel='rbf', gamma=0.1, C=1.0)

params = {"C":(0.1, 0.5, 1, 2, 5, 10, 20), 
          "gamma":(0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1), 
          "kernel":('linear', 'poly', 'rbf')}

svm_cv = GridSearchCV(svm, params, n_jobs=-1, cv=5, verbose=1, scoring="accuracy")
svm_cv.fit(X_train, y_train)
best_params = svm_cv.best_params_
print(f"Best params: {best_params}")

svm = SVC(**best_params)
svm.fit(X_train, y_train)

print_score(svm, X_train, y_train, X_test, y_test, train=True)
print_score(svm, X_train, y_train, X_test, y_test, train=False)

test_score = accuracy_score(y_test, svm.predict(X_test)) * 100
train_score = accuracy_score(y_train, svm.predict(X_train)) * 100

accuracy_tuned['SVM'] = test_score

results_df_2 = pd.DataFrame(data=[["Tuned Support Vector Machine", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
tuning_results_df = tuning_results_df.append(results_df_2, ignore_index=True)
plot_cm(y_test, svm)

##Accuracy comparison between different tuned models

In [None]:
tuning_results_df

In [None]:
sns.set_style("darkgrid")
plt.figure(figsize=(16,5))
plt.yticks(np.arange(0,100,10))
plt.ylabel("Accuracy")
sns.barplot(x = list(accuracy_tuned.keys()), y = list(accuracy_tuned.values()), palette = 'Paired')
plt.show()

As seen from the table and graph, the highest testing accuracy is produced by Logistic Regression and highest training accuracy is produced by SVM.

#8.Feature Importance 

It helps to undersand which features are relevant. 

##8.1 According to Random Forest Classifier

In [None]:
def feature_imp(df, model):
    fi = pd.DataFrame()
    fi["feature"] = df.columns
    fi["importance"] = model.best_estimator_.feature_importances_
    return fi.sort_values(by="importance", ascending=False)

In [None]:
feature_imp(X, rf).plot(kind='bar', figsize=(12,10), legend=False, colormap = 'seismic')