![](https://jumpstartyourheart.org/wp-content/uploads/2018/11/penn-research-team-identifies-novel-therapeutic-target-for-heart-disease.jpg)

**Attribute Information:**

age

sex

chest pain type (4 values)

resting blood pressure

serum cholestoral in mg/dl

fasting blood sugar > 120 mg/dl

resting electrocardiographic results (values 0,1,2)

maximum heart rate achieved

exercise induced angina

oldpeak = ST depression induced by exercise relative to rest

the slope of the peak exercise ST segment

number of major vessels (0-3) colored by flourosopy

thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

**Acknowledgements**
Creators:

**Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.**

**University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.**

**University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.**

**V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.**

**Donor:
David W. Aha (aha '@' ics.uci.edu) (714) 856-8779**

In [None]:
#read the data and show the first 5 rows
import pandas as pd
import numpy as np
df = pd.read_csv("../input/heart-disease-uci/heart.csv")
df.head()

In [None]:
#check for missing values and basic informations

df.info()

No NULL values.

In [None]:
df.describe().T.style.bar(subset=['mean'], color='#205ff2')\
                            .background_gradient(subset=['std'], cmap='Reds')\
                             .background_gradient(subset=['50%'], cmap='coolwarm')

Cholesterol has the highest mean value of 246.26 & also he highest standard deviation of 51.83


# Basic Plots

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
%matplotlib inline

plt.figure(figsize = (12,10))

sns.heatmap(df.corr(), annot =True)

slope and old peak has high negative correlation of -0.58. This means if slope value increases then old peak will decrease and vice versa.

Target and cp(chest pain) has the highest positive correlation of 0.43.

In [None]:
plt.figure(figsize=(20,15))
sns.set_theme(style='dark')
plt.subplot(2,3,1)
sns.countplot(data=df,x='fbs',hue='target')
plt.subplot(2,3,2)
sns.countplot(data=df,x='restecg',hue='target')
plt.subplot(2,3,3)
sns.countplot(data=df,x='slope',hue='target')
plt.subplot(2,3,4)
sns.countplot(data=df,x='ca',hue='target')
plt.subplot(2,3,5)
sns.countplot(data=df,x='exang',hue='target')
plt.subplot(2,3,6)
sns.countplot(data=df,x='thal',hue='target')
plt.show()

This plot depicts an important relationship.

**For example, in the 1st subplot of "fbs", it says that when fbs is zero, the target value zero is about 120 and target value 1 is 140, again when fbs is one, the target value for zero is about 20 and target value for one is just above 20.**

*You get the rest*

In [None]:
df.hist(figsize=(20,16))
plt.show()

This is a histogram plot.

This also depicts the counts of each value of each column.
Here we can see the imblances in data too.

For example the fbs column has around 250 zero values, and about 20 one values.

*You get the rest, readers are advised to study each column and make mental notes to get most out of this notebook*

In [None]:


plt.figure(figsize=(13,13))

sns.set_theme(style='darkgrid')
plt.subplot(2,3,1)
sns.boxplot(x='thal',data=df)
plt.subplot(2,3,2)
sns.boxplot(x='oldpeak',data=df)
plt.subplot(2,3,3)
sns.boxplot(x='thalach',data=df)
plt.subplot(2,3,4)
sns.boxplot(x='chol',data=df)
plt.subplot(2,3,5)
sns.boxplot(x='trestbps',data=df)
plt.subplot(2,3,6)
sns.boxplot(x='age',data=df)
plt.show()

The middle line in boxplot represents median value, the dots outside represents outliers

In [None]:
import plotly.express as px

fig = px.box(df,x = 'trestbps')
fig.show()

A clear representation of box plot for "trestbps"

Here the median value is 130, quartile 1 is 120, quartile 3 is 140, max = 170, min = 94, max outlier value is 200

# Scaling after train_test_split

In [None]:
X = df.drop(['target'], axis = 1)
y = df['target']

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 42)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
#imports

from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB,BernoulliNB
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.linear_model import RidgeClassifier, LogisticRegression

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from catboost import CatBoostClassifier

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score,confusion_matrix,roc_auc_score

# Applying diiferent models

In [None]:
models = []
models.append(['RidgeClassifier',RidgeClassifier()])
models.append(['XGBClassifier',XGBClassifier(use_label_encoder=False,objective='binary:logistic',random_state=0,eval_metric='logloss')])
models.append(['Logistic Regression',LogisticRegression(random_state=0)])
models.append(['SVM',SVC(random_state=0)])
models.append(['KNeigbors',KNeighborsClassifier()])
models.append(['GaussianNB',GaussianNB()])
models.append(['BernoulliNB',BernoulliNB()])
models.append(['DecisionTree',DecisionTreeClassifier(random_state=0)])
models.append(['RandomForest',RandomForestClassifier(random_state=0)])
models.append(['AdaBoostClassifier',AdaBoostClassifier()])
models.append(['MLPClassifier',MLPClassifier(random_state = 42, max_iter=1000)])
models.append(['ExtraTreesClassifier',ExtraTreesClassifier()])
models.append(['CatBoostClassifier', CatBoostClassifier(eval_metric = 'AUC', verbose = 0)])
models.append(['GradientBoostingClassifier', GradientBoostingClassifier()])
models.append(['SGDClassifier',SGDClassifier()])

In [None]:
lst_1 = []
for m in range(len(models)):
    lst_2 = []
    model = models[m][1]
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test,y_pred)
    accuracies = cross_val_score(estimator= model, X = X_train,y = y_train, cv=10)

# k-fOLD Validation
    roc = roc_auc_score(y_test,y_pred)
    
    print(models[m][0],':')
    print(cm)
    print('Accuracy Score: ',accuracy_score(y_test,y_pred))
    print('')
    print('K-Fold Validation Mean Accuracy: {:.2f} %'.format(accuracies.mean()*100))
    print('')
    print('ROC AUC Score: {:.2f}'.format(roc))
    print('-'*40)
    print('')
    lst_2.append(models[m][0])
    lst_2.append(accuracy_score(y_test,y_pred)*100)
    lst_2.append(accuracies.mean()*100)
    lst_2.append(roc)
    lst_1.append(lst_2)

In [None]:
df2 = pd.DataFrame(lst_1,columns=['Model','Accuracy','K-Fold Mean Accuracy','ROC_AUC'])

df2.sort_values(by=['ROC_AUC'],inplace=True,ascending=False)
df2

In [None]:
fig = plt.figure(figsize=(12,12))
sns.barplot(x='ROC_AUC',y='Model',data=df2,color='r')
plt.title('Model Comparison');

In [None]:
grid_models = [
               (KNeighborsClassifier(),[{'n_neighbors':np.arange(1, 100), 'metric': ['euclidean', 'manhattan', 'chebyshev', 'minkowski']}]), 
               (DecisionTreeClassifier(),[{'criterion':['gini','entropy'],'max_depth':np.arange(1, 50), 'min_samples_leaf':[1,2,4]}]), 
               (RandomForestClassifier(),[{'n_estimators':[100,150,200],'criterion':['gini','entropy'], 'min_samples_leaf':[2, 10, 30]}]),
               (MLPClassifier(max_iter = 1000),[{'solver':['lbfgs', 'sgd', 'adam'], 'learning_rate' :['constant', 'invscaling', 'adaptive']}]), 
               (RidgeClassifier(),[{'alpha':[0.1,0.5,1], 'solver':['auto', 'svd', 'cholesky']}]),
               (GaussianNB(),[{'var_smoothing': np.logspace(0,-9, num=100)}]),
               (XGBClassifier(use_label_encoder = False), [{'learning_rate': [0.01, 0.05, 0.1], 'eval_metric': ['error', 'logloss']}])
               ]

In [None]:
for i,j in grid_models:
    grid = GridSearchCV(estimator=i,param_grid = j, scoring = 'roc_auc',cv = 5)
    grid.fit(X_train,y_train)
    best_score = grid.best_score_
    best_param = grid.best_params_
    print(' {}: \n Best score: {:.1f} %'.format(i,best_score*100))
    print('')
    print('-'*25)
    print('')

## Top performing model is KNeighborsClassifier and then Random Forest Classifier.

## Do upvote if you like it or fork it. This motivates us to produce more notebooks for the community. Thank you!