# Heart Disease Prediction

#### AIM : 
#### World Health Organization has estimated that four out of five cardiovascular disease (CVD) deaths are due to heart attacks. This whole research intends to pinpoint the ratio of patients who possess a good chance of being affected by CVD and also to predict the overall risk using Logistic Regression.

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz # to export graph of decision tree to pdf
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

### Data Collection and Processing

**About the Dataset**

**Context :**
This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The "target" field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

**Content :**

Attribute Information:

age  

sex  

chest pain type (4 values)  

resting blood pressure  

serum cholestoral in mg/dl  

fasting blood sugar > 120 mg/dl  

resting electrocardiographic results (values 0,1,2)  

maximum heart rate achieved  

exercise induced angina  

oldpeak = ST depression induced by exercise relative to rest  

the slope of the peak exercise ST segment  

number of major vessels (0-3) colored by flourosopy  
thal: 0 = normal; 1 = fixed defect; 2 = reversable defect  

The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.


In [2]:
# loading the csv data to a pandas dataframe
heart_data = pd.read_csv('../DATA/heart.csv')

In [3]:
# taking a look at the dataset
heart_data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [4]:
# number of rows and columns in the dataset
heart_data.shape

(1025, 14)

In [5]:
# getting some info about the data
heart_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


In [6]:
# checking for missing values
heart_data.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [7]:
# Statistical measures about the data
heart_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,1025.0,54.434146,9.07229,29.0,48.0,56.0,61.0,77.0
sex,1025.0,0.69561,0.460373,0.0,0.0,1.0,1.0,1.0
cp,1025.0,0.942439,1.029641,0.0,0.0,1.0,2.0,3.0
trestbps,1025.0,131.611707,17.516718,94.0,120.0,130.0,140.0,200.0
chol,1025.0,246.0,51.59251,126.0,211.0,240.0,275.0,564.0
fbs,1025.0,0.149268,0.356527,0.0,0.0,0.0,0.0,1.0
restecg,1025.0,0.529756,0.527878,0.0,0.0,1.0,1.0,2.0
thalach,1025.0,149.114146,23.005724,71.0,132.0,152.0,166.0,202.0
exang,1025.0,0.336585,0.472772,0.0,0.0,0.0,1.0,1.0
oldpeak,1025.0,1.071512,1.175053,0.0,0.0,0.8,1.8,6.2


### Data Scaling

In [8]:
# Checking the distribution of Target variable
heart_data['target'].value_counts()

target
1    526
0    499
Name: count, dtype: int64

1 &rarr; Defective Heart  
0 &rarr; Healthy Heart

*Splitting Features and Target*

In [9]:
X = heart_data.drop('target',axis=1) # features
y = heart_data['target']             # target

In [10]:
scaler = MinMaxScaler()
X = pd.DataFrame(scaler.fit_transform(X),columns = X.columns)
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.479167,1.0,0.000000,0.292453,0.196347,0.0,0.5,0.740458,0.0,0.161290,1.0,0.50,1.000000
1,0.500000,1.0,0.000000,0.433962,0.175799,1.0,0.0,0.641221,1.0,0.500000,0.0,0.00,1.000000
2,0.854167,1.0,0.000000,0.481132,0.109589,0.0,0.5,0.412214,1.0,0.419355,0.0,0.00,1.000000
3,0.666667,1.0,0.000000,0.509434,0.175799,0.0,0.5,0.687023,0.0,0.000000,1.0,0.25,1.000000
4,0.687500,0.0,0.000000,0.415094,0.383562,1.0,0.5,0.267176,0.0,0.306452,0.5,0.75,0.666667
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,0.625000,1.0,0.333333,0.433962,0.216895,0.0,0.5,0.709924,1.0,0.000000,1.0,0.00,0.666667
1021,0.645833,1.0,0.000000,0.292453,0.301370,0.0,0.0,0.534351,1.0,0.451613,0.5,0.25,1.000000
1022,0.375000,1.0,0.000000,0.150943,0.340183,0.0,0.0,0.358779,1.0,0.161290,0.5,0.25,0.666667
1023,0.437500,0.0,0.000000,0.150943,0.292237,0.0,0.0,0.671756,0.0,0.000000,1.0,0.00,0.666667


In [11]:
y

0       0
1       0
2       0
3       0
4       0
       ..
1020    1
1021    0
1022    0
1023    1
1024    0
Name: target, Length: 1025, dtype: int64

In [12]:
# Splitting the Data into Training and Testing Data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify=y ,random_state=101)

In [13]:
print('The Dimensions of the split: ')
print('X : ',X.shape)
print('X_train : ',X_train.shape)
print('X_test : ',X_test.shape)

The Dimensions of the split: 
X :  (1025, 13)
X_train :  (820, 13)
X_test :  (205, 13)


### Model Training and Evaluation

**Logistic Regression**

In [14]:
logistic_reg = LogisticRegression(random_state=0)

# Training
logistic_reg.fit(X_train,y_train)

In [15]:
# Accuracy on training data
X_train_prediction_1 = logistic_reg.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction_1,y_train)

print('Accuracy on training data: ',training_data_accuracy)

Accuracy on training data:  0.8475609756097561


In [16]:
# Accuracy on test data
X_test_prediction_1 = logistic_reg.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction_1,y_test)

print('Accuracy on test data: ',test_data_accuracy)

Accuracy on test data:  0.8585365853658536


In [17]:
# Confusion Matrix
print(classification_report(y_test,X_test_prediction_1))

              precision    recall  f1-score   support

           0       0.91      0.79      0.84       100
           1       0.82      0.92      0.87       105

    accuracy                           0.86       205
   macro avg       0.87      0.86      0.86       205
weighted avg       0.86      0.86      0.86       205



### Decision Tree Classifier

In [18]:
dec_tree_clf = DecisionTreeClassifier(random_state=0,max_depth=5,min_samples_leaf=1,min_samples_split=5)
dec_tree_clf.fit(X_train,y_train) # fitting the data

In [19]:
# Accuracy on the training data
X_train_prediction_2 = dec_tree_clf.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction_2,y_train)

print('Accuracy on training data: ',training_data_accuracy)

Accuracy on training data:  0.9243902439024391


In [20]:
# Accuracy on the test data
X_test_prediction_2 = dec_tree_clf.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction_2,y_test)

print('Accuracy on test data: ',test_data_accuracy)

Accuracy on test data:  0.9414634146341463


In [21]:
# Confusion Matrix
print(classification_report(y_test,X_test_prediction_2))

              precision    recall  f1-score   support

           0       0.94      0.94      0.94       100
           1       0.94      0.94      0.94       105

    accuracy                           0.94       205
   macro avg       0.94      0.94      0.94       205
weighted avg       0.94      0.94      0.94       205



### Random Forest Classifier (Best Accuracy)

In [22]:
# creating object or instance
ran_for_clf = RandomForestClassifier(max_depth=6,random_state=0) 

# Fitting the data
ran_for_clf.fit(X_train,y_train)

In [23]:
# Accuracy on the training data
X_train_prediction_3 = ran_for_clf.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction_3,y_train)

print('Accuracy on training data: ',training_data_accuracy)

Accuracy on training data:  0.9804878048780488


In [24]:
# Accuracy on the test data
X_test_prediction_3 = ran_for_clf.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction_3,y_test)

print('Accuracy on test data: ',test_data_accuracy)

Accuracy on test data:  0.9853658536585366


In [25]:
# Confusion Matrix
print(classification_report(y_test,X_test_prediction_3))

              precision    recall  f1-score   support

           0       1.00      0.97      0.98       100
           1       0.97      1.00      0.99       105

    accuracy                           0.99       205
   macro avg       0.99      0.98      0.99       205
weighted avg       0.99      0.99      0.99       205



### Support Vector Classifier

In [26]:
# Linear Kernel
svcLinear = SVC(kernel='linear',C=10000,gamma=0.001)
svcLinear.fit(X_train,y_train)

In [27]:
# Accuracy on the training data
X_train_prediction_4 = svcLinear.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction_4,y_train)

print('Accuracy on training data: ',training_data_accuracy)

Accuracy on training data:  0.8463414634146341


In [28]:
# Accuracy on the test data
X_test_prediction_4 = svcLinear.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction_4,y_test)

print('Accuracy on test data: ',test_data_accuracy)

Accuracy on test data:  0.8682926829268293


In [29]:
# Sigmoid Kernel
svm = SVC(kernel='sigmoid',C=100000,gamma=0.005)
svm.fit(X_train,y_train)

In [30]:
# Accuracy on the training data
X_train_prediction_4 = svm.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction_4,y_train)

print('Accuracy on training data: ',training_data_accuracy)

Accuracy on training data:  0.801219512195122


In [31]:
# Accuracy on the test data
X_test_prediction_4 = svm.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction_4,y_test)

print('Accuracy on test data: ',test_data_accuracy)

Accuracy on test data:  0.824390243902439


In [32]:
# Confusion Matrix
print(classification_report(y_test,X_test_prediction_4))

              precision    recall  f1-score   support

           0       0.86      0.77      0.81       100
           1       0.80      0.88      0.84       105

    accuracy                           0.82       205
   macro avg       0.83      0.82      0.82       205
weighted avg       0.83      0.82      0.82       205



### Grid-Search CV

In [33]:
clf = SVC()
grid = {'C' : [1e2,1e3,5e3,1e4,5e4,1e5],
        'gamma' : [1e-3,5e-4,1e-4,5e-3]}
abc = GridSearchCV(clf,grid)
abc.fit(X_train,y_train)

In [34]:
abc.best_estimator_

### KNN - K Nearest Neighbours

In [35]:
# Creating an object or instance
knn = KNeighborsClassifier()

# Fitting the data or training the model
knn.fit(X_train,y_train)

In [36]:
# Accuracy on the training data
X_train_prediction_5 = knn.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction_5,y_train)

print('Accuracy on training data: ',training_data_accuracy)

Accuracy on training data:  0.9536585365853658


In [37]:
# Accuracy on the test data
X_test_prediction_5 = knn.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction_5,y_test)

print('Accuracy on test data: ',test_data_accuracy)

Accuracy on test data:  0.848780487804878


In [38]:
# Confusion Matrix
print(classification_report(y_test,X_test_prediction_5))

              precision    recall  f1-score   support

           0       0.82      0.89      0.85       100
           1       0.89      0.81      0.85       105

    accuracy                           0.85       205
   macro avg       0.85      0.85      0.85       205
weighted avg       0.85      0.85      0.85       205



### Conclusion : The Random Forest is the most optimal model for the given dataset. 