# **Introduction**

WHO announced that cardiovascular diseases is the top one killer over the world. There are seventeen million people died from it every year, especially heart disease. Prevention is better than cure. If we can evaluate the risk of every patient who probably has heart disease, that is, not only patients but also everyone can do something earlier to keep illness away.

This dataset is a real data including important features of patients. This time we will build the predictable model by RandomFores tDecision Trees library.

Confusion matrix is a common technique to figure out the accuracy of the model. From the standpoint of medicine.



# Exploratory Analysis

There are thirteen features and one target as below:

age: The person's age in years
sex: The person's sex (1 = male, 0 = female)
cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
trestbps: The person's resting blood pressure (mm Hg on admission to the hospital)
chol: The person's cholesterol measurement in mg/dl
fbs: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
thalach: The person's maximum heart rate achieved
exang: Exercise induced angina (1 = yes; 0 = no)
oldpeak: ST depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
ca: The number of major vessels (0-3)
thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)

target: Heart disease (0 = no, 1 = yes)

In [None]:
# import librariea
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import seaborn as sns

import matplotlib
import matplotlib.pyplot  as plt

In [None]:

#load csv file
heart_data = pd.read_csv('../input/heart-disease-uci/heart.csv')

In [None]:
#print first 5 rows
heart_data.head()



In [None]:
# print last 5 rows
heart_data.tail()



In [None]:
# number of rows and columns in the dataset
heart_data.shape



In [None]:
# getting some info about the data
heart_data.info()




In [None]:
# checking for missing values
heart_data.isnull().sum()


In [None]:
# statistical measure about the data
heart_data.describe()



In [None]:
# checking the distirbution of the Target Variable
# 1 => Defective Heart
# 0 => Healthy Heart
heart_data["target"].value_counts()

In [None]:
# spearman method in correlaction is the best one 
#The highest correlaction between attributes
correlaction_matrix = heart_data.corr(method="spearman")
plt.figure(figsize=(15, 20))

sns.heatmap(correlaction_matrix, annot = True)

plt.title("Correlation matrix for Numeric Features")

plt.xlabel("Heart Disease UCI features")

plt.ylabel("Heart Disease UCI features")
plt.show()


In [None]:
corr_pair = correlaction_matrix.unstack()

sorted_pairs = corr_pair.sort_values()
sorted_pairs
high_corr = sorted_pairs[sorted_pairs>0.4]
high_corr

In [None]:
plt.hist([heart_data[heart_data.target==0].age, heart_data[heart_data.target==1].age], bins = 20, alpha = 0.5, label = ["no_heart_disease","with heart disease"])
plt.xlabel("age")
plt.ylabel("percentage")
plt.legend()
plt.show()


In [None]:
plt.hist([heart_data[heart_data.target==0].chol, heart_data[heart_data.target==1].chol], bins = 20, alpha = 0.5, label = ["no_heart_disease","with heart disease"])
plt.xlabel("chol")
plt.ylabel("percentage")
plt.legend()
plt.show()

In [None]:
# splitting the feature and target 

X = heart_data.drop(columns = 'target', axis = 1)
Y = heart_data["target"]




In [None]:
#splitting data into training set and testing set 

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, stratify = Y, random_state = 2)



In [None]:
X.shape


In [None]:

X_train.shape

In [None]:
# import models on the whole dataset
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC


In [None]:
#Create param

model_param = {
    
    'DecisionTreeClassifier':{
        'model':DecisionTreeClassifier(),
        'param':{
            'criterion':['gini','entropy']
            }
        },
    'RandomForestClassifier':{
        'model':RandomForestClassifier(),
        'param':{
                'criterion':['gini','entropy'],
                'n_estimators':[20,50,80,120,150]
                
            }
        },
    'KNeighborsClassifier':{
        'model':KNeighborsClassifier(),
        'param':{
           
                'n_neighbors':[5,10,15,20,25]
                
            }
        },
    'SVC':{
        'model':SVC(),
        'param':{
            
                'kernel':['rbf','linear','sigmoid']
                
            }
        }
    
    }



In [None]:
score =[]
for model_name, mp in model_param.items():
    model_selection = GridSearchCV(estimator= mp['model'], param_grid = mp['param'], cv=5, return_train_score= False)
    model_selection.fit(X, Y)
    score.append({
        'model':model_name,
        'best_score':model_selection.best_score_,
        'best_param': model_selection.best_params_
        
        })


In [None]:
df_model_score = pd.DataFrame(score,columns=['model','best_score','best_param'])
df_model_score

In [None]:
model = RandomForestClassifier(criterion="entropy",n_estimators=15,random_state=0)
model.fit(X_train,Y_train)


In [None]:
y_pred = model.predict(X_test)
randomForest_training_data_accuracy = accuracy_score( Y_test, y_pred)
randomForest_training_data_accuracy

In [None]:

#Making confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test,y_pred)
cm

## Build predictive system


In [None]:

#             age, sex, cp, rbp,  chol  fbs restecg  maxHRate  exang  oldpeak  slope  ca  thal
input_data = (26,   0,   1, 110, 199    ,0,   0     ,120      ,0     ,1       ,4     ,2   ,3)

#change input data into numpy array 

input_data_as_numpy_array = np.asarray(input_data)

# reshape numpy array as we are predicting for only on instance

input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

prediction = model.predict(input_data_reshaped)   

if (prediction[0] == 0):
    print("doesn't have heart diseases")
else:
    print("have a diseases")