Data Set Information:

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

One file has been "processed", that one containing the Cleveland database. All four unprocessed files also exist in this directory.

To see Test Costs (donated by Peter Turney), please see the folder "Costs"

Attribute Information:

Only 14 attributes used: 1. #3 (age) 2. #4 (sex) 3. #9 (cp) 4. #10 (trestbps) 5. #12 (chol) 6. #16 (fbs) 7. #19 (restecg) 8. #32 (thalach) 9. #38 (exang) 10. #40 (oldpeak) 11. #41 (slope) 12. #44 (ca) 13. #51 (thal) 14. #58 (num) (the predicted attribute)

Columns:
    age:age in years
    sex:(1 = male; 0 = female)
    cp:chest pain type
    trestbps:resting blood pressure (in mm Hg on admission to the hospital)
    chol:serum cholestoral in mg/dl
    fbs:(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
    restecg:resting electrocardiographic results
    thalach:maximum heart rate achieved
    exang:exercise induced angina (1 = yes; 0 = no)
    oldpeak:ST depression induced by exercise relative to rest
    slope:the slope of the peak exercise ST segment
    ca:number of major vessels (0-3) colored by flourosopy
    thal : 3 = normal; 6 = fixed defect; 7 = reversable defect
    target:1 or 0 


In [35]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np
import pandas as pd
import pandas_profiling as pp
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

['heart-disease-uci']


Loading the data into dataframe

In [None]:
df = pd.read_csv("../input/heart-disease-uci/heart.csv")

Viewing the data

In [None]:
#Once the data is loaded,we can view the data. Instead of viewing the entire data , we can view the first five rows
df.head()

In [None]:
# We can view the last five rows of the data
df.tail()

Viewing the size of the data (no of rows and no of columns)

In [None]:
#we can view the total number of rows and columns of data
df.shape

Understanding the column attributes - datatype , no of non null rows

In [None]:
# Now we got the information about the total size of the data . We can further explore the detailed information about each column using the info method on datadrame
df.info()

The ProfileReport() method gives detailed data summary statistics 

In [None]:
profile = pp.ProfileReport(df)
profile

In [None]:
# We can get the summary statistics (min,max,count...) of each column by using the describe() method
df.describe()

In [None]:
#Checking the column names
df.columns

In [None]:
# Rename the columns of dataframe to more meaningful
df=df.rename(columns={'age':'Age','sex':'Sex','cp':'Cp','trestbps':'Trestbps','chol':'Chol','fbs':'Fbs','restecg':'Restecg','thalach':'Thalach','exang':'Exang','oldpeak':'Oldpeak','slope':'Slope','ca':'Ca','thal':'Thal','target':'Target'})

In [None]:
#Recheck the column names
df.columns

In [None]:
# Check for missing values
#df.isnull().sum()
df.isnull().mean()

Detecting the Outliers

In [None]:
# Calculating the IQR for entire dataset, to detect outliers
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

In [None]:
# Filtering the columns by removing the outliers
print((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))  )

In [None]:
# Try to delete the rows with outliers and check if this impacts our prediction
df_out = df[(df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))] #viewing the outliers
print(df_out)
#df_out = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

In [None]:
#df_out.shape # We can see that by deleting the rows with outliers , we may lose a large amount of data

In [None]:
# We will now try perform imputation on these outliers , As all columns are numerical we can perform median imputation
#df.out = df[(df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))]

Identifying the duplicated values

In [None]:
# We will try to check and drop duplicate values
#Incase of duplicate rows we use drop_duplicates() method
df_dup = df[df.duplicated()]
df_dup

In [None]:
#As we have one duplicate row , Delete the duplicated rows
df = df.drop_duplicates()
df

In [None]:
df['Target'].value_counts()

In [None]:
sns.countplot(df['Target'])
plt.show()

In [None]:
df.hist()

Converting the categorical variables

In [None]:
dataset = pd.get_dummies(df, columns = ['Sex', 'Cp', 'Fbs', 'Restecg', 'Exang', 'Slope', 'Ca', 'Thal'])

Performing Feature Scaling

In [None]:
standardScaler = StandardScaler()
columns_to_scale = ['Age', 'Trestbps', 'Chol', 'Thalach', 'Oldpeak']
df[columns_to_scale] = standardScaler.fit_transform(df[columns_to_scale])

In [None]:
df.head()

In [None]:
y = df['Target']
X = df.drop(['Target'], axis = 1)

Logistic Regression

In [38]:
logreg = LogisticRegression()
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.4,random_state = 42)
logreg.fit(X_train,y_train)
y_pred = logreg.predict(X_test)
print("The Score is",logreg.score(X_test,y_test))
cv_score_five = cross_val_score(logreg,X,y,cv=5)
cv_score_ten = cross_val_score(logreg,X,y,cv=10)
print("Cross validation score - Five Folds",cv_score_five)
print("Cross validation score - Ten Folds",cv_score_ten)
print("Mean cross validation score",cv_score_ten.mean())
print("Confusion Matrix",confusion_matrix(y_test,y_pred))
print("Classification Report")
print(classification_report(y_test,y_pred))

The Score is 0.8429752066115702
Cross validation score - Five Folds [0.81967213 0.8852459  0.85245902 0.85       0.74576271]
Cross validation score - Ten Folds [0.87096774 0.77419355 0.87096774 0.87096774 0.9        0.76666667
 0.86666667 0.9        0.68965517 0.75862069]
Mean cross validation score 0.8268705969595848
Confusion Matrix [[43  9]
 [10 59]]
Classification Report
              precision    recall  f1-score   support

           0       0.81      0.83      0.82        52
           1       0.87      0.86      0.86        69

    accuracy                           0.84       121
   macro avg       0.84      0.84      0.84       121
weighted avg       0.84      0.84      0.84       121



Decision Tree Classfier

In [36]:
dec_clf = DecisionTreeClassifier(max_depth = 3 , random_state=1)
dec_clf.fit(X_train,y_train)
y_pred = dec_clf.predict(X_test)
print("The accuarcy score of decision tree classifier is ",accuracy_score(y_test,y_pred))
cv_dec_tree_clf = cross_val_score(dec_clf,X,y,cv=10)
print("The Cross validation score ",cv_dec_tree_clf.mean())

The accuarcy score of decision tree classifier is  0.8429752066115702
The Cross validation score  0.8069929551353356


Random Forest Classifier

In [37]:
randomforest_classifier= RandomForestClassifier(n_estimators=10)
score=cross_val_score(randomforest_classifier,X,y,cv=10)
print(score.mean())

0.8136744530960327
