### Heart Prediction

Importing required libraries 

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

Reading the dataset using pandas into heart dataframe

In [None]:
heart = pd.read_csv('heart.csv')

In [None]:
heart.head()

Looking at the first few data points, we can say that all the columns are numerical

Lets cross check by looking into info() of the dataframe

In [None]:
heart.info()

All the columns are of type int64 except the slope, which is float64

From the above info() details, we can tell that there are no null values

But still lets take the advantage of null method in pandas and check

In [None]:
heart.isnull().sum()

Now lets check the measures of central tendency and also percentails of each feature 

In [None]:
heart.describe()

Lets see wheather the data is balanced or not 

In [None]:
print("percentage of target people having heart problem :"+str(heart.target.value_counts()[1]/len(heart.target)))
print("percentage of target people not having heart problem :"+str(heart.target.value_counts()[0]/len(heart.target)))

Based on the above result, we can say that the data is balanced

In [None]:
features = ["age","sex","cp","trestbps","chol","fbs","restecg","thalach","exang","oldpeak","slope","ca","thal","target"]

kind_of_cat = ["sex","cp","fbs","restecg","exang","slope","ca","thal"]

# The above features have 3/4 kinds of values, so lets plot borplot using seaborn

plt.figure(figsize = (16,15))
loc = 1
for feature in kind_of_cat:
    plt.subplot(3,3,loc)
    plt.title(feature)
    sns.barplot(x = feature, y = "target", data = heart)
    loc = loc +1

In [None]:
outliers = ["age","trestbps","chol","thalach","oldpeak"]
# Lets see if there are any outliers in these continious features
plt.figure(figsize = (16,10))
loc = 1
for feature in outliers:
    plt.subplot(2,3,loc)
    sns.boxplot(x = feature, data = heart)
    loc = loc + 1

From the above figure, we can see that we have quite a few outliers for trestbps, chol and oldpeak and only one outlier for thalach

As of now we are not handling the outliers.

If we want to handle the outliers, we can simply replace all the outlier datapoints with the 100 higher bond and lower bond values

In [None]:
# Lets see the correlation of the features

plt.figure(figsize = (16,10))
sns.heatmap(heart.corr(), annot = True)
plt.show()

In [None]:
# Dividing the dependent and independent features

X = heart.drop('target', axis = 1)
y = heart['target']

In [None]:
# Lets split the data into train and test using train_test_split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42)

In [None]:
print("X_train :"+str(X_train.shape))
print("y_train :"+str(y_train.shape))
print("X_test :"+str(X_test.shape))
print("y_test :"+str(y_test.shape))

In [None]:
# Lets perform both Logestic Regression and Descision Tree

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier


In [None]:
LR = LogisticRegression()

LR.fit(X_train, y_train)
LR_pred = LR.predict(X_test)

In [None]:
DTC = DecisionTreeClassifier()

DTC.fit(X_train, y_train)
DTC_pred = DTC.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score


In [None]:
print("accuracy score of LR is :"+str(accuracy_score(y_test , LR_pred)))
print("accuracy score of DTC is :"+str(accuracy_score(y_test , DTC_pred)))

Logistic Regression model has given better score compared to Decision Tree classifier