# Objective

The aim of this project to collect clinical data and predict the possibilty of heart failure based on certain health parameters. 
Along the way, certain data insights are also demonstrated to get a clear picture of the data before making predictions.
In the project, we will use 3 main ML models namely - Logistic Regression , KNN and Decistion Tree Classfier to determine the accuracy of the prediction

# Initialization

**Importing necessary packages**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**Reading the file**

In [None]:
df=pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
df.head(10)

**Checking for null values**

In [None]:
df.isnull().sum()


**Plotting feature importances**

In [None]:
plt.rcParams['figure.figsize']=15,6
sns.set_style('darkgrid')

x=df.iloc[:,:-1]
y=df.iloc[:,-1]

from sklearn.ensemble import ExtraTreesClassifier
model=ExtraTreesClassifier()
model.fit(x,y)
print(model.feature_importances_)
feat_imp=pd.Series(model.feature_importances_,index=x.columns)
feat_imp.plot(kind='barh')
plt.show()


**From the given data we select only 4 factors - Age ,Time, Serum Creatinine and Ejection Fraction for our analysis**

# Data Insights

In [None]:
# Box Plot for Ejection Fraction
sns.boxplot(df['ejection_fraction'])
plt.show()

We find that there are two outliers in the above boxplot. Therefore we remove them.

In [None]:
df['ejection_fraction']=df[df['ejection_fraction']<70]
sns.boxplot(df['ejection_fraction'])
plt.show()

**We can see that the outliers have been removed**

In [None]:
#Boxplot for age
sns.boxplot(df['age'])
plt.show()

No outliers in age

In [None]:
#Distribution of Age

import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Histogram(
    x = df['age'],
    xbins=dict( # bins used for histogram
        start=40,
        end=95,
        size=2
    ),
    marker_color='#e8aa60',
    opacity=1
))

fig.update_layout(
    title_text='Age Distribution',
    xaxis_title_text='Age',
    yaxis_title_text='Count', 
    bargap=0.05, # gap between bars of adjacent location coordinates
    plot_bgcolor='#000000',
    xaxis =  {'showgrid': False },
    yaxis = {'showgrid': False }
)
fig.show()

In [None]:
# Now lets categorize the above histogram by DEATH_EVENT

import plotly.express as px
fig = px.histogram(df, x="age", color="DEATH_EVENT", hover_data=df.columns)
fig.show()

In [None]:
#Distribution of Serum Creatinine

fig = go.Figure()
fig.add_trace(go.Histogram(
    x = df['serum_creatinine'],
    xbins=dict( # bins used for histogram
        start=0.5,
        end=9.4,
        size=0.2
    ),
    marker_color='#e8ab60',
    opacity=1
))

fig.update_layout(
    title_text='Serum Creatinine Distribution',
    xaxis_title_text='Serum Creatinine',
    yaxis_title_text='Count', 
    bargap=0.05, # gap between bars of adjacent location coordinates
    plot_bgcolor='#000000',
    xaxis =  {'showgrid': False },
    yaxis = {'showgrid': False }
)
fig.show()

In [None]:
#Histogram in comparison to DEATH_EVENT

fig = px.histogram(df, x="serum_creatinine", color="DEATH_EVENT",marginal='violin', hover_data=df.columns)
fig.show()

In [None]:
#Distribution of Platelets

fig = go.Figure()
fig.add_trace(go.Histogram(
    x = df['platelets'],
    xbins=dict( # bins used for histogram
        start=25000,
        end=850000,
        size=10000
    ),
    marker_color='#e8ab60',
    opacity=1
))

fig.update_layout(
    title_text='Platelets Distribution',
    xaxis_title_text='Platelets',
    yaxis_title_text='Count', 
    bargap=0.05, # gap between bars of adjacent location coordinates
    plot_bgcolor='#000000',
    xaxis =  {'showgrid': False },
    yaxis = {'showgrid': False }
)
fig.show()

In [None]:
#Histogram of platelets as a function of DEATH_EVENT

fig = px.histogram(df, x="platelets", color="DEATH_EVENT",marginal='violin', hover_data=df.columns)
fig.show()

In [None]:
df['time'].describe()

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(
    x = df['time'],
    xbins=dict( # bins used for histogram
        start=4,
        end=285,
        size=5
    ),
    marker_color='#e8ab60',
    opacity=1
))

fig.update_layout(
    title_text='Time Distribution',
    xaxis_title_text='Time',
    yaxis_title_text='Count', 
    bargap=0.05, # gap between bars of adjacent location coordinates
    plot_bgcolor='#000000',
    xaxis =  {'showgrid': False },
    yaxis = {'showgrid': False }
)
fig.show()

In [None]:
fig1=px.pie(df, values='diabetes',names='DEATH_EVENT', title='Diabetes VS Death Event',width=600, height=400)
fig2=px.pie(df, values='DEATH_EVENT',names='diabetes',width=500, height=400)

fig1.show()

fig2.show()


**The above pie chart shows that 32% of people who have diabetes die of heart failure whereas 68% dont**
**Also 58.3% of people who die of heart failure dont have diabetes**

In [None]:
df.head()

In [None]:
fig1=px.pie(df, values='smoking',names='DEATH_EVENT', title='Smoking VS Death Event',width=600, height=400)
fig2=px.pie(df, values='DEATH_EVENT',names='smoking',width=500, height=400)

fig1.show()

fig2.show()

**The first piechart shows that only 31.3% of the smokers die of heart failure**

In [None]:
fig1=px.pie(df, values='high_blood_pressure',names='DEATH_EVENT', title='High BP VS Death Event',width=600, height=400)
fig2=px.pie(df, values='DEATH_EVENT',names='high_blood_pressure',width=500, height=400)

fig1.show()

fig2.show()

**It is also interesting to note that 59.4% of deaths related to heart failure occur to people without high blood pressure**

# Training and testing the model

In [None]:
#We select the following features

Features=['time','ejection_fraction','serum_creatinine','age']

In [None]:
df.head()
x=df.iloc[:,[0,4,7,11]].values
y=df.iloc[:,-1].values

In [None]:
df.head()

In [None]:
print(x)

In [None]:
#Splitting the data into train and test set

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

In [None]:
x_test

In [None]:
x_test=np.nan_to_num(x_test)

In [None]:
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

# Trying out different learning models

1.Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=10000)
classifier.fit(x_train,y_train)

In [None]:
# Predicting the value for the test set

y_pred=classifier.predict(x_test)

In [None]:
#Making Confusion matrix and predicting accuracy score

mylist=[]
from sklearn.metrics import confusion_matrix,accuracy_score
cm=confusion_matrix(y_test,y_pred)
ac=accuracy_score(y_test,y_pred)
print(cm)
print(ac)

**Using logistic regression, we get 85% accuracy**

**2.K Nearest Neighbours**

In [None]:
#Finding the optimum number of neighbors

from sklearn.neighbors import KNeighborsClassifier

list1=[]
for neighbors in range(1,10):
    classifier=KNeighborsClassifier(n_neighbors=neighbors,metric='minkowski')
    classifier.fit(x_train,y_train)
    y_pred=classifier.predict(x_test)
    list1.append(accuracy_score(y_test,y_pred))
plt.plot(list(range(1,10)),list1)
plt.show()

In [None]:
classifier=KNeighborsClassifier(n_neighbors=7,metric='minkowski')
classifier.fit(x_train,y_train)

In [None]:
y_pred=classifier.predict(x_test)

In [None]:
#Finding the confusion matrix and accuracy score

cm=confusion_matrix(y_test,y_pred)
ac=accuracy_score(y_test,y_pred)
print(cm)
print(ac)

We get 83% accuracy with KNN

3.Decision Tree Classifier

In [None]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier()
classifier.fit(x_train,y_train)

In [None]:
y_pred=classifier.predict(x_test)

mylist=[]
from sklearn.metrics import confusion_matrix,accuracy_score
cm=confusion_matrix(y_test,y_pred)
ac=accuracy_score(y_test,y_pred)
print(cm)
print(ac)

Using Decsion Tree Classifier we get 83.33% Accuracy

**Among the three models tried above, we find that the logistic regression gives us the best possible accuracy of 85%**