## <h1 style="background-color:#85B3F1" align="center">About Dataset</h1>

<br>
<p style="font-size:20px;">This dataset is for the prediction of heart failure based on multiple attributes like diabetes, sex, smoking etc. Our task is to find a appropriate model that can easily predict the death event based on the attributes.</p>

## <h1 style="background-color:#85B3F1" align="center"> Contents</h1>

<li style="font-size:15px">Importing Libraries</li> 
<li style="font-size:15px">Analyzing Data</li> 
<li style="font-size:15px">Visualizing Data</li> 
<li style="font-size:15px">Data Modelling</li>
<li style="font-size:15px">Conclusions</li> 

## <h1 id='implib' style="background-color:#85B3F1;" align="center">Importing Libraries</h1>

In [None]:
# Data handling and visualizing
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Machine Leaarning Libraries
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

## <h1 id="anadata" style="background-color:#85B3F1;" align="center">Analyzing Dataset</h1>

In [None]:
df = pd.read_csv("../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")
df.head()

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

### <i>Observations:</i>
<br>
<li style="font-size:15px">The dataset has no null values.</li>
<li style="font-size:15px">There are only numerical values in the dataset.</li>

## <h1 style="background-color:#85B3F1" align="center" id="visdata">Visualizing dataset</h1>

In [None]:
df.hist(figsize=(20,20),color="#44B1B1");

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(x = "age", hue = "DEATH_EVENT",data=df,palette=['#432371',"#FAAE7B"])

### <i>Observations:</i>
<br>
<li style="font-size:15px">The column "DEATH_EVENT" is dependant on the age column.</li><br>

In [None]:
#plt.figure(figsize=(20,10))
sns.countplot(x = "diabetes", hue = "DEATH_EVENT",data=df,palette=['#CAF6D6',"#DC74AD"])

### <i>Observations:</i>
<br>
<li style="font-size:15px">The column "DEATH_EVENT" is not dependant on the diabetes column.</li>
<br>

In [None]:
## Plotting heatmap to identify all the relationships

plt.figure(figsize=(20,20))
sns.heatmap(df.corr(),cmap="coolwarm",annot=True)

###  <i>Observations</i>:
<br>
<p style="font-size:15px">From the graph following insights can be drawn:<br><br>
<li style="font-size:15px">The column "DEATH_EVENT" is independent of 'anaemia', 'creatinine_phosphokinase','diabetes', 'high_blood_pressure','platelets', 'sex', 'smoking' and 'time' as they are negatively corelated.</li><br>
<li style="font-size:15px">The column "DEATH_EVENT" is dependent on 'age','anaemia','creatinine_phosphokinase', 'high_blood_pressure' and 'serum_creatinine' as they have positive vaues for the corelation.</li></p><br>

In [None]:
sns.pairplot(df,hue="DEATH_EVENT")

### <i>Observations:</i>
<br>
<p style="font-size:15px">The graph clearly shows that there is overlapping of data points for the features so it is better to use Decision tree and KNN algorithms.<br>

## <h1 id="moddata" style="background-color:#85B3F1;" align="center">Dataset Modeling</h1>

In [None]:
X = df.drop(['anaemia', 'creatinine_phosphokinase', 
            'diabetes', 'high_blood_pressure', 
            'platelets', 'sex', 'smoking', 
            'time', 'DEATH_EVENT'], axis=1) 
y = df['DEATH_EVENT']

# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [None]:
## Feature Scaling to scale the values in a same range

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### 1. KNN Algorithm

In [None]:
kclassifier = KNeighborsClassifier(n_neighbors=5,metric="minkowski",p=2)
kclassifier.fit(X_train,y_train)

In [None]:
y_pred_kn = kclassifier.predict(X_test)

kac = cross_val_score(estimator=kclassifier,X=X_test,y = y_test,cv = 10)
print("Accuracy of KNN: ", str(round(kac.mean()*100,2))+"%")

In [None]:
cf_knn = confusion_matrix(y_pred_kn,y_test)
sns.heatmap(cf_knn,annot=True)

### 2. Decision Tree Algorithm

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X_train,y_train)

In [None]:
y_pred_dt = dt.predict(X_test)

dac = cross_val_score(estimator=dt,X=X_test,y = y_test,cv = 10)
print("Accuracy of Decision Tree: ", str(round(dac.mean()*100,2))+"%")

In [None]:
cf_dt = confusion_matrix(y_pred_dt,y_test)
sns.heatmap(cf_dt,annot=True)

## <h1 id="con" style="background-color:#85B3F1;" align="center">Conclusions</h1>
<br>
<p style="font-size:20px">This clearly shows that the <b>KNN algorithm</b> has better accuracy than <b>Decision Tree algorithm</b> for the prediction.<br>