# About this dataset
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.
CDVs can be referred tp number of conditions. e.g: **heart disease**, **heart attack**, **stroke**,... and also ***Coronary artery Disease (CAD)***. In 2015, CAD affected 110 million people and resulted in 8.9 million deaths [1](https://doi.org/10.1016/S0140-6736(16)31678-6)
![Heart Disease](https://cdn.mdedge.com/files/s3fs-public/Image/November-2017/fed03405026s_1.png)
Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current sessions

# **Import Libraries**

In [None]:
from colorama import Fore, Back, Style 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
plt.style.use("seaborn")
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from mlxtend.plotting import plot_confusion_matrix

# Explore the DATA

## Read in DATA

In [None]:
heart_rate = pd.read_csv("/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")
heart_rate.head()

In [None]:
print(heart_rate.columns)
print("="*80)
print("|Check missing value|")
print("=====================")
print(heart_rate.isna().sum())
print("="*80)
for col in heart_rate.columns:
    print(col, '\t',heart_rate[col].nunique())

## Features Explain
We can see all the features in the Data Frame are numerical feature

                                                                 Some Features Explain
* **sex** (boolean): 1 for Male 0 for Female
* **anamenia** : a condition in which there is a deficiency of red cells or of haemoglobin in the blood, resulting in tired, weakness, shortness of breath, and a poor ability to exercise.
* **creatine_phosphokinase**  (mcg/L):(CPK) or (CK) is a enzyme that catalyzes the reaction of creatine and adenosine triphosphate (ATP). **Phosphocreatine** created from this reaction is used to supply tissues and cells e.g. brain skeletal muscles, and the heart.
* **diabetes** : a metabolic disease that causes high blood sugar. Result in increased hunger, increased thirst,weight loss,frequent urination,blurry vision,extreme fatigue,sores that don’t heal
* **ejection_fraction** : Percentage of blood leaving the heart at each contraction (percentage)
* **high_blood_pressure** : common condition in which the long-term force of the blood against your artery
* **platelets** (kiloplatelets/mL): small, colorless cell fragments in our blood that form clots and stop or prevent bleeding.
* **serum_creatinine** (mg/dL): Level of serum creatinine in the blood
* **serum_sodium** (mEq/L): Level of serum sodium in the blood. Reference range for serum sodium is 135-147 mmol/L
* **time** : follow-up period

## Are "Sex", "Age" indicators of "DEATH_EVENT" ?

In [None]:
# age distribution

hist_rate =[heart_rate["age"].values]
group_labels = ['age'] 

fig = ff.create_distplot(hist_rate, group_labels)
fig.update_layout(title_text='Age Distribution plot')

fig.show()

In [None]:
m = heart_rate.DEATH_EVENT[heart_rate.sex==1].mean()
f = heart_rate.DEATH_EVENT[heart_rate.sex==0].mean()

plt.bar(["Male", "Female"], [m, f])
plt.title("Mortality Rate per gender")

Male and Female have approximately equal dead rate (≈33%)

In [None]:
nbin = np.arange(40, 100, 3)
def plot_histogram(df, columns, by, bins, i):
    fig = plt.figure(i)
    sns.distplot(df[columns][df[by] == 1],kde=False,color='r', bins=bins, hist=True, label = 1)
    sns.distplot(df[columns][df[by] == 0],kde=False,color='g', bins=bins, hist=True, label = 0)
    plt.xlabel(columns)
    plt.legend()
    plt.title("{} Distribution by {}".format(columns,by))
    plt.show()

### The Distribution of Age coressponse to "sex", "DEATH_EVENT" 

In [None]:
plot_histogram(heart_rate, "age", 'sex', nbin,0)

In [None]:
plot_histogram(heart_rate, 'age', 'DEATH_EVENT', nbin,1)

There is a similar between the two distributions. Hence the relation between age and sex does not tell us much about the "DEATH_EVENT"

### **Let see how the boolean features related to DEATH_EVENT**

In [None]:
fig = sns.lmplot(y='creatinine_phosphokinase', x='age', data=heart_rate, hue='DEATH_EVENT',x_bins=10)

People who have "high-blood-pressure" and "anaemia" have high dead rate than who don't

## Overall relations of the features

In [None]:
plt.subplots(figsize=(20,15))
sns.heatmap(heart_rate.corr(), annot=True, fmt='.2f')

As we can see here, the **DEATH_EVENT** is highly related to:
* **age**
* **ejection_fraction**
* **serum_creatinine**
* **serum_sodium**
* **time**

In [None]:
def my_regplot(ax, x, y, data, hue):
    sns.regplot(x=x, y=y, data=data[data[hue]==1][[x,y]], ax=ax, color='orange', label="death")
    sns.regplot(x=x, y=y, data=data[data[hue]==0][[x,y]], ax=ax, color='green', label='survive')
    ax.set_xlim(35,100)
    ax.set_title("Analyze the {} feature".format(y), fontsize=16)
    ax.legend()

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(3, figsize=(10,15))
# sns.lineplot(x='age', y='ejection_fraction',data=heart_rate, color='r', ax=ax1, hue="DEATH_EVENT")
# sns.lineplot(x='age', y='serum_creatinine',data=heart_rate, color='r', ax=ax2, hue="DEATH_EVENT")
# sns.lineplot(x='age', y='',data=heart_rate, color='r', ax=ax3, hue="DEATH_EVENT")
my_regplot(ax1, 'age', 'ejection_fraction', heart_rate, "DEATH_EVENT")
my_regplot(ax2, 'age', 'serum_creatinine', heart_rate, "DEATH_EVENT")
my_regplot(ax3, 'age', 'serum_sodium', heart_rate, "DEATH_EVENT")

# Quick KNN-Model for Classification

## Splitting Data

Since the Death_event is highly related to some features, I only choose those features to the train-test data set.

In [None]:
# Choosing features
feats = ["ejection_fraction","serum_creatinine","serum_sodium","time"]
inputdf = heart_rate[feats]
labels = heart_rate.DEATH_EVENT

xtrain, xtest, ytrain, ytest = train_test_split(inputdf, labels)

## KNN Model

In [None]:
# Build - fit data to the model
neighbor = KNeighborsClassifier(n_neighbors=7)
neighbor.fit(xtrain, ytrain)

In [None]:
# Make some prediction on the test set
print(np.array(ytest[:10]))
neighbor.predict(xtest)[:10]

The model works quite good in the first 10 examples that only wrong 1 time.
Let see how its score on the test set

In [None]:
print(Fore.GREEN + "Accuracy of KNN model: {:.2f}%".format(100*neighbor.score(xtest,ytest)))

## Confusion Matrix

In [None]:
cm = confusion_matrix(ytest, neighbor.predict(xtest))
plt.figure()
plot_confusion_matrix(cm, figsize=(12,8), hide_ticks=True, cmap=plt.cm.Blues)
plt.title("K Neighbors Model - Confusion Matrix")
plt.xticks(range(2), ["Heart Not Failed","Heart Fail"], fontsize=16)
plt.yticks(range(2), ["Heart Not Failed","Heart Fail"], fontsize=16)
plt.show()