# Background Data

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.
Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

# Import the libraries

In [70]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objs as go

# Data exploration

### Specify the size of plot

In [20]:
HEIGHT = 500
WIDTH = 700
NBINS = 50
SCATTER_SIZE=700

### Return the head of the data

In [21]:
heart_data = pd.read_csv("heart_failure_clinical_records_dataset.csv")
heart_data.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


### Return the summary of the data

In [22]:
heart_data.describe()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,0.431438,581.839465,0.41806,38.083612,0.351171,263358.029264,1.39388,136.625418,0.648829,0.32107,130.26087,0.32107
std,11.894809,0.496107,970.287881,0.494067,11.834841,0.478136,97804.236869,1.03451,4.412477,0.478136,0.46767,77.614208,0.46767
min,40.0,0.0,23.0,0.0,14.0,0.0,25100.0,0.5,113.0,0.0,0.0,4.0,0.0
25%,51.0,0.0,116.5,0.0,30.0,0.0,212500.0,0.9,134.0,0.0,0.0,73.0,0.0
50%,60.0,0.0,250.0,0.0,38.0,0.0,262000.0,1.1,137.0,1.0,0.0,115.0,0.0
75%,70.0,1.0,582.0,1.0,45.0,1.0,303500.0,1.4,140.0,1.0,1.0,203.0,1.0
max,95.0,1.0,7861.0,1.0,80.0,1.0,850000.0,9.4,148.0,1.0,1.0,285.0,1.0


### Return the size of the data

In [23]:
print(heart_data.shape)

(299, 13)


Let's check the ratio of the NaNs for every columns

In [24]:
for col in heart_data.columns:
    print(col, str(round(100* heart_data[col].isnull().sum() / len(heart_data), 2)) + '%')

age 0.0%
anaemia 0.0%
creatinine_phosphokinase 0.0%
diabetes 0.0%
ejection_fraction 0.0%
high_blood_pressure 0.0%
platelets 0.0%
serum_creatinine 0.0%
serum_sodium 0.0%
sex 0.0%
smoking 0.0%
time 0.0%
DEATH_EVENT 0.0%


As can be seen the head of the data, there are 13 dimensions and 299 samples.
All the columns are devoid of NaNs.
We need make some rules before the data processing

- Sex - Gender of patient Male = 1, Female =0
- Age - Age of patient
- Diabetes - 0 = No, 1 = Yes
- Anaemia - 0 = No, 1 = Yes
- High_blood_pressure - 0 = No, 1 = Yes
- Smoking - 0 = No, 1 = Yes
- DEATH_EVENT - 0 = No, 1 = Yes

#### Patients age distribution with gender

In [25]:
def plot_histogram(dataframe, column, color, bins, marginal,title, width=WIDTH, height=HEIGHT):
    figure = px.histogram(
        dataframe,
        column,
        color=color,
        nbins=bins,
        marginal= marginal,
        title=title,
        width=width,
        height=height
    )
    figure.show()

In [29]:
plot_histogram(heart_data, 'age', 'sex', NBINS, "violin",'Figure 1: Patients age distribution with gender')

Wider section of the violin plot represent a higher probability of observations
taking a given value, the thinner sections corresponding a lower probability and
the value of probability is given by kde value (Kernel Density Estimation) for given x

Figure 1:
- Most patients' age ranged from 40 to 80
- Only a small amount of patients were smaller than 40 or older than 80

##### Patients age distribution with gender

In [30]:
plot_histogram(heart_data, 'age', 'DEATH_EVENT', NBINS, "violin",'Figure 2: Patients age distribution with death event')

Figure 2:
- The age distribution was similar in figure 1
- Most people survived, but the older people (>80) seems to have a high mortality rate

#### Box plot of gender

In [32]:
def plot_boxplot(dataframe, x, y,points,title,width=WIDTH, height=HEIGHT):
    figure = px.box(
        dataframe,
        x=x,
        y=y,
        points=points,
        title=title,
        width=width,
        height=height
    )
    figure.show()

In [53]:
# plot_boxplot(heart_data, 'sex', 'age', None,'Figure 3: Box plot of patients\' age distribution with death event')
plot_boxplot(heart_data, 'sex', 'age', "all",'Figure 3: Box plot of patients\' age distribution with gender <br> '
                                             '     -Male = 1 Female = 0')

### Analysis on survival rate (sex factor)

In [75]:
male = heart_data[heart_data["sex"]==1]
female = heart_data[heart_data["sex"]==0]
male_survival= male[male["DEATH_EVENT"]==0]
female_survival= female[female["DEATH_EVENT"]==0]
## assign the labels
labels = ['Male - Survived','Male - Not Survived', "Female -  Survived", "Female - Not Survived"]
## value is set according to the labels
values = [len(male[heart_data["DEATH_EVENT"]==0]),len(male[heart_data["DEATH_EVENT"]==1]),
         len(female[heart_data["DEATH_EVENT"]==0]),len(female[heart_data["DEATH_EVENT"]==1])]
fig = go.Figure(data=[go.Pie(labels=labels,values=values,hole=.3)])
fig.update_layout(
    title_text = "Figure 4: Analysis on Survival - Gender factor"
)
fig.show()


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.


Boolean Series key will be reindexed to match DataFrame index.



### Based 

# Data modeling

##
