# Introduction
<img src='https://images.unsplash.com/photo-1517607648415-b431854daa86?ixlib=rb-1.2.1&q=80&fm=jpg&crop=entropy&cs=tinysrgb&dl=debby-hudson-jcc8sxK2Adw-unsplash.jpg' style="height:300px"/>


Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.
Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

### Please upvote the kernel if you like it. Keep me motivated.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
Path.ls = lambda x: list(x.iterdir())
import seaborn as sns
import matplotlib.pyplot as plt

# Reading Input Data

In [None]:
path = Path("/kaggle/input/heart-failure-clinical-data/")
path.ls()

In [None]:
df = pd.read_csv(path/'heart_failure_clinical_records_dataset.csv')
df.head()

The dataset consists of various markers for each patient. Namely:

* `age`: Age of the patient
* `anaemia`: Is the patient anaemic
* `creatine_phosphokinase`: Level of the CPK enzyme in the blood (mcg/L)
* `diabetes`: If the patient has diabetes (boolean)
* `ejection_fraction`: Percentage of blood leaving the heart at each contraction (percentage)
* `high_blood_pressure`: If the patient has hypertension (boolean)
* `platelets`: Platelets in the blood (kiloplatelets/mL)
* `serum_creatine`: Level of serum creatinine in the blood (mg/dL)
* `serum_sodium`: Level of serum sodium in the blood (mEq/L)
* `sex`: Woman or man (binary)
* `smoking`: If the patient smokes or not (boolean)
* `time`: Follow-up period (days)
* `DEATH_EVENT`: If the patient deceased during the follow-up period (boolean)


# Data Cleaning

In [None]:
df.head().T

In [None]:
print('Information about the data columns along with their null counts')
df.info()

Just by taking a look at the head of the data and the column descriptions, we can understand that:

* We have a few binary categorical data: `anaemia`, `diabetes`, `high_blood_pressure`, `sex`, `smoking`
* A few continuous columns: `creatinine_phosphoinase`, `platelets`,`serum_creatinine`, `serum_sodium`, `ejection_fraction`
* Not sure : `age`, `time`

### Lets analyse the ones we are not sure about

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(df.age,palette='winter')
plt.title('Count Plot of Age',fontsize=14)
plt.xticks(rotation=90)
plt.show()

#### Observations:
* We can see that there are spikes at multiples of five.
* It looks like the data is an approximate age and mostly has been rounded up to the nearest 5s.
* I think we can consider `age` as continuous variable given we have the quite a bit of data available.

In [None]:
print('Checking the unique values in time:: ')
print(len(df.time.unique()))

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(df.time,palette='winter')
plt.title('Count Plot of Time',fontsize=14)
plt.xticks(rotation=90)
plt.show()

#### Observation:
* There is no point in considering time as a categorical variable here. The reason I had a look at `time` was to check if there were any binnings done during data collection.
* Time clearly looks like a continuous variable.

So we now that we have our variables sorted, we can have a look at understanding the patterns in the data.

# Exploratory Data Analysis

Lets start with analysis each column, looking at the data distributions and interesting outliers.

We will be particularly looking at the continuous columns: 
* `creatinine_phosphoinase`
* `platelets`
* `serum_creatinine`
* `serum_sodium`
* `ejection_fraction`

In [None]:
def plot_hist(col, bins=40, title="",xlabel="",ax=None):
#     plt.figure(figsize=(12,8))
    sns.distplot(col, bins=bins,ax=ax)
    ax.set_title(f'Histogram of {title}',fontsize=20)
    ax.set_xlabel(xlabel)
    

In [None]:
fig, axes = plt.subplots(3,2,figsize=(20,20),constrained_layout=True)
plot_hist(df.creatinine_phosphokinase,
          title='Creatinine Phosphokinase',
          xlabel="Level of the CPK (mcg/L)",
          ax=axes[0,0])
plot_hist(df.platelets,
          bins=30,
          title='Platelets',
          xlabel='Platelets in the blood (kiloplatelets/mL)',
          ax=axes[0,1])
plot_hist(df.serum_creatinine,
          title='Serum Creatinine', 
          xlabel='Level of serum creatinine in the blood (mg/dL)',
          ax=axes[1,0])
plot_hist(df.serum_sodium,
          bins=30,
          title='Serum Sodium',
          xlabel='Level of serum sodium in the blood (mEq/L)',
          ax=axes[1,1])
plot_hist(df.ejection_fraction,
          title='Ejection Fraction', 
          xlabel='Percentage of blood leaving the heart at each contraction (percentage)',
          ax=axes[2,0])
plot_hist(df.time,
          bins=30,
          title='Time',
          xlabel='Follow-up period (days)',
          ax=axes[2,1])
plt.show()

#### Observations: creatinine_phosphokinase
* We can see from the plots that **creatinine_phosphokinase** for a lot of patients if zero.
* A simple google search reveals that the normal range 10-120. 
* I think we can consider our plot suggests that most of the patients have normal levels of **creatinine_phosphokinase**
* There are a few outliers with very high values near 8000.

#### Observation: Platelets
* The Platelets distribution looks fairly like a normal distribution with a mean of around 250000 and a std of 100000.
* We can see a few outliers and an extreme outlier near 800000

#### Observation: Serum Creatinine
* This is a heavily left skewed distribution. 
* We can see a few outliers again 

#### Observations: Serum Sodium
* The normal levels of sodium is between 135 and 145 mEq/L. 
* We can see the mass of our data is at the normal levels.


#### Observations: Ejection Fraction
* The normal range for ejection fraction is 50-70%.
* We can see that most of our patients have values less than 50%.
* We can also see some very low values. Might be the same patients that we had as outliers. 


#### Observations: Time
* The Histogram looks random.
* There is a possibility that there are two gaussians at play here. But not sure about that.

Lets check the categorical variables now.

In [None]:
def plot_categorical_var(x='DEATH_EVENT', col=None, title="",label="",ax=None):
    sns.countplot(data=df, x=col, hue=x,palette='winter',ax=ax)
    ax.set_title(title,fontsize=16)
    ax.set_xlabel(label)

In [None]:
fig, axes = plt.subplots(2,3,figsize=(20,10),constrained_layout=True)
plot_categorical_var(col='diabetes',
                     title='Death vs diabetes',
                     label='Diabetes',
                     ax=axes[0,0])
plot_categorical_var(col='high_blood_pressure',
                     title='Death vs high blood pressure',
                     label='High blood pressure',
                     ax=axes[0,1])
plot_categorical_var(col='sex',
                     title='Sex vs Death',
                     label='Sex',
                     ax=axes[0,2])
plot_categorical_var(col='smoking',
                     title='Smoking vs Death', 
                     label='Smoking Status',
                     ax=axes[1,0])
plot_categorical_var(col='anaemia',
                     title='Anaemia vs Death',
                     label='is anaemic?',
                     ax=axes[1,1])
plt.show()

#### Observation: Diabetes
* We can see that diabetes has no significant contribution to death here.

#### Observations: high blood pressure
* The People with high blood pressure have a high death count.

#### Observations: Sex
* The dataset has a lot more patients of sex corresponding to 1. 
* I think it is safe to assume 1 as Male and 0 as Female.
* The plot shows that there is no significant difference in the proportions of death count in both the sexes.

#### Observations: smoking
* We can see that most of our patients are non smokers.
* And there seems to be no significant relation with smoking status and death count.

#### Observation: Anaemia
* Anaemic patient seem to have a higher relative of death count.

## Multivariate Relations

We have had a look at all the individual columns. Lets have a look at how these variables relate with each other

In [None]:
totitle= lambda x: " ".join(x.split('_'))

In [None]:
def plot_multivar(df,x,y,hue_list=None):
    if hue_list is None:
        hue_list = ['DEATH_EVENT','sex']
    fig, axes = plt.subplots(1,len(hue_list), figsize=(6*len(hue_list),5),constrained_layout=True)
    fig.suptitle(f'{totitle(x)} vs {totitle(y)}'.title(),fontsize=18)
    if not isinstance(axes, np.ndarray):
        axes = np.array(axes)
    for i,(ax,hue) in enumerate(zip(axes.flatten(),hue_list)):
        sns.scatterplot(data=df, x=x,y=y,hue=hue,alpha=0.8,ax=ax,palette='rocket')
        ax.set_title(f'{totitle(hue)}'.title(),fontsize=18)
    plt.show()

In [None]:
plot_multivar(df,'serum_creatinine','creatinine_phosphokinase')

#### Observations:
* We can see a good relation between death counts and serum creatinine levels
* We can alse see that Males have a very high levels of both cpk and serum creatinine

In [None]:
plot_multivar(df,'platelets','serum_creatinine')

#### Observations: 
* We can see high death counts for lower platelets and high serum creatinine

In [None]:
plot_multivar(df,'platelets','creatinine_phosphokinase')

#### Observation: 
* No relationship is found here

In [None]:
plot_multivar(df,'serum_creatinine','serum_sodium')

In [None]:
plot_multivar(df,'ejection_fraction','serum_creatinine')

#### Observations:
* We can see high death counts for lower ejection fraction

In [None]:
plot_multivar(df,'time','serum_creatinine')

#### Observations:
* The Death counts are high at the lower end on follow_up periods

# PCA

Now that we have a good understanding of the data distribution and relationships, lets try to apply PCA on the continuous variables to get better visualizations

In [None]:
cont_cols = ['creatinine_phosphokinase', 'platelets','serum_creatinine', 'serum_sodium', 'ejection_fraction','age','time']
cat_vars = ['anaemia', 'diabetes', 'high_blood_pressure', 'sex', 'smoking','DEATH_EVENT']

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(df[['creatinine_phosphokinase', 'platelets','serum_creatinine', 'serum_sodium', 'ejection_fraction','age','time']])
scaled_df = df.copy()
scaled_df[cont_cols] = scaled_features
scaled_df.head()

We use PCA to reduce the dimensionality of the dataset. PCA accomplishes this by capturing the variance in the dataset. It get the components such that the are in the direction of the highest variance. 

![](https://upload.wikimedia.org/wikipedia/commons/f/f5/GaussianScatterPCA.svg?download)

We can see two components in the above image. It combines the existing features to calculate vectors that capture the variance. The longer vector captures the maximum variance here

When using PCA, we have to choose how much explained variance are we looking for. Anything between 70% to 90% is a good choice. Further more, the job we are trying to accomplish dictates how much explained variance should we settle for. We can see that we can get 81% explained variance below. Lets see what the components have captured.

In [None]:
from sklearn.decomposition import PCA

N_COMPONENTS = 6

pca = PCA(n_components = N_COMPONENTS)
pca.fit(scaled_df[cont_cols].values)
print(f"Explained variance: {pca.explained_variance_ratio_[:4].sum()}")

v = pd.DataFrame(pca.components_)

We can now combine the PCA components with our categorical columns into a dataframe

In [None]:
transformed = pca.transform(scaled_df[cont_cols])
transformed_df = pd.DataFrame(transformed)
transformed_df.columns = list(map(lambda x: f'pca_{x+1}', list(transformed_df.columns)))
transformed_df[cat_vars] = df[cat_vars]
transformed_df.head()

To understand what each PCA component represents, we can use the below function to plot how much each of our original features account for in each component

In [None]:
def display_component(v, features_list, component_num):
    
    row_idx = N_COMPONENTS - component_num
    
    v_1_row = v.iloc[:,row_idx]
    v_1 = np.squeeze(v_1_row.values)
    
    comps = pd.DataFrame(list(zip(v_1, features_list)),
                         columns=['weights', 'features'])
    
    comps['abs_weights']=comps['weights'].apply(lambda x: np.abs(x))
    sorted_weight_data = comps.sort_values('abs_weights',ascending=False).head()
    
    ax=plt.subplots(figsize=(10,6))
    ax=sns.barplot(data=sorted_weight_data,
                   x="weights",
                   y="features",
                   palette="Blues_d")
    ax.set_title("PCA Component Makeup, Component #" + str(component_num), fontsize=20)
    plt.show()

In [None]:
def show_component_details(num_component):
    print(f"Percent explained variance: {pca.explained_variance_ratio_[num_component-1]*100:.4f}","%")
    display_component(v,cont_cols,num_component)

# Understanding PCA Features

## PCA component 1

* We can see that the first PCA component has an `explained variance` of 38 %. 
* Below plot shows how much each of our feature contributes to the component.
* Positive weight in the plot shows positive correlation.

In [None]:
show_component_details(1)

#### Observation: 
* We can see that the platelets makes the biggest contribution the component num 1. 
* Similarly we can have a look at all the other components

In [None]:
show_component_details(2)

In [None]:
show_component_details(3)

In [None]:
show_component_details(4)

# Using PCA Components for multivariate plots

In [None]:
plot_multivar(transformed_df, 'pca_1','pca_2',hue_list=['DEATH_EVENT'])

In [None]:
plot_multivar(transformed_df,'pca_1','pca_3',hue_list=['DEATH_EVENT'])

In [None]:
plot_multivar(transformed_df,'pca_1','pca_4',hue_list=['DEATH_EVENT'])

We can see that the data is pretty easily separable now.

## Please upvote the kernel if you like it. Keep me motivated.