# Heart Failure - Risk Factors that can indicate the possibility of heart failure

From the dataset description :
> Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.
People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

The dataset is available on Kaggle at https://www.kaggle.com/andrewmvd/heart-failure-clinical-data. 
The dataset contains results from a study that was on cardiovascular diseases and monitored 12 different metrics in patients as well as indicating if they secumed to heart disease and the time to the outcome.

We will look at which of these metrics points to heart failure and if there is any correlation to time.

This analysis is done as part of  the course [Data Analysis with Python: Zero to Pandas](zerotopandas.com). Check it out if you are interested.

In [None]:
# Setup Jovian
!pip install jovian --upgrade --quiet

## Import the libraries that will be needed for this analysis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import jovian

## Downloading the Dataset

The data set is available on Kaggle. As this notebook is on Kaggle, we ca simply import the data after adding it to the notebook as an input source.

In [None]:
#Import the dataset as a Pandas Dataframe
raw_df = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

If we look at the dataset we just imported, we can see that there are 299 records, and 13 features to each record. The features include age, some risk factors related to cardiovascular disease, some cardiovascular related metrics, and details about the death event. Some features are measurement and some features are Boolean Values.

We also check if we have a complete dataset, or if there are any missing values.

In [None]:
#View metadata about the data
print("Shape: ",raw_df.shape, "\n")
print(raw_df.info())
print("\n Any Missing Values: ",raw_df.isnull().values.any())

> ## Statistical Analysis
Next, we determine some statistics from the data.

### Death Event

We have a 'death_event' column. According to the dataset details, a record with a 0 in this field means the patient dropped out of the study after the number of days in the 'time' column. If the column has 1, it means the patient died in the study, after 'time' days. As patients that did not die within the study will not contribute to the analysis we are doing as there are no conclusions in their cases, we will ignore them.

In [None]:
# Use only data of patients that remained part of the study
data_df = raw_df[raw_df.DEATH_EVENT == 1]

In [None]:
num_records = data_df.count()

##### From the data, we notice the following:

The mean age if the dataset is just over 65 years, the minimum age is 42 years and the oldest person was 95. This makes sense, as it is well known that CVDs mostly affect people in their later years.

We also see that we have 62 Males and 34 Females in the data. That equates to 64.5% and 35.5% respectively. This distribution seems dispreportionate, but it may be that Males are more succeptibal to CVDs and that a CVD was a prerequisite for the study.

### Age Distribution

In [None]:
# View statistics of the age distribution
data_df.age.describe()

### Gender Distribution

In [None]:
# View statistics of the gender distribution
print(data_df.sex.value_counts())
data_df.sex.value_counts()/96

### Risk Factors

We can look at the proportion of people in each risk factor category as well:

48% were Anaemic,

41% were Diabetic,

35% had High Blood Pressure and

31% were smokers.

What is interesting to note here is that 13.5% of patients that died in the study did not have any of these risk factors. We will explore this a bit later.




In [None]:
data_df.anaemia.value_counts()/96

In [None]:
data_df.diabetes.value_counts()/96

In [None]:
data_df.high_blood_pressure.value_counts()/96

In [None]:
data_df.smoking.value_counts()/96

In [None]:
# Determine number of patients not having risk factors above
no_risks_df = data_df.query('anaemia==0 and diabetes==0 and high_blood_pressure==0 and smoking==0')
print(no_risks_df.age.count())
print(no_risks_df.age.count()/96)


We can look at the impact each of these factors has by checking the ditrbutions of each. The plot below shows this.

The top 4 rows of the plot, indicate patient that had each of the risk factors, and the bottom 4 rows indicate patients that did not have them. This is compared to time to death.

Red : Smoking
Green : Aneamia
Orange : High Blood Pressure
Blue : Diabetes

Although we can see a concentration of each risk factor towards the lower values of time, the difference between the top and bottom distributions are not very much, indicating that whether or not a patient suffers from a risk or not does not make a difference to the duration of death.

In [None]:
# Using a sctterplot, see the distribution of patients having and not having the risk factors. Each risk factor is desplaced from the others to ease visualizing
plt.figure(figsize=(15,8))
plt.scatter(data_df.time,data_df.diabetes)
plt.scatter(data_df.time,data_df.high_blood_pressure+0.2)
plt.scatter(data_df.time,data_df.anaemia+0.4)
plt.scatter(data_df.time,data_df.smoking+0.6)


### Days to Death Analysis

In [None]:
#Plot a Pareto Diagram

#Create histogram bins
hist_count, hist_bins = np.histogram(data_df.time, bins=[i for i in range(7,246,7)])

plt.figure(figsize=(15,8))
#Plot the histogram bars
plt.bar(hist_bins[:-1], hist_count, width=7)

#Create the percentage of total data for the line plot
lineplot = [] 
lineplot.append(hist_count[0] / hist_count.sum()) #Create the first value
for i in range(1,len(hist_count)):
    lineplot.append(lineplot[i - 1] + (hist_count[i] / hist_count.sum())) #Add the values of each bin to the data.
# Plot the pecentage of total line. We have to multiply the lineplot by the max of count to scale the line plot to the y axis.
plt.plot(hist_bins[:-1],np.dot(lineplot,hist_count.max()), color='red')

In [None]:
print("Cummalative Precentage of Total of each Bin:")
for i in range(len(lineplot)):
    print(f"Bin {i + 1:2} - {100 * lineplot[i]:.2f}%")

It is noticable that the first half of the bins (bins are 7 days or a week each) or the first 18 bins represents 80% of deaths. If we also look at the box plot for the values of time we notice the mojority of values are below about 110 days.

In [None]:
# Box plot the time data
data_df.time.plot.box()

### Blood Test Parameters

Looking at the 5 blood parameters data also icluded in the data, we can see which ones are indicators of heart failure. We can do this by comparing the ranges of the values to the time to death, using scatter plots.

#### Creatine Phosphokinase (CPK)
We note that the majority of patients had a normal range of CPK (10-120) with a few outliers in the higher ranges. Also note that the deaths within the normal range is distributed over the number of days to death. This indicates that we can not accurately infer anything from the CPK levels.

In [None]:
#Plot a scatterplot
plt.figure(figsize=(5,5))
plt.scatter(data_df.creatinine_phosphokinase, data_df.time, color="green")
plt.show()

#### Ejection Fraction

If we look at Ejection Fraction we can see many of the patients below the 80% mark of ~120 days to death grouped below about 45% EF. The normal range for EF is roughly between 50 and 70%. This means that a lower than normal EF can indicate potential heart failure.

In [None]:
#Plot a scatterplot
plt.figure(figsize=(5,5))
plt.scatter(data_df.ejection_fraction, data_df.time, color="red")
plt.show()

#### Platelets Count

The normal range for Platelet Count is between 150K - 400K. Looking at our data we see that the majority of patients had a PC within the normal range, indicating that it would not be a factor in potential Heart Failure.

In [None]:
#Plot a scatterplot
plt.figure(figsize=(5,5))
plt.scatter(data_df.platelets, data_df.time, color="yellow")
plt.show()

#### Serum Creatinine

If we look at our patients Serum Creatinine, we see that most of them were below the normal range of 4 to 9, indicating that a low SC contributes to potential heart failure.

In [None]:
#Plot a scatterplot
plt.figure(figsize=(5,5))
plt.scatter(data_df.serum_creatinine, data_df.time, color="orange")
plt.show()

#### Serum Sodium

Serum Sodium has a normal range of 135-145. Looking at our data again, the mojority of patients were within the normal range, with about a quarter of them below the normal range but mostly still within 1 Standard Deviation. From this, Serum Sodium would not be an accurate indicator of potential heart failure.

In [None]:
#Plot a scatterplot
plt.figure(figsize=(5,5))
plt.scatter(data_df.serum_sodium, data_df.time, color="blue")
plt.show()
#135-145

In [None]:
# Calculate the Standard Deviation of serum_sodium
data_df.serum_sodium.std()

## Conclusion

Looking at a dataset from a study that examined heart failure patients (It is assumed that all patients suffered a cardiovascular event and was then monitored afterwards to obtain the data) and only taking into account the patients that was still part of the program when deseased, we have determined the following:

* The age group for CVDs is relatively high with all patients being over 40 and on average 62 years of age.
* Potentially, males affected by CVDs are double that of females.
* Although diabetes, high blood pressure, smoking and anaemia may increase the risk of CVDs, it does not definitely result in heart failure on it's own, with 13% of patients not having any of these risk factors.
* Some blood tests can be used as an indication of risk of heart failure with Ejection Fraction and Serum Creatinine below normal levels being strong indicators. Serum Sodium and Platelet Count are not accurate in indicating heart failure.

In [None]:
jovian.commit(project="CVDRisks")