Importing Library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
filterwarnings('ignore')
import random 
import statistics
from scipy import stats
from statsmodels.stats import weightstats as stests
from scipy.stats import shapiro
from statsmodels.stats import power
from scipy.stats import chi2_contingency
from scipy.stats import pearsonr
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

Loading Data

In [None]:
hdata=pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
hdata

Exploratry Data Analysis

In [None]:
hdata.shape    #to find the size of Dataset

Interpretation:

The dataset has 299 observation and 13 variables.

In [None]:
hdata.dtypes  # to check datatypes of observation

Interpretation:

The variables Age,Anaemia,Creatinine Phosphokinase,Ejection Fraction, Platelets,Serum Creatinine, Serum Sodium, Time,Diabetes,High blood pressure,sex,smoking,death event are numeric variable.

In [None]:
hdata.isnull().sum()  #to check null value

Interpretation:

There is no missing data in the dataset.

Descriptive Analysis

In [None]:
hdata.describe()  #statstical Description

Measure Of Dispersion

In [None]:
#Histogram Plot to check distribution of data
hdata.hist(figsize=(16,20),bins=50,xlabelsize=8,ylabelsize=8)
plt.show()

Interpretation : 

Only platelets coloumn look normally distributed with bell shape distribution.

Quantile-Quantile Plot

In [None]:
#To plot Probability plot for Variables to check Distribitution
variables= ['age','anaemia','creatinine_phosphokinase','diabetes','ejection_fraction',
            'high_blood_pressure','platelets','serum_creatinine','serum_sodium','time','sex','smoking','DEATH_EVENT']
plt.figure(figsize=(20,18))

for i in range(1, 14):
    ax=plt.subplot(7, 2, i)
    ax=stats.probplot(hdata[variables[i-1]],dist="norm",plot=plt)
    plt.title(variables[i-1])

Interpretation:

Age ,Platelets & Serum Sodium looks normally distributed

Measure of Shape

In [None]:
#Skewness
hdata.skew()

Interpretation:

The variables age,anemia,creatinine_phosphokinase,diabetes,ejection_fraction, platelets, serum_creatinine,time & DEATH_EVENT are positively skewed.The variable serum sodium & sex are negatively skewed.

In [None]:
#Kurtosis
hdata.kurtosis()

Interpretation:

The distribution of variables age,anaemia,ejection_fraction,high_blood_pressure,sex,smoking,time & DEATH_EVENT are platykurtic.This implies that there are very less number of extreme observations in these variables.

The variables creatinine_phosphokinase,platelets,serum_creatinine & serum_sodium are leptokurtic.This implies that the distribution of these variables is accumulated near mean, with the presence of more extreme observations.

In [None]:
#Kurtosis plot
variables= ['age','anaemia','creatinine_phosphokinase','diabetes',
            'ejection_fraction','high_blood_pressure','platelets','serum_creatinine',
            'serum_sodium','time','sex','smoking','DEATH_EVENT']
plt.figure(figsize=(20,28))

for i in range(1, 14):
    ax=plt.subplot(7, 2, i)
    ax=sns.distplot(hdata[variables[i-1]],label='Kurtosis : %.2f'%hdata[variables[i-1]].kurtosis())
    plt.legend(loc='best')
    plt.axvline(hdata[variables[i-1]].mean())

Shapiro Test

In [None]:
#Null Hypothesis - H0 : sample comes from a normal distribution(Normally Distributed)
#Alternative Hypothesis- H1 : sample doesn't comes from a normal distribution(Not Normally Distributed)
for i in variables:
    alpha= 0.05 #singificance-level
    stat,p = stats.shapiro(hdata[i])
    print('Statistics=%.3f, p=%.3f' % (stat, p))
    if p>alpha:
        print('{} ----- Normally distributed (Retain the null hypothesis)'.format(i))
    else:
        print('{} -----  Not normally distributed (Reject the null hypothesis)'.format(i))

Interpretation:

Variables in Dataset are not normally distributed.

In [None]:
plt.figure(figsize=(20,30))
corrMatrix = hdata.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

Interpretation:

Heatmap shows correlation between 
Death Event & these 3 Parameters:

1) Ejection Fraction

2) Serum Creatinine

3) Serum Sodium

Hypothesis Testing

To check whether the Death Event has occurred due to the cause of Serum Sodium, Serum Creatinine, Ejection Fraction

Serum Sodium Normal Level (130 - 145), Serum Creatinine (0.9-1.3=Male, 0.6-1.1=Female), Ejection Fraction (45-75)

In [None]:
#Dividing dataset into 2 part 
df_death=hdata[hdata['DEATH_EVENT']== 1]    #patient died
df_surv=hdata[hdata['DEATH_EVENT']== 0]     #patient survived

Ejection Fraction vs chances of survival

Reduction in ejection fraction could lead to death of heart patients

Null hypothesis H0: (mu)death-(mu)surv>=0

Alternate hypothesis H1: (mu)death-(mu)surv<0

since the variables are not normally distributed we proceed with non-parametric two sample test.

In [None]:
stats.mannwhitneyu(df_death['ejection_fraction'],df_surv['ejection_fraction'],alternative='less')  #Mannwitney test

Interpretation:

pvalue < 0.05 this falls in the rejection region.

We reject H0.

Thus we can now conclude that mean of ejection fraction for patients who have died is lesser than for patients who have survived.

In [None]:
sns.distplot(df_death['ejection_fraction'],color='Red')
sns.distplot(df_surv['ejection_fraction'],color='Grey')
plt.title("Chances of survival vs Ejection Fraction")
plt.legend(('Dead','Survived'))
plt.plot()

Interpretation:

Graph above shows that lesser ejection fraction in a patient could lead to death of paitent.

Serum Creatinine vs Chances of Survival

Increase in serum creatinine levels could lead to death of heart patients

Null hypothesis H0: (mu)surv-(mu)death>=0

Alternate hypothesis H1: (mu)surv-(mu)death<0

In [None]:
stats.mannwhitneyu(df_surv['serum_creatinine'],df_death['serum_creatinine'],alternative='less')   #Mannwitney test

Interpretation:

P-Value < .05

we reject null H0.

Thus we can conclude that chances of survival for patients with incresed serum creatinine levels is lower

In [None]:
sns.distplot(df_death['serum_creatinine'],color='Red')
sns.distplot(df_surv['serum_creatinine'],color='Grey')
plt.title("Chances of survival vs Serum Creatinine")
plt.legend(('Dead','Survived'))
plt.plot()

Interpretation:

Graph above shows that increase in serum creatininelevel lead to more number of death.

Serum Sodium vs chances of survival

Decrease in serum sodium levels could decrease chances of survival

H0: (mu)death-(mu)surv>=0

H1: (mu)death-(mu)surv<0

In [None]:
stats.mannwhitneyu(df_death['serum_sodium'],df_surv['serum_sodium'],alternative='less')  #Mannwitney test

Interpretation:

Since pvalue<0.05 it falls in the rejection region. 

We reject H0.

This concludes that with a reduction in serum sodium levels chances of survival is less

In [None]:
plt.figure(figsize=(5,5))
ax1=sns.distplot(df_death['serum_sodium'],color='Red')
ax2=sns.distplot(df_surv['serum_sodium'],color='Grey')
plt.legend(('Dead','Survived'))
plt.title("Chances of survival vs Serum Sodium")
plt.show()

Interpretation:

Graph above shows that decrease in serum sodium level,lessen the chance of survival.

As above 3 hypothesis shows that low ejection fraction, icrease in serum creatinine & decrease in serum sodium level lessen the chance of survival.

Testing the risk for heart failure for less ejection fraction, increase serum creatinine and low serum sodium levels.(At medical defined level)

Claim 1 : People with ejection fraction less than 45 have a higher tendecy for heart failure

In [None]:
#record of all the patients who have died due to heart failure
df_death.head()

In [None]:
M0 = 45           #hypothesized median

# H0: M >= 45
# H1: M < 45

alpha=0.05
diff = df_death['ejection_fraction']-M0
test_statistic,pval = stats.wilcoxon(x=diff,alternative='less')
print('Stats :',test_statistic)
print('P-Value :',pval)
if (pval < alpha):
    print('The pvalue calculated is less than the level of significance. So, we reject the null hypothesis.')
else:
    print('The pvalue calculated is greater than the level of significance. So, we fail to reject the null hypothesis.')

Interpretation:

The claim that people with ejection fraction less than 45% have a higher tendency of heart failure is true.

In [None]:
lower_values = df_death.loc[df_death['ejection_fraction'] <45]
upper_values = df_death.loc[df_death['ejection_fraction'] > 75]
normal_values = df_death.loc[(df_death['ejection_fraction'] >= 45) & (df_death['ejection_fraction'] <=75)]
x,y,z = lower_values.shape[0],normal_values.shape[0],upper_values.shape[0]
plt.bar(['lower_values','normal_values','upper_values'],[x,y,z],color='blue')
plt.ylabel('Number of Death')
plt.show()

Interpretation:

Above graph shows that ejection fraction lesser than 45 % leads to more nummber of death of patient.

Claim 2 : Risk for Heart failure increases with increased levels of serum creatinine (greater than 1.21 milligrams per deciliter)

In [None]:
M0 = 1.21           #hypothesized median

# H0: M <= 1.21
# H1: M > 1.21

alpha=0.05
diff = df_death['serum_creatinine']-M0
test_statistic,pval = stats.wilcoxon(x=diff,alternative='greater')
print('Stats :',test_statistic)
print('P-Value :',pval)
if (pval < alpha):
    print('The pvalue calculated is less than the level of significance.Therefore we reject the null hypothesis')
else:
    print('The pvalue calculated is greater than the level of significance.Therefore we fail to reject the null hypothesis')

Interpretation:

The claim that the risk for heart failure increases when serum creatinine levels increases in a patient (greater than 1.21 mg/dL) is true.

In [None]:
lower_values = df_death.loc[df_death['serum_creatinine'] <0.84]
upper_values = df_death.loc[df_death['serum_creatinine'] >1.21]
normal_values = df_death.loc[(df_death['serum_creatinine'] >0.84) & (df_death['serum_creatinine'] <1.21)]
x,y,z = lower_values.shape[0],normal_values.shape[0],upper_values.shape[0]
plt.bar(['lower_values','normal_values','upper_values'],[x,y,z],color='blue')
plt.ylabel('Number of Death')
plt.show()

Interpretation:

Above graph shows increase in serum creatinine level of a patient will lead to more number of death.

Claim 3 : Risk for Heart failure increases when serum sodium ion level decreaases (less than 137 mE/L)

In [None]:
M0 = 137           #hypothesized median

# H0: M >= 135
# H1: M < 135

alpha=0.05
diff = df_death['serum_sodium']-M0
test_statistic,pval = stats.wilcoxon(x=diff,alternative='less')
print('Stats :',test_statistic)
print('P-Value :',pval)
if(pval<alpha):
    print('The pvalue calculated is less than the level of significance.Therefore we reject the null hypothesis.')
else:
    print('The pvalue calculated is greater than the level of significance.Therefore we fail to reject the null hypothesis.')

Interpretation:

When a patient's serum sodium level decreases (less than 137 mE/L) the risk for heart failure in patient increases is true.

In [None]:
lower_values = df_death.loc[df_death['serum_sodium'] <137]
upper_values = df_death.loc[df_death['serum_sodium'] > 145]
normal_values = df_death.loc[(df_death['serum_sodium'] >= 137) & (df_death['serum_sodium'] <=145)]
x,y,z = lower_values.shape[0],normal_values.shape[0],upper_values.shape[0]
plt.bar(['lower_values','normal_values','upper_values'],[x,y,z],color='blue')
plt.ylabel('Number of Death')
plt.show()

Interpretation:

Above graph shows decrease in serum sodium will lead to more number of death.

Above 3 hypothesis proves the claim that patient with ejection fraction lower than 45%,serum creatinine level more than 1.21 mg/dL & serum sodium less than 137 mE/L will have higher chance of heart failure.

Since above hypothesis shows that at the lower level or higher level of certain variables(i.e. Ejection fraction,Serum Sodium & Serum Creatinine) in Dataset increase the chance of Heart Failure in patient.We move to test if the normal level of variables(Ejection Fraction,Serum Creatinine & Seum Sodium) & gender of patient affects the risk of heart failure.

Test of Claim within the Normal Level & Gender

Testing the claim if the Deaths are caused due to Ejection Fraction within the normal Level & does gender of patient affects it.

In [None]:
ej = hdata.loc[(hdata['ejection_fraction']>45) & (hdata['ejection_fraction']<75),
               ['ejection_fraction','sex','DEATH_EVENT']] #Dividing dataset within the the normal ejection fraction.
ej.head(10)

In [None]:
m1=ols('Q("ejection_fraction")~Q("sex")+Q("DEATH_EVENT")',data=ej).fit()  #For Anova table  # Anova Test
anova_table=anova_lm(m1)
anova_table

Hypothesis for column: sex

H0: Normal Level of Ejection Fraction does not influence on the basis of Gender

H1: Normal Level of Ejection Fraction does influence on the basis of Gender

In this we are checking whether the death have caused with the normal level of Ejection Fraction. Ejection Fraction (Normal Level) : 45 to 75

In [None]:
fcrit_Gender = stats.f.isf(0.05,1,56)
print(fcrit_Gender)

Interpretation:

f_stat(Sex) < f_critical_Gender hence we fail to Reject Null Hypothesis ( Accept H0)

Therefore we conclude that Normal Level of Ejection Fraction does not influence on the basis of Gender

Hypothesis for column: death events

H0: Normal Level of Ejection Fraction does not influence Death Event

H1: Normal Level of Ejection Fraction does influence Death Event

In [None]:
fcrit_Death = stats.f.isf(0.05,1,56)
print(fcrit_Death)

Interpretation:

f_stat(Death Event) < f_critical_death hence we fail to Reject Null Hypothesis ( Accept H0)

Therefore we conclude that Normal level of Ejection Fraction does not influence on Death Event

In [None]:
sns.stripplot(x=ej['DEATH_EVENT'],y=ej['ejection_fraction'],hue=ej['sex'])
plt.show()

Interpretation:

From above graph we can conclude that the deaths have been recorded in a nominal range as more patients have survived irrespective of their Gender(DEATH_EVENT(0= No Death,1= Death),sex(0= male,1=female))

Testing the claim if the Deaths are caused due to Serum Creatinine within the normal Level & Gender of patient affects it.

In [None]:
sc = hdata.loc[(hdata['serum_creatinine']>0.6) & (hdata['serum_creatinine']<1.3),
               ['serum_creatinine','sex','DEATH_EVENT']]   #Dividing dataset within the the normal serum Creatinine level.
sc.head()

In [None]:
m2=ols('Q("serum_creatinine")~Q("sex")+Q("DEATH_EVENT")',data=sc).fit()  #for anova table  # for Anova test
anova_table=anova_lm(m2)
anova_table

Hypothesis for column: sex

H0: Normal Level of Serum Creatinine does not influence on the basis of Gender

H1: Normal Level of Serum Creatinine does influence on the basis of Gender

In [None]:
fcrit_Gender = stats.f.isf(0.05,1,190)
print(fcrit_Gender)

Interpretation:

f_stat(Sex) < f_critical_Gender hence we fail to Reject Null Hypothesis ( Accept H0)

Therefore we conclude that Normal Level of Serum Creatinine does not influence on the basis of Gender

Hypothesis for column: Death Event

H0: Normal Level of Serum Creatinine does not influence on the basis of Death occurances

H1: Normal Level of Serum Creatinine does influence on the basis of Death occurances

In [None]:
fcrit_Death = stats.f.isf(0.05,1,190)
print(fcrit_Death)

Interpretation:

f_stat(Death) > f_critical_Death hence we Reject Null Hypothesis

Therefore we conclude that Serum Creatinine influence on the basis of Death occurances

In [None]:
sns.stripplot(x=sc['DEATH_EVENT'],y=sc['serum_creatinine'],hue=sc['sex'])
plt.show()

Interpretation:

From above graph we can conclude that the deaths have been recorded in a nominal range as more patients have survived irrespective of their Gender but more deaths have been recorded compared to the cause of Ejection Fraction.(DEATH_EVENT(0= No Death,1= Death),sex(0= male,1=female))

Testing the claim if the Deaths are caused due to Serum Sodium within the normal Level & Gender effect.

In [None]:
ss = hdata.loc[(hdata['serum_sodium']>130) & (hdata['serum_sodium']<145),
               ['serum_sodium','sex','DEATH_EVENT']]   #Dividing dataset within the the normal Serum Sodium Level.
ss.head()

In [None]:
m3=ols('Q("serum_sodium")~Q("sex")+Q("DEATH_EVENT")',data=ss).fit()
anova_table=anova_lm(m3)
anova_table

Hypothesis for column: sex

H0: Serum Sodium does not influence on the basis of Gender

H1: Serum Sodium influence on the basis of Gender

In [None]:
fcrit_Gender = stats.f.isf(0.05,1,263)
print(fcrit_Gender)

Interpretation:

f_stat(Sex) > f_critical_Gender hence we Reject Null Hypothesis

Therefore we conclude that Serum Sodium influence on the basis of Gender

Hypothesis for column: Death Events

H0: Serum Sodium does not influence on the basis of Death Occurances

H1: Serum Sodium influence on the basis of Death Occurances

In [None]:
fcrit_Death = stats.f.isf(0.05,1,263)
print(fcrit_Death)

Interpretation:

f_stat(Death) > f_critical_Death hence we Reject Null Hypothesis

Therefore we conclude that Serum Sodium influence on the basis of Death Occurances

In [None]:
sns.stripplot(x=ss['DEATH_EVENT'],y=ss['serum_sodium'],hue=ss['sex'])
plt.show()

Interpretation:

From above graph we can conclude that the more deaths have been recorded and has taken an effect on Gender: Male compared to Female.(DEATH_EVENT(0= No Death,1= Death),sex(0= male,1=female)

Survival Prediction Plot


Analysis 1: Serum_Creatinine vs Ejection Fraction

In [None]:
sns.lmplot(x="serum_creatinine", y="ejection_fraction", hue="DEATH_EVENT", data=hdata)
plt.show()

Interpretation:


The linear Regression line indicates that increase in count on Ejection Fraction, nominal number of Deaths have been recorded within the normal level.

Whereas there is a less chance of Survival rate due to decrease in linearity of curve within the normal level

Analysis 2: Serum Sodium vs Ejection Fraction

In [None]:
sns.lmplot(x="serum_sodium", y="ejection_fraction", hue="DEATH_EVENT", data=hdata)
plt.show()

Interpretation:

The linear Regression line indicates that increase in count on Ejection Fraction, nominal number of Deaths have been recorded within the normal level.

Also there is a linear improvement on the Survival rate of patients within the normal level

Conclusion :

Above Statistical Analysis is divided into two part:

1)Descriptive Analysis

2)Inferential Analysis


Descriptive analysis indicates that dataset have 299 patient record with 13 distinct columns all the columns were numerical column but some numerical column had categorical value as 0 & 1. There were no null value present in the dataset.In exploratry data analysis indicates that mean age of patient in dataset is 60 year(approx).Mean ejection fraction is 38.08%. Patient's mean serum creatinine level was 1.39 mg/dL.Mean of sodium level in patient's was 136.62 mEq/L.For the normality of the data multiple test were performed which were soft fail proving that dataset is guassian like.Heatmap shows correlation between Ejection Fraction,Serum Creatinine ,serum sodium & death event.

Inferential analysis indicates that Ejection Fraction,Serum Creatinine,Serum Sodium had affected the occurance of heart failure.After hypothesis test we conclude that ejection fraction less than 45%,Serum Creatinine level more than 1.21 mg/dL & serum sodium level lower than 137 mE/L in any patient have higher chances of heart failure whereas gender of a patient doesnot affect any selected features.


Biostatistics analysis selected ejection fraction,serum creatinine & serum sodium as the three most relevant features to predicted the Survival of the Patient.Any patient getting treatment for heart,these features should be checked in priority to predict the possiblity of heart failure.