# Introduction

In this assignment, we are going to reproduce the research "Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone" of Davide Chicco and Giuseppe Jurman. The purpose of this research is to predict the if a patient can survive given the features they have. This will give the doctors the insights which factors to focus when they check the patients' records.
As an outline, our assignment is split into 3 parts:
- Descriptive statistics
- Logistics Regression
- Random Forrest

# Descriptive statistics

In this section, we are checking the characteristics of each factor. We will reproduce the results of the table 2, 3, 5, 6, 7 from the paper.

In [1]:
import numpy as np
import pandas as pd
import statistics
import scipy.stats as stats
from scipy.stats import pearsonr
from scipy.stats import chisquare
from scipy.stats import chi2_contingency

In [2]:
heart_data = pd.read_csv('data/heart_failure_records.csv')
heart_data['platelets'] = heart_data['platelets']/1000

In [3]:
def summarize_bin(data, *cols):
    summary_table = pd.DataFrame()
    for col in cols:
        col_summary = data.groupby(col).agg(number=(col, 'size'),
                             percentage=(col, lambda x: round((len(x)/len(data))*100,2)))
        col_summary['variable'] = col
        col_summary['value'] = col_summary.index.copy()
        col_list = ['variable', 'value', 'number', 'percentage']
        col_summary = col_summary[col_list]
        summary_table = pd.concat([summary_table,col_summary], axis = 0)
    return(summary_table)

In [4]:
full_sample_1 = summarize_bin(heart_data, 'anaemia', 
          'high_blood_pressure',
         'diabetes',
         'sex',
         'smoking')
full_sample_1.columns = pd.MultiIndex.from_product([['Full sample'], full_sample_1.columns])

In [5]:
dead_patients_1 = summarize_bin(heart_data[heart_data.DEATH_EVENT == 1], 'anaemia', 
          'high_blood_pressure',
         'diabetes',
         'sex',
         'smoking')
dead_patients_1.columns = pd.MultiIndex.from_product([['Dead patients'], dead_patients_1.columns])

In [6]:
survived_patients_1 = summarize_bin(heart_data[heart_data.DEATH_EVENT == 0], 'anaemia', 
          'high_blood_pressure',
         'diabetes',
         'sex',
         'smoking')
survived_patients_1.columns = pd.MultiIndex.from_product([['Survived patients'], survived_patients_1.columns])

In [7]:
summary_table_1 = pd.concat([full_sample_1, dead_patients_1.iloc[:, 1:], survived_patients_1.iloc[:, 1:]], axis = 1)
summary_table_1 = summary_table_1.reset_index(drop = True)
summary_table_1 = summary_table_1.drop(summary_table_1.columns[[4, 7]], axis=1) 
summary_table_1

Unnamed: 0_level_0,Full sample,Full sample,Full sample,Full sample,Dead patients,Dead patients,Survived patients,Survived patients
Unnamed: 0_level_1,variable,value,number,percentage,number,percentage,number,percentage
0,anaemia,0,170,56.86,50,52.08,120,59.11
1,anaemia,1,129,43.14,46,47.92,83,40.89
2,high_blood_pressure,0,194,64.88,57,59.38,137,67.49
3,high_blood_pressure,1,105,35.12,39,40.62,66,32.51
4,diabetes,0,174,58.19,56,58.33,118,58.13
5,diabetes,1,125,41.81,40,41.67,85,41.87
6,sex,0,105,35.12,34,35.42,71,34.98
7,sex,1,194,64.88,62,64.58,132,65.02
8,smoking,0,203,67.89,66,68.75,137,67.49
9,smoking,1,96,32.11,30,31.25,66,32.51


The table above is table 2 in the paper, from the table, it can be seen that 'anaemia' and 'high_blood_pressure' are the two binary variables that have highest differences between dead patients and survived patients.

In [8]:
def summarize_bin(data, *cols):
    summary_table = pd.DataFrame()
    for col in cols:
        col_summary = data.groupby(col).agg(number=(col, 'size'),
                             percentage=(col, lambda x: round((len(x)/len(data))*100,2)))
        col_summary['variable'] = col
        col_summary['value'] = col_summary.index.copy()
        col_list = ['variable', 'value', 'number', 'percentage']
        col_summary = col_summary[col_list]
        summary_table = pd.concat([summary_table,col_summary], axis = 0)
    return(summary_table)

In [9]:
num_vars = ['age', 
            'creatinine_phosphokinase', 
            'ejection_fraction',
           'platelets',
           'serum_creatinine',
           'serum_sodium',
           'time']

In [10]:
def summarize_num(data):
    median_ = pd.DataFrame(data[num_vars].median()).rename(columns={0:'Median'})
    mean_ = pd.DataFrame(data[num_vars].mean().round(2)).rename(columns={0:'Mean'})
    std_ = pd.DataFrame(data[num_vars].std().round(2)).rename(columns={0:'Std'})
    summary_table = pd.concat([median_,mean_,std_], axis = 1)
    return(summary_table)

In [11]:
full_sample_2 = summarize_num(heart_data[num_vars])
full_sample_2.columns = pd.MultiIndex.from_product([['Full sample'], full_sample_2.columns])

In [12]:
dead_patients_2 = summarize_num(heart_data[num_vars][heart_data.DEATH_EVENT == 1])
dead_patients_2.columns = pd.MultiIndex.from_product([['Dead patients'], dead_patients_2.columns])

In [13]:
survived_patients_2 = summarize_num(heart_data[num_vars][heart_data.DEATH_EVENT == 0])
survived_patients_2.columns = pd.MultiIndex.from_product([['Survived patients'], survived_patients_2.columns])

In [14]:
summary_table_2 = pd.concat([full_sample_2, dead_patients_2, survived_patients_2], axis = 1)
summary_table_2

Unnamed: 0_level_0,Full sample,Full sample,Full sample,Dead patients,Dead patients,Dead patients,Survived patients,Survived patients,Survived patients
Unnamed: 0_level_1,Median,Mean,Std,Median,Mean,Std,Median,Mean,Std
age,60.0,60.83,11.89,65.0,65.22,13.21,60.0,58.76,10.64
creatinine_phosphokinase,250.0,581.84,970.29,259.0,670.2,1316.58,245.0,540.05,753.8
ejection_fraction,38.0,38.08,11.83,30.0,33.47,12.53,38.0,40.27,10.86
platelets,262.0,263.36,97.8,258.5,256.38,98.53,263.0,266.66,97.53
serum_creatinine,1.1,1.39,1.03,1.3,1.84,1.47,1.0,1.18,0.65
serum_sodium,137.0,136.63,4.41,135.5,135.38,5.0,137.0,137.22,3.98
time,115.0,130.26,77.61,44.5,70.89,62.38,172.0,158.34,67.74


The table above is the table 3 in the research, the result obtained here is similar to the ones in the paper.
- From the table, we can see that the average age of people who are dead is higher.
- The level of creatinie phospokinase (CPK) of dead patients is also more than the ones of survived people, CPK is enzyme which is flown into blood when muscle tissues get damaged.
- In addition, the percentage of blood leaving the heart at each contraction is lower for patients who are dead.
- It is also witnessed that the platelets in blood of survived patients are more than dead ones.
- For level of serum creatimine, it is seen that people who are dead have higher level of this substance.
- But this is not the case for serum sodium, as the level of this substance does not vary much between survived and dead patients.

In [15]:
vars_list = ['serum_creatinine',
          'ejection_fraction',
          'age',
          'serum_sodium',
          'high_blood_pressure',
          'anaemia',
          'platelets',
          'creatinine_phosphokinase',
          'smoking',
          'sex',
          'diabetes']

In [16]:
uw_pvalue = []
for var in vars_list:
    result = stats.mannwhitneyu(heart_data[var][heart_data.DEATH_EVENT == 0], 
                   heart_data[var][heart_data.DEATH_EVENT == 1], alternative='two-sided')
    
    uw_pvalue.append(round(result[1],6))

In [17]:
summary_table_3 = pd.DataFrame(list(zip(vars_list, uw_pvalue)),
               columns =['Feature', 'P-value']).sort_values(by=['P-value']).reset_index(drop = True)
summary_table_3['Rank'] = summary_table_3.index + 1
summary_table_3

Unnamed: 0,Feature,P-value,Rank
0,serum_creatinine,0.0,1
1,ejection_fraction,1e-06,2
2,age,0.000167,3
3,serum_sodium,0.000293,4
4,high_blood_pressure,0.171016,5
5,anaemia,0.25297,6
6,platelets,0.425559,7
7,creatinine_phosphokinase,0.68404,8
8,smoking,0.82819,9
9,sex,0.941292,10


This table is table 5 in the paper, from this table we can see that the top two features are 'serum creatimine' and 'ejection faction" whose p-value are close to 0.

In [18]:
#Calculate Pearson correlation
pcorr = []
for var in vars_list: 
    result = pearsonr(heart_data[var], heart_data.DEATH_EVENT)
    pcorr.append(abs(result[0]))
#Create summary table
summary_table_4 = pd.DataFrame(list(zip(vars_list, pcorr)),
               columns =['Feature', 'abs(PCC)']).sort_values(by=['abs(PCC)'],ascending=False).reset_index(drop = True)
summary_table_4['Rank'] = summary_table_4.index + 1
summary_table_4


Unnamed: 0,Feature,abs(PCC),Rank
0,serum_creatinine,0.294278,1
1,ejection_fraction,0.268603,2
2,age,0.253729,3
3,serum_sodium,0.195204,4
4,high_blood_pressure,0.079351,5
5,anaemia,0.06627,6
6,creatinine_phosphokinase,0.062728,7
7,platelets,0.049139,8
8,smoking,0.012623,9
9,sex,0.004316,10


The table above which is the first lef half of table 6 in the paper also confirms the insights from the Mann-Whitney U test that 'serum creatimine' and 'ejection fraction' have highest correlation with the fact that patients are dead or survived.

In [19]:
shapiro_pvalue = stats.shapiro(heart_data['age'])

In [20]:
#Add DEATH_EVENT to the new var list
vars_list_full = vars_list.copy()
vars_list_full.append('DEATH_EVENT')

In [21]:
#Calculate Shapiro-Wilk test
shapiro_pvalue = []
for var in vars_list_full: 
    result = stats.shapiro(heart_data[var])
    shapiro_pvalue.append(result[1])
   
#Create summary table
summary_table_5 = pd.DataFrame(list(zip(vars_list_full, shapiro_pvalue)),
               columns =['Feature', 'p-value']).sort_values(by=['p-value'],ascending=True).reset_index(drop = True)
summary_table_5['Rank'] = summary_table_5.index + 1
summary_table_5

Unnamed: 0,Feature,p-value,Rank
0,creatinine_phosphokinase,7.050336e-28,1
1,serum_creatinine,5.3927580000000004e-27,2
2,smoking,4.581843e-26,3
3,DEATH_EVENT,4.581843e-26,4
4,sex,1.1685000000000001e-25,5
5,high_blood_pressure,1.168618e-25,6
6,diabetes,5.115524e-25,7
7,anaemia,6.209964e-25,8
8,platelets,2.883561e-12,9
9,serum_sodium,9.210248e-10,10


This table is a reproduced version of right half of table 6 in the paper. Looking at the p-value, they indicate that all variables are non-normal.

In [24]:
#Calculate Chi Square test
chi2_pvalue = []
for var in vars_list: 
    observed = pd.crosstab(heart_data['DEATH_EVENT'], heart_data[var])
    result = chi2_contingency(observed)
    chi2_pvalue.append(result[1])
#Create summary table
summary_table_6 = pd.DataFrame(list(zip(vars_list, chi2_pvalue)),
               columns =['Feature', 'p-value']).sort_values(by=['p-value'],ascending=True).reset_index(drop = True)
summary_table_6['Rank'] = summary_table_6.index + 1
summary_table_6

Unnamed: 0,Feature,p-value,Rank
0,ejection_fraction,6.459328e-08,1
1,serum_creatinine,3.145236e-06,2
2,serum_sodium,0.009600557,3
3,age,0.01522741,4
4,high_blood_pressure,0.2141034,5
5,anaemia,0.3073161,6
6,creatinine_phosphokinase,0.4317506,7
7,platelets,0.5482704,8
8,smoking,0.9317653,9
9,sex,1.0,10


For Chi Square test, this is a replication for table 7 in the paper, this table once again confirms that 'ejection fraction', 'serum creatimine' are the most important factors. Even though the results of p-value are a bit different from the source paper, the ranking is the same. And if we have the confidence level as 95%, the top 4 features that have rejection to NULL hypothesis, therefore, there are differnce between two samples (dead patients and survived patients) in term of those 4 variables.