<h1>Congestive Heart Failure Analysis</h1>

<img src='/Users/akash-5162/Desktop/chf.jpg'>

<p1>Congestive Heart Failure is top cause of death worldwide. </p1>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix, f1_score, recall_score, precision_score, accuracy_score

In [None]:
heart_failure=pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv',header=0)

In [None]:
heart_failure.head()

<p1>
<ol>
    <li>Women->1</li>
    
</ol>

</p1>

In [None]:
heart_failure.shape

There are 299 samples, with 13 variable each in this data set. We would have to create a model based on this data. To make our model more effecient, we would be splitting this data set into testing and training set

<h2>Lets understand the data better!</h2>

In [None]:
heart_failure.describe()

<h2>Lets make sure that the column names are all uniform</h2>

In [None]:
heart_failure.columns=['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
       'death_event']

heart_failure.head()

As we can see, several columns likes diabetes, anaemia, etc are of the type float rather than Boolean. Lets sort this issue

In [None]:
to_update={'age':float, 'anaemia':bool, 'creatinine_phosphokinase':int, 'diabetes':bool,
       'ejection_fraction':int, 'high_blood_pressure':bool, 'platelets':float,
       'serum_creatinine':float, 'serum_sodium':float, 'sex':bool, 'smoking':bool, 'time':int,
       'death_event':bool}

heart_failure=heart_failure.astype(to_update)
heart_failure.info()

As there are no missing values or junk values, we dont have to alter the data set much. We can proceed with building the classification model to predict the chances of getting affected by Congestive Heart Failure

Lets plot the various attributes present in the data set to see which of them could be use in our prediction

<h2>Lets plot Age and Death event</h2>

In [None]:
plt.scatter(x=heart_failure['age'],y=heart_failure['death_event'],color='red')
#plt.figsize(10,10)

As it can be seen, we don't have concrete evidence to say that age has a strong triggering factor of Congestive Heart Failure

<h3>Distribution of Age</h3>

In [None]:
heart_failure['age'].plot(kind='hist', bins=15,figsize=(10,10))

<h3>Distribution of Ejection Fraction</h3>

In [None]:
heart_failure['ejection_fraction'].plot(kind='hist',bins=15,figsize=(10,10))

<h3>Distribution of Creatinine Phosphokinase</h3>

In [None]:
heart_failure['creatinine_phosphokinase'].plot(kind='hist',bins=15,figsize=(10,10))

<h3>Distribution of Platelets</h3>

In [None]:
heart_failure['platelets'].plot(kind='hist',bins=15,figsize=(10,10))

<h3>Distribution of Serum Creatinine</h3>

In [None]:
heart_failure['serum_creatinine'].plot(kind='hist',bins=15,figsize=(10,10))

<h3>Distribution of Serum Sodium</h3>

In [None]:
heart_failure['serum_sodium'].plot(kind='hist',bins=15,figsize=(10,10))

<h3>Distribution of other boolean variables</h3>

In [None]:
heart_failure['anaemia'].astype(int).plot(kind='hist')

In [None]:
heart_failure['high_blood_pressure'].astype(int).plot(kind='hist',bins=10)

In [None]:
heart_failure['diabetes'].astype(int).plot(kind='hist',bins=10)

In [None]:
heart_failure['smoking'].astype(int).plot(kind='hist',bins=10)

In [None]:
heart_failure['death_event'].astype(int).plot(kind='hist',bins=10)

<h3>Lets focus on cases where death event happened</h3>

In [None]:
deaths=heart_failure[heart_failure['death_event']==True]
deaths.shape

In [None]:
deaths.describe()

Lets see how many of the 96 people were anaemic, had diabetes, and high blood pressure

In [None]:
a=deaths['anaemia'].sum()
b=deaths['diabetes'].sum()
c=deaths['high_blood_pressure'].sum()

to_print="Anaemic: {}, had diabetes: {}, had high blood pressure:{}".format(a,b,c)
print(to_print)

How many people who died had all three

In [None]:
a=np.logical_and(deaths['anaemia']==True,deaths['diabetes']==True)
deaths[np.logical_and(a,deaths['high_blood_pressure']==True)]['age'].count()

Only 6 out of the 96 people who died due to congestive heart failure suffered from all three conditions. This means, each of these conditions seperately are extremely leathal 

Lets see the distribution of variables of for the death case

In [None]:
#deaths[].plot(kind='hist',subplots=True,figsize=(10,10))

['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium']



Distribution of Age

In [None]:
deaths['age'].plot(kind='hist',figsize=(10,10))

Distribution of Creatinine Phosphokinase

In [None]:
deaths['creatinine_phosphokinase'].plot(kind='hist',figsize=(10,10))

<p1><b>As we can see, there are no unusual trend in the distribution of Creatinine Phosphokinase. Most people has a value between 0 to 1000, which is how it is for normal people as well. Hence this might not be an important factor contributing to Congestive heart failure.</b></p1>

Distribution of Ejection Fraction

In [None]:
deaths['ejection_fraction'].plot(kind='hist',figsize=(10,10))

<p1><b>Ejection Fraction, however offers an interesting insight. We see that the ejection fraction value is mostly centered around 20 to 30. However the average value of the data set happens to be between 30 to 40. Hence this could be used as a parameter</b></p1>

Distribution of Platelets

In [None]:
deaths['platelets'].plot(kind='hist',figsize=(10,10))

<p1><b>Platelets counts of those affected people are slightly on the lower side.</b></p1>

Distribution of Serum Creatinine

In [None]:
deaths['serum_creatinine'].plot(kind='hist',figsize=(10,10))

Distribution of Serum Sodium

In [None]:
deaths['serum_sodium'].plot(kind='hist',figsize=(10,10))

In [None]:
df = heart_failure.drop(['time'], axis=1)
f = plt.figure(figsize=(16, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X=np.asarray(heart_failure[['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time']])

Y=np.asarray(heart_failure[['death_event']])

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.3, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

In [None]:
yhat = LR.predict(X_test)
#yhat

In [None]:
yhat_prob = LR.predict_proba(X_test)
#yhat_prob

<h2>jaccard index</h2>
Lets try jaccard index for accuracy evaluation. we can define jaccard as the size of the intersection divided by the size of the union of two label sets. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0

In [None]:
#from sklearn.metrics import jaccard_similarity_score
#jaccard_similarity_score(y_test, yhat)

In [None]:
print('Classification f1-score', f1_score(y_test, yhat))
print('Classification precision', precision_score(y_test, yhat))
print('Classification recall', recall_score(y_test, yhat))

<b>As we saw, the age can't be used as a deciding factor. Hence lets drop the factor 'age' from our model and see its impace on our model effeciency</b>

In [None]:
X1=np.asarray(heart_failure[['anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time']])

Y1=np.asarray(heart_failure[['death_event']])

In [None]:
X_train1, X_test1, y_train1, y_test1 = train_test_split( X1, Y1, test_size=0.3, random_state=4)
print ('Train set:', X_train1.shape,  y_train1.shape)
print ('Test set:', X_test1.shape,  y_test1.shape)

In [None]:
LR1 = LogisticRegression(C=0.01, solver='liblinear').fit(X_train1,y_train1)
LR1

In [None]:
yhat1 = LR1.predict(X_test1)
yhat1

In [None]:
yhat_prob1 = LR1.predict_proba(X_test1)
#yhat_prob1

In [None]:
#jaccard_similarity_score(y_test1, yhat1)

In [None]:
plot_confusion_matrix(LR1, X_test1, y_test1)
plt.show()

Removing age as certainly improved the effeciency of the model

<h2>Lets check other effeciency measuring metrics</h2>

In [None]:
print('Classification f1-score', f1_score(y_test1, yhat1))
print('Classification precision', precision_score(y_test1, yhat1))
print('Classification recall', recall_score(y_test1, yhat1))