Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.
Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

In [None]:
heart=pd.read_csv("../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")
heart.head()

In [None]:
heart.isnull().sum()

In [None]:
heart["DEATH_EVENT"].unique()

In [None]:
heart.describe()

In [None]:
heart.head(5)

Here, we have data between 40-95 years old.
<br><br>Normal value of createning phosphokinase is between 10 mcg/l to 120 mcg/l
<br>Ejection fraction normal range is between 53 and 73%
<br>As we age, our heart ages also age, the walls thickken
<br>E.F below 53 for women and 52 for men is low... E.F <45% is a potential factor for heart issues
<br>HBP increases heart failure chance
<br>Normal platelet range is 150000 to 350000
<br>Normal serum cretinine is 0.84 to 1.21 or 0.6 to 1.21
<br>Creatiine more than 5 may indicate serious kidney impairment
<br>Diabetes may cause high createnine
<br>Normal serum sodium  is 135 to 145


### NORMAL PERSON ANALYSIS

In [None]:
heart[(heart["creatinine_phosphokinase"]>9)&(heart["creatinine_phosphokinase"]<120)&(heart["ejection_fraction"]>50)&
     (heart["ejection_fraction"]<74)&(heart["platelets"]>149999)&
     (heart["platelets"]<351000)]

In [None]:
heart["DEATH_EVENT"].value_counts()

In [None]:
sns.boxplot(heart["age"])

In [None]:
heart["ejection_fraction"].describe()

In [None]:
sns.boxplot(heart["ejection_fraction"])

In [None]:
%matplotlib
sns.boxplot(heart["platelets"])

In [None]:
heart=heart[(heart["platelets"]<400000)&(heart["platelets"]>80000)]

In [None]:
heart=heart[heart["ejection_fraction"]<70]

In [None]:
sns.boxplot(heart["serum_creatinine"])

In [None]:
heart["serum_creatinine"].describe()

In [None]:
heart[heart["serum_creatinine"]>5]

In [None]:
sns.boxplot(heart["serum_sodium"])

In [None]:
heart["sex"]=heart["sex"].apply(lambda x:"male" if x==1 else "female")

In [None]:
sns.countplot(heart["sex"])

More males than females

In [None]:
heart

time is Follow up period in days

### CREATININE PHOSPHASE TO DEATH

In [None]:
plt.figure(figsize=(30,18))
sns.countplot(heart[(heart["creatinine_phosphokinase"]>9)&(heart["creatinine_phosphokinase"]<121)]["creatinine_phosphokinase"],hue=heart["DEATH_EVENT"])

The highest deaths had creatinine of 23,46 ,60,68,69,70,76,81,94,99,104,110-113

THe range is still in the normal level range that's why we did'nt get so much deaths

### DIABETES TO DEATH

In [None]:
sns.countplot(heart["diabetes"],hue=heart["DEATH_EVENT"])

In [None]:
%matplotlib inline
sns.barplot(heart["diabetes"],heart["DEATH_EVENT"])

We see that most people that were diabetes free survived more; but still many people with diabetes also survived

In [None]:
heart[heart["diabetes"]==1]["DEATH_EVENT"].value_counts()

Most people with diabetes survived

### EJECTION FRACTION TO DEATH

In [None]:
sns.boxplot(heart["ejection_fraction"])

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(heart["ejection_fraction"],hue=heart["DEATH_EVENT"])

We see that most people with very low ejection fraction died but as the ejection fraction reaches 30, most of them survived

### HIGH BLOOD PRESSURE TO DEATH

In [None]:
sns.countplot(heart["high_blood_pressure"],hue=heart["DEATH_EVENT"])

Here, we see that majority of our dataset did not have high blood pressure and majority of the people that had high blood pressure did not die from heart failure

### PLATELETS TO DEATH

In [None]:
%matplotlib inline
plt.figure(figsize=(20,10))
sns.countplot(heart["platelets"],hue=heart["DEATH_EVENT"])

In [None]:
heart[heart["platelets"]<150000]["DEATH_EVENT"].value_counts()

In [None]:
heart[heart["platelets"]>350000]["DEATH_EVENT"].value_counts()

In [None]:
heart[(heart["platelets"]>=150000)&(heart["platelets"]<=350000)]["DEATH_EVENT"].value_counts()

### SERUM CREATININE

In [None]:
heart["serum_creatinine"]

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(heart["serum_creatinine"],hue=heart["DEATH_EVENT"])

We see that as serum creatinine value reaches 1.8, more people died from heart failure. So, we can conclude that serum creatinine value has direct inference with the heart failure. 

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(heart["serum_creatinine"],heart["DEATH_EVENT"])


In [None]:
heart[(heart["serum_creatinine"]>0.8)&(heart["serum_creatinine"]<=1.2)]["DEATH_EVENT"].value_counts()

Most people in the normal createnine range survived but most people with extreme createnine values died

### SERUM SODIUM

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(heart["serum_sodium"],heart["DEATH_EVENT"])


In [None]:
plt.figure(figsize=(20,10))
sns.countplot(heart["serum_sodium"],hue=heart["DEATH_EVENT"])


Serum sodium less than 133 recorded many deaths and serum sodium values greater than 145 recorded many deaths too

In [None]:
heart[(heart["serum_sodium"]<133)|(heart["serum_sodium"]>145)]["DEATH_EVENT"].value_counts()

SO , at low values and extreme values of serum sodium, more deaths may occur

### SMOKING

In [None]:
heart[heart["smoking"]==1]["DEATH_EVENT"].value_counts()

### TIME

In [None]:
heart["time"]

In [None]:
plt.figure(figsize=(20,10))

sns.countplot(heart["time"],hue=heart["DEATH_EVENT"])

In [None]:
heart[heart["time"]<73]["DEATH_EVENT"].value_counts()

We see that follow up days less than 73 recorded more deaths by a large margin

### DATA MODELLING

Here , I want to use ExtraTreesClassifier to get the most important features in my dataset.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
model=ExtraTreesClassifier()

In [None]:
X=heart.drop(["DEATH_EVENT","sex"],axis="columns")
y=heart["DEATH_EVENT"]

In [None]:
model.fit(X,y)

In [None]:
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers

In [None]:
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

In [None]:
X=heart[["time","ejection_fraction","serum_creatinine","age","platelets","creatinine_phosphokinase"]]
y=heart["DEATH_EVENT"]

In [None]:
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [None]:
Xtrain,Xtest,ytrain,ytest=train_test_split(X,y,random_state=61,test_size=0.2)

In [None]:
from sklearn.preprocessing import StandardScaler
scale=StandardScaler()
Xtrain=scale.fit_transform(Xtrain)
Xtest=scale.transform(Xtest)

In [None]:
ytrain.shape

In [None]:
models={"svc":{
    "model":SVC()
},
        "dtc":{
            "model":DecisionTreeClassifier()
        },
       "rfc":{
            "model":RandomForestClassifier()
        },
        "lr":{
            "model":LogisticRegression()
        },
        "gbc":{
            "model":GradientBoostingClassifier()
        },
        "knc":{
            "model":KNeighborsClassifier()
        },
       
       
       }

In [None]:
for name,modell in models.items():
    mod=modell["model"].fit(Xtrain,ytrain)
    ypred=mod.predict(Xtest)
    print(f"The accuracy of {name} is {accuracy_score(ypred,ytest)}")

In [None]:
#for i in range(0,100):
 #   for name,modell in models.items():
  #      Xtrain,Xtest,ytrain,ytest=train_test_split(X,y,random_state=i,test_size=0.2)
   #     mod=modell["model"].fit(Xtrain,ytrain)
    #    ypred=mod.predict(Xtest)
     #   print(f"The accuracy of {name} for random {i} is {accuracy_score(ytest,ypred)}")

In [None]:
for i in range(1,100):
    rfc=RandomForestClassifier(random_state=i).fit(Xtrain,ytrain)
    ypred=rfc.predict(Xtest)
    #print(f"The accuracy for random {i} is {accuracy_score(ytest,ypred)}")

In [None]:
rfc=RandomForestClassifier(random_state=72)
rfc.fit(Xtrain,ytrain)
ypred=rfc.predict(Xtest)
print(f"The accuracy is {accuracy_score(ypred,ytest)}")

In [None]:
mat=confusion_matrix(ytest,ypred)
mat

In [None]:
sns.heatmap(mat,annot=True,xticklabels=["Survived","Died"],yticklabels=["Survived","Died"])
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
plt.title("GETTING ACTUAL AND MISLABELLED DATA POINTS")

In [None]:
pd.DataFrame({"Actual":ytest,"Predicted":ypred})

In [None]:
from sklearn.metrics import classification_report
print(classification_report(ytest,ypred))

In [None]:
from sklearn.metrics import precision_recall_curve
precision ,recall,thresholds=precision_recall_curve(ytest,ypred)

In [None]:
precision

In [None]:
recall

In [None]:
thresholds

Here I created a pickle file which I will reference when I want to deploy the model into an app.

In [None]:
pickle.dump(rfc,open("model3.pkl","wb"))


In [None]:
model3=pickle.load(open("model3.pkl","rb"))
