<center><img src='https://i.ytimg.com/vi/pB7SWDcgPic/maxresdefault.jpg'></center>

## Problem Statement

There are some factors that affects death of an patenet. This dataset contains person's information like age ,sex , blood pressure, smoke, diabetes,ejection fraction, creatinine phosphokinase, serum_creatinine, serum_sodium, time and we have to predict their DEATH EVENT.

1. age
2. anaemia - Decrease of red blood cells or hemoglobin (boolean)
3. creatinine_phosphokinase - Level of the CPK enzyme in the blood (mcg/L)
4. diabetes - If the patient has diabetes (boolean)
5. ejection_fraction - Percentage of blood leaving the heart at each contraction (percentage)
6. high_blood_pressure - If the patient has hypertension (boolean)
7. platelets - Platelets in the blood (kiloplatelets/mL)
8. serum_creatinine - Level of serum creatinine in the blood (mg/dL)
9. serum_sodium - Level of serum sodium in the blood (mEq/L)
10. sex - Woman or man (binary)
11. smoking - If the patient smokes or not (boolean)
12. time - Follow-up period (days)
13. DEATH_EVENT - If the patient deceased during the follow-up period (boolean)

## Import Libraries

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Some mandatory Libraries
import string 
import warnings
import numpy as np
import pandas as pd

# plotting
import seaborn as sns;
import matplotlib.pyplot as plt

# features selection
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

# scaling
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

# model building
from sklearn.svm import SVC
from sklearn.svm import NuSVC
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# sccuracy
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, plot_roc_curve, plot_precision_recall_curve

# others
%matplotlib inline
warnings.filterwarnings("ignore")

## [1.1] Load Data

Loading the data into the pandas data frame is certainly one of the most important steps, as we can see that the value from the data set is comma-separated. So all we have to do is to just read the CSV into a data frame and pandas data frame does the job for us.

In [None]:
df = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

In [None]:
df.head()

Here __"DEATH_EVENT"__ is the target column.

## [2.1] Lets Explore the Data

Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main characteristics often plotting them visually. This step is very important especially when we arrive at modeling the data in order to apply Machine learning.

1. Checking the types of data.
2. Dropping irrelevant columns.
3. Renaming the columns.
4. Dropping the duplicate rows.
5. Dropping the missing or null values.
6. Detecting Outliers
7. Plotting

In [None]:
print('Shape of our Data:',df.shape)

In [None]:
# check datatypes

print(df.dtypes)

In [None]:
# check min, max and other details

df.describe()

In [None]:
# check the missing or null values.

print(df.isnull().sum())

In [None]:
print('DEATH_EVENT:')
print(df['DEATH_EVENT'].value_counts())

In [None]:
print('Distribution of DEATH_EVENT:')
print(df['DEATH_EVENT'].value_counts()/len(df))

__Let's try to visualise the same using plots.__

In [None]:
ax = sns.countplot(x='DEATH_EVENT', data=df, facecolor=(0, 0, 0, 0), linewidth=5, edgecolor=sns.color_palette("dark", 3))

Here we can see our data is immbalanced. So we need to perform some preprocessing on this dataset.

#### [2.1.1] Plot Correlation Metrice:

In [None]:
# define correlation matrice
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

with sns.axes_style("white"):
    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=(10, 8))
    ax = sns.heatmap(corr, cmap=cmap, mask=mask, vmax=.3, square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)

Here we can see __'diabetes', 'sex' and 'smoking'__ has very less impact in our target value.

#### [2.1.2] Best Featues:

In [None]:
#apply SelectKBest class to extract best features
X_train = df.drop(['DEATH_EVENT'], axis=1)
Y_test = df['DEATH_EVENT']
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X_train, Y_test)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_train.columns)

#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Featue','Score']
feature_imp = featureScores.nlargest(X_train.shape[1],'Score')

In [None]:
# plot top 5 features

print(feature_imp.head())

In [None]:
# plot each feature with it's importance

ax = sns.barplot(x='Score', y='Featue', data=feature_imp)

#### [2.1.3] Pairplot:

In [None]:
sns.pairplot(df, hue="DEATH_EVENT", palette="husl",diag_kind="kde")
plt.show()

#### [2.1.3] Univariate Analysis:

In [None]:
for column in df.columns[:12]:
    sns.barplot(x='DEATH_EVENT',y=column, data=df, palette='Blues_d')
    plt.title('Death Event Vs. {}'.format(string.capwords(column.replace("_", " "))))
    plt.show()

Here we can see some featutes has quite good impact in our terget such as **'serum_creatinine'** and **'time'**. Let's analyse these two featues a little bit more.

In [None]:
# define two new dataframe for Survived & Non Servived

survived = df[df['DEATH_EVENT'] == 0]
not_survived = df[df['DEATH_EVENT'] == 1]

__Analyse Time based on PDF & CDF:__

In [None]:
counts, bin_edges = np.histogram(survived['time'], bins=10, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)

counts, bin_edges = np.histogram(not_survived['time'], bins=10, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)

plt.xlabel('Time')
plt.title('PDF & CDF (Time)')
plt.legend(['PDF of Survived','CDF of Survived','PDF of Non-Survived','CDF of Non-Survived'])
plt.show()

**Onservations on Time:**

Time can be key feature to analyse our target. If follow up days is morethan **100** then maximum (Near about 20%) chances that patent is well. On the other hand if the patenet has lessthan **50** days follow up days then 30% chances that patenet has heart failure.

__Analyse Serum Creatinine based on PDF & CDF:__

In [None]:
counts, bin_edges = np.histogram(survived['serum_creatinine'], bins=10, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)

counts, bin_edges = np.histogram(not_survived['serum_creatinine'], bins=10, density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)

plt.xlabel('Serum Creatinine')
plt.title('PDF & CDF (Serum Creatinine)')
plt.legend(['PDF of Survived','CDF of Survived','PDF of Non-Survived','CDF of Non-Survived'])
plt.show()

**Onservations on Serum Creatinine:**

It is also an key feature to anayse. If patent's Serum Creatinine is more that 6 it is higer chances that patent has heart failuire.

__Analyse the outlieres:__

In [None]:
for column in df.columns[:12]:
    sns.boxplot(x='DEATH_EVENT',y=column, data=df, palette='Set3')
    plt.title('Death Event Vs. {}'.format(string.capwords(column.replace("_", " "))))
    plt.show()

Here we can have some extream outliers such as **'creatinine_phosphokinase'** and **'serum_sodium'**

### Observation on EDA

From the above analysis we can't conclude anything as we have major overlap between data. But we can point out some of the details as,

* Most of my patentece are between 40-80 age group. 
* Most of the Nonsurvived patentece are between 45 to 65 age group.
* Ejection Fraction bellow 40 is a good singh. More than 50% patentece survived who fad Ejection Fraction less than 40.
* Geder is also overlapped alot. But we can say 60% males and 40% females are srvived. 
* From **Time in Days** we can say more non survived patentece are found as my observation period increase.

On a nutshell we can say Univariate analysis is not that good as we have lot of overlapper between datas. Let's try other techniques to be more accurate analysis.

## [3.1] Data Preprocessing

__Remove Outliers__

__Apply MinMaxScaler__

In [None]:
# define features need to be scale
# select all numeric features except categorial
cols = ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_sodium', 'time']

# define object
scaler = MinMaxScaler()

# perform Min Max Scaling
for col in cols:
    scaler.fit(df[col].values.reshape(-1, 1))
    df['nrm_' + col] = scaler.transform(df[col].values.reshape(-1, 1))

# drop old columns
df.drop(['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_sodium', 'time'], axis = 1, inplace=True)

## [4.1] Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(['DEATH_EVENT'], axis=1), df['DEATH_EVENT'], test_size=0.3, random_state=11)

## [5.1] Modeling

In [None]:
# Define classifiers with default parameters.

classifiers = {
    'SVC': SVC(),
    'LinearSVC': LinearSVC(),
    'NuSVC': NuSVC(),
    'DecisionTree':DecisionTreeClassifier()
}

In [None]:
for name, classifier in classifiers.items():
    classifier.fit(X_train, y_train)
    training_score = cross_val_score(classifier, X_train, y_train, cv=5)
    print('Classifiers: ',name, 'has training score of', round(training_score.mean(),2) * 100)

## [6.1] Hyper Parameter Tuning

In [None]:
# SVC

params = {
    'C':[10**-3, 10**-2, 10**-1, 1, 10, 10**2, 10**3], 
    'kernel':['linear', 'poly', 'rbf', 'sigmoid'],
    'gamma': ['scale', 'auto']
}

gs = GridSearchCV(SVC(), params, cv = 5, n_jobs=-1, scoring='accuracy')
gs_results = gs.fit(X_train, y_train)

SVC_best_estimator = gs.best_estimator_ # store best estimators for future analysis

print('Best Accuracy: ', gs_results.best_score_)
print('Best Parametrs: ', gs_results.best_params_)

In [None]:
# LinearSVC

params = {
    'C':[10**-3, 10**-2, 10**-1, 1, 10, 10**2, 10**3], 
    'penalty':['l1', 'l2'],
    'loss': ['hinge', 'squared_hinge']
}

gs = GridSearchCV(LinearSVC(), params, cv = 5, n_jobs=-1, scoring='accuracy')
gs_results = gs.fit(X_train, y_train)

LinearSVC_best_estimator = gs.best_estimator_ # store best estimators for future analysis

print('Best Accuracy: ', gs_results.best_score_)
print('Best Parametrs: ', gs_results.best_params_)

In [None]:
# DecisionTree

params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [2,4,6,8,10,12]
}

gs = GridSearchCV(DecisionTreeClassifier(), params, cv = 5, n_jobs=-1, scoring='accuracy')
gs_results = gs.fit(X_train, y_train)

DecisionTree_best_estimator = gs.best_estimator_ # store best estimators for future analysis

print('Best Accuracy: ', gs_results.best_score_)
print('Best Parametrs: ', gs_results.best_params_)

In [None]:
# plot top 5 best features

pd.Series(DecisionTree_best_estimator.feature_importances_, index=X_train.columns).nlargest(5).plot(kind='barh')

__Plot classification report:__

In [None]:
train_pred = SVC_best_estimator.predict(X_train)
print(classification_report(y_train,train_pred))

In [None]:
train_pred = LinearSVC_best_estimator.predict(X_train)
print(classification_report(y_train,train_pred))

In [None]:
train_pred = DecisionTree_best_estimator.predict(X_train)
print(classification_report(y_train,train_pred))

## [7.1] Accuracy on Test:

In [None]:
print('Final Test Accuracy for')
print('     SVC:',SVC_best_estimator.score(X_test,y_test))
print('     Linear SVC:',LinearSVC_best_estimator.score(X_test,y_test))
print('     Decision Tree:',DecisionTree_best_estimator.score(X_test,y_test))

__Polt AUC Curve with DecisionTree:__

In [None]:
plot_roc_curve(DecisionTree_best_estimator, X_test, y_test)
plt.show()

In [None]:
plot_precision_recall_curve(DecisionTree_best_estimator, X_test, y_test)
plt.show()

__Confusion Matrix with Decision Tree__

In [None]:
pred = DecisionTree_best_estimator.predict(X_test)
sns.heatmap(confusion_matrix(y_test,pred),annot=True)
plt.ylabel("Actual")
plt.xlabel("Prediction")
plt.show()