# Classification
In this project, you will use a [dataset from Kaggle](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data) to predict the survival of patients with heart failure from serum creatinine and ejection fraction, and other factors such as age, anemia, diabetes, and so on.

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Heart failure is a common event caused by CVDs, and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioral risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity, and harmful alcohol use using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidemia, or already established disease) need early detection and management wherein a machine learning model can be of great help.

In [None]:
import pandas as pd
import os

from collections import Counter

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer
from tensorflow.keras.utils import to_categorical

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report

import numpy as np

from matplotlib import pyplot as plt

from collections import Counter

## Loading the data
Using `pandas.read_csv()`, load the data from **heart_failure.csv** to a pandas DataFrame object. Assign the resulting DataFrame to a variable called `data`.

In [None]:
data = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

Use the `DataFrame.info()` method to print all the columns and their types of the DataFrame instance data.

In [None]:
data.info()

Print the distribution of the `DEATH_EVENT` column in the data DataFrame class using `collections.Counter`. This is the column you will need to predict.

In [None]:
print(Counter(data['DEATH_EVENT']))

Extract the label column `DEATH_EVENT` from the data DataFrame and assign the result to a variable called `y`.

In [None]:
y = data['DEATH_EVENT']

Extract the features columns `['age','anaemia','creatinine_phosphokinase','diabetes','ejection_fraction','high_blood_pressure','platelets','serum_creatinine','serum_sodium','sex','smoking','time']` from the DataFrame instance data and assign the result to a variable called `x`.

In [None]:
x = data.iloc[:, :-1]

## Data preprocessing
Use the `pandas.get_dummies()` function to convert the categorical features in the DataFrame instance `x` to one-hot encoding vectors and assign the result back to variable  `x`.

In [None]:
x = pd.get_dummies(x)

Use the `sklearn.model_selection.train_test_split()` method to split the data into training features, test features, training labels, and test labels, respectively. To the `test_size` parameter assign the percentage of data you wish to put in the test data, and use any value for the  `random_state` parameter. Store the results of the function to `X_train`, `X_test`, `Y_train`, `Y_test` variables, making sure you use this order.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state= 1)

Initialize a `ColumnTransformer` object by using `StandardScaler` to scale the numeric features in the dataset: `['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium','time']`. Assign the resulting object to a variable called `ct`.

In [None]:
numerical_columns = x.columns
ct = ColumnTransformer([("only numeric", StandardScaler(), numerical_columns)])

Use the `ColumnTransformer.fit_transform()` function to train the scaler instance `ct` on the training data `X_train` and assign the result back to `X_train`. Do the same for `x_test`

In [None]:
x_train = ct.fit_transform(x_train)
x_test = ct.fit_transform(x_test)

## Prepare labels for classification
Initialize an instance of `LabelEncoder` and assign it to a variable called `le`.

In [None]:
le =LabelEncoder()

Using the `LabelEncoder.fit_transform()` function, fit the encoder instance `le` to the training labels `Y_train`, while at the same time converting the training labels according to the trained encoder.

Using the `LabelEncoder.transform()` function, encode the test labels `Y_test` using the trained encoder `le`.

In [None]:
y_train = le.fit_transform(y_train)

In [None]:
y_test = le.transform(y_test)

Using the `tensorflow.keras.utils.to_categorical()` function, transform the encoded training labels `Y_train` into a binary vector and assign the result back to `Y_train`. Do the same for y_test

In [None]:
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

## Design the model
Initialize a `tensorflow.keras.models.Sequential` model instance called `model`.

In [None]:
model = Sequential()

Create an input layer instance of `tensorflow.keras.layers.InputLayer` and add it to the model instance model using the `Model.add()` function.

Create a hidden layer instance of `tensorflow.keras.layers.Dense` with `relu` activation function and 12 hidden neurons, and add it to the model instance `model`.

Create an output layer instance of `tensorflow.keras.layers.Dense` with a `softmax` activation function (because of classification) with the number of neurons corresponding to the number of classes in the dataset.

In [None]:
shape = x_train.shape

model.add(InputLayer(input_shape = shape))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(2, activation = 'softmax'))

model.build()
model.summary()

Using the `Model.compile()` function, compile the model instance model using the `categorical_crossentropy` loss, `adam` optimizer and `accuracy` as metrics.

In [None]:
model.compile(loss='categorical_crossentropy', optimizer = 'adam', metrics=['accuracy'])

## Train and evaluate the model
Using the `Model.fit()` function, fit the model instance model to the training data `X_train` and training labels `Y_train`. Set the number of `epochs` to  `100` and the `batch_size` parameter to `16`.

In [None]:
history1 = model.fit(x_train, y_train, epochs=200, batch_size=16)

In [None]:
from scipy.interpolate import make_interp_spline
loss = history1.history['loss']
accuracy = history1.history['accuracy']

fig, (ax1, ax2) = plt.subplots(1, 2,  figsize=(15,5))

#loss plot
ax1.plot(loss, c = 'orange')
ax1.set_xlabel('# epochs')
ax1.set_ylabel('loss')

#accuracy plot
ax2.plot(accuracy)
ax2.set_xlabel('# epochs')
ax2.set_ylabel('accuracy')

plt.show()


Using the `Model.evaluate()` function, evaluate the trained model instance model on the test data `X_test` and test labels `Y_test`. Assign the result to a variable called `loss` (representing the final loss value) and a variable called `acc` (representing the accuracy metrics), respectively.

In [None]:
loss, acc = model.evaluate(x_test, y_test, verbose =0)

Use the `Model.predict()` to get the predictions for the test data `X_test` with the trained model instance model. Assign the result to a variable called `y_estimate`.

In [None]:
y_estimate = model.predict(x_test)
y_estimate = np.argmax(y_estimate, axis=1)
y_true = np.argmax(y_test, axis=1)

Print additional metrics, such as `F1-score`, using the `sklearn.metrics.classification_report()` function by providing it with `y_true` and `y_estimate` vectors as input parameters.

In [None]:
print(classification_report(y_true, y_estimate))