<a href="https://colab.research.google.com/github/rcarrata/deeplearning_tf_examples/blob/master/8_Classification_Project_Heart_Failure.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Classification Project

In this project, you will use a dataset from Kaggle to predict the survival of patients with heart failure from serum creatinine and ejection fraction, and other factors such as age, anemia, diabetes, and so on.

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Heart failure is a common event caused by CVDs, and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioral risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity, and harmful alcohol use using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidemia, or already established disease) need early detection and management wherein a machine learning model can be of great help.

[Kaggle Source Data](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data)

## A - Loading the data

1. Using pandas.read_csv(), load the data from heart_failure.csv to a pandas DataFrame object. Assign the resulting DataFrame to a variable called data.

2. Use the DataFrame.info() method to print all the columns and their types of the DataFrame instance data.

3. Print the distribution of the death_event column in the data DataFrame class using collections.Counter. This is the column you will need to predict.

4. Extract the label column death_event from the data DataFrame and assign the result to a variable called y.

5. Extract the features columns ['age','anaemia','creatinine_phosphokinase','diabetes','ejection_fraction','high_blood_pressure','platelets','serum_creatinine','serum_sodium','sex','smoking','time'] from the DataFrame instance data and assign the result to a variable called x.

In [None]:
import pandas as pd
from collections import Counter

import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

!ls "/content/drive/My Drive/Colab/Classification/heart_failure.csv"

root_folder = "/content/drive/My Drive/Colab/"
project_folder = "Classification/"
csv_file = "heart_failure.csv"

csv_data = root_folder + project_folder + csv_file
print(csv_data)

data = pd.read_csv(csv_data)

from google.colab.data_table import DataTable
DataTable.max_columns = 60

#print the class distribution
print("## Dataframe.Info")
print(data.info())
print("\n")

print("## Death_Events distribution")
print('DEATH_EVENT',Counter(data["DEATH_EVENT"]))
print("\n")


# Extract the label column
y = data["DEATH_EVENT"]

# Extract the feature columns
x = data[['age','anaemia','creatinine_phosphokinase','diabetes','ejection_fraction','high_blood_pressure','platelets','serum_creatinine','serum_sodium','sex','smoking','time']]


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
'/content/drive/My Drive/Colab/Classification/heart_failure.csv'
/content/drive/My Drive/Colab/Classification/heart_failure.csv
## Dataframe.Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex       

## B - Data preprocessing

6. Use the pandas.get_dummies() function to convert the categorical features in the DataFrame instance x to one-hot encoding vectors and assign the result back to variable x. **NOTE**: Not using the sklearn.preprocessing.LabelEncoder because we have the label vectors in the column of Labels, all transformed in integers from 0 to 1

7. Use the sklearn.model_selection.train_test_split() method to split the data into training features, test features, training labels, and test labels, respectively. To the test_size parameter assign the percentage of data you wish to put in the test data, and use any value for the random_state parameter. Store the results of the function to X_train, X_test, Y_train, Y_test variables, making sure you use this order.

8. Initialize a ColumnTransformer object by using StandardScaler to scale the numeric features in the dataset: ['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium','time']. Assign the resulting object to a variable called ct.

9. Use the ColumnTransformer.fit_transform() function to train the scaler instance ct on the training data X_train and assign the result back to X_train.

10. Use the ColumnTransformer.transform() to scale the test data instance X_test using the trained scaler ct, and assign the result back to X_test. Note - https://stackoverflow.com/questions/23838056/what-is-the-difference-between-transform-and-fit-transform-in-sklearn

  ```python
  from sklearn.preprocessing import StandardScaler
  sc = StandardScaler()
  sc.fit_transform(X_train)
  sc.transform(X_test)
  ```

In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer

# Use pandas.get_dummies(DataFrame) to apply one-hot-encoding on all the categorical columns. Assign the result of the encoding back to the features variable.
x = pd.get_dummies(x)
print(x.head())

# Not using the sklearn.preprocessing.LabelEncoder because we have the label vectors in the column of Labels, all transformed in integers from 0 to 1

# sklearn.model_selection.train_test_split() method
# test_size - If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
# random_state - Controls the shuffling applied to the data before applying the split.
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

# When initializing ColumnTransformer make sure to list all of the numerical features you have in your dataset. Or use DataFrame.select_dtypes() to select float64 or int64 feature types automatically.
numerical_features = x.select_dtypes(include=['float64', 'int64'])
numerical_columns = numerical_features.columns
print("## Selecting Numerical Columns only to use ColumnTransformer on them")
print(numerical_columns)
print("\n")

ct = ColumnTransformer([('numeric', StandardScaler(), numerical_columns)], remainder='passthrough')

X_train = ct.fit_transform(X_train)
#print(X_train)

X_test = ct.transform(X_test)
#print(X_test)



    age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
0  75.0        0                       582         0                 20   
1  55.0        0                      7861         0                 38   
2  65.0        0                       146         0                 20   
3  50.0        1                       111         0                 20   
4  65.0        1                       160         1                 20   

   high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
0                    1  265000.00               1.9           130    1   
1                    0  263358.03               1.1           136    1   
2                    0  162000.00               1.3           129    1   
3                    0  210000.00               1.9           137    1   
4                    0  327000.00               2.7           116    0   

   smoking  time  
0        0     4  
1        0     6  
2        1     7  
3        0     7  
4        

## C - Prepare labels for classification

11. Initialize an instance of LabelEncoder and assign it to a variable called le. NOTE - Use the LabelEncoder.fit_transform() method to encode the label vector y_train into integers and assign the result back to the y_train variable.

12. Using the LabelEncoder.fit_transform() function, fit the encoder instance le to the training labels Y_train, while at the same time converting the training labels according to the trained encoder.

13. Using the LabelEncoder.transform() function, encode the test labels Y_test using the trained encoder le.

14. Using the tensorflow.keras.utils.to_categorical() function, transform the encoded training labels Y_train into a binary vector and assign the result back to Y_train.

In [None]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical


le = LabelEncoder()
Y_train = le.fit_transform(Y_train.astype(str))
Y_test = le.transform(Y_test.astype(str))

Y_train = to_categorical(Y_train)
Y_test = to_categorical(Y_test)
#print(Y_train)
#print(Y_test)



[[1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]]


## D - Design the model

16. Initialize a tensorflow.keras.models.Sequential model instance called model.

17. Create an input layer instance of tensorflow.keras.layers.InputLayer and add it to the model instance model using the Model.add() function.

18. Create a hidden layer instance of tensorflow.keras.layers.Dense with relu activation function and 12 hidden neurons, and add it to the model instance model.

19. Create an output layer instance of tensorflow.keras.layers.Dense with a softmax activation function (because of classification) with the number of neurons corresponding to the number of classes in the dataset.

20. Using the Model.compile() function, compile the model instance model using the categorical_crossentropy loss, adam optimizer and accuracy as metrics.

In [None]:
import pandas as pd
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import  InputLayer
from tensorflow.keras.layers import  Dense

#design the model
model = Sequential()

# Input Layer Instance
model.add(InputLayer(input_shape=(X_train.shape[1],)))

# Input Layer
model.add(Dense(10, activation='relu'))

# Hidden Layers
model.add(Dense(12, activation='relu'))

# Output Layer
# The number of neurons corresponding to the number of classes in a dataset is 2, DEATH or NOT_DEATH
model.add(Dense(2, activation='softmax'))

# setting the optimizer (in other exercise worked with categorical_crossentropy)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])




## E - Train and Evaluate the model

21. Using the Model.fit() function, fit the model instance model to the training data X_train and training labels Y_train. Set the number of epochs to 100 and the batch size parameter to 16.

22. Using the Model.evaluate() function, evaluate the trained model instance model on the test data X_test and test labels Y_test. Assign the result to a variable called loss (representing the final loss value) and a variable called acc (representing the accuracy metrics), respectively.

23. How to choose the Loss Functions - https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/
Loss better when it's close to 0. The score is minimized and a perfect cross-entropy value is 0.

In [None]:
import pandas as pd
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import  InputLayer
from tensorflow.keras.layers import  Dense


# Fit the model with training data
# Workaround
print("\n")
print("## Fitting the model with Training Data")
model.fit(X_train, Y_train, epochs=100, batch_size=16)

# Evaluate the model with unseen Test Data
print("\n")
print("## Evaluate the model with Test Data")
loss, acc= model.evaluate(X_test, Y_test, verbose = 0)
print("Loss", loss, "Accuracy:", acc)



## Fitting the model with Training Data
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch

## F - Generating a classification report

23. Use the Model.predict() to get the predictions for the test data X_test with the trained model instance model. Assign the result to a variable called y_estimate.

24. Use the numpy.argmax() method to select the indices of the true classes for each label encoding in y_estimate. Assign the result to a variable called y_estimate.

25. Use the numpy.argmax() method to select the indices of the true classes for each label encoding in Y_test. Assign the result to a variable called y_true.

# 

In [None]:
import numpy as np
from sklearn.metrics import classification_report

# Get the predictions for the test data X_test
print("\n")
print("## Predictions for the Test Data (X_test)")
y_estimate = model.predict(X_test)
print(y_estimate)

# select the indices of the true classes for each label encoding in y_estimate
y_estimate = np.argmax(y_estimate, axis = 1)

# select the indices of the true classes for each label encoding in Y_test
y_true = np.argmax(Y_test, axis = 1)

# Print additional metrics, such as F1-score
print("\n")
print("## Print additional Metrics")
print(classification_report(y_true, y_estimate))



## Predictions for the Test Data (X_test)
[[9.56739724e-01 4.32602279e-02]
 [9.72063184e-01 2.79368497e-02]
 [6.52504981e-01 3.47495049e-01]
 [9.89320755e-01 1.06792599e-02]
 [7.70613134e-01 2.29386896e-01]
 [8.22895169e-01 1.77104831e-01]
 [5.08633375e-01 4.91366655e-01]
 [9.69651103e-01 3.03489100e-02]
 [4.51679498e-01 5.48320472e-01]
 [6.43049300e-01 3.56950670e-01]
 [9.73917782e-01 2.60822382e-02]
 [5.13367541e-02 9.48663235e-01]
 [9.02368009e-01 9.76319760e-02]
 [9.67137158e-01 3.28628160e-02]
 [1.57343045e-01 8.42656970e-01]
 [5.78066260e-02 9.42193389e-01]
 [9.93327379e-01 6.67258026e-03]
 [8.11383247e-01 1.88616812e-01]
 [9.97530282e-01 2.46973406e-03]
 [9.96634424e-01 3.36561142e-03]
 [4.93412077e-01 5.06587863e-01]
 [9.65832546e-02 9.03416693e-01]
 [1.84259519e-01 8.15740407e-01]
 [6.90233529e-01 3.09766471e-01]
 [5.61238289e-01 4.38761681e-01]
 [9.82451975e-01 1.75480805e-02]
 [7.90069282e-01 2.09930673e-01]
 [9.95841086e-01 4.15890338e-03]
 [9.97981012e-01 2.01901793e-03]