# BME-336546-C10-Feature engineering versus representation learning

## Medical topic
This tutorial's medical topic is atrial fibrillation (AF) arrhythmia detection. We will rely on features which were found and extracted in our lab and pubilshed [here](https://ieeexplore.ieee.org/document/9281068/authors#authors). Some of the features and the model we will build today are shown below below.

<center><img src="images/Capture3.png" width=400><center>
<center><img src="images/Capture1.PNG" width=400><center>
<center><img src="images/Capture2.PNG" width=300><center>

## Our mission
In this tutorial we will compare between the classical supervised machine learning, which is based on classification using features extracted by humans, and deep learning which is based on feature extraction and classification done by neural network that is fed by raw data. Moreover, we will learn how to use our faculty's GPU clusters. Check [BME cluster Wiki](http://132.68.176.116/index.php/Main_Page), written by Snir Lugassy.

In [None]:
import numpy as np
import itertools
from tqdm import tqdm
import pickle
import sys
import pandas as pd
import matplotlib as mpl
import seaborn as sns
import matplotlib.pyplot as plt
mpl.style.use(['ggplot']) 
# %matplotlib inline
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from IPython.display import display, clear_output
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import log_loss
import os
# os.environ['TF_CPP_MIN_LOG_LEVEL']='3'

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential, load_model 
from tensorflow.keras.layers import Dense, Dropout, Activation, Conv1D, MaxPool1D, Flatten 
from tensorflow.keras import utils

In [None]:
data_src = '/MLdata/MLcourse/LTAF/'
X = np.load(data_src + 'X_LTAF.npy')
y = np.load(data_src + 'y_LTAF.npy')


print(X.shape)
print(y.shape)


In [None]:
X_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 336546, stratify=y)

In [None]:
from sklearn.metrics import confusion_matrix
calc_TN = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[0, 0]
calc_FP = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[0, 1]
calc_FN = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[1, 0]
calc_TP = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[1, 1]

In [None]:
def stat_metric(y_test, y_pred_test, y_pred_proba_test, clf_name, temp=np.empty(())):
    TN = calc_TN(y_test, y_pred_test)
    FP = calc_FP(y_test, y_pred_test)
    FN = calc_FN(y_test, y_pred_test)
    TP = calc_TP(y_test, y_pred_test)
    Se = TP/(TP+FN)
    Sp = TN/(TN+FP)
    PPV = TP/(TP+FP)
    NPV = TN/(TN+FN)
    Acc = (TP+TN)/(TP+TN+FP+FN)
    F1 = (2*Se*PPV)/(Se+PPV)
    print('The fitted classifier is ' + clf_name + '\n')
    print('Sensitivity is {:.2f}. \nSpecificity is {:.2f}. \nPPV is {:.2f}. \nNPV is {:.2f}. \nAccuracy is {:.2f}. \nF1 is {:.2f}. '.format(Se,Sp,PPV,NPV,Acc,F1))
    if temp.size == 1:
        print('AUROC is {:.2f}'.format(roc_auc_score(y_test, y_pred_proba_test[:,1])))
    else:
        print('AUROC is {:.2f}'.format(roc_auc_score(y_test, temp[:,1])))

Let's begin with feature engineering. Scale your training and testing sets. Take the scaled data and fit the next three models onto it:
*  Logistic regression with $L_2$ penalty and maximum number of iterations of 300.
*  SVM with rbf kernel and $C=1$ and maximum number of iterations of 1000.
*  Random forest with 20 estimators and maximal depth of 5.
For each and every one of them, calculate the predictions of the test set and name the prediction `y_pred_test`. You should also calculate the probabilites and name it `y_pred_proba_test` except the SVM case. For each model, run the function `stat_metric`. Set all `random_state` to 336546.


In [None]:
scaler = StandardScaler()
#----------------------Implement your code here:------------------------------

#------------------------------------------------------------------------------

In [None]:
stat_metric(y_test, y_pred_test, y_pred_proba_test, clf_name='logistic regression')

### Expected output:
<center><img src="outputs/1.PNG" width="480"><center>

# Non-linear

In [None]:
from sklearn.svm import SVC
#----------------------Implement your code here:------------------------------

#------------------------------------------------------------------------------

In [None]:
stat_metric(y_test, y_pred_test, y_pred_proba_test, clf_name='RBF SVM') # y_pred_proba_test is not really calculated here

### Expected output:
<center><img src="outputs/2.PNG" width="480"><center>

In [None]:
from sklearn.ensemble import RandomForestClassifier
#----------------------Implement your code here:------------------------------

#------------------------------------------------------------------------------

In [None]:
stat_metric(y_test, y_pred_test, y_pred_proba_test, clf_name='random forest')

### Expected output:
<center><img src="outputs/3.PNG" width="480"><center>

In [None]:
rr = np.load(data_src + 'rr_LTAF.npy')
print(rr.shape) #84 patients, 1700 windows per each patient. 60 beats per each window. 142853 windows in total
rr_train, rr_test, _, _ = train_test_split(rr, y, test_size = 0.20, random_state = 336546, stratify=y)

In [None]:
tf.keras.backend.clear_session()
config = tf.compat.v1.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.2
tf.compat.v1.keras.backend.set_session(tf.compat.v1.Session(config=config))

Build the following model:
conv1d with 128 filters with 10 smaples size --> maxpool --> conv1d with 256 filters with 10 smaples size --> dorpout with probabilty of 0.5 --> Flatten --> 3 fully connected hidden layers with relu activation and 512, 256 and 128 neurons respectively --> dropuot.

In [None]:
window_size=60
n_filters_start=64
n_hidden_start=512
len_sub_window=10
dropout=0.5
model = Sequential()
model.add(Conv1D(n_filters_start, len_sub_window, activation='relu', input_shape=(60, 1)))
#----------------------Implement your code here:------------------------------

#------------------------------------------------------------------------------
model.add(Dense(1, activation='sigmoid'))
# model.add(Dense(2, activation='softmax')) # should change the labels for that
model.compile(optimizer='adam', metrics=['accuracy'], loss='binary_crossentropy')

In [None]:
model.summary()

### Expected output:
<center><img src="outputs/4.PNG" width="480"><center>

In [None]:
rr_train = rr_train.reshape(rr_train.shape[0],rr_train.shape[1],1)
rr_test = rr_test.reshape(rr_test.shape[0],rr_test.shape[1],1)

build the fitting model without running the cell. Use batch size of 1024 and 20 epochs.

In [None]:
#----------------------Implement your code here:------------------------------

#-----------------------------------------------------------------------------

In [None]:
if not("results" in os.listdir()):
    os.mkdir("results")
save_dir = "results/"
model_name = "final_weights.h5"
model_path = os.path.join(save_dir, model_name)
model.save(model_path)
print('Saved trained model at %s ' % model_path)

In [None]:
final_model = load_model("results/final_weights.h5")
loss_and_metrics = final_model.evaluate(rr_test, y_test, verbose=2)

In [None]:
y_pred = final_model.predict(rr_test)
y_pred[y_pred>=0.5] = 1
y_pred[y_pred<0.5] = 0

In [None]:
temp = final_model.predict(rr_test)
temp2 = np.zeros((temp.shape[0], 2))
temp2[:,0] = 1-temp[:,0]
temp2[:,1] = temp[:,0]

In [None]:
stat_metric(y_test, y_pred_test, y_pred_proba_test, clf_name='CNN', temp=temp2)

#### *This tutorial was written by [Moran Davoodi](mailto:morandavoodi@gmail.com) with the assitance of [Yuval Ben Sason](mailto:yuvalbse@gmail.com) & Kevin Kotzen*

### Acknowladgements:
Thanks to our lab colleagues *Armand Chocron* and *Shany Biton* for helping with this tutrial which relies on their paper.