# Predictive Modeling

In this notebook, we will explore various ML models to predict heart failure in MIMIC Data.

### Import required packages

In [1]:
from sklearn.model_selection import KFold
from sklearn.datasets import load_svmlight_file
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, auc, accuracy_score, confusion_matrix, f1_score, precision_score, recall_score, roc_curve, classification_report, confusion_matrix
from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler
import numpy as np

from sklearn.ensemble import RandomForestClassifier

### Preparing the dataset

Our dataset is in svmlight format, and was generated using Spark in our previous workflow

In [2]:
X, y = load_svmlight_file("3_code/patients.svmlight")

In [3]:
X = X.toarray()

In [4]:
## split data into training/validation and test set.
X_trainval, X_final_test, y_trainval, y_final_test = train_test_split(X, y, test_size=0.3, random_state=41)

## 1. Support Vector Classifier

In [5]:
## 5 Fold CV of trainval dataset
kf = KFold(n_splits=5)

In [6]:
svc_accuracy = []
svc_auc_scores = []

In [7]:
## fit a Linear SVC (Support Vector Classifier) model without L1 penalty with 5 fold CV
from sklearn.svm import LinearSVC
svc_model = LinearSVC(C=1.0, random_state=42)
for train_index, val_index in kf.split(X_trainval):
    # split into training and validation sets
    X_train = X_trainval[train_index]
    y_train = y_trainval[train_index]
    X_val = X_trainval[val_index]
    y_val = y_trainval[val_index]
    # fit the model on train and calculate acc and auc on validation set
    svc_model.fit(X_train, y_train)
    acc = svc_model.score(X_val, y_val)
    svc_accuracy.append(acc)
    y_score = svc_model.decision_function(X_val)
    auc_svc = roc_auc_score(y_val, y_score)
    svc_auc_scores.append(auc_svc)

In [8]:
## Accuracy and AUC scores validation scores of SVC model
print('SVC accuracy: {}'.format(np.mean(svc_accuracy)))
print('SVC AUC: {}'.format(np.mean(svc_auc_scores)))

SVC accuracy: 0.6142217320806446
SVC AUC: 0.6421937416824725


In [9]:
## get confusion matrix and metrics using the last fold of CV
y_pred = svc_model.predict(X_trainval[val_index])
print("Linear SVC Metrics...")
print('SVC accuracy: {}'.format(np.mean(svc_accuracy)))
print('SVC AUC: {}'.format(np.mean(svc_auc_scores)))
print(confusion_matrix(y_trainval[val_index], y_pred))
print("f1_score: {}".format(f1_score(y_trainval[val_index], y_pred)))
print("precision_score: {}".format(precision_score(y_trainval[val_index], y_pred)))
print("recall_score: {}".format(recall_score(y_trainval[val_index], y_pred)))

Linear SVC Metrics...
SVC accuracy: 0.6142217320806446
SVC AUC: 0.6421937416824725
[[419 185]
 [187 227]]
f1_score: 0.549636803874092
precision_score: 0.5509708737864077
recall_score: 0.5483091787439613


In [10]:
l1_accuracy = []
l1_auc_scores = []

In [11]:
scaler = MinMaxScaler()
X_trainval_l1 = scaler.fit_transform(X_trainval)
l1_model = LinearSVC(C=1.0, random_state=42, dual=False, penalty='l1')
for train_index, val_index in kf.split(X_trainval_l1):
    # split into training and validation sets
    X_train = X_trainval_l1[train_index]
    y_train = y_trainval[train_index]
    X_val = X_trainval_l1[val_index]
    y_val = y_trainval[val_index]
    
    # fit the model on train and calculate acc and auc on validation set
    l1_model.fit(X_train, y_train)
    acc = l1_model.score(X_val, y_val)
    l1_accuracy.append(acc)
    y_score = l1_model.decision_function(X_val)
    auc_l1 = roc_auc_score(y_val, y_score)
    l1_auc_scores.append(auc_l1)

In [12]:
## confusion matrix and metrics using fold of CV
y_pred = l1_model.predict(X_val)
print("SVC (L1 penalty) Metrics...")
print("for sparse model, cv accuracy = {}, auc = {}".format(np.mean(l1_accuracy), np.mean(l1_auc_scores)))
print(confusion_matrix(y_val, y_pred))
print("f1_score: {}".format(f1_score(y_val, y_pred)))
print("precision_score: {}".format(precision_score(y_val, y_pred)))
print("recall_score: {}".format(recall_score(y_val, y_pred)))

SVC (L1 penalty) Metrics...
for sparse model, cv accuracy = 0.6959344170003721, auc = 0.7434077786464497
[[500 104]
 [193 221]]
f1_score: 0.5981055480378891
precision_score: 0.68
recall_score: 0.533816425120773


In [13]:
# loading mapping
mapping = []
with open('3_code/mapping.txt') as f:
    for line in f.readlines():
        splits = line.split('|') # feature-name | feature-index
        mapping.append(splits[0])

# get last 10 - the largest 10 indices
top_10 =np.argsort(l1_model.coef_[0])[-20:]

for index, fid in enumerate(top_10[::-1]): #read in reverse order
    print("%d: feature [%s] with coef %.3f" % (index, mapping[fid], l1_model.coef_[0][fid]))

0: feature [proc99238] with coef 3.951
1: feature [textcompared] with coef 3.082
2: feature [textmoderate] with coef 2.925
3: feature [textventricular] with coef 2.434
4: feature [diag1573] with coef 2.368
5: feature [meddesi10] with coef 2.336
6: feature [diag4255] with coef 2.200
7: feature [medenal5] with coef 2.025
8: feature [textvalve] with coef 2.000
9: feature [proc32020] with coef 1.963
10: feature [proc99232] with coef 1.906
11: feature [lab51006] with coef 1.894
12: feature [medtuss5l] with coef 1.888
13: feature [diag42971] with coef 1.883
14: feature [proc99261] with coef 1.835
15: feature [diag9351] with coef 1.826
16: feature [medntg100pb] with coef 1.795
17: feature [diag25082] with coef 1.663
18: feature [proc99239] with coef 1.632
19: feature [diag74609] with coef 1.588


### Build best SVC model, with L1 penalization

In [14]:
l1_model = LinearSVC(C=1.0, random_state=42, dual=False, penalty='l1')
l1_model.fit(X_trainval, y_trainval)
y_pred_l1_svc = l1_model.predict(X_final_test)

accuracy_score(y_final_test, y_pred_l1_svc)

0.6810265811182401

## 2. Random Forest Classifier

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X_trainval, y_trainval, test_size=0.2, random_state=41)

In [16]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=256, min_samples_split=6, n_jobs=-1, random_state=0)
rf_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=6,
            min_weight_fraction_leaf=0.0, n_estimators=256, n_jobs=-1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [17]:
rf_accuracy = rf_clf.score(X_test, y_test)
print("for sparse model, accuracy = %.3f" % (rf_accuracy))

for sparse model, accuracy = 0.704


In [18]:
rf_y_pred = rf_clf.predict(X_test)
rf_accuracy = rf_clf.score(X_test, y_test)
print("RF Metrics...")
print("for sparse model, accuracy = %.3f" % (rf_accuracy))
print(confusion_matrix(y_test, rf_y_pred))
print("f1_score: {}".format(f1_score(y_test, rf_y_pred)))
print("precision_score: {}".format(precision_score(y_test, rf_y_pred)))
print("recall_score: {}".format(recall_score(y_test, rf_y_pred)))

DT Metrics...
for sparse model, accuracy = 0.704
[[515  82]
 [220 202]]
f1_score: 0.5722379603399433
precision_score: 0.7112676056338029
recall_score: 0.4786729857819905


### Build best Random Forest Classifier

Number trees = 256, min samples split = 6

In [19]:
rf_clf = RandomForestClassifier(n_estimators=256, min_samples_split=6, n_jobs=-1, random_state=0)
rf_clf.fit(X_trainval, y_trainval)
y_pred_rf = rf_clf.predict(X_final_test)

accuracy_score(y_final_test, y_pred_rf)

0.72548120989917508

## 3. Decision Tree Classifier

In [20]:
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier(min_samples_leaf=4, random_state=0)
dt_clf.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=4, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

In [21]:
dt_accuracy = dt_clf.score(X_test, y_test)
print("for sparse model, accuracy = %.3f" % (dt_accuracy))

for sparse model, accuracy = 0.656


In [22]:
dt_y_pred = dt_clf.predict(X_test)
dt_accuracy = dt_clf.score(X_test, y_test)
print("DT Metrics...")
print("for sparse model, accuracy = %.3f" % (dt_accuracy))
print(confusion_matrix(y_test, dt_y_pred))
print("f1_score: {}".format(f1_score(y_test, dt_y_pred)))
print("precision_score: {}".format(precision_score(y_test, dt_y_pred)))
print("recall_score: {}".format(recall_score(y_test, dt_y_pred)))

DT Metrics...
for sparse model, accuracy = 0.656
[[438 159]
 [192 230]]
f1_score: 0.5672009864364981
precision_score: 0.5912596401028277
recall_score: 0.5450236966824644


### Build best Decision Tree Classifier

Min samples leaf = 4

In [23]:
dt_clf = DecisionTreeClassifier(min_samples_leaf=4, random_state=0)
dt_clf.fit(X_trainval, y_trainval)
y_pred_dt = dt_clf.predict(X_final_test)

accuracy_score(y_final_test, y_pred_dt)

0.63107241063244734

## 4. Feed-forward Neural Network

We will implement a feed-forward neural network by using PyTorch.

This code is adapted from Big Data Healthcare Lab session: https://github.com/ast0414/CSE6250BDH-LAB-DL/blob/master/1_FeedforwardNet.ipynb

### Preparing the dataset

We will use features scaled into values between 0 and 1.

In [24]:
scaler_train_nn = MaxAbsScaler().fit(X_train)
X_train_nn = scaler_train_nn.transform(X_train)
X_validation_nn = scaler_train_nn.transform(X_test)

#### 1. Loading datasets
We will use DataLoader and TensorDataset (from [torch.utils.data](http://pytorch.org/docs/master/data.html#)) for convinience in data handling.

In [25]:
import torch
from torch.utils.data import DataLoader, TensorDataset

# lets fix the random seeds for reproducibility.
torch.manual_seed(0)
if torch.cuda.is_available():
    torch.cuda.manual_seed(0)

trainset = TensorDataset(torch.from_numpy(X_train_nn.astype('float32')), torch.from_numpy(y_train.astype('float32')).view(-1,1))
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=4) # can specify as many cores/threads as your machine

validationset = TensorDataset(torch.from_numpy(X_validation_nn.astype('float32')), torch.from_numpy(y_test.astype('float32')).view(-1,1))
validationloader = torch.utils.data.DataLoader(validationset, batch_size=4, shuffle=False, num_workers=4)

#### 2. Define a Feed-forward Neural Network

Next, we will define the feed-forward neural network.

After trying out various hyperparameters, we will use 4-layer, 3 hidden layers with 1 hidden-to-output layer, feed-forward net. Each layer is a fully-connected layer, with 256 nodes. Also, we will apply RELU activation for each layer.

In [26]:
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F


class FeedForwardNet(nn.Module):
    # this just specifies the architecture of our NN, there's no computations yet
    def __init__(self, n_input, n_hidden, n_output):
        super(FeedForwardNet, self).__init__()
        self.hidden1 = nn.Linear(n_input, n_hidden) # linear is fully-connected NN
        self.hidden2 = nn.Linear(n_hidden, n_hidden)
        self.hidden3 = nn.Linear(n_hidden, n_hidden)
        self.hidden4 = nn.Linear(n_hidden, n_hidden)
        self.out = nn.Linear(n_hidden, n_output)

    # computation
    def forward(self, x):
        x = F.relu(self.hidden1(x)) # F.relu: apply non-linearity
        x = F.relu(self.hidden2(x))
        x = F.relu(self.hidden3(x))
        x = F.relu(self.hidden4(x))
        x = self.out(x)
        return x

net = FeedForwardNet(n_input=7301, n_hidden=256, n_output=1)

#### 3. Define a Loss function and Optimizer
We will use Binary Cross Entropy loss and SGD with momentum as our optimizer.

SGD optimizer will have a learning rate of 0.001, and moment of 0.9

In [27]:
import torch.optim as optim

criterion = nn.BCEWithLogitsLoss() # higher numerical stability than just sigmoid
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

#### 4. Train the network

We will utilize GPU (if available) as it provides faster computation

In [28]:
cuda = torch.cuda.is_available()

In [29]:
if cuda:
    net = net.cuda()

In [30]:
for epoch in range(20):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs
        inputs, labels = data

        # wrap them in Variable
        inputs, labels = Variable(inputs), Variable(labels)
        if cuda:
            inputs, labels = inputs.cuda(), labels.cuda()

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward
        outputs = net(inputs)
        loss = criterion(outputs, labels) # our defined loss function
        # backward. this will calculate gradient for each parameter in model
        loss.backward()
        # optimize. doing one step of optimization
        optimizer.step()

        # print statistics
        running_loss += loss.data[0]
        
        if i % 10 == 9:    # print every 10 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 10))
            running_loss = 0.0

print('Finished Training')

[1,    10] loss: 0.695
[1,    20] loss: 0.685
[1,    30] loss: 0.682
[1,    40] loss: 0.690
[1,    50] loss: 0.680
[1,    60] loss: 0.700
[1,    70] loss: 0.681
[1,    80] loss: 0.671
[1,    90] loss: 0.685
[1,   100] loss: 0.666
[1,   110] loss: 0.680
[1,   120] loss: 0.693
[1,   130] loss: 0.684
[1,   140] loss: 0.669
[1,   150] loss: 0.698
[1,   160] loss: 0.674
[1,   170] loss: 0.678
[1,   180] loss: 0.699
[1,   190] loss: 0.667
[1,   200] loss: 0.672
[1,   210] loss: 0.688
[1,   220] loss: 0.705
[1,   230] loss: 0.660
[1,   240] loss: 0.688
[1,   250] loss: 0.682
[1,   260] loss: 0.653
[1,   270] loss: 0.714
[1,   280] loss: 0.670
[1,   290] loss: 0.676
[1,   300] loss: 0.702
[1,   310] loss: 0.695
[1,   320] loss: 0.682
[1,   330] loss: 0.676
[1,   340] loss: 0.670
[1,   350] loss: 0.689
[1,   360] loss: 0.669
[1,   370] loss: 0.662
[1,   380] loss: 0.668
[1,   390] loss: 0.661
[1,   400] loss: 0.645
[1,   410] loss: 0.697
[1,   420] loss: 0.682
[1,   430] loss: 0.690
[1,   440] 

[4,   570] loss: 0.673
[4,   580] loss: 0.673
[4,   590] loss: 0.663
[4,   600] loss: 0.662
[4,   610] loss: 0.695
[4,   620] loss: 0.673
[4,   630] loss: 0.629
[4,   640] loss: 0.718
[4,   650] loss: 0.629
[4,   660] loss: 0.684
[4,   670] loss: 0.651
[4,   680] loss: 0.695
[4,   690] loss: 0.651
[4,   700] loss: 0.617
[4,   710] loss: 0.673
[4,   720] loss: 0.684
[4,   730] loss: 0.662
[4,   740] loss: 0.685
[4,   750] loss: 0.752
[4,   760] loss: 0.652
[4,   770] loss: 0.630
[4,   780] loss: 0.618
[4,   790] loss: 0.718
[4,   800] loss: 0.718
[4,   810] loss: 0.673
[4,   820] loss: 0.684
[4,   830] loss: 0.641
[4,   840] loss: 0.619
[4,   850] loss: 0.640
[4,   860] loss: 0.696
[4,   870] loss: 0.640
[4,   880] loss: 0.662
[4,   890] loss: 0.708
[4,   900] loss: 0.684
[4,   910] loss: 0.673
[4,   920] loss: 0.662
[4,   930] loss: 0.706
[4,   940] loss: 0.662
[4,   950] loss: 0.673
[4,   960] loss: 0.684
[4,   970] loss: 0.641
[4,   980] loss: 0.684
[4,   990] loss: 0.640
[4,  1000] 

[8,   130] loss: 0.664
[8,   140] loss: 0.673
[8,   150] loss: 0.682
[8,   160] loss: 0.654
[8,   170] loss: 0.700
[8,   180] loss: 0.700
[8,   190] loss: 0.664
[8,   200] loss: 0.682
[8,   210] loss: 0.655
[8,   220] loss: 0.637
[8,   230] loss: 0.692
[8,   240] loss: 0.673
[8,   250] loss: 0.683
[8,   260] loss: 0.672
[8,   270] loss: 0.654
[8,   280] loss: 0.634
[8,   290] loss: 0.702
[8,   300] loss: 0.682
[8,   310] loss: 0.653
[8,   320] loss: 0.672
[8,   330] loss: 0.673
[8,   340] loss: 0.662
[8,   350] loss: 0.652
[8,   360] loss: 0.663
[8,   370] loss: 0.611
[8,   380] loss: 0.683
[8,   390] loss: 0.704
[8,   400] loss: 0.629
[8,   410] loss: 0.705
[8,   420] loss: 0.640
[8,   430] loss: 0.662
[8,   440] loss: 0.684
[8,   450] loss: 0.640
[8,   460] loss: 0.641
[8,   470] loss: 0.718
[8,   480] loss: 0.585
[8,   490] loss: 0.696
[8,   500] loss: 0.741
[8,   510] loss: 0.661
[8,   520] loss: 0.662
[8,   530] loss: 0.673
[8,   540] loss: 0.662
[8,   550] loss: 0.629
[8,   560] 

[11,   600] loss: 0.671
[11,   610] loss: 0.671
[11,   620] loss: 0.641
[11,   630] loss: 0.680
[11,   640] loss: 0.682
[11,   650] loss: 0.690
[11,   660] loss: 0.630
[11,   670] loss: 0.659
[11,   680] loss: 0.702
[11,   690] loss: 0.710
[11,   700] loss: 0.593
[11,   710] loss: 0.672
[11,   720] loss: 0.705
[11,   730] loss: 0.682
[11,   740] loss: 0.679
[11,   750] loss: 0.670
[11,   760] loss: 0.710
[11,   770] loss: 0.680
[11,   780] loss: 0.689
[11,   790] loss: 0.662
[11,   800] loss: 0.653
[11,   810] loss: 0.653
[11,   820] loss: 0.642
[11,   830] loss: 0.710
[11,   840] loss: 0.641
[11,   850] loss: 0.651
[11,   860] loss: 0.651
[11,   870] loss: 0.640
[11,   880] loss: 0.627
[11,   890] loss: 0.683
[11,   900] loss: 0.680
[11,   910] loss: 0.705
[11,   920] loss: 0.659
[11,   930] loss: 0.702
[11,   940] loss: 0.619
[11,   950] loss: 0.651
[11,   960] loss: 0.630
[11,   970] loss: 0.638
[11,   980] loss: 0.661
[11,   990] loss: 0.681
[11,  1000] loss: 0.647
[11,  1010] loss

[14,  1000] loss: 0.634
[14,  1010] loss: 0.605
[15,    10] loss: 0.582
[15,    20] loss: 0.719
[15,    30] loss: 0.644
[15,    40] loss: 0.578
[15,    50] loss: 0.685
[15,    60] loss: 0.662
[15,    70] loss: 0.632
[15,    80] loss: 0.641
[15,    90] loss: 0.685
[15,   100] loss: 0.629
[15,   110] loss: 0.603
[15,   120] loss: 0.628
[15,   130] loss: 0.547
[15,   140] loss: 0.682
[15,   150] loss: 0.653
[15,   160] loss: 0.557
[15,   170] loss: 0.723
[15,   180] loss: 0.718
[15,   190] loss: 0.593
[15,   200] loss: 0.678
[15,   210] loss: 0.648
[15,   220] loss: 0.606
[15,   230] loss: 0.593
[15,   240] loss: 0.620
[15,   250] loss: 0.694
[15,   260] loss: 0.617
[15,   270] loss: 0.664
[15,   280] loss: 0.650
[15,   290] loss: 0.692
[15,   300] loss: 0.649
[15,   310] loss: 0.662
[15,   320] loss: 0.581
[15,   330] loss: 0.658
[15,   340] loss: 0.650
[15,   350] loss: 0.625
[15,   360] loss: 0.643
[15,   370] loss: 0.612
[15,   380] loss: 0.678
[15,   390] loss: 0.586
[15,   400] loss

[18,   410] loss: 0.484
[18,   420] loss: 0.448
[18,   430] loss: 0.501
[18,   440] loss: 0.515
[18,   450] loss: 0.515
[18,   460] loss: 0.458
[18,   470] loss: 0.685
[18,   480] loss: 0.574
[18,   490] loss: 0.505
[18,   500] loss: 0.542
[18,   510] loss: 0.505
[18,   520] loss: 0.557
[18,   530] loss: 0.485
[18,   540] loss: 0.552
[18,   550] loss: 0.532
[18,   560] loss: 0.513
[18,   570] loss: 0.577
[18,   580] loss: 0.624
[18,   590] loss: 0.494
[18,   600] loss: 0.505
[18,   610] loss: 0.480
[18,   620] loss: 0.511
[18,   630] loss: 0.549
[18,   640] loss: 0.622
[18,   650] loss: 0.530
[18,   660] loss: 0.496
[18,   670] loss: 0.583
[18,   680] loss: 0.577
[18,   690] loss: 0.567
[18,   700] loss: 0.615
[18,   710] loss: 0.480
[18,   720] loss: 0.374
[18,   730] loss: 0.547
[18,   740] loss: 0.569
[18,   750] loss: 0.497
[18,   760] loss: 0.527
[18,   770] loss: 0.522
[18,   780] loss: 0.565
[18,   790] loss: 0.663
[18,   800] loss: 0.552
[18,   810] loss: 0.504
[18,   820] loss

#### 5. Test the network on the validation data

In [31]:
y_true_nn = []
y_scores_nn = []

In [32]:
for data in validationloader:
    inputs, labels = data
    if cuda:
        inputs = inputs.cuda()
    outputs = net(Variable(inputs))
    outputs = F.sigmoid(outputs)
    if cuda:
        outputs = outputs.cpu()
    y_true_nn.extend(labels.numpy().flatten().tolist())
    y_scores_nn.extend(outputs.data.numpy().flatten().tolist())

In [33]:
fpr, tpr, _ = roc_curve(y_true_nn, y_scores_nn)
auc_ffnet = auc(fpr, tpr)
y_scores_round = [round(elem) for elem in y_scores_nn]

In [34]:
# Testing
print('Accuracy: ', accuracy_score(y_true_nn, y_scores_round))
print('AUC: ', auc_ffnet)
print('f1 score: ', f1_score(y_true_nn, y_scores_round))
print('Precision: ', precision_score(y_true_nn, y_scores_round))
print('Recall: ', recall_score(y_true_nn, y_scores_round))

Accuracy:  0.682041216879
AUC:  0.747880397247
f1 score:  0.64238410596
Precision:  0.601239669421
Recall:  0.689573459716


In [35]:
confusion_matrix(y_true_nn, y_scores_round)

array([[404, 193],
       [131, 291]], dtype=int64)

### Build best NN


In [36]:
scaler_train_nn = MaxAbsScaler().fit(X_trainval)
X_trainval_nn = scaler_train_nn.transform(X_trainval)
X_test_nn = scaler_train_nn.transform(X_final_test)

trainvalidationset = TensorDataset(torch.from_numpy(X_trainval_nn.astype('float32')), torch.from_numpy(y_trainval.astype('float32')).view(-1,1))
trainvalidationloader = torch.utils.data.DataLoader(trainvalidationset, batch_size=4, shuffle=False, num_workers=4)

testset = TensorDataset(torch.from_numpy(X_test_nn.astype('float32')), torch.from_numpy(y_final_test.astype('float32')).view(-1,1))
testloader = torch.utils.data.DataLoader(testset, batch_size=4, shuffle=False, num_workers=4)

for epoch in range(20):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainvalidationloader, 0):
        # get the inputs
        inputs, labels = data

        # wrap them in Variable
        inputs, labels = Variable(inputs), Variable(labels)
        if cuda:
            inputs, labels = inputs.cuda(), labels.cuda()

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward
        outputs = net(inputs)
        loss = criterion(outputs, labels) # our defined loss function
        # backward. this will calculate gradient for each parameter in model
        loss.backward()
        # optimize. doing one step of optimization
        optimizer.step()

        # print statistics
        running_loss += loss.data[0]
        
        if i % 10 == 9:    # print every 10 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 10))
            running_loss = 0.0

print('Finished Training')


y_true_nn = []
y_scores_nn = []

for data in testloader:
    inputs, labels = data
    if cuda:
        inputs = inputs.cuda()
    outputs = net(Variable(inputs))
    outputs = F.sigmoid(outputs)
    if cuda:
        outputs = outputs.cpu()
    y_true_nn.extend(labels.numpy().flatten().tolist())
    y_scores_nn.extend(outputs.data.numpy().flatten().tolist())
    
y_scores_round = [round(elem) for elem in y_scores_nn]

accuracy_score(y_true_nn, y_scores_round)

[1,    10] loss: 0.586
[1,    20] loss: 0.340
[1,    30] loss: 0.492
[1,    40] loss: 0.530
[1,    50] loss: 0.390
[1,    60] loss: 0.480
[1,    70] loss: 0.548
[1,    80] loss: 0.520
[1,    90] loss: 0.522
[1,   100] loss: 0.866
[1,   110] loss: 0.509
[1,   120] loss: 0.556
[1,   130] loss: 0.386
[1,   140] loss: 0.570
[1,   150] loss: 0.442
[1,   160] loss: 0.481
[1,   170] loss: 0.554
[1,   180] loss: 0.533
[1,   190] loss: 0.433
[1,   200] loss: 0.444
[1,   210] loss: 0.499
[1,   220] loss: 0.342
[1,   230] loss: 0.371
[1,   240] loss: 0.685
[1,   250] loss: 0.531
[1,   260] loss: 0.540
[1,   270] loss: 0.567
[1,   280] loss: 0.512
[1,   290] loss: 0.504
[1,   300] loss: 0.402
[1,   310] loss: 0.565
[1,   320] loss: 0.457
[1,   330] loss: 0.418
[1,   340] loss: 0.456
[1,   350] loss: 0.534
[1,   360] loss: 0.522
[1,   370] loss: 0.504
[1,   380] loss: 0.517
[1,   390] loss: 0.457
[1,   400] loss: 0.545
[1,   410] loss: 0.474
[1,   420] loss: 0.597
[1,   430] loss: 0.514
[1,   440] 

[3,  1050] loss: 0.622
[3,  1060] loss: 0.434
[3,  1070] loss: 0.317
[3,  1080] loss: 0.347
[3,  1090] loss: 0.440
[3,  1100] loss: 0.408
[3,  1110] loss: 0.496
[3,  1120] loss: 0.581
[3,  1130] loss: 0.371
[3,  1140] loss: 0.315
[3,  1150] loss: 0.491
[3,  1160] loss: 0.403
[3,  1170] loss: 0.345
[3,  1180] loss: 0.417
[3,  1190] loss: 0.572
[3,  1200] loss: 0.418
[3,  1210] loss: 0.492
[3,  1220] loss: 0.400
[3,  1230] loss: 0.310
[3,  1240] loss: 0.451
[3,  1250] loss: 0.351
[3,  1260] loss: 0.422
[3,  1270] loss: 0.290
[4,    10] loss: 0.504
[4,    20] loss: 0.344
[4,    30] loss: 0.435
[4,    40] loss: 0.406
[4,    50] loss: 0.313
[4,    60] loss: 0.371
[4,    70] loss: 0.586
[4,    80] loss: 0.432
[4,    90] loss: 0.524
[4,   100] loss: 0.628
[4,   110] loss: 0.470
[4,   120] loss: 0.465
[4,   130] loss: 0.329
[4,   140] loss: 0.422
[4,   150] loss: 0.448
[4,   160] loss: 0.456
[4,   170] loss: 0.492
[4,   180] loss: 0.464
[4,   190] loss: 0.377
[4,   200] loss: 0.391
[4,   210] 

[6,   810] loss: 0.644
[6,   820] loss: 0.549
[6,   830] loss: 0.342
[6,   840] loss: 0.359
[6,   850] loss: 0.594
[6,   860] loss: 0.405
[6,   870] loss: 0.431
[6,   880] loss: 0.380
[6,   890] loss: 0.568
[6,   900] loss: 0.370
[6,   910] loss: 0.362
[6,   920] loss: 0.437
[6,   930] loss: 0.392
[6,   940] loss: 0.435
[6,   950] loss: 0.415
[6,   960] loss: 0.335
[6,   970] loss: 0.275
[6,   980] loss: 0.464
[6,   990] loss: 0.335
[6,  1000] loss: 0.409
[6,  1010] loss: 0.348
[6,  1020] loss: 0.267
[6,  1030] loss: 0.439
[6,  1040] loss: 0.509
[6,  1050] loss: 0.527
[6,  1060] loss: 0.406
[6,  1070] loss: 0.287
[6,  1080] loss: 0.305
[6,  1090] loss: 0.372
[6,  1100] loss: 0.404
[6,  1110] loss: 0.459
[6,  1120] loss: 0.556
[6,  1130] loss: 0.363
[6,  1140] loss: 0.273
[6,  1150] loss: 0.467
[6,  1160] loss: 0.370
[6,  1170] loss: 0.295
[6,  1180] loss: 0.387
[6,  1190] loss: 0.463
[6,  1200] loss: 0.384
[6,  1210] loss: 0.466
[6,  1220] loss: 0.340
[6,  1230] loss: 0.282
[6,  1240] 

[9,   570] loss: 0.308
[9,   580] loss: 0.241
[9,   590] loss: 0.533
[9,   600] loss: 0.413
[9,   610] loss: 0.379
[9,   620] loss: 0.443
[9,   630] loss: 0.347
[9,   640] loss: 0.595
[9,   650] loss: 0.359
[9,   660] loss: 0.263
[9,   670] loss: 0.343
[9,   680] loss: 0.356
[9,   690] loss: 0.474
[9,   700] loss: 0.183
[9,   710] loss: 0.269
[9,   720] loss: 0.385
[9,   730] loss: 0.417
[9,   740] loss: 0.334
[9,   750] loss: 0.319
[9,   760] loss: 0.256
[9,   770] loss: 0.274
[9,   780] loss: 0.363
[9,   790] loss: 0.387
[9,   800] loss: 0.477
[9,   810] loss: 0.586
[9,   820] loss: 0.438
[9,   830] loss: 0.310
[9,   840] loss: 0.319
[9,   850] loss: 0.586
[9,   860] loss: 0.345
[9,   870] loss: 0.416
[9,   880] loss: 0.379
[9,   890] loss: 0.550
[9,   900] loss: 0.317
[9,   910] loss: 0.312
[9,   920] loss: 0.386
[9,   930] loss: 0.333
[9,   940] loss: 0.391
[9,   950] loss: 0.356
[9,   960] loss: 0.299
[9,   970] loss: 0.251
[9,   980] loss: 0.414
[9,   990] loss: 0.308
[9,  1000] 

[12,   210] loss: 0.397
[12,   220] loss: 0.322
[12,   230] loss: 0.300
[12,   240] loss: 0.324
[12,   250] loss: 0.357
[12,   260] loss: 0.363
[12,   270] loss: 0.439
[12,   280] loss: 0.447
[12,   290] loss: 0.480
[12,   300] loss: 0.238
[12,   310] loss: 0.377
[12,   320] loss: 0.277
[12,   330] loss: 0.362
[12,   340] loss: 0.301
[12,   350] loss: 0.491
[12,   360] loss: 0.352
[12,   370] loss: 0.305
[12,   380] loss: 0.281
[12,   390] loss: 0.310
[12,   400] loss: 0.359
[12,   410] loss: 0.454
[12,   420] loss: 0.271
[12,   430] loss: 0.331
[12,   440] loss: 0.448
[12,   450] loss: 0.225
[12,   460] loss: 0.355
[12,   470] loss: 0.225
[12,   480] loss: 0.256
[12,   490] loss: 0.347
[12,   500] loss: 0.332
[12,   510] loss: 0.397
[12,   520] loss: 0.435
[12,   530] loss: 0.630
[12,   540] loss: 0.377
[12,   550] loss: 0.279
[12,   560] loss: 0.331
[12,   570] loss: 0.276
[12,   580] loss: 0.225
[12,   590] loss: 0.501
[12,   600] loss: 0.368
[12,   610] loss: 0.374
[12,   620] loss

[14,  1090] loss: 0.282
[14,  1100] loss: 0.386
[14,  1110] loss: 0.311
[14,  1120] loss: 0.533
[14,  1130] loss: 0.317
[14,  1140] loss: 0.279
[14,  1150] loss: 0.383
[14,  1160] loss: 0.322
[14,  1170] loss: 0.231
[14,  1180] loss: 0.320
[14,  1190] loss: 0.368
[14,  1200] loss: 0.320
[14,  1210] loss: 0.425
[14,  1220] loss: 0.299
[14,  1230] loss: 0.217
[14,  1240] loss: 0.275
[14,  1250] loss: 0.215
[14,  1260] loss: 0.261
[14,  1270] loss: 0.168
[15,    10] loss: 0.383
[15,    20] loss: 0.202
[15,    30] loss: 0.434
[15,    40] loss: 0.236
[15,    50] loss: 0.223
[15,    60] loss: 0.225
[15,    70] loss: 0.542
[15,    80] loss: 0.356
[15,    90] loss: 0.387
[15,   100] loss: 0.327
[15,   110] loss: 0.459
[15,   120] loss: 0.482
[15,   130] loss: 0.694
[15,   140] loss: 0.483
[15,   150] loss: 0.581
[15,   160] loss: 0.477
[15,   170] loss: 0.385
[15,   180] loss: 0.578
[15,   190] loss: 0.356
[15,   200] loss: 0.311
[15,   210] loss: 0.439
[15,   220] loss: 0.278
[15,   230] loss

[17,   730] loss: 0.409
[17,   740] loss: 0.284
[17,   750] loss: 0.244
[17,   760] loss: 0.173
[17,   770] loss: 0.224
[17,   780] loss: 0.314
[17,   790] loss: 0.274
[17,   800] loss: 0.446
[17,   810] loss: 0.471
[17,   820] loss: 0.403
[17,   830] loss: 0.277
[17,   840] loss: 0.261
[17,   850] loss: 0.437
[17,   860] loss: 0.322
[17,   870] loss: 0.442
[17,   880] loss: 0.360
[17,   890] loss: 0.381
[17,   900] loss: 0.238
[17,   910] loss: 0.266
[17,   920] loss: 0.394
[17,   930] loss: 0.289
[17,   940] loss: 0.337
[17,   950] loss: 0.313
[17,   960] loss: 0.316
[17,   970] loss: 0.242
[17,   980] loss: 0.332
[17,   990] loss: 0.303
[17,  1000] loss: 0.337
[17,  1010] loss: 0.301
[17,  1020] loss: 0.198
[17,  1030] loss: 0.424
[17,  1040] loss: 0.405
[17,  1050] loss: 0.383
[17,  1060] loss: 0.345
[17,  1070] loss: 0.208
[17,  1080] loss: 0.229
[17,  1090] loss: 0.294
[17,  1100] loss: 0.364
[17,  1110] loss: 0.282
[17,  1120] loss: 0.518
[17,  1130] loss: 0.281
[17,  1140] loss

[20,   370] loss: 0.298
[20,   380] loss: 0.255
[20,   390] loss: 0.315
[20,   400] loss: 0.335
[20,   410] loss: 0.438
[20,   420] loss: 0.230
[20,   430] loss: 0.266
[20,   440] loss: 0.351
[20,   450] loss: 0.178
[20,   460] loss: 0.320
[20,   470] loss: 0.169
[20,   480] loss: 0.241
[20,   490] loss: 0.335
[20,   500] loss: 0.267
[20,   510] loss: 0.286
[20,   520] loss: 0.317
[20,   530] loss: 0.388
[20,   540] loss: 0.207
[20,   550] loss: 0.213
[20,   560] loss: 0.310
[20,   570] loss: 0.234
[20,   580] loss: 0.251
[20,   590] loss: 0.587
[20,   600] loss: 0.416
[20,   610] loss: 0.314
[20,   620] loss: 0.347
[20,   630] loss: 0.384
[20,   640] loss: 0.491
[20,   650] loss: 0.299
[20,   660] loss: 0.220
[20,   670] loss: 0.351
[20,   680] loss: 0.259
[20,   690] loss: 0.405
[20,   700] loss: 0.161
[20,   710] loss: 0.194
[20,   720] loss: 0.326
[20,   730] loss: 0.437
[20,   740] loss: 0.244
[20,   750] loss: 0.263
[20,   760] loss: 0.165
[20,   770] loss: 0.227
[20,   780] loss

0.65948670944087995

## Majority voting

We will experiment with using the best 3 performing models (L1 SVC, Random Forest, and Neural Network) and create an ensemble learning model. The final predictions will be based on the majority votes from the respective 3 models.

All 3 models were trained on the training + validation data, and they were used to make predictions on the final test set.

In [37]:
a = zip(list(y_pred_l1_svc), list(y_pred_rf), y_scores_round)

In [38]:
majority_vote = []
for i in a:
    majority_vote.append(max(set(i), key=i.count))

In [39]:
fpr, tpr, _ = roc_curve(y_final_test, majority_vote)
auc_ffnet = auc(fpr, tpr)
print('Accuracy: ', accuracy_score(y_final_test, majority_vote))
print('AUC: ', auc_ffnet)
print('f1 score: ', f1_score(y_final_test, majority_vote))
print('Precision: ', precision_score(y_final_test, majority_vote))
print('Recall: ', recall_score(y_final_test, majority_vote))

Accuracy:  0.720439963336
AUC:  0.681009039615
f1 score:  0.583333333333
Precision:  0.696574225122
Recall:  0.501762632197


In [40]:
confusion_matrix(y_final_test, majority_vote)

array([[1145,  186],
       [ 424,  427]], dtype=int64)