# KE5006 Applied Research

### Identifying enhancers and their strength with deep neural networks

## Layer 1 Model with Physiochemical Property Features - Results

## Summary of Findings
* Validation Accuracy :
    * Ensemble of models of 2x16 GRU, 1x16 Dense
        * The best model ensemble of 5 models (around the best model) has an accuracy of 77.449%. The model ensemble beats the best single model accuracy of 76.735% from the same training run.
    * Ensemble of models of 1x16 conv1D, 2x16 GRU bidirectional, 1x8 Dense
        * The best model ensemble of 3 models (around the best model) has an accuracy of 76.429%. The model ensemble beats the best single model accuracy of 76.224% from the same training run.
* Test Accuracy :
    * Both ensembles achieved higher accuracy of **75.25%** and **75.50%** than the best single models of 74.00% and 73.75%. The single models were from training runs without warm restarts. 

## Load libraries

In [1]:
# Set the working directory (which contains the directories source, data, etc.)
import os
os.chdir(os.path.join(os.path.sep, 'home', 'tkokkeng', 'Documents', 'KE5006-AppliedResearch', 'enhancer'))
os.getcwd()

'/home/tkokkeng/Documents/KE5006-AppliedResearch/enhancer'

In [2]:
# Check if the directory containing the source files are in the path.
import sys
if os.path.join(os.getcwd(), 'source') not in sys.path:
    sys.path.append(os.path.join(os.getcwd(), 'source'))
sys.path

['/home/tkokkeng/python/python367/tsfvenv/lib/python36.zip',
 '/home/tkokkeng/python/python367/tsfvenv/lib/python3.6',
 '/home/tkokkeng/python/python367/tsfvenv/lib/python3.6/lib-dynload',
 '/usr/lib/python3.6',
 '',
 '/home/tkokkeng/python/python367/tsfvenv/lib/python3.6/site-packages',
 '/home/tkokkeng/.local/lib/python3.6/site-packages',
 '/usr/local/lib/python3.6/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/home/tkokkeng/python/python367/tsfvenv/lib/python3.6/site-packages/IPython/extensions',
 '/home/tkokkeng/.ipython',
 '/home/tkokkeng/Documents/KE5006-AppliedResearch/enhancer/source']

In [3]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import datetime
import pickle

import myUtilities as mu

from sklearn.metrics import classification_report, confusion_matrix, matthews_corrcoef, make_scorer
from sklearn.metrics import recall_score, roc_auc_score, roc_curve, accuracy_score
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.preprocessing import MinMaxScaler

from keras.preprocessing.text import Tokenizer
from keras.models import Sequential, load_model
from keras import layers
from keras.optimizers import RMSprop
from keras.wrappers.scikit_learn import KerasClassifier
from keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard

Using TensorFlow backend.


In [4]:
pd.set_option('precision', 3)

## Load data

In [5]:
enhancer_df = pd.read_csv(os.path.join('data', 'enhancer.csv'))
enhancer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1484 entries, 0 to 1483
Data columns (total 2 columns):
id          1484 non-null object
sequence    1484 non-null object
dtypes: object(2)
memory usage: 23.3+ KB


In [6]:
enhancer_df['enhancer'] = np.ones((len(enhancer_df),))

In [7]:
enhancer_df.head()

Unnamed: 0,id,sequence,enhancer
0,CHRX_48897056_48897256,CACAATGTAGAAGCAGAGACACAGGAACCAGGCTTGGTGATGGCTC...,1.0
1,CHR12_6444339_6444539,GCCCTCACATTCCCTGGCCCATCCCCTCCACCTCAAAATTTACAAA...,1.0
2,CHR12_6444939_6445139,GAGCAGGAGGCCAGTCACCCTGAGTCAGCCACGGGGAGACGCTGCA...,1.0
3,CHR12_6445139_6445339,CCTCTGCTGAGAACAGGACTGGGGCTTCCAGGGCAACAGGAAGGGT...,1.0
4,CHR12_6445339_6445539,ACAGCCTTAAAGGGAGCTTTTCAGGGACCTCTGGCCAGTGGGGGAT...,1.0


In [8]:
non_enhancer_df = pd.read_csv(os.path.join('data', 'non_enhancer.csv'))
non_enhancer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1484 entries, 0 to 1483
Data columns (total 2 columns):
id          1484 non-null object
sequence    1484 non-null object
dtypes: object(2)
memory usage: 23.3+ KB


In [9]:
non_enhancer_df['enhancer'] = np.zeros((len(non_enhancer_df),))

In [10]:
non_enhancer_df.head()

Unnamed: 0,id,sequence,enhancer
0,CHRX_2970600_2970800,CAGTCACATCTGTAATCACAATACGTTGGGAGGCTGAGGCAGGAGG...,0.0
1,CHRX_6179400_6179600,ACTTTGAAGAAGTCAGTCATCAAGATGAGAGACCCAACTGTCAAGC...,0.0
2,CHRX_11003079_11003279,TCGGCCTCCCAAAGTGCTGGGATTATAGGCATGAGCTACTGCACCC...,0.0
3,CHRX_22042679_22042879,TGGGAGCTGTATCAATCATGTTTTTTATTTTCTATATTTTATGATG...,0.0
4,CHRX_23280479_23280679,TACAGCAAATAGCCTTGGCAGATACAGTGTTTCCCTCCAGAGCAAA...,0.0


## Combine the data frames to form a single dataset

In [11]:
all_data_df = pd.concat([enhancer_df, non_enhancer_df])
all_data_df.reset_index(drop=True, inplace=True)
all_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2968 entries, 0 to 2967
Data columns (total 3 columns):
id          2968 non-null object
sequence    2968 non-null object
enhancer    2968 non-null float64
dtypes: float64(1), object(2)
memory usage: 69.6+ KB


In [12]:
all_data_df.head()

Unnamed: 0,id,sequence,enhancer
0,CHRX_48897056_48897256,CACAATGTAGAAGCAGAGACACAGGAACCAGGCTTGGTGATGGCTC...,1.0
1,CHR12_6444339_6444539,GCCCTCACATTCCCTGGCCCATCCCCTCCACCTCAAAATTTACAAA...,1.0
2,CHR12_6444939_6445139,GAGCAGGAGGCCAGTCACCCTGAGTCAGCCACGGGGAGACGCTGCA...,1.0
3,CHR12_6445139_6445339,CCTCTGCTGAGAACAGGACTGGGGCTTCCAGGGCAACAGGAAGGGT...,1.0
4,CHR12_6445339_6445539,ACAGCCTTAAAGGGAGCTTTTCAGGGACCTCTGGCCAGTGGGGGAT...,1.0


All the sequences are of length 200 characters.

In [13]:
all_data_df['sequence'].map(lambda x: len(x)).value_counts()

200    2968
Name: sequence, dtype: int64

## Load the physiochemical property data

In [14]:
pcp_df = pd.read_csv(os.path.join('data', 'S2.csv'), index_col=0)
pcp_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16 entries, AA to TT
Data columns (total 6 columns):
Rise     16 non-null float64
Roll     16 non-null float64
Shift    16 non-null float64
Slide    16 non-null float64
Tilt     16 non-null float64
Twist    16 non-null float64
dtypes: float64(6)
memory usage: 896.0+ bytes


In [15]:
scaler = MinMaxScaler()
pcp_df.loc[:, :] = scaler.fit_transform(pcp_df.values)
pcp_df

Unnamed: 0,Rise,Roll,Shift,Slide,Tilt,Twist
AA,0.43,0.403,1.0,0.545,0.4,0.833
AC,0.818,0.696,0.619,1.0,0.7,0.833
AG,0.258,0.316,0.763,0.773,0.3,0.792
AT,0.861,1.0,0.32,0.864,0.6,0.75
CA,0.045,0.221,0.361,0.091,0.1,0.292
CC,0.548,0.171,0.732,0.545,0.3,1.0
CG,0.0,0.304,0.371,0.0,0.0,0.333
CT,0.258,0.316,0.763,0.773,0.3,0.792
GA,0.706,0.278,0.619,0.5,0.4,0.833
GC,1.0,0.536,0.495,0.5,1.0,0.75


## Prepare the sequence data for modelling

Create a transformation pipleline to prepare the training dataset for RNN.

In [16]:
# This class selects the desired attributes and drops the rest.
class DataFrameSelector(BaseEstimator, TransformerMixin):

    def __init__(self, attribute_names):
        self.attribute_names = attribute_names

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.attribute_names]

In [17]:
# This class converts a nucleotide base (A, C, G, T) to one-hot-encoding.
class one_hot_encoder(BaseEstimator, TransformerMixin):

    def __init__(self):
        self.tokenizer = Tokenizer(num_words=4, lower=False, char_level=True)

    def fit(self, X, y=None):
        # Note that X is a data frame.
        # Fit the tokenizer on the 1st sequence in the dataset.
        self.tokenizer.fit_on_texts(X.iloc[0, 0])
        self.len_sequence = len(X.iloc[0, 0])
        return self

    def transform(self, X):
        # Note that X is a data frame.
        one_hot_X = X.iloc[:, 0].map(lambda x: self.tokenizer.texts_to_matrix(x, mode='binary')).values
        one_hot_X = np.concatenate(one_hot_X)
        one_hot_X = np.reshape(one_hot_X, (-1, self.len_sequence, 4))
        return one_hot_X

In [18]:
# This class converts a sequence of nucleotide bases (A, C, G, T) to a sequence of dinucleotides and then to a sequence of pysiochemical properties of each dinucleotide.
class pcp_encoder(BaseEstimator, TransformerMixin):

    def __init__(self, pcp_df):
        self.pcp_df = pcp_df

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # Note that X is a data frame.
        dinuc_seq = X.iloc[:, 0].map(lambda x: [ x[i:i+2] for i in range(len(x) - 1) ])
        pcp_seq = dinuc_seq.map(lambda x: [ pcp_df[j][i] for i in x for j in pcp_df.columns.tolist() ])
        # Pad with -1 for last element of sequence; it does not have an associated di-nucleotide
        pcp_seq = pcp_seq.map(lambda x: np.array(x + [-1. for i in range(len(pcp_df.columns))]).reshape((len(X.iloc[0, 0]), len(pcp_df.columns)))).values
        # pandas values returns a 1-D array of objects; use numpy stack to reshape it to a multi-dimensional array
        return np.stack(pcp_seq)

In [19]:
# This class shapes a numpy array.
class Array_Shaper(BaseEstimator, TransformerMixin):
    
    def __init__(self, shape):
        self.shape = shape
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.reshape(self.shape)

In [20]:
attrbs = ['sequence']
num_bases = 4  # number of nucleotide bases
num_pcp = 6  # number of di-nucleotide physiochemical properties
len_seq = len(all_data_df['sequence'][0])
one_hot_pipeline = Pipeline([
    ('selector', DataFrameSelector(attrbs)),
    ('one_hot_encoder', one_hot_encoder()),
    ('array_shaper2D', Array_Shaper((-1, num_bases)))
])
pcp_pipeline = Pipeline([
    ('selector', DataFrameSelector(attrbs)),
    ('pcp_encoder', pcp_encoder(pcp_df)),
    ('array_shaper2D', Array_Shaper((-1, num_pcp)))
])
union_pipeline = FeatureUnion(transformer_list=[
    ("one_hot_pipeline", one_hot_pipeline),
    ("pcp_pipeline", pcp_pipeline)
])
my_pipeline = Pipeline([
    ('feature_combiner', union_pipeline),
    ('array_shaper3D', Array_Shaper((-1, len_seq, num_bases + num_pcp)))
])

In [21]:
X = my_pipeline.fit_transform(all_data_df)
X.shape

(2968, 200, 10)

Check the 1st sequence is correctly encoded.

In [22]:
X[0, :10, :]

array([[0.        , 0.        , 0.        , 0.        , 0.04545455,
        0.22053232, 0.36082474, 0.09090909, 0.1       , 0.29166667],
       [0.        , 0.        , 0.        , 1.        , 0.81818182,
        0.69581749, 0.6185567 , 1.        , 0.7       , 0.83333333],
       [0.        , 0.        , 0.        , 0.        , 0.04545455,
        0.22053232, 0.36082474, 0.09090909, 0.1       , 0.29166667],
       [0.        , 0.        , 0.        , 1.        , 0.43030303,
        0.40304183, 1.        , 0.54545455, 0.4       , 0.83333333],
       [0.        , 0.        , 0.        , 1.        , 0.86060606,
        1.        , 0.31958763, 0.86363636, 0.6       , 0.75      ],
       [0.        , 0.        , 1.        , 0.        , 0.04545455,
        0.22053232, 0.36082474, 0.09090909, 0.1       , 0.29166667],
       [0.        , 1.        , 0.        , 0.        , 0.81818182,
        0.69581749, 0.6185567 , 1.        , 0.7       , 0.83333333],
       [0.        , 0.        , 1.       

In [23]:
X[0, -10:, :]

array([[ 0.        ,  0.        ,  1.        ,  0.        ,  0.43030303,
         0.40304183,  1.        ,  0.54545455,  0.4       ,  0.83333333],
       [ 0.        ,  0.        ,  1.        ,  0.        ,  0.04545455,
         0.22053232,  0.36082474,  0.09090909,  0.1       ,  0.29166667],
       [ 0.        ,  1.        ,  0.        ,  0.        ,  0.81818182,
         0.69581749,  0.6185567 ,  1.        ,  0.7       ,  0.83333333],
       [ 0.        ,  0.        ,  1.        ,  0.        ,  0.70606061,
         0.27756654,  0.6185567 ,  0.5       ,  0.4       ,  0.83333333],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.54848485,
         0.17110266,  0.73195876,  0.54545455,  0.3       ,  1.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.25757576,
         0.31558935,  0.7628866 ,  0.77272727,  0.3       ,  0.79166667],
       [ 0.        ,  0.        ,  1.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.13636364

In [24]:
y = all_data_df['enhancer'].values
y.shape

(2968,)

In [25]:
y[:10]

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

## Split the dataset into train / validation sets

For the initial base model, we will use a simple train / validation split. 5-fold cross-validation will be used during model fine-tuning to obtain the final model.

In [26]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=123)

In [27]:
X_train.shape

(1988, 200, 10)

In [28]:
X_train[0][:10]

array([[0.        , 0.        , 0.        , 0.        , 0.25757576,
        0.31558935, 0.7628866 , 0.77272727, 0.3       , 0.79166667],
       [0.        , 0.        , 1.        , 0.        , 0.        ,
        0.        , 0.        , 0.13636364, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 1.        , 0.81818182,
        0.69581749, 0.6185567 , 1.        , 0.7       , 0.83333333],
       [0.        , 0.        , 0.        , 0.        , 0.54848485,
        0.17110266, 0.73195876, 0.54545455, 0.3       , 1.        ],
       [0.        , 0.        , 0.        , 0.        , 0.04545455,
        0.22053232, 0.36082474, 0.09090909, 0.1       , 0.29166667],
       [0.        , 0.        , 0.        , 1.        , 0.43030303,
        0.40304183, 1.        , 0.54545455, 0.4       , 0.83333333],
       [0.        , 0.        , 0.        , 1.        , 0.86060606,
        1.        , 0.31958763, 0.86363636, 0.6       , 0.75      ],
       [0.        , 0.        , 1.       

In [29]:
y_train.shape

(1988,)

In [30]:
y_train[0]

1.0

## Load models

### Ensemble of models of 2x16 GRU, 1x16 Dense
* Dropouts (.1/.1/.1)
* Warm restarts cycle = 200, lr=0.003
* Saved models are best in each cycle

In [31]:
from numpy.random import seed
seed(123)
from tensorflow import set_random_seed
set_random_seed(456)

In [32]:
# Best model
model = load_model(os.path.join('models', 'pcp-2x16gru1x16dense-dropout010101-wr41.best-epch1283.h5'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru_1 (GRU)                  (None, None, 16)          1344      
_________________________________________________________________
gru_2 (GRU)                  (None, 16)                1632      
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dropout_2 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 3,265
Trainable params: 3,265
Non-trainable params: 0
_________________________________________________________________


In [33]:
results_df = pd.DataFrame()

In [34]:
results_df['best'] = pd.Series(model.predict_classes(X_val, batch_size=128, verbose=1).flatten())



In [35]:
best_acc = 100. * (len(y_val) - (np.abs(results_df['best'].values - y_val)).sum()) / len(y_val)
best_acc

76.73469387755102

The best model from training was saved separately from the models saved in each of the cycles below. It will be the same as one of the best models from the cycles. The best model from the first cycle (1-200 epoch) is not used as the loss had not converged to a minimum.

In [36]:
# Path to the weights saved for the best model in each cycle
path = [os.path.join('models', 'pcp-2x16gru1x16dense-dropout010101-wr41-weights', i)
        for i in ['model_wgts_cyc0400.h5', 'model_wgts_cyc0600.h5', 'model_wgts_cyc0800.h5', 'model_wgts_cyc1000.h5', 'model_wgts_cyc1200.h5', 'model_wgts_cyc1400.h5', 'model_wgts_cyc1600.h5',
                 'model_wgts_cyc1800.h5', 'model_wgts_cyc2000.h5', 'model_wgts_cyc2200.h5', 'model_wgts_cyc2400.h5', 'model_wgts_cyc2600.h5', 'model_wgts_cyc2800.h5', 'model_wgts_cyc3000.h5']]
path

['models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc0400.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc0600.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc0800.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc1000.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc1200.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc1400.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc1600.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc1800.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc2000.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc2200.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc2400.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc2600.h5',
 'models/pcp-2x16gru1x16dens

In [37]:
# Calculate the predictions for the best models in each cycle.
for idx, a_file in enumerate(path):
    model.load_weights(filepath=a_file, by_name=False)
    results_df['model' + str(idx)] = model.predict_classes(X_val, batch_size=128, verbose=1)



In [38]:
# Compare the predictions of each cycle's best model with the overall best model.
# Count the number of different predictions.
for i in range(14):
    print((results_df['best'] - results_df['model' + str(i)]).abs().sum())

116
99
86
64
58
0
31
44
66
61
65
74
83
82


In [39]:
# calculate the accuracy for each model
for i in range(14):
    acc = 100. * (len(y_val) - (np.abs(results_df['model' + str(i)].values - y_val)).sum()) / len(y_val)
    print('model' + str(i) + ' = {:06.3f}%'.format(acc))

model0 = 75.510%
model1 = 76.224%
model2 = 76.531%
model3 = 76.531%
model4 = 76.531%
model5 = 76.735%
model6 = 76.429%
model7 = 75.918%
model8 = 75.918%
model9 = 75.408%
model10 = 75.612%
model11 = 75.306%
model12 = 75.408%
model13 = 75.714%


#### Evalute the Model Ensembles
#### Use models 1-13 (13 models)

In [40]:
acc_list = []

In [41]:
cols = results_df.columns.tolist()
cols.remove('best')
cols.remove('model0')

# Decision threshold for majority voting
threshold = len(cols) // 2
print('threshold = {}'.format(threshold))

threshold = 6


In [42]:
# Calculate the ensemble prediction
results_df['ensemble'] = results_df.loc[:, cols].apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
results_df.head()

Unnamed: 0,best,model0,model1,model2,model3,model4,model5,model6,model7,model8,model9,model10,model11,model12,model13,ensemble
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [43]:
# Calculate the model ensemble accuracy 
acc = 100. * (len(y_val) - (np.abs(results_df['ensemble'].values - y_val)).sum()) / len(y_val)
print('model ensemble accuracy = {:06.3f}%'.format(acc))
acc_list.append(('1-13', 13, acc))

model ensemble accuracy = 76.531%


#### Use models 2-12 (11 models)

In [44]:
cols = results_df.columns.tolist()
for i in ['best', 'model0', 'model1', 'model13', 'ensemble']:
    cols.remove(i)
threshold = len(cols) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.loc[:, cols].apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
results_df.head()

threshold = 5


Unnamed: 0,best,model0,model1,model2,model3,model4,model5,model6,model7,model8,model9,model10,model11,model12,model13,ensemble
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [45]:
acc = 100. * (len(y_val) - (np.abs(results_df['ensemble'].values - y_val)).sum()) / len(y_val)
print('model ensemble accuracy = {:06.3f}%'.format(acc))
acc_list.append(('2-12', 11, acc))

model ensemble accuracy = 76.633%


#### Use models 2-10 (9 models)

In [46]:
cols = results_df.columns.tolist()
for i in ['best', 'model0', 'model1', 'model11', 'model12', 'model13', 'ensemble']:
    cols.remove(i)
threshold = len(cols) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.loc[:, cols].apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
results_df.head()

threshold = 4


Unnamed: 0,best,model0,model1,model2,model3,model4,model5,model6,model7,model8,model9,model10,model11,model12,model13,ensemble
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [47]:
acc = 100. * (len(y_val) - (np.abs(results_df['ensemble'].values - y_val)).sum()) / len(y_val)
print('model ensemble accuracy = {:06.3f}%'.format(acc))
acc_list.append(('2-10', 9, acc))

model ensemble accuracy = 77.347%


#### Use models 2-8 (7 models)

In [48]:
cols = results_df.columns.tolist()
for i in ['best', 'model0', 'model1', 'model9', 'model10', 'model11', 'model12', 'model13', 'ensemble']:
    cols.remove(i)
threshold = len(cols) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.loc[:, cols].apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
results_df.head()

threshold = 3


Unnamed: 0,best,model0,model1,model2,model3,model4,model5,model6,model7,model8,model9,model10,model11,model12,model13,ensemble
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [49]:
acc = 100. * (len(y_val) - (np.abs(results_df['ensemble'].values - y_val)).sum()) / len(y_val)
print('model ensemble accuracy = {:06.3f}%'.format(acc))
acc_list.append(('2-8', 7, acc))

model ensemble accuracy = 77.245%


#### Use models 3-7 (5 models)

In [50]:
cols = results_df.columns.tolist()
for i in ['best', 'model0', 'model1', 'model2', 'model8', 'model9', 'model10', 'model11', 'model12', 'model13', 'ensemble']:
    cols.remove(i)
threshold = len(cols) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.loc[:, cols].apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
results_df.head()

threshold = 2


Unnamed: 0,best,model0,model1,model2,model3,model4,model5,model6,model7,model8,model9,model10,model11,model12,model13,ensemble
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [51]:
acc = 100. * (len(y_val) - (np.abs(results_df['ensemble'].values - y_val)).sum()) / len(y_val)
print('model ensemble accuracy = {:06.3f}%'.format(acc))
acc_list.append(('3-7', 5, acc))

model ensemble accuracy = 77.449%


#### Use models 4-6 (3 models)

In [52]:
cols = results_df.columns.tolist()
for i in ['best', 'model0', 'model1', 'model2', 'model3', 'model7', 'model8', 'model9', 'model10', 'model11', 'model12', 'model13', 'ensemble']:
    cols.remove(i)
threshold = len(cols) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.loc[:, cols].apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
results_df.head()

threshold = 1


Unnamed: 0,best,model0,model1,model2,model3,model4,model5,model6,model7,model8,model9,model10,model11,model12,model13,ensemble
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [53]:
acc = 100. * (len(y_val) - (np.abs(results_df['ensemble'].values - y_val)).sum()) / len(y_val)
print('model ensemble accuracy = {:06.3f}%'.format(acc))
acc_list.append(('4-6', 3, acc))

model ensemble accuracy = 76.837%


#### Use models 5, 9, 13 (3 models)

In [54]:
cols = ['model5', 'model9', 'model13']
threshold = len(cols) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.loc[:, cols].apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
acc = 100. * (len(y_val) - (np.abs(results_df['ensemble'].values - y_val)).sum()) / len(y_val)
print('model ensemble accuracy = {:06.3f}%'.format(acc))
acc_list.append(('5, 9, 13', 3, acc))

threshold = 1
model ensemble accuracy = 76.122%


#### Use models 2-6 (5 models)

In [55]:
cols = ['model2', 'model3', 'model4', 'model5', 'model6']
threshold = len(cols) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.loc[:, cols].apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
acc = 100. * (len(y_val) - (np.abs(results_df['ensemble'].values - y_val)).sum()) / len(y_val)
print('model ensemble accuracy = {:06.3f}%'.format(acc))
acc_list.append(('2-6', 5, acc))

threshold = 2
model ensemble accuracy = 76.837%


In [56]:
acc_df = pd.DataFrame(acc_list, columns=['Models', 'Number of models', 'Accuracy'])
acc_df

Unnamed: 0,Models,Number of models,Accuracy
0,1-13,13,76.531
1,2-12,11,76.633
2,2-10,9,77.347
3,2-8,7,77.245
4,3-7,5,77.449
5,4-6,3,76.837
6,"5, 9, 13",3,76.122
7,2-6,5,76.837


**Model ensemble 4** has the best accuracy. The model ensemble beats the best single model accuracy of 76.735%.

It is probably better to take a total of 5-9 models aound the best model.

### Ensemble of models of 1x16 conv1D, 2x16 GRU bidirectional, 1x8 Dense
* Dropouts (.6/.6/.6)
* Warm restarts cycle = 200, max_lr=.001
* Saved model are best in each cycle

In [57]:
from numpy.random import seed
seed(123)
from tensorflow import set_random_seed
set_random_seed(456)

In [58]:
acc_list = []

In [59]:
# Best model
model = load_model(os.path.join('models', 'pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44.best-epch871.h5'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_1 (Conv1D)            (None, 192, 16)           1456      
_________________________________________________________________
batch_normalization_1 (Batch (None, 192, 16)           64        
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 96, 16)            0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 96, 32)            3264      
_________________________________________________________________
bidirectional_2 (Bidirection (None, 32)                4800      
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 264       
__________

In [60]:
results_df = pd.DataFrame()

In [61]:
results_df['best'] = pd.Series(model.predict_classes(X_val, batch_size=128, verbose=1).flatten())



In [62]:
best_acc = 100. * (len(y_val) - (np.abs(results_df['best'].values - y_val)).sum()) / len(y_val)
'{:06.3f}%'.format(best_acc)

'76.224%'

In [63]:
path = [os.path.join('models', 'pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights', i)
        for i in ['model_wgts_cyc0200.h5', 'model_wgts_cyc0400.h5', 'model_wgts_cyc0600.h5',
                  'model_wgts_cyc0800.h5', 'model_wgts_cyc1000.h5', 'model_wgts_cyc1200.h5',
                  'model_wgts_cyc1400.h5', 'model_wgts_cyc1600.h5',
                  'model_wgts_cyc1800.h5', 'model_wgts_cyc2000.h5']]
path

['models/pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights/model_wgts_cyc0200.h5',
 'models/pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights/model_wgts_cyc0400.h5',
 'models/pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights/model_wgts_cyc0600.h5',
 'models/pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights/model_wgts_cyc0800.h5',
 'models/pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights/model_wgts_cyc1000.h5',
 'models/pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights/model_wgts_cyc1200.h5',
 'models/pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights/model_wgts_cyc1400.h5',
 'models/pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights/model_wgts_cyc1600.h5',
 'models/pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights/model_wgts_cyc1800.h5',
 'models/pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights/model_wgts_cyc2000.h5']

In [64]:
for idx, a_file in enumerate(path):
    model.load_weights(filepath=a_file, by_name=False)
    results_df['model' + str(idx)] = model.predict_classes(X_val, batch_size=128, verbose=1)



In [65]:
for i in range(10):
    print((results_df['best'] - results_df['model' + str(i)]).abs().sum())

85
58
54
37
0
55
32
37
61
61


In [66]:
for i in range(10):
    acc = 100. * (len(y_val) - (np.abs(results_df['model' + str(i)].values - y_val)).sum()) / len(y_val)
    print('model' + str(i) + ' = {:06.3f}%'.format(acc))

model0 = 74.694%
model1 = 76.224%
model2 = 75.816%
model3 = 75.918%
model4 = 76.224%
model5 = 75.510%
model6 = 75.612%
model7 = 75.510%
model8 = 75.918%
model9 = 75.714%


#### Use models 1-9 (9 models)

In [67]:
cols = results_df.columns.tolist()
cols.remove('best')
cols.remove('model0')
threshold = len(cols) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.loc[:, cols].apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
results_df.head()

threshold = 4


Unnamed: 0,best,model0,model1,model2,model3,model4,model5,model6,model7,model8,model9,ensemble
0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1,1,1,1,1,1
4,0,0,0,0,0,0,0,0,0,0,0,0


In [68]:
acc = 100. * (len(y_val) - (np.abs(results_df['ensemble'].values - y_val)).sum()) / len(y_val)
print('model ensemble accuracy = {:06.3f}%'.format(acc))
acc_list.append(('1-9', 9, acc))

model ensemble accuracy = 76.020%


#### Use models 2-8 (7 models)

In [69]:
cols = results_df.columns.tolist()
for i in ['best', 'model0', 'model1', 'model9', 'ensemble']:
    cols.remove(i)
threshold = len(cols) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.loc[:, cols].apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
results_df.head()

threshold = 3


Unnamed: 0,best,model0,model1,model2,model3,model4,model5,model6,model7,model8,model9,ensemble
0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1,1,1,1,1,1
4,0,0,0,0,0,0,0,0,0,0,0,0


In [70]:
acc = 100. * (len(y_val) - (np.abs(results_df['ensemble'].values - y_val)).sum()) / len(y_val)
print('model ensemble accuracy = {:06.3f}%'.format(acc))
acc_list.append(('2-8', 7, acc))

model ensemble accuracy = 75.816%


#### Use models 2-6 (5 models)

In [71]:
cols = results_df.columns.tolist()
for i in ['best', 'model0', 'model1', 'model7', 'model8', 'model9', 'ensemble']:
    cols.remove(i)
threshold = len(cols) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.loc[:, cols].apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
results_df.head()

threshold = 2


Unnamed: 0,best,model0,model1,model2,model3,model4,model5,model6,model7,model8,model9,ensemble
0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1,1,1,1,1,1
4,0,0,0,0,0,0,0,0,0,0,0,0


In [72]:
acc = 100. * (len(y_val) - (np.abs(results_df['ensemble'].values - y_val)).sum()) / len(y_val)
print('model ensemble accuracy = {:06.3f}%'.format(acc))
acc_list.append(('2-6', 5, acc))

model ensemble accuracy = 75.816%


#### Use models 3-5 (3 models)

In [73]:
cols = results_df.columns.tolist()
for i in ['best', 'model0', 'model1', 'model2', 'model6', 'model7', 'model8', 'model9', 'ensemble']:
    cols.remove(i)
threshold = len(cols) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.loc[:, cols].apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
results_df.head()

threshold = 1


Unnamed: 0,best,model0,model1,model2,model3,model4,model5,model6,model7,model8,model9,ensemble
0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,1,1,1,1,1,1,1
3,1,1,1,1,1,1,1,1,1,1,1,1
4,0,0,0,0,0,0,0,0,0,0,0,0


In [74]:
acc = 100. * (len(y_val) - (np.abs(results_df['ensemble'].values - y_val)).sum()) / len(y_val)
print('model ensemble accuracy = {:06.3f}%'.format(acc))
acc_list.append(('3-5', 3, acc))

model ensemble accuracy = 75.918%


#### Use models 1, 4, 7 (3 models)

In [75]:
cols = ['model1', 'model4', 'model7']
threshold = len(cols) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.loc[:, cols].apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
acc = 100. * (len(y_val) - (np.abs(results_df['ensemble'].values - y_val)).sum()) / len(y_val)
print('model ensemble accuracy = {:06.3f}%'.format(acc))
acc_list.append(('1, 4, 7', 3, acc))

threshold = 1
model ensemble accuracy = 76.429%


#### Use models 1, 4, 8 (3 best models)

In [76]:
cols = ['model1', 'model4', 'model8']
threshold = len(cols) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.loc[:, cols].apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
acc = 100. * (len(y_val) - (np.abs(results_df['ensemble'].values - y_val)).sum()) / len(y_val)
print('model ensemble accuracy = {:06.3f}%'.format(acc))
acc_list.append(('1, 4, 8', 3, acc))

threshold = 1
model ensemble accuracy = 76.633%


In [77]:
acc_df = pd.DataFrame(acc_list, columns=['Models', 'Number of models', 'Accuracy'])
acc_df

Unnamed: 0,Models,Number of models,Accuracy
0,1-9,9,76.02
1,2-8,7,75.816
2,2-6,5,75.816
3,3-5,3,75.918
4,"1, 4, 7",3,76.429
5,"1, 4, 8",3,76.633


**Model ensemble 4** has the best accuracy. 5 is only for comparison because the best models will be in different cycles for different training runs.

Taking 3 models around the best model (1, 4, 7) beats the best single model accuracy of 76.224%.

 ## Evaluate model ensembles, best single models on the test set

In [78]:
test_df = pd.read_csv(os.path.join('data', 'independent.csv'))
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
id          400 non-null object
type        400 non-null object
sequence    400 non-null object
dtypes: object(3)
memory usage: 9.5+ KB


In [79]:
test_df.head()

Unnamed: 0,id,type,sequence
0,Chr11_6627824_6628024,strong,ATGCTGCCAGAAGGAAAAGGGGTGGAATTAATGAAACTGGAAGGTT...
1,Chr11_9587224_9587424,strong,GGCATTTTTTAACCTGTGTTTCATTTTCATCTGTGAAATGTGAATA...
2,Chr11_65187024_65187224,strong,GAAACCACAGAGCTGACCTGGCTTCAGAACAAGATGTGGGGCTCCA...
3,Chr10_74014594_74014794,strong,TTTGCATAGGGGCATTACCACTGGACTTGGGCTCAGAGCAAGTGTT...
4,Chr10_105667810_105668010,strong,CGGGAGGCGGGGGTTGCAGTGAGCCAAGATCACACCACTGCACTCC...


In [80]:
test_df['type'].value_counts()

non-enhancer    200
weak            100
strong          100
Name: type, dtype: int64

In [81]:
test_df['enhancer'] = test_df['type'].map(lambda x: 0.0 if x == 'non-enhancer' else 1.0)

In [82]:
test_df.head()

Unnamed: 0,id,type,sequence,enhancer
0,Chr11_6627824_6628024,strong,ATGCTGCCAGAAGGAAAAGGGGTGGAATTAATGAAACTGGAAGGTT...,1.0
1,Chr11_9587224_9587424,strong,GGCATTTTTTAACCTGTGTTTCATTTTCATCTGTGAAATGTGAATA...,1.0
2,Chr11_65187024_65187224,strong,GAAACCACAGAGCTGACCTGGCTTCAGAACAAGATGTGGGGCTCCA...,1.0
3,Chr10_74014594_74014794,strong,TTTGCATAGGGGCATTACCACTGGACTTGGGCTCAGAGCAAGTGTT...,1.0
4,Chr10_105667810_105668010,strong,CGGGAGGCGGGGGTTGCAGTGAGCCAAGATCACACCACTGCACTCC...,1.0


In [83]:
test_df.loc[test_df['type'] == 'weak', :][:5]

Unnamed: 0,id,type,sequence,enhancer
100,hg19_ct_UserTrack_3545_11005,weak,TTATGGTCACCTTCGACCCCAGAAATAATGGTCTCTGTTGTCAGAT...,1.0
101,hg19_ct_UserTrack_3545_8529,weak,CATCCAGGCTTGGTCCTGGTTGTTCCTTGCTGTTATACCAGCCTGG...,1.0
102,hg19_ct_UserTrack_3545_7245,weak,TTGTTTTTTTCTGTTTTGAGACGGAGTTTCGCTCTTGTTGCCCAGG...,1.0
103,hg19_ct_UserTrack_3545_12669,weak,ACTGTTAAATAGCAAAAATTATTGAGCTCAAACCATCTAACCAGGT...,1.0
104,hg19_ct_UserTrack_3545_5404,weak,GAGAATTAAGTTTGTATTAAGTTGGAGACCAGGGCAGATGGAAAGA...,1.0


In [84]:
test_df.loc[test_df['type'] == 'non-enhancer', :][:5]

Unnamed: 0,id,type,sequence,enhancer
200,hg19_ct_UserTrack_3545_158,non-enhancer,AATTTTCTCATTTTCTCATAAAGTTTAACAGTTGTTTATTTGAGTC...,0.0
201,hg19_ct_UserTrack_3545_57,non-enhancer,ACTGGTTATCTTTTAGGACTAGTTAATATAACCCATTCTCTAACCA...,0.0
202,hg19_ct_UserTrack_3545_762,non-enhancer,ATGCATATGTTCTTCAGTAAACAGAGCAGCCACTGGTACCACAGGA...,0.0
203,hg19_ct_UserTrack_3545_78,non-enhancer,CTGCTCTCCTCGCTCTATAAAAGTCAGAGTGCCTAAGCTGTTAATT...,0.0
204,hg19_ct_UserTrack_3545_9,non-enhancer,GCTTGGGTATATATTGTCCAATATAGCAGGCCTCATGTGCTCCTTA...,0.0


In [85]:
# X_test = my_pipeline.fit_transform(test_df)  # wrong!
X_test = my_pipeline.transform(test_df)
X_test.shape

(400, 200, 10)

In [86]:
X_test[0, :10, :]

array([[0.        , 0.        , 0.        , 1.        , 0.86060606,
        1.        , 0.31958763, 0.86363636, 0.6       , 0.75      ],
       [0.        , 0.        , 1.        , 0.        , 0.04545455,
        0.22053232, 0.36082474, 0.09090909, 0.1       , 0.29166667],
       [0.        , 1.        , 0.        , 0.        , 1.        ,
        0.53612167, 0.49484536, 0.5       , 1.        , 0.75      ],
       [0.        , 0.        , 0.        , 0.        , 0.25757576,
        0.31558935, 0.7628866 , 0.77272727, 0.3       , 0.79166667],
       [0.        , 0.        , 1.        , 0.        , 0.04545455,
        0.22053232, 0.36082474, 0.09090909, 0.1       , 0.29166667],
       [0.        , 1.        , 0.        , 0.        , 1.        ,
        0.53612167, 0.49484536, 0.5       , 1.        , 0.75      ],
       [0.        , 0.        , 0.        , 0.        , 0.54848485,
        0.17110266, 0.73195876, 0.54545455, 0.3       , 1.        ],
       [0.        , 0.        , 0.       

In [87]:
y_test = test_df['enhancer'].values
y_test.shape

(400,)

In [88]:
y_test[:10]

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

### Parameters, Functions for Scoring

In [89]:
acc_list = []

In [90]:
# customer scorer function for specificity
def specificity(y, y_pred):
    # print(y_pred)
    tn, fp = 0, 0
    for idx, a_pred in enumerate(y_pred):
        if a_pred and not(y[idx]):
            fp += 1
        elif not(a_pred) and not(y[idx]):
            tn += 1
    return tn/(tn+fp)

In [91]:
# function for all scores
def all_scores(y, y_pred):
    acc = accuracy_score(y, y_pred) * 100.
    recall = recall_score(y, y_pred) * 100.
    spec = specificity(y, y_pred) * 100.
    auc = roc_auc_score(y, y_pred) * 100.
    mcc = matthews_corrcoef(y, y_pred)
    return (acc, mcc, recall, spec, auc)

### 2x16 GRU, 1x16 Dense Best Single Model
* Dropout 0.1
* Best accuracy @epoch 865

In [92]:
# Best model
model = load_model(os.path.join('models', 'pcp-2x16gru1x16dense-dropout0101.best-epch865.h5'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru_1 (GRU)                  (None, None, 16)          1344      
_________________________________________________________________
gru_2 (GRU)                  (None, 16)                1632      
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dropout_2 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 3,265
Trainable params: 3,265
Non-trainable params: 0
_________________________________________________________________


In [93]:
y_pred = model.predict_classes(X_test, batch_size=None, verbose=1)



In [94]:
acc_list.append(('2x16GRU 1x16D', 'single') + all_scores(y_test, y_pred))

### 1x16 Conv1D, 2x16 GRU Bidirectional, 1x8 Dense Best Single Model
* Dropout 0.6
* Best accuracy @epoch 693

In [95]:
# Best model
model = load_model(os.path.join('models', 'pcp-1x16cv-2x16gruB-1x8d-dropout060606.best-epch693.h5'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_3 (Conv1D)            (None, 192, 16)           1456      
_________________________________________________________________
batch_normalization_3 (Batch (None, 192, 16)           64        
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 96, 16)            0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 96, 32)            3264      
_________________________________________________________________
bidirectional_2 (Bidirection (None, 32)                4800      
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 264       
__________

In [96]:
y_pred = model.predict_classes(X_test, batch_size=None, verbose=1)



In [97]:
acc_list.append(('1x16 Conv1D 2x16 GRU-B 1x8 Dense', 'single') + all_scores(y_test, y_pred))

### Model Ensemble of 2x16 GRU, 1x16 Dense using 5 models
* Dropouts 0.1
* Warm restarts cycle = 200, lr=0.003
* Using models 3-7 out of 0-10

In [98]:
# Best model. Use this model to load the saved weights for the model ensemble.
model = load_model(os.path.join('models', 'pcp-2x16gru1x16dense-dropout010101-wr41.best-epch1283.h5'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru_1 (GRU)                  (None, None, 16)          1344      
_________________________________________________________________
gru_2 (GRU)                  (None, 16)                1632      
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dropout_2 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 3,265
Trainable params: 3,265
Non-trainable params: 0
_________________________________________________________________


In [99]:
results_df = pd.DataFrame()

In [100]:
path = [os.path.join('models', 'pcp-2x16gru1x16dense-dropout010101-wr41-weights', i)
        for i in ['model_wgts_cyc1000.h5', 'model_wgts_cyc1200.h5', 'model_wgts_cyc1400.h5',
                  'model_wgts_cyc1600.h5', 'model_wgts_cyc1800.h5']]
path

['models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc1000.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc1200.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc1400.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc1600.h5',
 'models/pcp-2x16gru1x16dense-dropout010101-wr41-weights/model_wgts_cyc1800.h5']

In [101]:
for idx, a_file in enumerate(path):
    model.load_weights(filepath=a_file, by_name=False)
    results_df['model' + str(idx)] = pd.Series(model.predict_classes(X_test, batch_size=128, verbose=1).flatten())



In [102]:
results_df.head()

Unnamed: 0,model0,model1,model2,model3,model4
0,1,1,1,1,1
1,0,0,1,1,0
2,1,1,1,1,1
3,1,1,1,1,1
4,0,1,1,1,1


In [103]:
threshold = len(results_df.columns) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
results_df.head()

threshold = 2


Unnamed: 0,model0,model1,model2,model3,model4,ensemble
0,1,1,1,1,1,1
1,0,0,1,1,0,0
2,1,1,1,1,1,1
3,1,1,1,1,1,1
4,0,1,1,1,1,1


In [104]:
acc_list.append(('2x16GRU 1x16D', 'ensemble') + all_scores(y_test, results_df['ensemble'].values))

#### Model Ensemble for base learner 1x16cv, 2x16gruB, 1x8d, dropout 060606 using 3 models (1, 4 and 7) 

### Model Ensemble of 1x16 Conv1D, 2x16 GRU Bidirectional, 1x8 Dense using 3 models
* Dropouts (.6/.6/.6)
* Warm restarts cycle = 200, max_lr=.001
* Using models 1, 4, 7 out of 0-9

In [105]:
# Best model
model = load_model(os.path.join('models', 'pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44.best-epch871.h5'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv1d_1 (Conv1D)            (None, 192, 16)           1456      
_________________________________________________________________
batch_normalization_1 (Batch (None, 192, 16)           64        
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 96, 16)            0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 96, 32)            3264      
_________________________________________________________________
bidirectional_2 (Bidirection (None, 32)                4800      
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 264       
__________

In [106]:
results_df = pd.DataFrame()

In [107]:
path = [os.path.join('models', 'pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights', i)
        for i in ['model_wgts_cyc0400.h5', 'model_wgts_cyc1000.h5', 'model_wgts_cyc1600.h5']]
path

['models/pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights/model_wgts_cyc0400.h5',
 'models/pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights/model_wgts_cyc1000.h5',
 'models/pcp-1x16cv-2x16gruB-1x8d-dropout060606-wr44-weights/model_wgts_cyc1600.h5']

In [108]:
for idx, a_file in enumerate(path):
    model.load_weights(filepath=a_file, by_name=False)
    results_df['model' + str(idx)] = pd.Series(model.predict_classes(X_test, batch_size=128, verbose=1).flatten())



In [109]:
results_df.head()

Unnamed: 0,model0,model1,model2
0,1,1,1
1,1,1,1
2,1,1,1
3,1,1,1
4,1,1,1


In [110]:
threshold = len(results_df.columns) // 2
print('threshold = {}'.format(threshold))
results_df['ensemble'] = results_df.apply(lambda x: 1 if x.sum() > threshold else 0, axis=1)
results_df.head()

threshold = 1


Unnamed: 0,model0,model1,model2,ensemble
0,1,1,1,1
1,1,1,1,1
2,1,1,1,1
3,1,1,1,1
4,1,1,1,1


In [111]:
acc_list.append(('1x16 Conv1D 2x16 GRU-B 1x8 Dense', 'ensemble')
                + all_scores(y_test, results_df['ensemble'].values))

In [112]:
acc_df = pd.DataFrame(acc_list, columns=['Model', 'Type', 'Accuracy(%)', 'MCC', 'Recall(%)', 'Specificity(%)',
                                         'AUC(%)'])
acc_df

Unnamed: 0,Model,Type,Accuracy(%),MCC,Recall(%),Specificity(%),AUC(%)
0,2x16GRU 1x16D,single,74.0,0.48,75.0,73.0,74.0
1,1x16 Conv1D 2x16 GRU-B 1x8 Dense,single,73.75,0.475,75.0,72.5,73.75
2,2x16GRU 1x16D,ensemble,75.25,0.506,73.0,77.5,75.25
3,1x16 Conv1D 2x16 GRU-B 1x8 Dense,ensemble,75.5,0.51,75.0,76.0,75.5
