# RawMSA test implementation

In this notebook I test my implementation of the rawMSA net for the prediction of secondary structure and solvent accessibility (its original tasks). I will in other notebooks adapt the net for the prediction of variant effect. I am using the very limited grey2018 dataset with 9 proteins.

In [11]:
import numpy as np
import joblib
import pandas as pd; pd.set_option('display.max_columns', None)

df_full = pd.read_csv('../dataset/gray2018/dmsTraining_2017-02-20.csv')
df = df_full

# some of the studies in the training set were excluded from training in the original paper
excluded_studies = ['Brca1_E3', 'Brca1_Y2H', 'E3_ligase']
for study in excluded_studies:
    df = df[df['dms_id'] != study]
    
# many of the columns are not needed for this task
df = df[['uniprot_id', 'position', 'dssp_sec_str', 'accessibility']]

# for this task I don't care about different mutations in the same position, I want each position only once
# I can drop duplicates since I removed all the columns relating to the kind of mutatiion
df = df.drop_duplicates()
    
# boolean vectors to slice the rows which have a value
has_dssp = df['dssp_sec_str'].notna()
has_accessibility = df['accessibility'].notna()

df_dssp = df[has_dssp]
df_accessibility = df[has_accessibility]

## Extracting the secondary structures and accessibility

Many sequence positions have an annotated dssp secondary structure and solvent accessibility in the Gray2018 dataset. Here I extract it and put it into a vector.

In [22]:
out_rawmsa_test_path = '../processing/raw_msa/implementation_test_grey2018/'

# 3 class dssp mapping reduction
ss_d = {
        "H": "H",
        "B": "E",
        "E": "E",
        "G": "-",
        "I": "-",
        "T": "-",
        "S": "-",
        ".": "-", # should be a space but it was replaced by . in the grey dataset
    }
# need to code each class with an integer
ss_sparse = {
        "H": 0,
        "E": 1,
        "-": 2,
    }
ss_list = [ss_sparse[ss_d[el]] for el in df['dssp_sec_str'][has_dssp]]

# creating and saving the label arrays
ss_array = np.array(ss_list).reshape(-1,1)
accessibility_array = np.array(df['accessibility'][has_accessibility]).reshape(-1,1)
joblib.dump(ss_array, out_rawmsa_test_path + 'ss_labels.np.joblib.xz')
joblib.dump(accessibility_array, out_rawmsa_test_path + 'accessibility_labels.np.joblib.xz')

['../processing/raw_msa/implementation_test_grey2018/accessibility_labels.np.joblib.xz']

## Obtaining sliding windows

For each of the position in the msa of the proteins in the dataset, I obtain the relative sliding window padded with 0 if needed at the right, left or bottom. I put it in a lookup table for later.

In [114]:
window_size = 31
msa_depth = 500
in_msa_path = '../processing/gray2018/msa_vectors/'
basename_list = '../processing/gray2018/input_list.txt'

with open(basename_list) as handle:
    sliding_windows = {}
    for line in handle:
        basename = line.rstrip()
        msa_vec = joblib.load(in_msa_path + basename + '.npy.joblib.xz')[:msa_depth]
        windows_list = []
        for i, _ in enumerate(msa_vec.T):
            upper = i + ((window_size - 1)//2) + 1
            lower = i - ((window_size - 1)//2)
            pad_lower, pad_upper = 0, 0
            if lower < 0:
                pad_lower = - lower
                lower = 0
            if upper > len(msa_vec.T):
                pad_upper = upper - len(msa_vec.T)
                # no need to reset upper since numpy allows indeces exceeding len
            curr_window_unpadded = msa_vec[:, lower:upper]
            # this is for the vertical padding if there are not enough sequences in the msa
            pad_vertical = 0
            if len(msa_vec) < msa_depth:
                pad_vertical = msa_depth - len(msa_vec)
            # 0 is a special padding value for the keras embedding layer
            curr_window = np.pad(curr_window_unpadded, ((0,pad_vertical),(pad_lower,pad_upper)), mode='constant', constant_values = 0)
            assert curr_window.shape == (msa_depth, window_size)
            windows_list.append(curr_window)
        sliding_windows[basename] = np.array(windows_list)
        assert sliding_windows[basename].shape == (len(msa_vec.T), msa_depth, window_size)

## Map the sliding windows to the training data

For each training point, I extract the relative sliding window and I put it into a vector that can be used for training together with the ss and accessibility vectors. I save the output to numpy arrays.

In [126]:
def map_windows(sliding_windows, df):
    # the first element of iterrows is the index
    windows = []
    for _, row in df.iterrows():
        # I remove 1 to position since the index starts from 0 but the position from 1
        curr_window = sliding_windows[row['uniprot_id']][row['position']-1]
        windows.append(curr_window)
    windows_vec = np.array(windows)
    assert windows_vec.shape[0] == len(df)
    return windows_vec

dssp_windows = map_windows(sliding_windows, df_dssp)
accessibility_windows = map_windows(sliding_windows, df_accessibility)

joblib.dump(dssp_windows, '../processing/raw_msa/implementation_test_grey2018/ss_sliding_window.np.joblib.xz')
joblib.dump(accessibility_windows, '../processing/raw_msa/implementation_test_grey2018/accessibility_sliding_window.np.joblib.xz')

['../processing/raw_msa/implementation_test_grey2018/accessibility_sliding_window.np.joblib.xz']

## Deep learning

Here general operation that are common to all the deep learning steps, and I show tensorboard.

In [1]:
import numpy as np
import tensorflow as tf
import joblib
import json
import datetime
import os

%load_ext tensorboard

In [2]:
%tensorboard --logdir logs

Reusing TensorBoard on port 6006 (pid 227810), started 0:24:31 ago. (Use '!kill 227810' to kill it.)

### SS network

Here I create the rawMSA network for secondary structure prediction.

In [7]:
# my input is of shape LY where L is the size of the sliding window (31 here) and Y is the depth (500 here)
# In this original implementation the input is a vector, so I presume that they first flatten the msa to give as input
# this is not really declared anywhere but if I give in input a None,None shape, then the reshape removes a dimension!
# the dimensions declared here do not include the batch size, which is added as a None first dimension automatically
# note that this is a tensor shape, not an input layer
# for the functional API is recommended in place of InputLayer
window_size = 31
msa_depth = 500
embedding_dim = 14
inputs = tf.keras.Input(shape=(msa_depth, window_size))
# I flatten the input for the embedding layer
x = tf.keras.layers.Flatten()(inputs)
# this is the embedding layer
# 26 is the number of residues (20 standard, the additional characthers XBZU, and - for gaps)
# 14 is the dimensionality of the embedding used
# I am avoiding here to specify many parameters, if the defaults differ from what is in the rawMSA paper I will adjust it later
# x is the running variable that I use to connect the layers
# the shape after the embedding is batch,LY,E
x = tf.keras.layers.Embedding(input_dim=26, output_dim=embedding_dim, mask_zero=True)(x)
# I reshape to undo the initial flattening of the input, so I separate the L and Y dimensions (length and alignment depth)
# my shape now is batch,L,Y,E
x = tf.keras.layers.Reshape((-1,msa_depth,embedding_dim))(x)
# for the ss_rsa netowork there is a single convolution on the embedded input
# I use a number of filters equal to the embedding dimensionality
# the filter is still of width 1 but much longer than the one used for the cmap network
# the activation is linear since there is a separate relu activation layer
# the dimensions do not change after this layers but from now on I will refer to the third axis as F and not E
x = tf.keras.layers.Conv2D(embedding_dim, [1,10], activation='relu', padding='same', data_format='channels_last')(x)
# this max pooling layer uses a really long pooling but always of 1 column width
# this compresses the Y dimension
# the shape goes from batch,L,Y,F to batch,L,S,F, where S=Y/n is 500/20=25
x = tf.keras.layers.MaxPooling2D(pool_size=[1,20], data_format='channels_last')(x)
# I remove 1 axis by concatenating S and F to SF where SF=25*14=350
# the new shape is batch,L,SF
x = tf.keras.layers.Reshape((-1,embedding_dim*(msa_depth//20)))(x)
# the last part of the network involves bidirectional LSTM units
# the sequence is read with recurrent units in both direction and then the results are averaged (merge_mode='ave')
# I want in output a full sequence and not only the last output of the recurrence
# It uses tanh activation in output but hard sigmoid between the recurrent units
# the hard sigmoid is a piecewise function with similar behaviour to the sigmoid
# it is easier to compute at the cost of precision
# the initializtion default is glorot while the authors used variance scaling
# I leave it glorot for now since I believe it to be due to the keras version default
# the lstm_brnn + droput combination is applied twice
# this loop does not alter the shape (it is a seq-to-seq lstm) which remains batch,L,SF=350
for _ in range(2):
    lstm_layer = tf.keras.layers.LSTM(embedding_dim*(msa_depth//20), return_sequences=True, recurrent_activation='hard_sigmoid')
    x = tf.keras.layers.Bidirectional(lstm_layer, merge_mode='ave')(x)
    # dropout layer with a 0.5 probability of dropping a connection
    # non-dropped input are rescaled so to maintain the some over the inputs
    x = tf.keras.layers.Dropout(0.4)(x)
# the L and SF axes are collapsed, giving dimension batch,LSF
# I am concatenating the sliding window dimension L with the depth/channel dimension SF
x = tf.keras.layers.Flatten(data_format='channels_last')(x)
# now 2 dense layers with dropout to convert all the information to 3 values, the prediction for the
# ss of the central residue of the sliding window
x = tf.keras.layers.Dense(50, activation='relu')(x)
x = tf.keras.layers.Dropout(0.4)(x)
x = tf.keras.layers.Dense(20, activation='relu')(x)
x = tf.keras.layers.Dropout(0.4)(x)
# finally softmax with 3 output units for the 3 ss classes for the central residue
outputs = tf.keras.layers.Dense(3, activation='softmax')(x)
# I generate the model linking inputs and outputs
model = tf.keras.Model(inputs=inputs, outputs=outputs, name='ss_recreated')
# the optimizer is RMSProp with a tweaked epsilon
optimizer = tf.keras.optimizers.RMSprop(epsilon=1e-9)
# the loss is sparse_categorical_crossentropy (a function), not the class SparseCategoricalCrossentropy
# when using the class it needs to be called first
loss = tf.keras.losses.sparse_categorical_crossentropy
q3_accuracy = sklearn.metrics.accuracy_score
model.compile(optimizer=optimizer, loss=loss, metrics=['sparse_categorical_accuracy'])

In [8]:
# input selection
x = joblib.load('../processing/raw_msa/implementation_test_grey2018/ss_sliding_window.np.joblib.xz')
y = joblib.load('../processing/raw_msa/implementation_test_grey2018/ss_labels.np.joblib.xz')
#np.random.shuffle(y)

# tensorflow logging
logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)

# fit the model
model.fit(x, y, epochs=10, validation_split=0.2, callbacks=[tensorboard_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7ff4951ded30>

### RSA network

In [9]:
# similar to ss network but with different embedding dim, conv2d kernel size, max_pool size, droput

window_size = 31
msa_depth = 500
embedding_dim = 28

inputs = tf.keras.Input(shape=(msa_depth, window_size))
x = tf.keras.layers.Flatten()(inputs)
x = tf.keras.layers.Embedding(input_dim=26, output_dim=embedding_dim, mask_zero=True)(x)
x = tf.keras.layers.Reshape((-1,msa_depth,embedding_dim))(x)
x = tf.keras.layers.Conv2D(embedding_dim, [1,20], activation='relu', padding='same', data_format='channels_last')(x)
x = tf.keras.layers.MaxPooling2D(pool_size=[1,30], data_format='channels_last')(x)
x = tf.keras.layers.Reshape((-1,embedding_dim*(msa_depth//30)))(x)

for _ in range(2):
    lstm_layer = tf.keras.layers.LSTM(embedding_dim*(msa_depth//30), return_sequences=True, recurrent_activation='hard_sigmoid')
    x = tf.keras.layers.Bidirectional(lstm_layer, merge_mode='ave')(x)
    x = tf.keras.layers.Dropout(0.4)(x)

x = tf.keras.layers.Flatten(data_format='channels_last')(x)
x = tf.keras.layers.Dense(50)(x)
x = tf.keras.layers.Dropout(0.4)(x)
x = tf.keras.layers.Dense(20)(x)
x = tf.keras.layers.Dropout(0.4)(x)
outputs = tf.keras.layers.Dense(2, activation='softmax')(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs, name='rsa_recreated')
optimizer = tf.keras.optimizers.RMSprop(epsilon=1e-9)
loss = tf.keras.losses.sparse_categorical_crossentropy
model.compile(optimizer=optimizer, loss=loss, metrics=['sparse_categorical_accuracy'])

In [10]:
# input selection
x = joblib.load('../processing/raw_msa/implementation_test_grey2018/accessibility_sliding_window.np.joblib.xz')
y_cont = joblib.load('../processing/raw_msa/implementation_test_grey2018/accessibility_labels.np.joblib.xz')
y = np.array([1 if el > 0.3 else 0 for el in y_cont])
#np.random.shuffle(y)

# tensorflow logging
logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)

# fit the model
model.fit(x, y, epochs=10, validation_split=0.2, callbacks=[tensorboard_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7ff49bcbe580>