# Deep learning Gray2018 full msa approach

Here I implement the network used to process the MSA obtained from the filtered output of hhblits. This implementation uses the network that extracts the contact map from an msa, and then further processes the output from there with additional layers. This approach considers the full msa and NOT a sliding window.

In [10]:
import numpy as np
import tensorflow as tf
import joblib
import json

## Original rawMSA cmap network

Here I re-create the model architecture of the rawMSA cmap network. There is no source code in the repository of the rawMSA paper, and the serialised model do not de-serialize correctly. A bad marshal error is raised at loading time, and I determined this to be due to a mismatch in python version that is incompatible with the bytecode of the lambda layer. The models can be successfully loaded under python 3.5. I loaded them in such way and then save the model architecture and weights separately for easier processsing. The architecture was saved a s a json.
Here I code the same architecture to use as a base for my model. The lambda layer was recreated by reading the paper and interpreting the operations that they did, so it may differ in the actual implementation. The layer naming does not coincide since the first layeris called layer_1 in the original cmap and layer here.

In [15]:
# my input is of shape LY where L is the lenght of the MSA (variable for each input) and Y is the depth (an hyperparameter)
# In this original implementation the input is a vector, so I presume that they first flatten the msa to give as input
# this is not really declared anywhere but if I give in input a None,None shape, then the reshape removes a dimension!
# the dimensions declared here do not include the batch size, which is added as a None first dimension automatically
# note that this is a tensor shape, not an input layer
# for the functional API is recommended in place of InputLayer
inputs = tf.keras.Input(shape=(None,))
# this is the embedding layer
# 26 is the number of residues (20 standard, the additional characthers XBZU, and - for gaps)
# 28 is the dimensionality of the embedding used
# I am avoiding here to specify many parameters, if the defaults differ from what is in the rawMSA paper I will adjust it later
# x is the running variable that I use to connect the layers
# the shape after the embedding is batch,LY,E
embedding = tf.keras.layers.Embedding(input_dim=26, output_dim=28)
x = embedding(inputs)
# I reshape to undo the initial flattening of the input, so I separate the L and Y dimensions (length and alignment depth)
# my shape now is batch,L,Y,E
reshape = tf.keras.layers.Reshape((-1,1000,28))
x = reshape(x)
# here I start the first round of convolutions and pooling
# I declare the layers inside the loop since otherwise it throws an error on the shape of the input for conv2d
# probably some internal param is initialized and remains so during subsequent calls
# recreating the layer each time just resets everything
# In order to avoid possible hard-to-debug problems I re-initialize also activation and batch_norm
# the general structure is a block of 2 convolutions and batch normalization followed by max pooling
# this block is repeated 4 to 8 times (in this case 6)
# after the first pass the third dimension goes from 28 to 22 and then remains constant
# from now on the third dimension will be the filter dimension F instead of the embedding dimension E
# the second dimension will be the stride dimension S instead than the msa depth dimension Y
# the activation and batch normalization do not affect the shape
# max pooling halves the S dimension at each round
for _ in range(6):
    for _ in range(2):
        # I use 22 convolutional filters of size 1,3 with same padding
        # The channel (the F dimension) is the last
        conv2d = tf.keras.layers.Conv2D(22, [1,3], padding='same', data_format='channels_last')
        x = conv2d(x)
        # apply a relu on the conv2d output
        # I am not sure why they did not just apply the activation in the conv2d layer
        activation = tf.keras.layers.Activation('relu')
        x = activation(x)
        # batch normalization brings the mean output close to 0 and the sd close to 1
        # during training nornmalization is done on the current batch
        # during testing, it normalizes according to the training set
        batch_normalization = tf.keras.layers.BatchNormalization(axis=[3])
        x = batch_normalization(x)
    # the final max pooling layer
    # this layer halves the S dimension at each round
    max_pooling = tf.keras.layers.MaxPooling2D(pool_size=[1,2], data_format='channels_last')
    x = max_pooling(x)
# after the firt convolutions and pooling loop, the shape is batch,L,Y/n,22
# n is the number of max pooling layers applied, and the division is approximated to the closes integer down
# for n=6 the shape is batch,L,15,22
# this reshape concatenates the S and F axis, removing one dimension (15*22=330)
# after the reshape the shape is batch,L,330
reshape = tf.keras.layers.Reshape((-1,330))
x = reshape(x)
# the lambda layer for the outer product
# this layer has the purpose of converting the dimensionality closer to that of the output (L,L)
# the input here has shape batch,L,330 where 330 is (F*S)
# the output has shape batch,L,L,330
# first I define the function to compute the outer product
def outer_product(x):
    # actually I need only FS but I put the correct names for reference
    batch, L, FS = x.shape
    # x_hat and x_bar have both a singleton dimension added at different orders (before and after L)
    x_hat = tf.keras.layers.Reshape((-1, 1, FS))(x)
    x_bar = tf.keras.layers.Reshape((1, -1, FS))(x)
    # the multiply operator returns the element-wise multiplication
    # since there is a dimensional mismatch, the singleton dimensions are bradcasted to the opposite L dimension
    x = tf.math.multiply(x_hat, x_bar)
    return x
# then I create the lambda layer implementing the custom outer product function
lambda_layer = tf.keras.layers.Lambda(outer_product, output_shape=[None, None, 330])
x = lambda_layer(x)
# now the second and final round of convolutions
# the first 3 layers are non-repetitive
# I start with a convolution with 34 filters of shape 3,3 and with relu activation
# This brings the shape from batch,L,L,330 (where 330 is the old FS) to batch,L,L,34 where 34 is the new F
conv2d = tf.keras.layers.Conv2D(34, [3,3], activation='relu', padding='same', data_format='channels_last')
x = conv2d(x)
# the batch normalization is identical to the one in the initial loop and does not alter the shape
batch_normalization = tf.keras.layers.BatchNormalization(axis=[3])
x = batch_normalization(x)
# this is similar to the previous conv2d but with a linear activation instead of relu
# this is done since the relu activation is the first component of the subsequent loop
# here the shape is not altered (batch,L,L,F=34) since I am using the same number of filters
conv2d = tf.keras.layers.Conv2D(34, [3,3], activation='linear', padding='same', data_format='channels_last')
x = conv2d(x)
# I have now a loop of relu activation followed by 2 rounds of batch normalization and conv2d, which ends
# with adding the output of the last convolution to the output of the first batch norm
# the loop is repeated 15 times (6 to 20 times in different models)
for _ in range(15):
    # the activation is identical to the one in the initial loop and does not alter the shape (batch,L,L,F=34)
    activation = tf.keras.layers.Activation('relu')
    x = activation(x)
    # batch norm and conv2d are repeated twice per each block
    # I nonetheless state them explicitly since I need a skip connection from the first batch norm
    # the batch normalization is identical to the one in the initial loop
    # I name x differently here since I need to retrieve the output for the skip connection
    # this does not alter the shape (batch,L,L,F=34)
    batch_normalization = tf.keras.layers.BatchNormalization(axis=[3])
    x_skip = batch_normalization(x)
    # convolution with 34 filters of shape 3,3 and with relu activation
    # this does not alter the shape (batch,L,L,F=34)
    conv2d = tf.keras.layers.Conv2D(34, [3,3], activation='relu', padding='same', data_format='channels_last')
    x = conv2d(x_skip)
    # the batch normalization is identical to the one in the initial loop
    # this does not alter the shape (batch,L,L,F=34)
    batch_normalization = tf.keras.layers.BatchNormalization(axis=[3])
    x = batch_normalization(x)
    # convolution with 34 filters of shape 3,3 and with relu activation
    # this does not alter the shape (batch,L,L,F=34)
    conv2d = tf.keras.layers.Conv2D(34, [3,3], activation='relu', padding='same', data_format='channels_last')
    x = conv2d(x)
    # this a skip connection: I add to the current tensor x the output of the FIRST batch norm of the loop
    # the 2 tensors have the same shape and also the output has the same shape of batch,L,L,FS=34
    add = tf.keras.layers.Add()
    x = add([x, x_skip])
# the network finishes with a convolution preceeded by normalization and activation
# the activation is identical to the one in the initial loop and does not alter the shape (batch,L,L,F=34)
activation = tf.keras.layers.Activation('relu')
x = activation(x)    
# the batch normalization is identical to the one in the initial loop
# this does not alter the shape (batch,L,L,F=34)
batch_normalization = tf.keras.layers.BatchNormalization(axis=[3])
x = batch_normalization(x)
# the final convolution has only 2 filters with size 3,3
# this reduces the F dimension to 2 bringing the shape to batch,L,L,F=2
conv2d = tf.keras.layers.Conv2D(2, [3,3], activation='relu', padding='same', data_format='channels_last')
x = conv2d(x)
# the final activation is a softmax and does not alter the shape (batch,L,L,F=2)
# this normalizes the output to a probability such that the 2 channels measure the probability of contact/no contact
activation = tf.keras.layers.Activation('softmax')
outputs = activation(x) 
model = tf.keras.Model(inputs=inputs, outputs=outputs, name='cmap_recreated')

## My derivative of the rawMSA cmap network

I start now from the original cmap net and I modify it for my needs. I find the way the input is fed counter-intuitive, so I make the input be of shape L,Y instead of LY.

In [9]:
# my input is of shape L,Y where L is the lenght of the MSA (variable for each input) and Y is the depth (an hyperparameter)
# I do not declare Y for generality
# the dimensions declared here do not include the batch size, which is added as a None first dimension automatically
# note that this is a tensor shape, not an input layer
# for the functional API is recommended in place of InputLayer
inputs = tf.keras.Input(shape=(None,1000))
# this flattening was added by me to do the flattening implicitly and not before feeding the network
# flattening is required for the embedding layer
# I am converting from batch,L,Y to batch,LY
flatten = tf.keras.layers.Flatten()
x = flatten(inputs)
# this is the embedding layer
# 26 is the number of residues (20 standard, the additional characthers XBZU, and - for gaps)
# 28 is the dimensionality of the embedding used
# I am avoiding here to specify many parameters, if the defaults differ from what is in the rawMSA paper I will adjust it later
# x is the running variable that I use to connect the layers
# the shape after the embedding is batch,LY,E
embedding = tf.keras.layers.Embedding(input_dim=26, output_dim=28)
x = embedding(x)
# I do a reshape to recreate the Y dimension (alignment depth)
# my shape now is batch,L,Y,E
reshape = tf.keras.layers.Reshape((-1,1000,28))
x = reshape(x)
# here I start the first round of convolutions and pooling
# I declare the layers inside the loop since otherwise it throws an error on the shape of the input for conv2d
# probably some internal param is initialized and remains so during subsequent calls
# recreating the layer each time just resets everything
# In order to avoid possible hard-to-debug problems I re-initialize also activation and batch_norm
# the general structure is a block of 2 convolutions and batch normalization followed by max pooling
# this block is repeated 4 to 8 times (in this case 6)
# after the first pass the third dimension goes from 28 to 22 and then remains constant
# from now on the third dimension will be the filter dimension F instead of the embedding dimension E
# the second dimension will be the stride dimension S instead than the msa depth dimension Y
# the activation and batch normalization do not affect the shape
# max pooling halves the S dimension at each round
for _ in range(6):
    for _ in range(2):
        # I use 22 convolutional filters of size 1,3 with same padding
        # The channel (the F dimension) is the last
        conv2d = tf.keras.layers.Conv2D(22, [1,3], padding='same', data_format='channels_last')
        x = conv2d(x)
        # apply a relu on the conv2d output
        # I am not sure why they did not just apply the activation in the conv2d layer
        activation = tf.keras.layers.Activation('relu')
        x = activation(x)
        # batch normalization brings the mean output close to 0 and the sd close to 1
        # during training nornmalization is done on the current batch
        # during testing, it normalizes according to the training set
        batch_normalization = tf.keras.layers.BatchNormalization(axis=[3])
        x = batch_normalization(x)
    # the final max pooling layer
    # this layer halves the S dimension at each round
    max_pooling = tf.keras.layers.MaxPooling2D(pool_size=[1,2], data_format='channels_last')
    x = max_pooling(x)
# after the firt convolutions and pooling loop, the shape is batch,L,Y/n,22
# n is the number of max pooling layers applied, and the division is approximated to the closes integer down
# for n=6 the shape is batch,L,15,22
# this reshape concatenates the S and F axis, removing one dimension (15*22=330)
# after the reshape the shape is batch,L,330
reshape = tf.keras.layers.Reshape((-1,330))
x = reshape(x)
# the lambda layer for the outer product
# this layer has the purpose of converting the dimensionality closer to that of the output (L,L)
# the input here has shape batch,L,330 where 330 is (F*S)
# the output has shape batch,L,L,330
# first I define the function to compute the outer product
def outer_product(x):
    # actually I need only FS but I put the correct names for reference
    batch, L, FS = x.shape
    # x_hat and x_bar have both a singleton dimension added at different orders (before and after L)
    x_hat = tf.keras.layers.Reshape((-1, 1, FS))(x)
    x_bar = tf.keras.layers.Reshape((1, -1, FS))(x)
    # the multiply operator returns the element-wise multiplication
    # since there is a dimensional mismatch, the singleton dimensions are bradcasted to the opposite L dimension
    x = tf.math.multiply(x_hat, x_bar)
    return x
# then I create the lambda layer implementing the custom outer product function
lambda_layer = tf.keras.layers.Lambda(outer_product, output_shape=[None, None, 330])
x = lambda_layer(x)
# now the second and final round of convolutions
# the first 3 layers are non-repetitive
# I start with a convolution with 34 filters of shape 3,3 and with relu activation
# This brings the shape from batch,L,L,330 (where 330 is the old FS) to batch,L,L,34 where 34 is the new F
conv2d = tf.keras.layers.Conv2D(34, [3,3], activation='relu', padding='same', data_format='channels_last')
x = conv2d(x)
# the batch normalization is identical to the one in the initial loop and does not alter the shape
batch_normalization = tf.keras.layers.BatchNormalization(axis=[3])
x = batch_normalization(x)
# this is similar to the previous conv2d but with a linear activation instead of relu
# this is done since the relu activation is the first component of the subsequent loop
# here the shape is not altered (batch,L,L,F=34) since I am using the same number of filters
conv2d = tf.keras.layers.Conv2D(34, [3,3], activation='linear', padding='same', data_format='channels_last')
x = conv2d(x)
# I have now a loop of relu activation followed by 2 rounds of batch normalization and conv2d, which ends
# with adding the output of the last convolution to the output of the first batch norm
# the loop is repeated 15 times (6 to 20 times in different models)
for _ in range(15):
    # the activation is identical to the one in the initial loop and does not alter the shape (batch,L,L,F=34)
    activation = tf.keras.layers.Activation('relu')
    x = activation(x)
    # batch norm and conv2d are repeated twice per each block
    # I nonetheless state them explicitly since I need a skip connection from the first batch norm
    # the batch normalization is identical to the one in the initial loop
    # I name x differently here since I need to retrieve the output for the skip connection
    # this does not alter the shape (batch,L,L,F=34)
    batch_normalization = tf.keras.layers.BatchNormalization(axis=[3])
    x_skip = batch_normalization(x)
    # convolution with 34 filters of shape 3,3 and with relu activation
    # this does not alter the shape (batch,L,L,F=34)
    conv2d = tf.keras.layers.Conv2D(34, [3,3], activation='relu', padding='same', data_format='channels_last')
    x = conv2d(x_skip)
    # the batch normalization is identical to the one in the initial loop
    # this does not alter the shape (batch,L,L,F=34)
    batch_normalization = tf.keras.layers.BatchNormalization(axis=[3])
    x = batch_normalization(x)
    # convolution with 34 filters of shape 3,3 and with relu activation
    # this does not alter the shape (batch,L,L,F=34)
    conv2d = tf.keras.layers.Conv2D(34, [3,3], activation='relu', padding='same', data_format='channels_last')
    x = conv2d(x)
    # this a skip connection: I add to the current tensor x the output of the FIRST batch norm of the loop
    # the 2 tensors have the same shape and also the output has the same shape of batch,L,L,FS=34
    add = tf.keras.layers.Add()
    x = add([x, x_skip])
# the network finishes with a convolution preceeded by normalization and activation
# the activation is identical to the one in the initial loop and does not alter the shape (batch,L,L,F=34)
activation = tf.keras.layers.Activation('relu')
x = activation(x)    
# the batch normalization is identical to the one in the initial loop
# this does not alter the shape (batch,L,L,F=34)
batch_normalization = tf.keras.layers.BatchNormalization(axis=[3])
x = batch_normalization(x)
# the final convolution has only 2 filters with size 3,3
# this reduces the F dimension to 2 bringing the shape to batch,L,L,F=2
conv2d = tf.keras.layers.Conv2D(2, [3,3], activation='relu', padding='same', data_format='channels_last')
x = conv2d(x)
# the final activation is a softmax and does not alter the shape (batch,L,L,F=2)
# this normalizes the output to a probability such that the 2 channels measure the probability of contact/no contact
activation = tf.keras.layers.Activation('softmax')
outputs = activation(x) 
model = tf.keras.Model(inputs=inputs, outputs=outputs, name='my_model')

In [8]:
model.summary()

Model: "my_model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            [(None, None, 1000)] 0                                            
__________________________________________________________________________________________________
flatten_3 (Flatten)             (None, None)         0           input_4[0][0]                    
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, None, 28)     728         flatten_3[0][0]                  
__________________________________________________________________________________________________
reshape_4 (Reshape)             (None, None, 1000, 2 0           embedding_2[0][0]                
___________________________________________________________________________________________

In [5]:
from contextlib import redirect_stdout
with open('../temp/my_cmap_summary.txt', 'w') as handle:
    with redirect_stdout(handle):
        model.summary(line_length=150)

In [6]:
my_model_json = json.loads(model.to_json())

In [7]:
len(my_model_json['config']['layers'])

144

In [5]:
with open('../processing/raw_msa/cmap_sample_model/architecture.json') as handle:
    cmap_model_json = json.load(handle)

len(cmap_model_json['config']['layers'])

144

In [25]:
print(np.array([el for el in cmap_model_json['config']['layers'][0]]).reshape(-1,1))
print(np.array([el for el in my_model_json['config']['layers'][0]]).reshape(-1,1))

[['config']
 ['name']
 ['class_name']
 ['inbound_nodes']]
[['class_name']
 ['config']
 ['name']
 ['inbound_nodes']]


In [27]:
new = my_model_json['config']['layers'][2]
old = cmap_model_json['config']['layers'][2]

def recursive_differences(old, new, path=''):
    if isinstance(old, list):
        for i,el in enumerate(old):
            if isinstance(old[i], dict) or isinstance(old[i], list):
                recursive_differences(old[i], new[i], path + '[' + str(i) + ']:')
            elif old[i] != new[i]:
                print('Difference found:', path + '[' + str(i) + ']:', '\nold:', old[i], '\tnew:', new[i])
    elif isinstance(old, dict):
        for key in old:
            try:
                if isinstance(old[key], dict):
                    recursive_differences(old[key], new[key], path + key + ':')
                elif old[key] != new[key]:
                    print('Difference found:', path + key, '\nold:', old[key], '\tnew:', new[key])
            except KeyError:
                print('KeyError at:', path + key, '\nold:', old[key])
            
recursive_differences(old, new)

Difference found: config:name 
old: reshape_1 	new: reshape
Difference found: name 
old: reshape_1 	new: reshape
Difference found: inbound_nodes 
old: [[['embedding_1', 0, 0, {}]]] 	new: [[['embedding', 0, 0, {}]]]


In [16]:
cmap_model_json['config']['layers'][0]

{'config': {'ragged': False,
  'dtype': 'float32',
  'name': 'input_1',
  'batch_input_shape': [None, None],
  'sparse': False},
 'name': 'input_1',
 'class_name': 'InputLayer',
 'inbound_nodes': []}

In [17]:
my_model_json['config']['layers'][0]

{'class_name': 'InputLayer',
 'config': {'batch_input_shape': [None, None, None],
  'dtype': 'float32',
  'sparse': False,
  'ragged': False,
  'name': 'input_1'},
 'name': 'input_1',
 'inbound_nodes': []}

I first define the possible values with which the input layer has been coded, in the same order.

In [49]:
possible_chars = 'ARNDCQEGHILKMFPSTWYV-XBZU'
input_dim = len(possible_chars) + 1 # I add 1 since 0 is a placeholder for padding
input_dim

26

I load a sample input vector. I still did not define an output y. For now I am filtering with a depth of 1000.

In [25]:
msa_vec = joblib.load('../processing/gray2018/hhblits_msa_filtered_vectors/P00552.npy.joblib.xz')
msa_depth = 1000
x = msa_vec[:msa_depth].T # the original input is L*Y where Y is depth, so I do the same
del msa_vec
x.shape

(4502, 1000)

In [None]:
import time
for i in range(100):
    print(i)
    time.sleep(5)