<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:blue; border:0; color:white' role="tab" aria-controls="home"><center>Introduction to the Competition : OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction
 </center></h2>

COVID 19 does not need any introduction, this virus has already killed millions of people all around the world.

Winning the fight against the COVID-19 pandemic will require an effective vaccine that can be equitably and widely distributed. 

mRNA vaccines have taken the lead as the fastest vaccine candidates for COVID-19, but currently, they face key potential limitations. One of the biggest challenges right now is how to design super stable messenger RNA molecules (mRNA). 

Researchers have observed that RNA molecules have the tendency to spontaneously degrade. 

This is a serious limitation--a single cut can render the mRNA vaccine useless. 

Currently, little is known on the details of where in the backbone of a given RNA is most prone to being affected. 

Improving the stability of mRNA vaccines was a problem that was being explored before the pandemic but was expected to take many years to solve. Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19. 

The problem we are trying to solve has eluded academic labs, industry R&D groups, and supercomputers, and now its our turn. 

In this competition, we are looking to leverage the data science expertise of the Kaggle community to develop models and design rules for RNA degradation. 

We need to build a model which will predict likely degradation rates at each base of an RNA molecule, trained on a subset of an Eterna dataset comprising over 3000 RNA molecules (which span a panoply of sequences and structures) and their degradation rates at each position. 

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:blue; border:0; color:white' role="tab" aria-controls="home"><center>What is RNA & Why Is It so much PRONE to Degradation?
 </center></h2>

    
`Ribonucleic acid / RNA`

Ribonucleic acid (RNA) is a `linear molecule` composed of `four types of smaller molecules` called `ribonucleotide bases`: 

* adenine  (A)
* cytosine (C)
* guanine  (G)
* uracil   (U)

RNA is often compared to a copy from a reference book, or a template, because it carries the same information as its DNA template but is not used for long-term storage.

Each ribonucleotide base consists of a ribose sugar, a phosphate group, and a nitrogenous base. 
    
`Adjacent ribose nucleotide` bases are chemically attached to one another in a chain via chemical bonds called `phosphodiester bonds`. 
    
Unlike DNA, RNA is usually `single-stranded`. 

Additionally, RNA contains ribose sugars rather than deoxyribose sugars, which makes RNA more unstable and more prone to degradation.

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:blue; border:0; color:white' role="tab" aria-controls="home"><center> Import Necessary tools üõ†
 </center></h2>

In [None]:
import json

import pandas as pd
import numpy as np
import plotly.express as px
import tensorflow.keras.layers as L
import tensorflow as tf
from sklearn.model_selection import train_test_split

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:blue; border:0; color:white' role="tab" aria-controls="home"><center> Read JSON Files ‚úî
 </center></h2>

In [None]:
data_dir = '/kaggle/input/stanford-covid-vaccine/'
train = pd.read_json(data_dir + 'train.json', lines=True)
test = pd.read_json(data_dir + 'test.json', lines=True)
sample_df = pd.read_csv(data_dir + 'sample_submission.csv')

In [None]:
train.shape,test.shape

In [None]:
train.head()

In [None]:
test.head()

In [None]:
sample_df.head()

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:blue; border:0; color:white' role="tab" aria-controls="home"><center>Lets check the Data üìÅ
 </center></h2>
    
### We will check all the fields given [here](https://www.kaggle.com/c/stanford-covid-vaccine/data)

### seq_scored - (68 in Train and Public Test, 91 in Private Test) 
Integer value denoting the number of positions used in scoring with predicted values. 

This should match the length of `reactivity`, `deg_*` and `*_error_* `columns

In [None]:
print("Unique values & no. of occurences for seq_scored in the training dataset:\n",train.seq_scored.value_counts())
print("\nUnique values & no. of occurences for seq_scored in the test dataset:\n",test.seq_scored.value_counts())

<div class="alert alert-block alert-info">
<b>Key Observatoionsüìå</b> 

* We have got the exact count as mentioned on the data page of the competition, all the records present in the training dataset has a value of 68 for seq_scored
    
* Test dataset has 2 different values, 68 for public test data and 91 for private test data
    
* No. of records for different values are also in synch with what is mentioned on the data page, for both train and test datasets
</div>

### Now let's verify the columns which are tightly bound with seq_scored
* reactivity
* deg_* 
* *_error_*

In [None]:
# training dataset
deg_columns = ['reactivity','deg_error_Mg_pH10', 'deg_error_pH10','deg_error_Mg_50C', 'deg_error_50C', 'deg_Mg_pH10','deg_pH10', 'deg_Mg_50C', 'deg_50C']

for col in deg_columns:
    length = []
    for each in range(train.shape[0]):
        length.append(len(train[col].iloc[each]))

    print("Length of different values for " + col + " in training dataset:",set(length))


<div class="alert alert-block alert-info">
<b>Key Observatoionsüìå</b> 

* What we verified above is that all the columns in question have values of length 68 for all the records in the training dataset.
This is in synch with the information provided by seq_scored field 
</div>

### seq_length - (107 in Train and Public Test, 130 in Private Test) 
Integer values, denotes the length of sequence. Note that molecules used for the Private Test will be longer than those in the Train and Public Test data, so the size of this vector will be different.

In [None]:
print("Unique values & there occurences for seq_length in the training dataset:\n",train.seq_length.value_counts())
print("\nUnique values & there occurences for seq_length in the test dataset:\n",test.seq_length.value_counts())

<div class="alert alert-block alert-info">
<b>Key Observatoionsüìå</b> 

* For each record in the training dataset, sequence lenght is 107
    
* For test dataset, we have a mix of 130 and 107 sequence lengh which actually represents public and private test data

* Data seems to be in synch with what is mentioned on the data page for both training and test dataset
</div>

### Now let's verify the column "sequence" which are tightly bound with seq_length

In [None]:
# training dataset
length = []
for each in range(train.shape[0]):
    length.append(len(train.sequence.iloc[each]))

print("length of different values for sequence in training dataset:",set(length))

# test dataset
length = []
for each in range(test.shape[0]):
    length.append(len(test.sequence.iloc[each]))

print("\nlength of different values for sequence in test dataset:",set(length))


<div class="alert alert-block alert-info">
<b>Key Observatoionsüìå</b> 

* We have now verified that sequence has a length of 107 for each record in the training dataset
 
* Whereas test dataset has a mix of 130 and 107 lengths which represents private and public test data
</div>

### structure - (1x107 string in Train and Public Test, 130 in Private Test) 
An array of (, ), and . characters that describe whether a base is estimated to be paired or unpaired. Paired bases are denoted by opening and closing parentheses e.g. (....) means that base 0 is paired to base 5, and bases 1-4 are unpaired.

In [None]:
# training dataset
length = []
for each in range(train.shape[0]):
    length.append(len(train.structure.iloc[each]))

print("length of different values for structure in training dataset:",set(length))

# test dataset
length = []
for each in range(test.shape[0]):
    length.append(len(test.structure.iloc[each]))

print("\nlength of different values for structure in test dataset:",set(length))


<div class="alert alert-block alert-info">
<b>Key Observatoionsüìå</b> 

* We have now verified that structure has a length of 107 for each record in the training dataset
 
* Whereas test dataset has a mix of 130 and 107 lengths which represents private and public test data
</div>

### predicted_loop_type - (1x107 string) Describes the structural context
(also referred to as 'loop type')of each character in sequence. Loop types assigned by bpRNA from Vienna RNAfold 2 structure. From the bpRNA_documentation: S: paired "Stem" M: Multiloop I: Internal loop B: Bulge H: Hairpin loop E: dangling End X: eXternal loop

In [None]:
# training dataset
length = []
for each in range(train.shape[0]):
    length.append(len(train.predicted_loop_type.iloc[each]))

print("length of different values for predicted_loop_type in training dataset:",set(length))

# test dataset
length = []
for each in range(test.shape[0]):
    length.append(len(test.predicted_loop_type.iloc[each]))

print("\nlength of different values for predicted_loop_type in test dataset:",set(length))


<div class="alert alert-block alert-info">
<b>Key Observatoionsüìå</b> 

* We have now verified that predicted loop has a length of 107 for each record in the training dataset
 
* Whereas test dataset has a mix of 130 and 107 lengths which represents private and public test data
</div>

### We have now seen how the data looks like, lets move toward data pre-processing

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:blue; border:0; color:white' role="tab" aria-controls="home"><center>Data Preprocessing‚úç
 </center></h2>

<div class="alert alert-block alert-info">
<b>Please Note:</b> 

As mentioned in the additional notes on the data page:
    
Mean signal/noise across all 5 conditions must be greater than 1.0. [Signal/noise is defined as mean( measurement value over 68 nts )/mean( statistical error in measurement value over 68 nts)]

We will filter out records with noise less than 1.0
                                         
</div>


In [None]:
# filter records with signal to noise < 1
train = train.query("signal_to_noise >= 1")
train.shape

### Helper Variables & Functions to pre-process the input data

In [None]:
# This function would help us converting the target variables into an array which can be fed into keras model
def pandas_list_to_array(df):
    """
    Input: dataframe of shape (x, y), containing list of length l
    Return: np.array of shape (x, l, y)
    """
    
    return np.transpose(
        np.array(df.values.tolist()),
        (0, 2, 1)
    )

In [None]:
# We are defining a function here to take care of the conversion
# df would be the training or the test dataset
# token2int is dictionary which contains the character/integer mapping

def preprocess_inputs(df, token2int, cols=['sequence', 'structure', 'predicted_loop_type']):
    return pandas_list_to_array(
        df[cols].applymap(lambda seq: [token2int[x] for x in seq])
    )

In [None]:
# predictor variables
pred_cols = ['reactivity', 'deg_Mg_pH10', 'deg_Mg_50C', 'deg_pH10', 'deg_50C']

<div class="alert alert-block alert-info">
<b>Please Note:</b> 

Even though we have taken 5 predictor variables above, only 3 (reactivity, degMgpH10, and degMg50C) are going to be scored                                         
</div>

<div class="alert alert-block alert-info">
<b>Please Note:</b> 

We have following 3 columns with character sequence:
  
* sequence
* structure 
* predicted_loop_type
    
we need to convert them to integers so that we can feed them into the model.                              
</div>

In [None]:
# we are using a dictinoary here to map each character with a unique integer
token2int = {x:i for i, x in enumerate('().ACGUBEHIMSX')}

# calling the function defined above to apply the actual character to integer conversion
# train_inputs is the dataframe we are going to use to feed our keras model
train_inputs = preprocess_inputs(train, token2int)

# call the function to reshape the predictor variables to convert into an array which can be fed into keras model
train_labels = pandas_list_to_array(train[pred_cols])

In [None]:
# sets the random seed
tf.random.set_seed(2020)
np.random.seed(2020)

In [None]:
# This is to generate a new set of random values every time
y_true = tf.random.normal((32, 68, 3))
y_pred = tf.random.normal((32, 68, 3))

<div class="list-group" id="list-tab" role="tablist">
<a id="10"></a>
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:blue; border:0; color:white' role="tab" aria-controls="home"><center>MCRMSE - Mean Columnwise Root Mean Squared Error üî¨
 </center></h2>

MCRMSE is the evaluation metric used in this competition.

The reason we are using MCRMSE in this challenges is because there are multiple outputs that we are trying to predict.

Normally, we can calculate RMSE to get a single-number evaluation metric for our prediction, but if we are predicting multiple values at once‚àíin the case of the OpenVaccine competition, we need to predict degradation rates under multiple conditions‚àíwe would get multiple different RMSE values, one for each column.

The MCRMSE is simply an average across all RMSE values for each of our columns, so we can still use a single-number evaluation metric, even in the case of multiple outputs.

### Lets write a helper function to implement MCRMSE.

In [None]:
# function to calculate average across all RMSE values for each column
def MCRMSE(y_true, y_pred):
    colwise_mse = tf.reduce_mean(tf.square(y_true - y_pred), axis=1)
    return tf.reduce_mean(tf.sqrt(colwise_mse), axis=1)

# GRU LAYER

In [None]:
def gru_layer(hidden_dim, dropout):
    return L.Bidirectional(L.GRU(
        hidden_dim, dropout=dropout, return_sequences=True, kernel_initializer='orthogonal'))

In [None]:
def build_model(embed_size, seq_len=107, pred_len=68, dropout=0.5, 
                sp_dropout=0.2, embed_dim=200, hidden_dim=256, n_layers=3):
    inputs = L.Input(shape=(seq_len, 3))
    embed = L.Embedding(input_dim=embed_size, output_dim=embed_dim)(inputs)
    
    reshaped = tf.reshape(
        embed, shape=(-1, embed.shape[1],  embed.shape[2] * embed.shape[3])
    )
    hidden = L.SpatialDropout1D(sp_dropout)(reshaped)
    
    for x in range(n_layers):
        hidden = gru_layer(hidden_dim, dropout)(hidden)
    
    # Since we are only making predictions on the first part of each sequence, 
    # we have to truncate it
    truncated = hidden[:, :pred_len]
    out = L.Dense(5, activation='linear')(truncated)
    
    model = tf.keras.Model(inputs=inputs, outputs=out)
    model.compile(tf.optimizers.Adam(), loss=MCRMSE)
    
    return model

In [None]:
x_train, x_val, y_train, y_val = train_test_split(
    train_inputs, train_labels, test_size=.1, random_state=34, stratify=train.SN_filter)

In [None]:
public_df = test.query("seq_length == 107")
private_df = test.query("seq_length == 130")

public_inputs = preprocess_inputs(public_df, token2int)
private_inputs = preprocess_inputs(private_df, token2int)

In [None]:
model = build_model(embed_size=len(token2int))
model.summary()

In [None]:
history = model.fit(
    x_train, y_train,
    validation_data=(x_val, y_val),
    batch_size=32,
    epochs=50,
    verbose=2,
    callbacks=[
        tf.keras.callbacks.ReduceLROnPlateau(patience=5),
        tf.keras.callbacks.ModelCheckpoint('model.h5')
    ]
)


# Let's check how our model is doing

In [None]:
fig = px.line(
    history.history, y=['loss', 'val_loss'],
    labels={'index': 'epoch', 'value': 'MCRMSE'}, 
    title='Training History')
fig.show()

# Predictions

In [None]:
# Caveat: The prediction format requires the output to be the same length as the input,
# although it's not the case for the training data.
model_public = build_model(seq_len=107, pred_len=107, embed_size=len(token2int))
model_private = build_model(seq_len=130, pred_len=130, embed_size=len(token2int))

model_public.load_weights('model.h5')
model_private.load_weights('model.h5')

In [None]:
public_preds = model_public.predict(public_inputs)
private_preds = model_private.predict(private_inputs)

# Post-processing and submit

For each sample, we take the predicted tensors of shape (107, 5) or (130, 5), and convert them to the long format (i.e.  629√ó107,5  or  3005√ó130,5 ):

In [None]:
preds_ls = []

for df, preds in [(public_df, public_preds), (private_df, private_preds)]:
    for i, uid in enumerate(df.id):
        single_pred = preds[i]

        single_df = pd.DataFrame(single_pred, columns=pred_cols)
        single_df['id_seqpos'] = [f'{uid}_{x}' for x in range(single_df.shape[0])]

        preds_ls.append(single_df)

preds_df = pd.concat(preds_ls)
preds_df.head()

In [None]:
submission = sample_df[['id_seqpos']].merge(preds_df, on=['id_seqpos'])
submission.to_csv('submission.csv', index=False)