# Covid-19 Open Vaccine

**Name: Stuart Hopkins**

**A-Number: A02080107**


## Introduction
Winning the fight against the COVID-19 pandemic will require an effective vaccine that can be equitably and widely distributed. Building upon decades of research has allowed scientists to accelerate the search for a vaccine against COVID-19, but every day that goes by without a vaccine has enormous costs for the world nonetheless. We need new, fresh ideas from all corners of the world. Could online gaming and crowdsourcing help solve a worldwide pandemic? Pairing scientific and crowdsourced intelligence could help computational biochemists make measurable progress.

mRNA vaccines have taken the lead as the fastest vaccine candidates for COVID-19, but currently, they face key potential limitations. One of the biggest challenges right now is how to design super stable messenger RNA molecules (mRNA). Conventional vaccines (like your seasonal flu shots) are packaged in disposable syringes and shipped under refrigeration around the world, but that is not currently possible for mRNA vaccines.

Researchers have observed that RNA molecules have the tendency to spontaneously degrade. This is a serious limitation--a single cut can render the mRNA vaccine useless. Currently, little is known on the details of where in the backbone of a given RNA is most prone to being affected. Without this knowledge, current mRNA vaccines against COVID-19 must be prepared and shipped under intense refrigeration, and are unlikely to reach more than a tiny fraction of human beings on the planet unless they can be stabilized.



The Eterna community, led by Professor Rhiju Das, a computational biochemist at Stanford’s School of Medicine, brings together scientists and gamers to solve puzzles and invent medicine. Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles. The solutions are synthesized and experimentally tested at Stanford by researchers to gain new insights about RNA molecules. The Eterna community has previously unlocked new scientific principles, made new diagnostics against deadly diseases, and engaged the world’s most potent intellectual resources for the betterment of the public. The Eterna community has advanced biotechnology through its contribution in over 20 publications, including advances in RNA biotechnology.

In this competition, we are looking to leverage the data science expertise of the Kaggle community to develop models and design rules for RNA degradation. Your model will predict likely degradation rates at each base of an RNA molecule, trained on a subset of an Eterna dataset comprising over 3000 RNA molecules (which span a panoply of sequences and structures) and their degradation rates at each position. We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines. These final test sequences are currently being synthesized and experimentally characterized at Stanford University in parallel to your modeling efforts -- Nature will score your models!

Improving the stability of mRNA vaccines was a problem that was being explored before the pandemic but was expected to take many years to solve. Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19. The problem we are trying to solve has eluded academic labs, industry R&D groups, and supercomputers, and so we are turning to you. To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic.

## Resources
Big thanks to @xhlulu for the base to this project. The submission is pulled from many sources (and personal code of course), but is largely based upon the workbook by @xhlulu. 

## Dependancies
#### Required Files:
All required files can be downloaded <a href="https://www.kaggle.com/c/stanford-covid-vaccine/data">here</a>
- input/test.json
- input/train.json
- input/sample_submission.csv

In [None]:
### Uncomment packages that you do not currently have installed to install them
# !pip install pandas
# !pip install numpy
# !pip install plotly
# !pip install tensorflow
# !pip install sklearn

In [None]:
import json

import pandas as pd
import numpy as np
import plotly.express as px
import tensorflow.keras.layers as L
import tensorflow as tf
from sklearn.model_selection import train_test_split

device = torch.device("cuda")

NameError: ignored

## Set random seed
Here we are setting the seed so that we can get the same random numbers every time.





In [None]:
tf.random.set_seed(42069)
np.random.seed(42069)

## Load and Preprocess Data

In [None]:
bs = 32
pred_len = 68
pred_cols = ['reactivity', 'deg_Mg_pH10', 'deg_Mg_50C', 'deg_pH10', 'deg_50C']

In [None]:
y_true = tf.random.normal((bs, pred_len, len(pred_cols)))
y_pred = tf.random.normal((bs, pred_len, len(pred_cols)))

In [None]:
data_dir = 'input/'
train = pd.read_json(data_dir + 'train.json', lines=True)
test = pd.read_json(data_dir + 'test.json', lines=True)
sample_df = pd.read_csv(data_dir + 'sample_submission.csv')

In [None]:
train = train.query("signal_to_noise >= 1")

In [None]:
def pandas_list_to_array(df):
    """
    Input: dataframe of shape (x, y), containing list of length l
    Return: np.array of shape (x, l, y)
    """
    
    return np.transpose(
        np.array(df.values.tolist()),
        (0, 2, 1)
    )

In [None]:
def preprocess_inputs(df, tokens_as_ints, cols=["sequence", "structure", "predicted_loop_type"]):
    return pandas_list_to_array(
        df[cols].applymap(lambda sequence: [tokens_as_ints[item] for item in sequence])
    )

In [None]:
# Map each character to an integer
tokens_as_ints = {x:i for i, x in enumerate('().ACGUBEHIMSX')}

train_labels = pandas_list_to_array(train[pred_cols])
train_inputs = preprocess_inputs(train, tokens_as_ints)

In [None]:
x_train, x_val, y_train, y_val = train_test_split(
    train_inputs, train_labels, test_size=.1, random_state=34, stratify=train.SN_filter)

In [None]:
public_df = test.query("seq_length == 107")
private_df = test.query("seq_length == 130")

public_inputs = preprocess_inputs(public_df, tokens_as_ints)
private_inputs = preprocess_inputs(private_df, tokens_as_ints)

## Build and train model
We will train a bi-directional GRU model. It has three layer and has dropout. To learn more about RNNs, LSTM and GRU, please see this blog post.

In [None]:
def MCRMSE(y_true, y_pred):
    column_mse = tf.reduce_mean(tf.square(y_true - y_pred), axis=1)
    return tf.reduce_mean(tf.sqrt(column_mse), axis=1)

In [None]:
def gru_layer(hidden_dim, dropout):
    return L.Bidirectional(L.GRU(
        hidden_dim, dropout=dropout, return_sequences=True, kernel_initializer='orthogonal'))

In [None]:
def build_model(embed_size, seq_len=107, pred_len=68, dropout=0.5, 
                sp_dropout=0.2, embed_dim=200, hidden_dim=256, n_layers=3):
    inputs = L.Input(shape=(seq_len, 3))
    embed = L.Embedding(input_dim=embed_size, output_dim=embed_dim)(inputs)
    
    reshaped = tf.reshape(
        embed, shape=(-1, embed.shape[1],  embed.shape[2] * embed.shape[3])
    )
    hidden = L.SpatialDropout1D(sp_dropout)(reshaped)
    
    for x in range(n_layers):
        hidden = gru_layer(hidden_dim, dropout)(hidden)
    
    # Since we are only making predictions on the first part of each sequence, 
    # we have to truncate it
    truncated = hidden[:, :pred_len]
    out = L.Dense(5, activation='linear')(truncated)
    
    model = tf.keras.Model(inputs=inputs, outputs=out)
    model.compile(tf.optimizers.Adam(), loss=MCRMSE)
    
    return model

In [None]:
model = build_model(embed_size=len(tokens_as_ints))
model.summary()

Model: "functional_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, 107, 3)]          0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 107, 3, 200)       2800      
_________________________________________________________________
tf_op_layer_Reshape_3 (Tenso [(None, 107, 600)]        0         
_________________________________________________________________
spatial_dropout1d_3 (Spatial (None, 107, 600)          0         
_________________________________________________________________
bidirectional_9 (Bidirection (None, 107, 512)          1317888   
_________________________________________________________________
bidirectional_10 (Bidirectio (None, 107, 512)          1182720   
_________________________________________________________________
bidirectional_11 (Bidirectio (None, 107, 512)         

In [None]:
history = model.fit(
    x_train, y_train,
    validation_data=(x_val, y_val),
    batch_size=64,
    epochs=1, # TODO: Up this number
    verbose=2,
    callbacks=[
        tf.keras.callbacks.ReduceLROnPlateau(patience=5),
        tf.keras.callbacks.ModelCheckpoint('model.h5')
    ]
)

30/30 - 149s - loss: 0.4533 - val_loss: 0.3862


## Evaluate training history
Let's use Plotly to quickly visualize the training and validation loss throughout the epochs.

In [None]:
fig = px.line(
    history.history, y=['loss', 'val_loss'],
    labels={'index': 'epoch', 'value': 'MCRMSE'}, 
    title='Training History')
fig.show()

## Load models and make predictions
Public and private sets have different sequence lengths, so we will preprocess them separately and load models of different tensor shapes. This is possible because RNN models can accept sequences of varying lengths as inputs.

In [None]:
# Caveat: The prediction format requires the output to be the same length as the input,
# although it's not the case for the training data.
model_public = build_model(seq_len=107, pred_len=107, embed_size=len(tokens_as_ints))
model_private = build_model(seq_len=130, pred_len=130, embed_size=len(tokens_as_ints))

model_public.load_weights('model.h5')
model_private.load_weights('model.h5')

In [None]:
public_preds = model_public.predict(public_inputs)
private_preds = model_private.predict(private_inputs)

## Post-processing and submit
For each sample, we take the predicted tensors of shape (107, 5) or (130, 5), and convert them to the long format (i.e.  629×107,5  or  3005×130,5 ):

In [None]:
preds_ls = []

for df, preds in [(public_df, public_preds), (private_df, private_preds)]:
    for i, uid in enumerate(df.id):
        single_pred = preds[i]

        single_df = pd.DataFrame(single_pred, columns=pred_cols)
        single_df['id_seqpos'] = [f'{uid}_{x}' for x in range(single_df.shape[0])]

        preds_ls.append(single_df)

preds_df = pd.concat(preds_ls)
preds_df.head()

Unnamed: 0,reactivity,deg_Mg_pH10,deg_Mg_50C,deg_pH10,deg_50C,id_seqpos
0,0.85786,1.157566,1.137862,2.102659,1.085711,id_00073f8be_0
1,1.180233,1.317555,1.413677,2.000974,1.346914,id_00073f8be_1
2,1.216386,1.246495,1.439602,1.650966,1.345469,id_00073f8be_2
3,1.069229,1.073168,1.301341,1.28694,1.195011,id_00073f8be_3
4,0.845142,0.867759,1.079663,0.973473,0.984996,id_00073f8be_4


In [None]:
submission = sample_df[['id_seqpos']].merge(preds_df, on=['id_seqpos'])
submission.to_csv('submission.csv', index=False) 