In [None]:
import warnings
warnings.filterwarnings('ignore')

#the basics
import pandas as pd, numpy as np, seaborn as sns
import math, json, os, random
from matplotlib import pyplot as plt
from tqdm import tqdm

#tensorflow basics
import tensorflow as tf
import tensorflow_addons as tfa
import keras.backend as K

#for model evaluation
from sklearn.model_selection import train_test_split, StratifiedKFold, RepeatedStratifiedKFold

In [None]:
def seed_everything(seed = 34):
    os.environ['PYTHONHASHSEED']=str(seed)
    tf.random.set_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    
seed_everything()

# Version Changes

**Version 10**:

* added competition metric, as inspired by [Xhlulu](https://www.kaggle.com/xhlulu)'s discussion post [here](https://www.kaggle.com/c/stanford-covid-vaccine/discussion/183211)
* removed filtering (no `SN_filter == 1` constraint)
* added kfold stratification by `SN_filter`


# Competition Overview

**In this [new competition](https://www.kaggle.com/c/stanford-covid-vaccine/overview) we are helping to fight against the worldwide pandemic COVID-19. mRNA vaccines are the fastest vaccine candidates to treat COVID-19 but they currently facing several limitations. In particular, it is a challenge to design stable messenger RNA molecules. Typical vaccines are packaged in syringes and shipped under refrigeration around the world, but that is not possible for mRNA vaccines (currently).**

**Researches have noticed that RNA molecules tend to spontaneously degrade, which is highly problematic because a single cut can render mRNA vaccines useless. Not much is known about which part of the backbone of a particular RNA is most susceptible to being damaged.**

**Without this knowledge, the current mRNA vaccines are shopped under intense refrigeration and are unlikely to reach enough humans unless they can be stabilized. This is our task as Kagglers: we must create a model to predict the most likely degradation rates at each base of an RNA molecule.**

**We are given a subset of an Eterna dataset comprised of over 3000 RNA molecules and their degradation rates at each position. Our models are then tested on the new generation of RNA sequences that were just created by Eterna players for COVID-19 mRNA vaccines**

**Before we get started, please check out [Xhlulu](https://www.kaggle.com/xhlulu)'s notebook [here](https://www.kaggle.com/xhlulu/openvaccine-simple-gru-model) as this one is based on it: I just added comments, made minor code changes, an LSTM, and fold training:**

In [None]:
#get comp data
train = pd.read_json('/kaggle/input/stanford-covid-vaccine/train.json', lines=True)
test = pd.read_json('/kaggle/input/stanford-covid-vaccine/test.json', lines=True)
sample_sub = pd.read_csv('/kaggle/input/stanford-covid-vaccine/sample_submission.csv')

# I. Brief EDA

**From the data [description tab](https://www.kaggle.com/c/stanford-covid-vaccine/data), we must predict multiple ground truths in this competition, 5 to be exact. While the submission requires all 5, only 3 are scored: `reactivity`, `deg_Mg_pH10` and `deg_Mg_50C`. It might be interesting to see how performance differs when training for all 5 predictors vs. just the 3 that are scored.**

**The training features we are given are as follows:**

* **id** - An arbitrary identifier for each sample.
* **seq_scored** - (68 in Train and Public Test, 91 in Private Test) Integer value denoting the number of positions used in scoring with predicted values. This should match the length of `reactivity`, `deg_*` and `*_error_*` columns. Note that molecules used for the Private Test will be longer than those in the Train and Public Test data, so the size of this vector will be different.
* **seq_length** - (107 in Train and Public Test, 130 in Private Test) Integer values, denotes the length of `sequence`. Note that molecules used for the Private Test will be longer than those in the Train and Public Test data, so the size of this vector will be different.
* **sequence** - (1x107 string in Train and Public Test, 130 in Private Test) Describes the RNA sequence, a combination of `A`, `G`, `U`, and `C` for each sample. Should be 107 characters long, and the first 68 bases should correspond to the 68 positions specified in `seq_scored` (note: indexed starting at 0).
* **structure** - (1x107 string in Train and Public Test, 130 in Private Test) An array of `(`, `)`, and `.` characters that describe whether a base is estimated to be paired or unpaired. Paired bases are denoted by opening and closing parentheses e.g. (....) means that base 0 is paired to base 5, and bases 1-4 are unpaired.
* **reactivity** - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as `seq_scored`. These numbers are reactivity values for the first 68 bases as denoted in `sequence`, and used to determine the likely secondary structure of the RNA sample.
* **deg_pH10** - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as `seq_scored`. These numbers are reactivity values for the first 68 bases as denoted in `sequence`, and used to determine the likelihood of degradation at the base/linkage after incubating without magnesium at high pH (pH 10).
* **deg_Mg_pH10** - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as `seq_scored`. These numbers are reactivity values for the first 68 bases as denoted in `sequence`, and used to determine the likelihood of degradation at the base/linkage after incubating with magnesium in high pH (pH 10).
* **deg_50C** - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as `seq_scored`. These numbers are reactivity values for the first 68 bases as denoted in `sequence`, and used to determine the likelihood of degradation at the base/linkage after incubating without magnesium at high temperature (50 degrees Celsius).
* **deg_Mg_50C** - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as `seq_scored`. These numbers are reactivity values for the first 68 bases as denoted in `sequence`, and used to determine the likelihood of degradation at the base/linkage after incubating with magnesium at high temperature (50 degrees Celsius).
* **`*_error_*`** - An array of floating point numbers, should have the same length as the corresponding `reactivity` or `deg_*` columns, calculated errors in experimental values obtained in `reactivity` and `deg_*` columns.
* **predicted_loop_type** - (1x107 string) Describes the structural context (also referred to as 'loop type')of each character in `sequence`. Loop types assigned by bpRNA from Vienna RNAfold 2 structure. From the bpRNA_documentation: S: paired "Stem" M: Multiloop I: Internal loop B: Bulge H: Hairpin loop E: dangling End X: eXternal loop

In [None]:
print(train.columns)

**It seems we also have a `signal_to_noise` and a `SN_filter` column. These columns control the 'quality' of samples, and as such are important training hyperparameters. We will explore them shortly:**

In [None]:
#sneak peak
print(train.shape)
if ~ train.isnull().values.any(): print('No missing values')
train.head()

In [None]:
#sneak peak
print(test.shape)
if ~ test.isnull().values.any(): print('No missing values')
test.head()

In [None]:
#sneak peak
print(sample_sub.shape)
if ~ sample_sub.isnull().values.any(): print('No missing values')
sample_sub.head()

**Now we explore `signal_to_noise` and `SN_filter` distributions. As per the data tab of this competition the samples in `test.json` have been filtered in the following way:**

1. Minimum value across all 5 conditions must be greater than -0.5.
2. Mean signal/noise across all 5 conditions must be greater than 1.0. [Signal/noise is defined as mean( measurement value over 68 nts )/mean( statistical error in measurement value over 68 nts)]
3. To help ensure sequence diversity, the resulting sequences were clustered into clusters with less than 50% sequence similarity, and the 629 test set sequences were chosen from clusters with 3 or fewer members. That is, any sequence in the test set should be sequence similar to at most 2 other sequences.

**But these filters have not been applied to the samples in `train.json` or the private test set of this competition, so we will likely see public leaderboard improvement by filtering `train.json` the same way that `test.json` was filtered.**

**We can do 2) by setting `train['signal_to_noise'] > 1`. Now, I am not sure how `SN_filter` is related to this, so let's explore a bit further:**

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (15, 5))
sns.kdeplot(train['signal_to_noise'], shade = True, ax = ax[0])
sns.countplot(train['SN_filter'], ax = ax[1])

ax[0].set_title('Signal/Noise Distribution')
ax[1].set_title('Signal/Noise Filter Distribution');

In [None]:
print(f"Samples with signal_to_noise greater than 1: {len(train.loc[(train['signal_to_noise'] > 1 )])}")
print(f"Samples with SN_filter = 1: {len(train.loc[(train['SN_filter'] == 1 )])}")

**It seems that `SN_filter = 1` filters more than just `signal_to_noise > 1`, so my guess it that `SN_filter = 1` performs step 1) and step 2).**

# II. Processing

In [None]:
#target columns
target_cols = ['reactivity', 'deg_Mg_pH10', 'deg_pH10', 'deg_Mg_50C', 'deg_50C']

**Now we tokenize our RNA sequences so we can feed it to our model:**

In [None]:
token2int = {x:i for i, x in enumerate('().ACGUBEHIMSX')}

In [None]:
def preprocess_inputs(df, cols=['sequence', 'structure', 'predicted_loop_type']):
    return np.transpose(
        np.array(
            df[cols]
            .applymap(lambda seq: [token2int[x] for x in seq])
            .values
            .tolist()
        ),
        (0, 2, 1)
    )

**We can filter by `SN_filter == 1` to train on only samples with similar statistics to those in `test.json`. Be careful using this though, as only the public leaderboard test set has filtered samples whereas the private one does not (more [here](https://www.kaggle.com/c/stanford-covid-vaccine/data)). So it might improve your public LB score, but we do not know if it helps (it could even hurt) performance on the private test set. That being said, I am *not* going to train with any sort of sample filters:**

In [None]:
def rmse(y_actual, y_pred):
    mse = tf.keras.losses.mean_squared_error(y_actual, y_pred)
    return K.sqrt(mse)

def mcrmse(y_actual, y_pred, num_scored=len(target_cols)):
    score = 0
    for i in range(num_scored):
        score += rmse(y_actual[:, :, i], y_pred[:, :, i]) / num_scored
    return score

In [None]:
train_inputs = preprocess_inputs(train)
train_labels = np.array(train[target_cols].values.tolist()).transpose((0, 2, 1))

# III. Model

**The below RNN architecture is adapted from the one and only [Xhlulu](https://www.kaggle.com/xhlulu)'s notebook [here](https://www.kaggle.com/xhlulu/openvaccine-simple-gru-model). For his explanation of the model/procedure, see his discussion post [here](https://www.kaggle.com/c/stanford-covid-vaccine/discussion/182303). I have made minor tweaks to some parameters and added an LSTM to experiment with blending.**

**Note that for submission, the output must be the same length as the input, which is 107 for `train.json` and `test.json` and 130 for the private test set. However, this is not true for training, so training prediction sequences only need to be 68 long**

**So we actually build 3 different models: one for training, one for predicting public test, and one for predicting private test set, each with different sequence lengths and prediction lengths. Luckily, we only need to train one model, save its weights, and load these weights into the other models.**

**The last thing to set is the size of the embedding layer. In the context of NLP, the input dimension size of an embedding layer is the size of the vocabulary, which in our case is `len(token2int)`. The output dimension is typically the length of the pre-trained vectors you are using, like the GloVe vectors or Word2Vec vectors, which we don't have in this case, so we are free to experiment with different sizes. Let's use 100 for now as a starting point**

In [None]:
len(token2int)

In [None]:
def gru_layer(hidden_dim, dropout):
    return tf.keras.layers.Bidirectional(
                                tf.keras.layers.GRU(hidden_dim,
                                dropout=dropout,
                                return_sequences=True,
                                kernel_initializer='orthogonal'))

def lstm_layer(hidden_dim, dropout):
    return tf.keras.layers.Bidirectional(
                                tf.keras.layers.LSTM(hidden_dim,
                                dropout=dropout,
                                return_sequences=True,
                                kernel_initializer='orthogonal'))

def build_model(gru=False, lstm=False, seq_len=107, pred_len=68, dropout=0.4,
                embed_dim=100, hidden_dim=128, layers=3):
    
    inputs = tf.keras.layers.Input(shape=(seq_len, 3))

    embed = tf.keras.layers.Embedding(input_dim=len(token2int), output_dim=embed_dim)(inputs)
    reshaped = tf.reshape(
        embed, shape=(-1, embed.shape[1],  embed.shape[2] * embed.shape[3]))
    
    hidden = tf.keras.layers.SpatialDropout1D(.2)(reshaped)
    
    if gru:
        for _ in range(layers):
            hidden = gru_layer(hidden_dim, dropout)(hidden)
        
    elif lstm:
        for _ in range(layers):
            hidden = lstm_layer(hidden_dim, dropout)(hidden)
    
    #only making predictions on the first part of each sequence
    truncated = hidden[:, :pred_len]
    
    out = tf.keras.layers.Dense(5, activation='linear')(truncated)

    model = tf.keras.Model(inputs=inputs, outputs=out)

    #some optimizers
    adam = tf.optimizers.Adam()
    radam = tfa.optimizers.RectifiedAdam()
    lookahead = tfa.optimizers.Lookahead(adam, sync_period=6)
    ranger = tfa.optimizers.Lookahead(radam, sync_period=6)
    
    model.compile(optimizer=adam, loss=mcrmse)
    
    return model

# IV. KFold Training and Inference

**Note that I am stratifying by `SN_filter`. I want each fold to train and be evaulated on a similar distribution of high-to-low quality samples.**  

In [None]:
#basic training configuration
FOLDS = 4
EPOCHS = 90
REPEATS = 1
BATCH_SIZE = 64
VERBOSE = 2
SEED = 34

In [None]:
#get different test sets and process each
public_df = test.query("seq_length == 107").copy()
private_df = test.query("seq_length == 130").copy()

public_inputs = preprocess_inputs(public_df)
private_inputs = preprocess_inputs(private_df)

In [None]:
if tf.config.list_physical_devices('GPU') is not None:
    print('Training on GPU')

**We add a simple learning rate callback for now:**

In [None]:
lr_callback = tf.keras.callbacks.ReduceLROnPlateau()

### 1. GRU

In [None]:
gru_histories = []
gru_private_preds = np.zeros((private_df.shape[0], 130, 5))
gru_public_preds = np.zeros((public_df.shape[0], 107, 5))

rskf = RepeatedStratifiedKFold(FOLDS, n_repeats = REPEATS, random_state = SEED)

for f, (train_index, val_index) in enumerate(rskf.split(train_inputs, train['SN_filter'])):

    sv_gru = tf.keras.callbacks.ModelCheckpoint(f'gru-{f}.h5')

    train_ = train_inputs[train_index]
    train_labs = train_labels[train_index]
    val_ = train_inputs[val_index]
    val_labs = train_labels[val_index]

    gru = build_model(gru=True)
    history = gru.fit(train_, train_labs, 
                      validation_data=(val_,val_labs),
                      batch_size=BATCH_SIZE,
                      epochs=EPOCHS,
                      callbacks=[lr_callback,sv_gru],
                      verbose = VERBOSE)  

    gru_histories.append(history)

    #load best model and predict
    gru_short = build_model(gru=True, seq_len=107, pred_len=107)
    gru_short.load_weights(f'gru-{f}.h5')
    gru_public_pred = gru_short.predict(public_inputs) / FOLDS

    gru_long = build_model(gru=True, seq_len=130, pred_len=130)
    gru_long.load_weights(f'gru-{f}.h5')
    gru_private_pred = gru_long.predict(private_inputs) / FOLDS * REPEATS

    gru_public_preds += gru_public_pred
    gru_private_preds += gru_private_pred

    del gru_short, gru_long

In [None]:
print(f" GRU mean fold MCRMSE: {np.mean([min(history.history['val_loss']) for history in gru_histories])}")

### 2. LSTM

In [None]:
lstm_histories = []
lstm_private_preds = np.zeros((private_df.shape[0], 130, 5))
lstm_public_preds = np.zeros((public_df.shape[0], 107, 5))

rskf = RepeatedStratifiedKFold(FOLDS, n_repeats = REPEATS, random_state = SEED)

for f, (train_index, val_index) in enumerate(rskf.split(train_inputs, train['SN_filter'])):

    sv_gru = tf.keras.callbacks.ModelCheckpoint(f'lstm-{f}.h5')

    train_ = train_inputs[train_index]
    train_labs = train_labels[train_index]
    val_ = train_inputs[val_index]
    val_labs = train_labels[val_index]

    lstm = build_model(lstm=True)
    history = lstm.fit(
                        train_, train_labs, 
                        validation_data=(val_,val_labs),
                        batch_size=BATCH_SIZE,
                        epochs=EPOCHS,
                        callbacks=[lr_callback,sv_gru],
                        verbose = VERBOSE)  

    lstm_histories.append(history)

    #load best model and predict
    lstm_short = build_model(lstm=True, seq_len=107, pred_len=107)
    lstm_short.load_weights(f'lstm-{f}.h5')
    lstm_public_pred = lstm_short.predict(public_inputs) / FOLDS

    lstm_long = build_model(lstm=True, seq_len=130, pred_len=130)
    lstm_long.load_weights(f'lstm-{f}.h5')
    lstm_private_pred = lstm_long.predict(private_inputs) / FOLDS * REPEATS

    lstm_public_preds += lstm_public_pred
    lstm_private_preds += lstm_private_pred

    del lstm_short, lstm_long

In [None]:
print(f" LSTM mean fold validation MCRMSE: {np.mean([min(history.history['val_loss']) for history in lstm_histories])}")

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (20, 10))

for history in gru_histories:
    ax[0].plot(history.history['loss'], color='C0')
    ax[0].plot(history.history['val_loss'], color='C1')
for history in lstm_histories:
    ax[1].plot(history.history['loss'], color='C0')
    ax[1].plot(history.history['val_loss'], color='C1')

ax[0].set_title('GRU')
ax[1].set_title('LSTM')

ax[0].legend(['train', 'validation'], loc = 'upper right')
ax[1].legend(['train', 'validation'], loc = 'upper right')

ax[0].set_ylabel('MCRMSE')
ax[0].set_xlabel('Epoch')
ax[1].set_ylabel('MCRMSE')
ax[1].set_xlabel('Epoch');

# V. Submission

**Again, thank you to [Xhlulu](https://www.kaggle.com/xhlulu)'s notebook [here](https://www.kaggle.com/xhlulu/openvaccine-simple-gru-model) for providing this submission-formatting code**

In [None]:
public_df = test.query("seq_length == 107").copy()
private_df = test.query("seq_length == 130").copy()

public_inputs = preprocess_inputs(public_df)
private_inputs = preprocess_inputs(private_df)

**Now we just need to change the shape of each sample to long format:**

In [None]:
preds_gru = []

for df, preds in [(public_df, gru_public_preds), (private_df, gru_private_preds)]:
    for i, uid in enumerate(df.id):
        single_pred = preds[i]

        single_df = pd.DataFrame(single_pred, columns=target_cols)
        single_df['id_seqpos'] = [f'{uid}_{x}' for x in range(single_df.shape[0])]

        preds_gru.append(single_df)

preds_gru_df = pd.concat(preds_gru)
preds_gru_df.head()

**Now we do the same for the LSTM model so we can blend their predictions:**

In [None]:
preds_lstm = []

for df, preds in [(public_df, lstm_public_preds), (private_df, lstm_private_preds)]:
    for i, uid in enumerate(df.id):
        single_pred = preds[i]

        single_df = pd.DataFrame(single_pred, columns=target_cols)
        single_df['id_seqpos'] = [f'{uid}_{x}' for x in range(single_df.shape[0])]

        preds_lstm.append(single_df)

preds_lstm_df = pd.concat(preds_lstm)
preds_lstm_df.head()

**And blend; for now, we stick with 50/50:**

In [None]:
blend_preds_df = pd.DataFrame()
blend_preds_df['id_seqpos'] = preds_gru_df['id_seqpos']
blend_preds_df['reactivity'] = .5*preds_gru_df['reactivity'] + .5*preds_lstm_df['reactivity']
blend_preds_df['deg_Mg_pH10'] = .5*preds_gru_df['deg_Mg_pH10'] + .5*preds_lstm_df['deg_Mg_pH10']
blend_preds_df['deg_pH10'] = .5*preds_gru_df['deg_pH10'] + .5*preds_lstm_df['deg_pH10']
blend_preds_df['deg_Mg_50C'] = .5*preds_gru_df['deg_Mg_50C'] + .5*preds_lstm_df['deg_Mg_50C']
blend_preds_df['deg_50C'] = .5*preds_gru_df['deg_50C'] + .5*preds_lstm_df['deg_50C']

In [None]:
submission = sample_sub[['id_seqpos']].merge(blend_preds_df, on=['id_seqpos'])

#sanity check
submission.head()

In [None]:
submission.to_csv('submission.csv', index=False)
print('Submission saved')