OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction

Description

Winning the fight against the COVID-19 pandemic will require an effective vaccine that can be equitably and widely distributed. Building upon decades of research has allowed scientists to accelerate the search for a vaccine against COVID-19, but every day that goes by without a vaccine has enormous costs for the world nonetheless. We need new, fresh ideas from all corners of the world. Could online gaming and crowdsourcing help solve a worldwide pandemic? Pairing scientific and crowdsourced intelligence could help computational biochemists make measurable progress.

mRNA vaccines have taken the lead as the fastest vaccine candidates for COVID-19, but currently, they face key potential limitations. One of the biggest challenges right now is how to design super stable messenger RNA molecules (mRNA). Conventional vaccines (like your seasonal flu shots) are packaged in disposable syringes and shipped under refrigeration around the world, but that is not currently possible for mRNA vaccines.

Researchers have observed that RNA molecules have the tendency to spontaneously degrade. This is a serious limitation--a single cut can render the mRNA vaccine useless. Currently, little is known on the details of where in the backbone of a given RNA is most prone to being affected. Without this knowledge, current mRNA vaccines against COVID-19 must be prepared and shipped under intense refrigeration, and are unlikely to reach more than a tiny fraction of human beings on the planet unless they can be stabilized.

The Eterna community, led by Professor Rhiju Das, a computational biochemist at Stanford’s School of Medicine, brings together scientists and gamers to solve puzzles and invent medicine. Eterna is an online video game platform that challenges players to solve scientific problems such as mRNA design through puzzles. The solutions are synthesized and experimentally tested at Stanford by researchers to gain new insights about RNA molecules. The Eterna community has previously unlocked new scientific principles, made new diagnostics against deadly diseases, and engaged the world’s most potent intellectual resources for the betterment of the public. The Eterna community has advanced biotechnology through its contribution in over 20 publications, including advances in RNA biotechnology.

In this competition, we are looking to leverage the data science expertise of the Kaggle community to develop models and design rules for RNA degradation. Your model will predict likely degradation rates at each base of an RNA molecule, trained on a subset of an Eterna dataset comprising over 3000 RNA molecules (which span a panoply of sequences and structures) and their degradation rates at each position. We will then score your models on a second generation of RNA sequences that have just been devised by Eterna players for COVID-19 mRNA vaccines. These final test sequences are currently being synthesized and experimentally characterized at Stanford University in parallel to your modeling efforts -- Nature will score your models!

Improving the stability of mRNA vaccines was a problem that was being explored before the pandemic but was expected to take many years to solve. Now, we must solve this deep scientific challenge in months, if not weeks, to accelerate mRNA vaccine research and deliver a refrigerator-stable vaccine against SARS-CoV-2, the virus behind COVID-19. The problem we are trying to solve has eluded academic labs, industry R&D groups, and supercomputers, and so we are turning to you. To help, you can join the team of video game players, scientists, and developers at Eterna to unlock the key in our fight against this devastating pandemic.

Data Description

In this competition, you will be predicting the degradation rates at various locations along RNA sequence.

There are multiple ground truth values provided in the training data. While the submission format requires all 5 to be predicted, only the following are scored: reactivity, deg_Mg_pH10, and deg_Mg_50C.

Files
train.json - the training data
test.json - the test set, without any columns associated with the ground truth.
sample_submission.csv - a sample submission file in the correct format

Columns

id - An arbitrary identifier for each sample.

seq_scored - (68 in Train and Public Test, 91 in Private Test) Integer value denoting the number of positions used in scoring with predicted values. This should match the length of reactivity, deg_* and *_error_* columns. Note that molecules used for the Private Test will be longer than those in the Train and Public Test data, so the size of this vector will be different.

seq_length - (107 in Train and Public Test, 130 in Private Test) Integer values, denotes the length of sequence. Note that molecules used for the Private Test will be longer than those in the Train and Public Test data, so the size of this vector will be different.

sequence - (1x107 string in Train and Public Test, 130 in Private Test) Describes the RNA sequence, a combination of A, G, U, and C for each sample. Should be 107 characters long, and the first 68 bases should correspond to the 68 positions specified in seq_scored (note: indexed starting at 0).

structure - (1x107 string in Train and Public Test, 130 in Private Test) An array of (, ), and . characters that describe whether a base is estimated to be paired or unpaired. Paired bases are denoted by opening and closing parentheses e.g. (....) means that base 0 is paired to base 5, and bases 1-4 are unpaired.

reactivity - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as seq_scored. These numbers are reactivity values for the first 68 bases as denoted in sequence, and used to determine the likely secondary structure of the RNA sample.

deg_pH10 - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as seq_scored. These numbers are reactivity values for the first 68 bases as denoted in sequence, and used to determine the likelihood of degradation at the base/linkage after incubating without magnesium at high pH (pH 10).

deg_Mg_pH10 - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as seq_scored. These numbers are reactivity values for the first 68 bases as denoted in sequence, and used to determine the likelihood of degradation at the base/linkage after incubating with magnesium in high pH (pH 10).

deg_50C - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as seq_scored. These numbers are reactivity values for the first 68 bases as denoted in sequence, and used to determine the likelihood of degradation at the base/linkage after incubating without magnesium at high temperature (50 degrees Celsius).

deg_Mg_50C - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating point numbers, should have the same length as seq_scored. These numbers are reactivity values for the first 68 bases as denoted in sequence, and used to determine the likelihood of degradation at the base/linkage after incubating with magnesium at high temperature (50 degrees Celsius).

*_error_* - An array of floating point numbers, should have the same length as the corresponding reactivity or deg_* columns, calculated errors in experimental values obtained in reactivity and deg_* columns.

predicted_loop_type - (1x107 string) Describes the structural context (also referred to as 'loop type')of each character in 

sequence. Loop types assigned by bpRNA from Vienna RNAfold 2 structure. From the bpRNA_documentation: S: paired "Stem" M: Multiloop 

I: Internal loop B: Bulge H: Hairpin loop E: dangling End X: eXternal loop

Additional Notes
At the beginning of the competition, Stanford scientists have data on 3029 RNA sequences of length 107. For technical reasons, measurements cannot be carried out on the final bases of these RNA sequences, so we have experimental data (ground truth) in 5 conditions for the first 68 bases.

We have split out 629 of these 3029 sequences for a public test set to allow for continuous evaluation through the competition, on the public leaderboard. These sequences, in test.json, have been additionally filtered based on three criteria detailed below to ensure that this subset is not dominated by any large cluster of RNA molecules with poor data, which might bias the public leaderboard. The remaining 2400 sequences for which we have data are in train.json.

For our final and most important scoring (the Private Leaderbooard), Stanford scientists are carrying out measurements on 3005 new RNAs, which have somewhat longer lengths of 130 bases. For these data, we expect to have measurements for the first 91 bases, again missing the ends of the RNA. These sequences constitute another 3005 of the 3634 sequences in test.json.

For those interested in how the 629 107-base sequences in test.json were filtered, here were the steps to ensure a diverse and high quality test set for public leaderboard scoring:

Minimum value across all 5 conditions must be greater than -0.5.

Mean signal/noise across all 5 conditions must be greater than 1.0. [Signal/noise is defined as mean( measurement value over 68 nts )/mean( statistical error in measurement value over 68 nts)]

To help ensure sequence diversity, the resulting sequences were clustered into clusters with less than 50% sequence similarity, and the 629 test set sequences were chosen from clusters with 3 or fewer members. That is, any sequence in the test set should be sequence similar to at most 2 other sequences.
Note that these filters have not been applied to the 2400 RNAs in the public training data train.json — some of those measurements have negative values or poor signal-to-noise, or some RNA sequences have near-identical sequences in that set. But we are providing all those data in case competitors can squeeze out more signal.

The three filters noted above will also not be applied to Private Test on 3005 sequences.

Notebooks used for reference (Thank you to the creators of these wonderful notebooks!)

https://www.kaggle.com/tuckerarrants/openvaccine-gru-lstm <br>
https://www.kaggle.com/isaienkov/openvaccine-eda-feature-engineering-modeling <br>
https://www.kaggle.com/mrkmakr/covid-ae-pretrain-gnn-attn-cnn

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import json
import tensorflow as tf
from matplotlib import pyplot as plt

In [None]:
os.chdir('/kaggle/')
os.getcwd()

**Load Input Files**

In [None]:
train_data = pd.read_json('/kaggle/input/stanford-covid-vaccine/train.json', lines = True)
test_data = pd.read_json('/kaggle/input/stanford-covid-vaccine/test.json', lines = True)
submission_format = pd.read_csv('/kaggle/input/stanford-covid-vaccine/sample_submission.csv', encoding = 'utf-8-sig')

**EDA**

In [None]:
train_data.head()

In [None]:
train_data.shape

In [None]:
train_data.groupby(['SN_filter']).size()

In [None]:
test_data.head()

In [None]:
test_data.shape

In [None]:
submission_format.head()

In [None]:
print(train_data.shape)
print(test_data.shape)
print(submission_format.shape)

In [None]:
print('Training data:\n',train_data['seq_scored'].value_counts())
print('Test data:\n',test_data['seq_scored'].value_counts())
len(train_data['reactivity'].iloc[0])

In [None]:
len(train_data['sequence'].iloc[0])

In [None]:
#Checking if error is negative in any cell
flag = False
for i in range(0,len(train_data)):
    if(([x<0 for x in train_data['reactivity_error'].iloc[i]].count(True) > 0) |
       ([x<0 for x in train_data['deg_error_Mg_pH10'].iloc[i]].count(True) > 0) |
       ([x<0 for x in train_data['deg_error_pH10'].iloc[i]].count(True) > 0) |
       ([x<0 for x in train_data['deg_error_Mg_50C'].iloc[i]].count(True) > 0) |
       ([x<0 for x in train_data['deg_error_50C'].iloc[i]].count(True) > 0)):
        flag = True
print(flag)

In [None]:
#Checking if result is negative in any cell
min_reactivity_value = min(train_data['reactivity'].iloc[0])
min_deg_Mg_pH10_value = min(train_data['deg_Mg_pH10'].iloc[0])
min_deg_pH10_value = min(train_data['deg_pH10'].iloc[0])
min_deg_Mg_50C_value = min(train_data['deg_Mg_50C'].iloc[0])
min_deg_Mg_50C_value = min(train_data['deg_50C'].iloc[0])

for i in range(0,len(train_data)):
    if(min(train_data['reactivity'].iloc[i]) < min_reactivity_value):
        min_reactivity_value = min(train_data['reactivity'].iloc[i])   

    if(min(train_data['deg_Mg_pH10'].iloc[i]) < min_deg_Mg_pH10_value):
        min_deg_Mg_pH10_value = min(train_data['deg_Mg_pH10'].iloc[i])

    if(min(train_data['deg_pH10'].iloc[i]) < min_deg_pH10_value):
        min_deg_pH10_value = min(train_data['deg_pH10'].iloc[i])

    if(min(train_data['deg_Mg_50C'].iloc[i]) < min_deg_Mg_50C_value):
        min_deg_Mg_50C_value = min(train_data['deg_Mg_50C'].iloc[i])

    if(min(train_data['deg_50C'].iloc[i]) < min_deg_Mg_50C_value):
        min_deg_Mg_50C_value = min(train_data['deg_50C'].iloc[i])
        
print(min_reactivity_value, min_deg_Mg_pH10_value, min_deg_pH10_value, min_deg_Mg_50C_value, min_deg_Mg_50C_value)

In [None]:
train_data.head()

In [None]:
# # subtracting errors from target cols
# for i in range(0,len(train_data)):
#     num_time_steps = len(train_data['reactivity'].iloc[i])
#     for j in range(num_time_steps):
#         train_data['reactivity'][i][j] =  train_data['reactivity'][i][j] - train_data['reactivity_error'][i][j]
#         train_data['deg_Mg_pH10'][i][j] =  train_data['deg_Mg_pH10'][i][j] - train_data['deg_error_Mg_pH10'][i][j]
#         train_data['deg_pH10'][i][j] =  train_data['deg_pH10'][i][j] - train_data['deg_error_pH10'][i][j]
#         train_data['deg_Mg_50C'][i][j] =  train_data['deg_Mg_50C'][i][j] - train_data['deg_error_Mg_50C'][i][j]
#         train_data['deg_50C'][i][j] =  train_data['deg_50C'][i][j] - train_data['deg_error_50C'][i][j]

**Build Model**

#### Train data

In [None]:
train_data.columns

In [None]:
# For embedding layer
token2int = {x:i for i, x in enumerate('().ACGUBEHIMSX')}
target_cols = ['reactivity', 'deg_Mg_pH10', 'deg_pH10', 'deg_Mg_50C', 'deg_50C']

In [None]:
token2int

Notebooks used for reference: 
https://www.kaggle.com/tuckerarrants/openvaccine-gru-lstm
https://www.kaggle.com/isaienkov/openvaccine-eda-feature-engineering-modeling
https://www.kaggle.com/mrkmakr/covid-ae-pretrain-gnn-attn-cnn

In [None]:
def read_bpps_sum(df):
    bpps_arr = []
    for mol_id in df.id.to_list():
        bpps_arr.append(np.load(f"../input/stanford-covid-vaccine/bpps/{mol_id}.npy").sum(axis=1))
    return bpps_arr

def read_bpps_max(df):
    bpps_arr = []
    for mol_id in df.id.to_list():
        bpps_arr.append(np.load(f"../input/stanford-covid-vaccine/bpps/{mol_id}.npy").max(axis=1))
    return bpps_arr

def read_bpps_nb(df):
    #mean and std from https://www.kaggle.com/symyksr/openvaccine-deepergcn 
    bpps_nb_mean = 0.077522
    bpps_nb_std = 0.08914
    bpps_arr = []
    for mol_id in df.id.to_list():
        bpps = np.load(f"../input/stanford-covid-vaccine/bpps/{mol_id}.npy")
        bpps_nb = (bpps > 0).sum(axis=0) / bpps.shape[0]
        bpps_nb = (bpps_nb - bpps_nb_mean) / bpps_nb_std
        bpps_arr.append(bpps_nb)
    return bpps_arr 

os.chdir("/kaggle/working/")
train_data['bpps_sum'] = read_bpps_sum(train_data)
test_data['bpps_sum'] = read_bpps_sum(test_data)
train_data['bpps_max'] = read_bpps_max(train_data)
test_data['bpps_max'] = read_bpps_max(test_data)
train_data['bpps_nb'] = read_bpps_nb(train_data)
test_data['bpps_nb'] = read_bpps_nb(test_data)

#sanity check
train_data.head()

In [None]:
import plotly.express as px
from collections import Counter as count

def get_bases(data):
    bases = []

    for j in range(len(data)):
        counts = dict(count(data.iloc[j]['sequence']))
        bases.append((
            counts['A'] / 107,
            counts['G'] / 107,
            counts['C'] / 107,
            counts['U'] / 107
        ))

    bases = pd.DataFrame(bases, columns=['A_percent', 'G_percent', 'C_percent', 'U_percent'])
    return bases

In [None]:
def get_pairs_rate(data):
    pairs_rate = []

    for j in range(len(data)):
        res = dict(count(data.iloc[j]['structure']))
        pairs_rate.append(res['('] / 53.5)

    pairs_rate = pd.DataFrame(pairs_rate, columns=['pairs_rate'])
    return pairs_rate

In [None]:
def get_pairs(data):
    pairs = []
    all_partners = []
    for j in range(len(data)):
        partners = [-1 for i in range(130)]
        pairs_dict = {}
        queue = []
        for i in range(0, len(data.iloc[j]['structure'])):
            if data.iloc[j]['structure'][i] == '(':
                queue.append(i)
            if data.iloc[j]['structure'][i] == ')':
                first = queue.pop()
                try:
                    pairs_dict[(data.iloc[j]['sequence'][first], data.iloc[j]['sequence'][i])] += 1
                except:
                    pairs_dict[(data.iloc[j]['sequence'][first], data.iloc[j]['sequence'][i])] = 1

                partners[first] = i
                partners[i] = first

        all_partners.append(partners)

        pairs_num = 0
        pairs_unique = [('U', 'G'), ('C', 'G'), ('U', 'A'), ('G', 'C'), ('A', 'U'), ('G', 'U')]
        for item in pairs_dict:
            pairs_num += pairs_dict[item]
        add_tuple = list()
        for item in pairs_unique:
            try:
                add_tuple.append(pairs_dict[item]/pairs_num)
            except:
                add_tuple.append(0)
        pairs.append(add_tuple)

    pairs = pd.DataFrame(pairs, columns=['U-G', 'C-G', 'U-A', 'G-C', 'A-U', 'G-U'])
    return pairs

In [None]:
def get_loops(data):
    loops = []
    for j in range(len(data)):
        counts = dict(count(data.iloc[j]['predicted_loop_type']))
        available = ['E', 'S', 'H', 'B', 'X', 'I', 'M']
        row = []
        for item in available:
            try:
                row.append(counts[item] / 107)
            except:
                row.append(0)
        loops.append(row)

    loops = pd.DataFrame(loops, columns=available)
    return loops

In [None]:
from tqdm.notebook import tqdm

def get_structure_adj(train):
    ## get adjacent matrix from structure sequence
    
    ## here I calculate adjacent matrix of each base pair, 
    ## but eventually ignore difference of base pair and integrate into one matrix
    Ss = []
    for i in tqdm(range(len(train))):
        seq_length = train["seq_length"].iloc[i]
        structure = train["structure"].iloc[i]
        sequence = train["sequence"].iloc[i]

        cue = []
        a_structures = {
            ("A", "U") : np.zeros([seq_length, seq_length]),
            ("C", "G") : np.zeros([seq_length, seq_length]),
            ("U", "G") : np.zeros([seq_length, seq_length]),
            ("U", "A") : np.zeros([seq_length, seq_length]),
            ("G", "C") : np.zeros([seq_length, seq_length]),
            ("G", "U") : np.zeros([seq_length, seq_length]),
        }
        a_structure = np.zeros([seq_length, seq_length])
        for j in range(seq_length):
            if structure[j] == "(":
                cue.append(j)
            elif structure[j] == ")":
                start = cue.pop()
#                 a_structure[start, i] = 1
#                 a_structure[i, start] = 1
                a_structures[(sequence[start], sequence[j])][start, j] = 1
                a_structures[(sequence[j], sequence[start])][j, start] = 1
        
        a_strc = np.stack([a for a in a_structures.values()], axis = 2)
        a_strc = np.sum(a_strc, axis = 2, keepdims = True)
        Ss.append(a_strc)
    
    Ss = np.array(Ss)
    print(Ss.shape)
    return Ss

In [None]:
As = []
data = train_data[train_data['signal_to_noise'] > 1].copy()
for id in tqdm(data['id']):
    a = np.load(f"/kaggle/input/stanford-covid-vaccine/bpps/{id}.npy")
    As.append(a)
As = np.array(As)

In [None]:
def get_distance_matrix(As):
    ## adjacent matrix based on distance on the sequence
    ## D[i, j] = 1 / (abs(i - j) + 1) ** pow, pow = 1, 2, 4
    
    idx = np.arange(As.shape[1])
    Ds = []
    for i in range(len(idx)):
        d = np.abs(idx[i] - idx)
        Ds.append(d)

    Ds = np.array(Ds) + 1
    Ds = 1/Ds
    Ds = Ds[None, :,:]
    Ds = np.repeat(Ds, len(As), axis = 0)
    
    Dss = []
    for i in [1, 2, 4]: 
        Dss.append(Ds ** i)
    Ds = np.stack(Dss, axis = 3)
    print(Ds.shape)
    return Ds

In [None]:
def preprocess_inputs(df, cols=['sequence', 'structure', 'predicted_loop_type'], seq_length = 107, flag = 'train'):
    base_fea = np.transpose(
        np.array(
            df[cols]
            .applymap(lambda seq: [token2int[x] for x in seq])
            .values
            .tolist()
        ),
        (0, 2, 1)
    )

    bpps_sum_fea = np.array(df['bpps_sum'].to_list())[:,:,np.newaxis]
    bpps_max_fea = np.array(df['bpps_max'].to_list())[:,:,np.newaxis]
    bpps_nb_fea = np.array(df['bpps_nb'].to_list())[:,:,np.newaxis]

    Ss = get_structure_adj(df)
    Ss = Ss.sum(axis = 1)
    
    if flag == 'train':
        Ds = get_distance_matrix(As)
    elif flag == 'test_private':
        Ds = get_distance_matrix(As_private)
    elif flag == 'test_public':
        Ds = get_distance_matrix(As_public)
    Ds = Ds.sum(axis = 1)
    
    data = np.concatenate([base_fea,bpps_sum_fea,bpps_max_fea,bpps_nb_fea, Ss, Ds], 2)
    
    array_data = (np.reshape(([([list(df['A_percent'])[0]] * seq_length)]),(1,seq_length,1)))
    for i in range(1,len(df)):
        array_data_i = (np.reshape(([([list(df['A_percent'])[i]] * seq_length)]),(1,seq_length,1)))
        array_data = np.concatenate([array_data,array_data_i], axis = 0)

    data = np.concatenate([data, array_data], 2)

    for col in ['G_percent','C_percent','U_percent','U-G','C-G','U-A','G-C','A-U','G-U',
                'E','S','H','B','X','I','M','pairs_rate']:
        array_data = (np.reshape(([([list(df[col])[0]] * seq_length)]),(1,seq_length,1)))
        for i in range(1,len(df)):
            arraydom_data_i = (np.reshape(([([list(df[col])[i]] * seq_length)]),(1,seq_length,1)))
            array_data = np.concatenate([array_data,array_data_i], axis = 0)

        data = np.concatenate([data, array_data], 2)

    return data

In [None]:
bases = get_bases(train_data)
pairs = get_pairs(train_data)
loops = get_loops(train_data)
pairs_rate = get_pairs_rate(train_data)
train_data = pd.concat([train_data, bases, pairs, loops, pairs_rate], axis=1)

bases = get_bases(test_data)
pairs = get_pairs(test_data)
loops = get_loops(test_data)
pairs_rate = get_pairs_rate(test_data)
test_data = pd.concat([test_data, bases, pairs, loops, pairs_rate], axis=1)

In [None]:
train_inputs = preprocess_inputs(train_data.loc[train_data['signal_to_noise'] > 1], seq_length = 107, flag = 'train')
train_labels = np.array(train_data.loc[train_data['signal_to_noise'] > 1][target_cols].values.tolist()).transpose((0, 2, 1))

In [None]:
train_inputs.shape

In [None]:
from keras.losses import mean_squared_error

def root_mean_squared_error(y_true, y_pred):
    return tf.sqrt(mean_squared_error(y_true, y_pred))

def MCRMSE(y_true, y_pred):
    colwise_mse = tf.reduce_mean(tf.square(y_true - y_pred), axis=1)
    return tf.reduce_mean(tf.sqrt(colwise_mse), axis=1)

def lstm_layer(hidden_dim, dropout):
    return tf.keras.layers.Bidirectional(
                                tf.keras.layers.LSTM(hidden_dim,
                                dropout=dropout,
                                return_sequences=True,
                                kernel_initializer = 'orthogonal'))

def gru_layer(hidden_dim, dropout):
    return tf.keras.layers.Bidirectional(
                                tf.keras.layers.GRU(hidden_dim,
                                dropout=dropout,
                                return_sequences=True,
                                kernel_initializer='orthogonal'))

def build_model(n_layers = 2, seq_len = 107, num_features = 28, embed_dim = 200, sp_dropout = 0.2, hidden_dim = 512, dropout = 0.5, pred_len = 68, gru_flag = False):
    
    inputs = tf.keras.layers.Input(shape=(seq_len, num_features))
    categorical_feats = inputs[:, :, :3]
    numerical_feats = inputs[:, :, 3:10]
    overall_gene_feats = inputs[:, :, 10:]

    embed = tf.keras.layers.Embedding(input_dim=len(token2int), output_dim=embed_dim)(categorical_feats)
   
    reshaped = tf.reshape(embed, shape=(-1, embed.shape[1],  embed.shape[2] * embed.shape[3]))
    
    reshaped_1 = tf.keras.layers.concatenate([reshaped, numerical_feats], axis=2)
          
    normalized_layer_1 = tf.keras.layers.BatchNormalization()(reshaped_1)
    
    spatial_dropout = tf.keras.layers.SpatialDropout1D(sp_dropout)(normalized_layer_1)
  
    if gru_flag:
        for x in range(n_layers):
            normalized_layer_1 = gru_layer(hidden_dim, dropout)(normalized_layer_1)
        normalized_layer_2 = tf.keras.layers.BatchNormalization()(normalized_layer_1)
    else:
        for x in range(n_layers):
            normalized_layer_1 = lstm_layer(hidden_dim, dropout)(normalized_layer_1)
        normalized_layer_2 = tf.keras.layers.BatchNormalization()(normalized_layer_1)
    
    concat_layer = tf.keras.layers.concatenate([normalized_layer_2, overall_gene_feats], axis=2)
    
    dense_layer_1 = tf.keras.layers.Dense(100, activation = 'linear')(concat_layer)
    normalized_layer_3 = tf.keras.layers.BatchNormalization()(dense_layer_1)
    dropout_layer_1 = tf.keras.layers.SpatialDropout1D(sp_dropout)(normalized_layer_3)
   
    dense_layer_2 = tf.keras.layers.Dense(100, activation = 'linear')(dropout_layer_1)
    normalized_layer_4 = tf.keras.layers.BatchNormalization()(dense_layer_2)
    dropout_layer_2 = tf.keras.layers.SpatialDropout1D(sp_dropout)(normalized_layer_4)
    
    #only making predictions on the first part of each sequence
    truncated = dropout_layer_2[:, :pred_len]

    out = tf.keras.layers.Dense(5, activation='linear')(truncated)

    model = tf.keras.Model(inputs=inputs, outputs=out)

    #some optimizers
    adam = tf.optimizers.Adam()

    model.compile(optimizer = adam, loss = MCRMSE)
    
    return model

In [None]:
train_inputs.shape

In [None]:
# EPOCHS = 60
# BATCH_SIZE = 32

# model_GRU_on_train_data = build_model(gru_flag = True)
# model_GRU_on_train_data.summary()
# model_GRU_callback = tf.keras.callbacks.ModelCheckpoint(f'GRU model.h5')

# history_GRU = model_GRU_on_train_data.fit(train_inputs, train_labels,
#                   batch_size=BATCH_SIZE,
#                   epochs=EPOCHS,
#                   verbose = 2,
#                   callbacks=[model_GRU_callback])  

In [None]:
# EPOCHS = 60
# BATCH_SIZE = 32

# model_LSTM_on_train_data = build_model(gru_flag = False)
# model_LSTM_on_train_data.summary()
# model_LSTM_callback = tf.keras.callbacks.ModelCheckpoint(f'LSTM model.h5')

# history_LSTM = model_LSTM_on_train_data.fit(train_inputs, train_labels,
#                   batch_size=BATCH_SIZE,
#                   epochs=EPOCHS,
#                   verbose = 2,
#                   callbacks=[model_LSTM_callback])

In [None]:
# print(f" LSTM loss: {min(history_LSTM.history['loss'])}")
# print(f" GRU loss: {min(history_GRU.history['loss'])}")

# fig, ax = plt.subplots(1, 1, figsize = (20, 10))

# ax.plot(history_LSTM.history['loss'])
# ax.plot(history_GRU.history['loss'])

# ax.set_title('Model - LSTM vs GRU')

# ax.set_ylabel('Loss')
# ax.set_xlabel('Epoch')
# ax.legend()
# plt.show()

#### Test data

In [None]:
public_df = test_data.query("seq_length == 107").copy()
private_df = test_data.query("seq_length == 130").copy()

In [None]:
As_public = []
for id in tqdm(public_df["id"]):
    a = np.load(f"/kaggle/input/stanford-covid-vaccine/bpps/{id}.npy")
    As_public.append(a)
As_public = np.array(As_public)
As_private = []
for id in tqdm(private_df["id"]):
    a = np.load(f"/kaggle/input/stanford-covid-vaccine/bpps/{id}.npy")
    As_private.append(a)
As_private = np.array(As_private)

In [None]:
public_inputs = preprocess_inputs(public_df, seq_length = 107, flag = 'test_public')
private_inputs = preprocess_inputs(private_df, seq_length = 130, flag = 'test_private')

In [None]:
model_LSTM_on_test_data_public = build_model(seq_len=107, pred_len=107, gru_flag = False)
model_LSTM_on_test_data_public.load_weights('../input/openvaccine-covid-model-weights/LSTM model.h5')
pred_test_data_public_LSTM = model_LSTM_on_test_data_public.predict(public_inputs)

model_GRU_on_test_data_public = build_model(seq_len=107, pred_len=107, gru_flag = True)
model_GRU_on_test_data_public.load_weights('../input/openvaccine-covid-model-weights/GRU model (1).h5')
pred_test_data_public_GRU = model_GRU_on_test_data_public.predict(public_inputs)

In [None]:
model_LSTM_on_test_data_private = build_model(seq_len=130, pred_len=130, gru_flag = False)
model_LSTM_on_test_data_private.load_weights('../input/openvaccine-covid-model-weights/LSTM model.h5')
pred_test_data_private_LSTM = model_LSTM_on_test_data_private.predict(private_inputs)

model_GRU_on_test_data_private = build_model(seq_len=130, pred_len=130, gru_flag = True)
model_GRU_on_test_data_private.load_weights('../input/openvaccine-covid-model-weights/GRU model (1).h5')
pred_test_data_private_GRU = model_GRU_on_test_data_private.predict(private_inputs)

In [None]:
def format_predictions(public_preds, private_preds):
    preds = []
    
    for df, preds_ in [(public_df, public_preds), (private_df, private_preds)]:
        for i, uid in enumerate(df.id):
            single_pred = preds_[i]

            single_df = pd.DataFrame(single_pred, columns=target_cols)
            single_df['id_seqpos'] = [f'{uid}_{x}' for x in range(single_df.shape[0])]

            preds.append(single_df)
    return pd.concat(preds).reset_index(drop = True)

In [None]:
lstm_preds = format_predictions(pred_test_data_public_LSTM, pred_test_data_private_LSTM)
gru_preds = format_predictions(pred_test_data_public_GRU, pred_test_data_private_GRU)

In [None]:
lstm_preds.head()

In [None]:
gru_preds.head()

In [None]:
submission_LSTM = submission_format[['id_seqpos']].merge(lstm_preds, how = 'inner', on = 'id_seqpos')
submission_GRU = submission_format[['id_seqpos']].merge(gru_preds, how = 'inner', on = 'id_seqpos')

In [None]:
print(submission_LSTM.shape)
submission_LSTM.head()

In [None]:
print(submission_GRU.shape)
submission_GRU.head()

In [None]:
target_cols

In [None]:
submission_lstm_gru_combined = submission_GRU.merge(submission_LSTM, how = 'inner', on = 'id_seqpos')

gru_weight = 0.5
lstm_weight = 0.5
for i in range(len(target_cols)):
    submission_lstm_gru_combined[target_cols[i]] = submission_lstm_gru_combined[target_cols[i]+'_x']*gru_weight + submission_lstm_gru_combined[target_cols[i]+'_y']*lstm_weight

In [None]:
submission_lstm_gru_combined = submission_lstm_gru_combined[['id_seqpos'] + target_cols]

In [None]:
submission_lstm_gru_combined.head()

In [None]:
os.chdir("/kaggle/working/")
submission_LSTM.to_csv('submission_LSTM.csv', index = False)
submission_GRU.to_csv('submission_GRU.csv', index = False)
submission_lstm_gru_combined.to_csv('submission_lstm_gru_combined.csv', index = False)