# **Version 1: Problem Statement**
# **Version 3: Transfer Learning**
# **Upcoming Version: Transferred model optimization**

Here I will discuss how to attack such problems when you have 0 domain knowledge or literature is too complex to digest :). I have 0 understanding of RNA and its theory. I will try to lay down the thoughts here to make some sense out of the problem. *  

***Introdution to DNA/RNA:*** I am newbie to the biology terminology so would write down bare minimum definition so to get the understanding of the problem we are trying to solve:

***DNA v/s RNA:*** DNA encodes all genetic information and is a blueprint from which biological life is creaed. In a raw manner, it can be considered as biological flash drive. RNA on the other hand is reader which decodes the information from the biological flash drive [DNA] and utilized in the process of creating protein

![DNA v/s RNA](https://scx1.b-cdn.net/csz/news/800/2020/11-newtechnolog.jpg)

There are three 3 types of RNA: **mRNA**, tRNA and rRNA. In this Data science problem, we are dealing with **mRNA** which is messenger RNA used for copying portion of genetic code and process is called transcription.

**mRNA Sequencing**:method of analyzing the transcriptomes [sum total of mRNA] of disease or biological states.


**Problem Statement:**To predict the degrade rate of various location along RNA sequence

***Fields Used in training set:***

1.   index
2.   id
3.   sequence [String of **max** size  107]
4.   structure [String of **max** size 107 ]
5.   predicted_loop_type [107 in both train/test]
6.   signal_to_noise
7.   SN_filter
8.   seq_length 
9.   seq_scored [**max**   68]
10.  reactivity_error [68 items]
11.  deg_error_Mg_pH10 [68 items]
12.  deg_error_pH10 [68 items]
13.  deg_error_Mg_50C [68 items]
14.  deg_error_50C [68 items]
15.  reactivity [68 items]
16.  deg_Mg_pH10 [68 items]
17.  deg_mH10 [68 items]
18.  deg_Mg_50C [68 items]
19.  deg_50C [68 items]

***Fields Used in test set:***

1.   index
2.   id
3.   sequence  [String of **max** size  130]
4.   structure [String of **max** size  130]
5.   predicted_loop_type [107 in both train/test]
6.   seq_length 
7.   seq_scored 

Data Stat:

As per description, Stanford scientists have data on 3029 RNA out of which 2400 are used for training and 629 for public test

**Training set**=2400 <br>
**public set**=629+3005[added new RNAs]=3634

So just reading the description of the data columns we can derive independent variable/Features**** X columns and dependent variable  Y columns:

**X:** sequence [string], structure [string], predicticted_loop_type[string] 

**Y:** 5 columns: reactivity,	deg_Mg_pH10,	deg_pH10,	deg_Mg_50C	deg_50C

Looking at the X - it's a classic problem of NLP.

In public test, each of 5 column is vector of length 68 whereas for private test each of 5 colums is a vector of length 91.

In coming days, will be adding more code in the notebook for the actual analysis. 


In [None]:
import json

import pandas as pd
import numpy as np
import plotly.express as px
import tensorflow.keras.layers as L
import tensorflow as tf
import tensorflow.keras as keras
from sklearn.model_selection import train_test_split
from tensorflow.keras.applications.inception_v3 import InceptionV3
from tensorflow.keras import layers

**In this notebook, I will be discussing transfer learning. Utilizing already created model and updating it to add an additional layer and running it. This saves time and help us to further configure the model to attain optimization. I have used the solid model developered by @xhlulu @vbmokin. Referenced Notebooks:**

https://www.kaggle.com/xhlulu/openvaccine-simple-gru-model <br>
https://www.kaggle.com/its7171/gru-lstm-with-feature-engineering-and-augmentation

**I have borrowed few code fxns from above notebooks. Credit goes to them, please upvote their notebooks as a token of appreciation and their quality work.****

**Utility Functions**

In [None]:
def LOSS_MCRMSE(y_true, y_pred):
    colwise_mse = tf.reduce_mean(tf.square(y_true - y_pred), axis=1)
    return tf.reduce_mean(tf.sqrt(colwise_mse), axis=1)

def gru_layer(hidden_dim, dropout):
    return L.Bidirectional(L.GRU(
        hidden_dim, dropout=dropout, return_sequences=True, kernel_initializer='orthogonal'))
def MCRMSE(y_true, y_pred):
    colwise_mse = tf.reduce_mean(tf.square(y_true - y_pred), axis=1)
    return tf.reduce_mean(tf.sqrt(colwise_mse), axis=1)
def lstm_layer(hidden_dim, dropout):
    return tf.keras.layers.Bidirectional(
                                tf.keras.layers.LSTM(hidden_dim,
                                dropout=dropout,
                                return_sequences=True,
                                kernel_initializer = 'orthogonal'))

def preprocess_inputs(df, token2int, cols=['sequence', 'structure', 'predicted_loop_type']):
    return pandas_list_to_array(
        df[cols].applymap(lambda seq: [token2int[x] for x in seq])
    )

def pandas_list_to_array(df):
    """
    Input: dataframe of shape (x, y), containing list of length l
    Return: np.array of shape (x, l, y)
    """
    
    return np.transpose(
        np.array(df.values.tolist()),
        (0, 2, 1)
    )

# **Data Loading**

In [None]:
data_dir = '/kaggle/input/stanford-covid-vaccine/'
pred_cols = ['reactivity', 'deg_Mg_pH10', 'deg_Mg_50C', 'deg_pH10', 'deg_50C']

y_true = tf.random.normal((32, 68, 3))
y_pred = tf.random.normal((32, 68, 3))


train = pd.read_json(data_dir + 'train.json', lines=True)
test = pd.read_json(data_dir + 'test.json', lines=True)
sample_df = pd.read_csv(data_dir + 'sample_submission.csv')

train = train.query("signal_to_noise >= 1")

token2int = {x:i for i, x in enumerate('().ACGUBEHIMSX')}

train_inputs = preprocess_inputs(train, token2int)
train_labels = pandas_list_to_array(train[pred_cols])

x_train, x_val, y_train, y_val = train_test_split(
    train_inputs, train_labels, test_size=.1, random_state=34, stratify=train.SN_filter)

public_df = test.query("seq_length == 107")
private_df = test.query("seq_length == 130")

public_inputs = preprocess_inputs(public_df, token2int)
private_inputs = preprocess_inputs(private_df, token2int)

# **Method to declare model structure - used in transfer learning**

In [None]:
def build_model_structure(embed_size=14, seq_len=107, pred_len=68, dropout=0.5, 
                sp_dropout=0.2, embed_dim=200, hidden_dim=256, n_layers=3):
    inputs = L.Input(shape=(seq_len, 3))
    embed = L.Embedding(input_dim=embed_size, output_dim=embed_dim)(inputs)
    
    reshaped = tf.reshape(
        embed, shape=(-1, embed.shape[1],  embed.shape[2] * embed.shape[3])
    )
    hidden = L.SpatialDropout1D(sp_dropout)(reshaped)
    
    for x in range(n_layers):
        hidden = gru_layer(hidden_dim, dropout)(hidden)
    
    # Since we are only making predictions on the first part of each sequence, 
    # we have to truncate it
    truncated = hidden[:, :pred_len]
    out = L.Dense(5, activation='linear')(truncated)
    
    model = tf.keras.Model(inputs=inputs, outputs=out)
    model.compile(tf.optimizers.Adam(), loss=MCRMSE)
    
    return model

# **Create public and private model structure**

**Public model structure -  model.h5 can  be downloaded from [model.h5](http://www.kaggle.com/gagankarora/learned-model)**

In [None]:
keras.backend.clear_session()
weight_file='/kaggle/input/learned-model/model.h5'
#public model
pre_trained_public_model = build_model_structure(seq_len=107, pred_len=107)
pre_trained_public_model.load_weights(weight_file)


#private model
pre_trained_private_model = build_model_structure(seq_len=130, pred_len=130)
pre_trained_private_model.load_weights(weight_file)
# Make all the layers in the pre-trained model non-trainable
for layer in pre_trained_public_model.layers:
    layer.trainable=False

# Make all the layers in the pre-trained model non-trainable
for layer in pre_trained_private_model.layers:
    layer.trainable=False
       
# Get the summary
pre_trained_public_model.summary()

# **Update to public Model**

In [None]:
# Select the last layer
pred_len=68
last_layer = pre_trained_public_model.get_layer('dense')
last_output = last_layer.output


# Add a fully connected layer with 1,024 hidden units and ReLU activation
new_layer = lstm_layer(200, dropout=.4)(last_output)
#x = layers.Dense(1024, activation='relu')(x)

# Add a dropout rate of 0.2
new_layer = layers.Dropout(.2)(new_layer)    

truncated = new_layer[:, :pred_len]

out = tf.keras.layers.Dense(5, activation='linear')(truncated)
transferred_model = tf.keras.Model(inputs=pre_trained_public_model.input,outputs=out)
transferred_model.compile(tf.optimizers.Adam(), loss=MCRMSE)

# **Update the private model**

In [None]:
pre_trained_private_model.summary()

**Please note Last layer is dense_1**

In [None]:
# Select the last layer
pred_len=107
last_layer = pre_trained_private_model.get_layer('dense_1')
last_output = last_layer.output


# Add a fully connected layer with 1,024 hidden units and ReLU activation
new_layer = lstm_layer(200, dropout=.4)(last_output)
#x = layers.Dense(1024, activation='relu')(x)

# Add a dropout rate of 0.2
new_layer = layers.Dropout(.2)(new_layer)    

truncated = new_layer[:, :pred_len]

out = tf.keras.layers.Dense(5, activation='linear')(truncated)
transferred_pr_model = tf.keras.Model(inputs=pre_trained_private_model.input,outputs=out)
transferred_pr_model.compile(tf.optimizers.Adam(), loss=MCRMSE)

# **Model Fitting**

In [None]:
history = transferred_model.fit(
    x_train, y_train,
    validation_data=(x_val, y_val),
    batch_size=64,
    epochs=40,
    verbose=2,
    callbacks=[
        tf.keras.callbacks.ReduceLROnPlateau(patience=5),
        tf.keras.callbacks.ModelCheckpoint('transferred_model.h5')
    ]
)

# **Model Evaluation**

In [None]:
fig = px.line(
    history.history, y=['loss', 'val_loss'],
    labels={'index': 'epoch', 'value': 'MCRMSE'}, 
    title='Training History')
fig.show()

# **Making Predictions**

In [None]:
transferred_model_public = transferred_model
transferred_model_private = transferred_pr_model

transferred_model_public.load_weights('transferred_model.h5')
transferred_model_private.load_weights('transferred_model.h5')

public_preds = transferred_model_public.predict(public_inputs)
private_preds = transferred_model_private.predict(private_inputs)

preds_ls = []

for df, preds in [(public_df, public_preds), (private_df, private_preds)]:
    for i, uid in enumerate(df.id):
        single_pred = preds[i]

        single_df = pd.DataFrame(single_pred, columns=pred_cols)
        single_df['id_seqpos'] = [f'{uid}_{x}' for x in range(single_df.shape[0])]

        preds_ls.append(single_df)

preds_df = pd.concat(preds_ls)
preds_df.head()

In [None]:
preds_df.to_csv('preds_df.csv', index=False)