### OpenVaccine - Stanford University - XGBoost
The following notebook constructs an XGBoost regression model to predict the degradation rate of mRNA molecules in the OpenVaccine competition held by Stanford University. 

#### 0. Import Libraries

In [None]:
!pip3 install -q forgi[all]
!conda install -y -c bioconda viennarna

In [None]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor
import forgi.graph.bulge_graph as fgb
import forgi.visual.mplotlib as fvm
sns.set(style='darkgrid')

#### 1. Exploratory Data Analysis

In [None]:
# load data
train = pd.read_json('../input/stanford-covid-vaccine/train.json',lines=True)
test = pd.read_json('../input/stanford-covid-vaccine/test.json', lines=True)
sample_sub = pd.read_csv('../input/stanford-covid-vaccine/sample_submission.csv')

##### 1.1 Shape of Input and Output and Important Variables Explained

In [None]:
# examine the shape of input and output
print("train data shape: ", train.shape)
print("test data shape: ", test.shape)
print("sample submission shape: ", sample_sub.shape)

- 2400 training examples, each with 19 attributes
- 3634 testing examples, each with 7 attributes (629 public samples + 3005 private samples)
- Public test data has Sequence length of 107, while private test data has Sequence length of 130 (explanation after). Prediction is required for every position on the sequence for every sample. Hence, total rows in submission = 629 * 107 + 3005 * 130 = 457,953

In [None]:
train.head()

**Important variables**
- seq_length - Int, the length of sequence.
- sequence - An array of A, G, U, and C. Describes the RNA sequence. **Main feature**
- structure - An array of (, ), and . characters that describe whether a base is estimated to be paired or unpaired.
- reactivity, deg_pH10, deg_Mg_pH10, deg_50C,deg_Mg_50C - An array of floating point numbers. Five indexes of the likelihood of degradation under different conditions. **Ground-truth Values**
- SN_filter - Int, 1 if it satisfies the below given conditions or else 0. The filter will also be applied to the public test set and private test set.
    - Minimum value across all 5 conditions must be greater than -0.5.
    - Mean signal/noise across all 5 conditions must be greater than 1.0.

In [None]:
test.head()

Five targeted variables as well as several variables (SN_filter, signal_to_noise, errors) are removed. These removed variables can't be used as features in the later model.

In [None]:
sample_sub.head()

For every position of every example (denoted by id_seqpos), all five indexes are predicted.

In [None]:
train['seq_length'].value_counts()

All the examples in the training data are of 107 sequence length.

In [None]:
test['seq_length'].value_counts()

- 629 examples in the test set are of 107 sequence length, the same as the training set, and they form the public test set.
- 3005 examples in the test set are of 130 sequence length, which together form the private test set.
- The variation in the sequence length between the public test set and the private test set is aimed to examine the generalization ability of the model.

##### 1.2 Distribution of Signal/Noise and SN_filter

In [None]:
# plot the boxplot of Signal/Noise and the barplot of SN_filter
fig, ax = plt.subplots(1, 2, figsize=(20,5))
sns.boxplot(data=train, x='signal_to_noise', ax=ax[0])
ax[0].set_title('Signal/Noise')
sns.countplot(data=train, y='SN_filter', ax=ax[1])
ax[1].set_title('SN_filter')
plt.show()

- There are samples with very high Signal/Noise values, which could be outliers.
- About 2/3 of the training samples have SN_filter of 1, which denotes that the example is relatively "good". **Removing samples with SN_filter of 0 in the training process might be a good choice.**

##### 1.3 Distribution of Ground-Truth vs Position

In [None]:
# Reference: https://www.kaggle.com/meemr5/openvaccine-interesting-visualizations
# obtain the average values of five indexes over positions
avg_reactivity = np.array(list(map(np.array,train.reactivity))).mean(axis=0)
avg_deg_50C = np.array(list(map(np.array,train.deg_50C))).mean(axis=0)
avg_deg_pH10 = np.array(list(map(np.array,train.deg_pH10))).mean(axis=0)
avg_deg_Mg_50C = np.array(list(map(np.array,train.deg_Mg_50C))).mean(axis=0)
avg_deg_Mg_pH10 = np.array(list(map(np.array,train.deg_Mg_pH10))).mean(axis=0)

In [None]:
# plot of the average values of five indexes over positions vs positions
plt.figure(figsize=(20,10))

sns.lineplot(x=range(68),y=avg_reactivity,label='avg_reactivity')
sns.lineplot(x=range(68),y=avg_deg_50C,label='avg_deg_50C')
sns.lineplot(x=range(68),y=avg_deg_pH10,label='avg_deg_ph10')
sns.lineplot(x=range(68),y=avg_deg_Mg_50C,label='avg_deg_Mg_50C')
sns.lineplot(x=range(68),y=avg_deg_Mg_pH10,label='avg_deg_Mg_pH10')

plt.xlabel('Positions on the RNA sequence')
plt.xticks(range(0,68))
plt.ylabel('Values')
plt.title('Average Target Values vs Positions')

plt.show()

- There is a certain pattern regarding the distribution of Reactivity and degradation: they tend to be high at the beginning of the sequence and stable and low in the middle.
- There is some correlation between these five indexes.


In [None]:
# correlation between the five indexes
np.corrcoef(np.vstack((avg_reactivity, avg_deg_50C, avg_deg_pH10, avg_deg_Mg_50C, avg_deg_Mg_pH10)))

Almost all correlation coefficients exceed 0.8, indicating indeed high correlation between indexes.

##### 1.4 RNA Visualization

To better understand the sequence and structure of RNA, we use a tool named viennarna to construct RNA models and visualize them from RNA structures and sequences.

In [None]:
# plotting function
def plot_sample(sample):
    
    """
    Reference: https://www.kaggle.com/erelin6613/openvaccine-rna-visualization
    Visualize RNA using viennarna
    Arguments:
    sample: pandas.series, a sample of RNA, must contain 'id', structure' and 'sequence'
    
    """
    struct = sample['structure']
    seq = sample['sequence']
    bg = fgb.BulgeGraph.from_fasta_text(f'>rna1\n{struct}\n{seq}')[0]
    
    plt.figure(figsize=(20,8))
    fvm.plot_rna(bg)
    plt.title(f"RNA Structure (id: {sample.id})")
    plt.show()

In [None]:
# example
sample = train.iloc[np.random.choice(2400)]
plot_sample(sample)

#### 2. Predition with XGBoost Regression

##### 2.1 Preprocess

As we have stated in 1.2, we remove training samples with SN_filter of 0.

In [None]:
# filter training data with SN_filter
mask = train['SN_filter'] == 1
train = train[mask]

And we drop useless columns for now.

In [None]:
# remove explanatory variables not available in the test set
train = train.drop(['signal_to_noise', 'SN_filter', 'reactivity_error', 'deg_error_Mg_pH10', 'deg_error_pH10', 'deg_error_Mg_50C', 'deg_error_50C'], axis = 1)
train.shape

Since prediction is required for all positions of every sample, we break every sample down into pieces.

In [None]:
# rearrange training data so that one row represents one position of one sample
train_data = []

for ID in train['id'].unique():
    entry = train.loc[train['id'] == ID]     
    for i in range(entry['seq_scored'].values[0]):
        sample_dict = {'id': entry['id'].values[0],
                       'id_seqpos': str(entry['id'].values[0]) + '_' + str(i),
                       'sequence': entry['sequence'].values[0][i],
                       'structure': entry['structure'].values[0][i],
                       'predicted_loop_type': entry['predicted_loop_type'].values[0][i],
                       'reactivity': entry['reactivity'].values[0][i],
                       'deg_Mg_pH10': entry['deg_Mg_pH10'].values[0][i],
                       'deg_pH10': entry['deg_pH10'].values[0][i],
                       'deg_Mg_50C': entry['deg_Mg_50C'].values[0][i],
                       'deg_50C': entry['deg_50C'].values[0][i]}
        train_data.append(sample_dict)
        
train_data = pd.DataFrame(train_data)
train_data.head()

In [None]:
# same for the test data
test_data = []

for ID in test['id'].unique():
    entry = test.loc[test['id'] == ID]     
    for i in range(entry['seq_length'].values[0]):
        sample_dict = {'id': entry['id'].values[0],
                       'id_seqpos': str(entry['id'].values[0]) + '_' + str(i),
                       'sequence': entry['sequence'].values[0][i],
                       'structure': entry['structure'].values[0][i],
                       'predicted_loop_type': entry['predicted_loop_type'].values[0][i]}
        test_data.append(sample_dict)
        
test_data = pd.DataFrame(test_data)
test_data.head()

'squence', 'structure' and 'predicted_loop_type' are all floats. We turn them into integers using dictionaries.

In [None]:
# convert string to float
dict_sequence = {'A': 0, 'G' : 1, 'U' : 2, 'C' : 3}
dict_structure = {'(' : 0, ')' : 1, '.' : 2}
dict_looptype = {'S':0, 'M':1, 'I':2, 'B':3, 'H':4, 'E':5, 'X':6}

train_data['sequence'] = train_data['sequence'].replace(dict_sequence)
train_data['structure'] = train_data['structure'].replace(dict_structure)
train_data['predicted_loop_type'] = train_data['predicted_loop_type'].replace(dict_looptype)

test_data['sequence'] = test_data['sequence'].replace(dict_sequence)
test_data['structure'] = test_data['structure'].replace(dict_structure)
test_data['predicted_loop_type'] = test_data['predicted_loop_type'].replace(dict_looptype)

train_data.head()

In [None]:
# split data in features and labels
X_train = train_data.drop(['reactivity', 'deg_Mg_pH10', 'deg_pH10', 'deg_Mg_50C', 'deg_50C'], axis=1)
Y_train = train_data[['reactivity', 'deg_Mg_pH10', 'deg_pH10', 'deg_Mg_50C', 'deg_50C']]

In [None]:
# split training set and test set
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2)
X_train.shape, X_val.shape, Y_train.shape, Y_val.shape

##### 2.2 Define Loss Function

In [None]:
# loss function used for scoring
def mcrmse_loss(y_true, y_pred, N = 5):
    """
    Calculates competition eval metric
    """
    n = len(y_true)
    return np.sum(np.sqrt(np.sum((y_true - y_pred)**2, axis = 0)/n)) / N

##### 2.3 Construct Model

In [None]:
# XGBoost Regressor model, best parameters after testing
xgb = XGBRegressor(
    n_estimators=800,
    eval_metric='rmse',
    learning_rate=0.1,
    subsample=0.8, # prevent overfitting
    colsample_bytree=0.8 # prevent overfitting
)

In [None]:
# delete from X features not used for prediction ('id', 'id_seqpos')
features = ['sequence', 'structure', 'predicted_loop_type']
targets = ['reactivity', 'deg_Mg_pH10', 'deg_pH10', 'deg_Mg_50C', 'deg_50C']
sub = pd.DataFrame(test_data['id_seqpos'])
feature_importances = pd.DataFrame(index=features)
# train X
tr_X = X_train[features]
# validation X
vl_X = X_val[features]
# test X
ts_X = test_data[features]
tr_X.shape, vl_X.shape, ts_X.shape

In [None]:
# train and test
for i in range(5):
    tr_Y, vl_Y = Y_train[targets[i]], Y_val[targets[i]]
    # train
    xgb.fit(tr_X, tr_Y)
    feature_importances.insert(i, targets[i], xgb.feature_importances_)
    # validate
    vl_pred = xgb.predict(vl_X)
    loss = mcrmse_loss(vl_Y, vl_pred)
    print(f'{targets[i]} loss : {loss}')
    # test
    sub[targets[i]] = xgb.predict(ts_X)

In [None]:
# feature importances visualization
fig, ax = plt.subplots(3, 2, figsize = (12, 8))
fig.suptitle('Feature Importances Visualization')
for i in range(5):
    sns.barplot(x = features, y = targets[i], data=feature_importances, ax=ax[i // 2][i % 2])
plt.tight_layout()
plt.show()

In [None]:
# submission file
sub.to_csv('submission.csv', index=False)
sub.shape, sub.head()

##### 2.4 Visualize Results

Finally, we pick up several samples randomly and visualize our prediction results.

In [None]:
# prediction results visualization
sub['id'], sub['seqpos'] = sub['id_seqpos'].str.rsplit('_', 1).str
sub['seqpos'] = sub['seqpos'].astype(int)
sub = sub.sort_values(by=['id', 'seqpos']).reset_index(drop=True)
# sampling
reac0 = sub.groupby('id')['reactivity'].apply(list)[0]
reac1 = sub.groupby('id')['reactivity'].apply(list)[1000]
reac2 = sub.groupby('id')['reactivity'].apply(list)[2000]
reac3 = sub.groupby('id')['reactivity'].apply(list)[3000]

In [None]:
# predicted reactivity vs position plot
fig, ax = plt.subplots(4, 1, sharex=True)
fig.suptitle('Predicted Reactivity vs Position')
sns.lineplot(data = reac0, ax=ax[0])
sns.lineplot(data = reac1, ax=ax[1])
sns.lineplot(data = reac2, ax=ax[2])
sns.lineplot(data = reac3, ax=ax[3])
plt.show()