## Baseline model private simulation

The baseline is based on the greatest kernel among all this competition (IMHO) by @mrkmakr 
https://www.kaggle.com/mrkmakr/covid-ae-pretrain-gnn-attn-cnn
which is in turn a hugely cleaned up version of uncle @cpmpml 's NFL big data bowl's graph transformer
https://www.kaggle.com/cpmpml/graph-transfomer

In this notebook I tested ideas against the baseline. Major simple ideas without major architectual changes are:
- adding dropouts in attention layers
- adding an LSTM layer in the regressor
- adding additional learnable edge feature matrices which are gonna to fed to attention layers
- adding an L1 penalty to make gradients bigger in later epochs
- adding bi-/tri- diagonal matrix in the base edge matrix to model non-paired neighbors
- Bpps with cut-offs
- Distance matrices weighted by Bpps matrices
- More hard-coded nodal features (columns sum of bpps, inverse Hamming distance to the loops)


Train with a truncated length (78), target simulating the 91/130 ratio (50).
Validation set is the full train seq_len (107)

Baseline private simulation score:
Version 1-9: 0.2585188

Baseline using SNR>1.5 with more AE training (30 epoch with more time on private simulated)
Version 10-: 0.2507343

### Change notes
- Version 1 (&2): CV  0.2585188. no model change from the best public version, just added private simulation
- Version 3 (&5): CV 0.2619488. add a tridiagonal matrix to Ss, keep only two distance matrices (linear and square)
- Version 6: CV 0.2624281. baseline+tridiagonal neighbor matrix in Ss, only two distance matrices, diagonal weight changes from 2 to 1 in Ss 
- Version 7: CV 0.2591735. baseline +  truncated As, only two distance matrices, remove the tridiagonal from the Ss
- Version 8: CV 0.2579629. baseline + truncated As, only inverse square distance matrix, weighted tridiagonal from the Ss
- Version 9: CV 0.2606475 truncated As, only inverse square distance matrix, only 0.5 weighted bi-diagonal for Ss

### SNR 1.5 versions

- Version 10: Add more AE epoch from baseline
- Version 11: more AE epochs than version 10
- Version 12: only inverse square distance + 2 Conv2D 4x4 filters + a new loss function
- Version 13: (CPU) same with 12, distance to nearest loops nodal feature added
- Version 15: (GPU) same with 13, more AE epochs to bring down the entropy loss
- Version 16: (GPU) version 12, bidiagonal weight, inverse linear and square distance, 2 Conv2D 4x4 filters, back to old loss
- Version 17: (CPU) version 12, + 4 Conv2D 4x4 filter, old loss
- Version 18: (CPU) 3 Conv2D learnable edge filters, L1 penalty, linear distance weighted by As (is this learnable?)
- Version 19: (CPU) version 18, stronger L1 penalty, additional linear distance weighted by As (is this learnable?)
- Version 20 & 21: (CPU) version 19 + LSTM in regressor, learnable edge matrix has now no activation.

In [None]:
TRUNCATED_LEN = 78 # truncating the train 
TRUNCATED_LEN_y = 50 # truncating the target
SEED_SPLIT = 42
SEED_VAL = 802
SIZE_TRAIN = 800
SN_THRESHOLD = 1.5
BPPS_THRESHOLD = 1e-2

pretrain_dir = None
one_fold = False
run_test = False

verbose=1

ae_epochs = 40
ae_epochs_each = 5
ae_batch_size = 32

epochs_list =     [30, 10, 10, 5,  5,   5]
batch_size_list = [8,  16, 32, 64, 128, 256] 

## copy pretrain model to working dir
import shutil
import json
import glob
if pretrain_dir is not None:
    for d in glob.glob(pretrain_dir + "*"):
        shutil.copy(d, ".")
    
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import gc
import os
import matplotlib.pyplot as plt

from tqdm import tqdm
%matplotlib inline

## Loading train/test/targets

In [None]:
train = pd.read_json("/kaggle/input/stanford-covid-vaccine/train.json",lines=True)
train = train[train.signal_to_noise > SN_THRESHOLD].reset_index(drop = True)
    
sub = pd.read_csv("/kaggle/input/stanford-covid-vaccine/sample_submission.csv")
As = []
for id in tqdm(train["id"]):
    a = np.load(f"/kaggle/input/stanford-covid-vaccine/bpps/{id}.npy")
    As.append(a)
As = np.array(As)

In [None]:
targets = list(sub.columns[1:])
print(targets)

y_train = []
seq_len = train["seq_length"].iloc[0]
seq_len_target = train["seq_scored"].iloc[0]
ignore = -10000
ignore_length = seq_len - seq_len_target
for target in targets:
    y = np.vstack(train[target])
    dummy = np.zeros([y.shape[0], ignore_length]) + ignore
    y = np.hstack([y, dummy])
    y_train.append(y)
y = np.stack(y_train, axis = 2)
print(y.shape)

## Node

In [None]:
## sequence
def return_ohe(n, i):
    tmp = [0] * n
    tmp[i] = 1
    return tmp

def get_inverse_distance_to_loop(sequence, loop_type):
    '''
    compute the graph distance of each base to the near loop
    '''
    prev = float('-inf')
    Dist = []
    for i, x in enumerate(sequence):
        if x == loop_type: 
            prev = i
        Dist.append(i - prev)

    prev = float('inf')
    for i in range(len(sequence) - 1, -1, -1):
        if sequence[i] == loop_type: prev = i
        Dist[i] = min(Dist[i], prev - i)
    Dist = 1/(np.array(Dist)+1)**2
    
    return Dist*(Dist>0.01)


def get_input(train):
    mapping = {}
    vocab = ["A", "G", "C", "U"]
    for i, s in enumerate(vocab):
        mapping[s] = return_ohe(len(vocab), i)
    X_node = np.stack(train["sequence"].apply(lambda x : list(map(lambda y : mapping[y], list(x)))))

    mapping = {}
    vocab = ["S", "M", "I", "B", "H", "E", "X"]
    for i, s in enumerate(vocab):
        mapping[s] = return_ohe(len(vocab), i)
    X_loop = np.stack(train["predicted_loop_type"].apply(lambda x : list(map(lambda y : mapping[y], list(x)))))
    
    mapping = {}
    vocab = [".", "(", ")"]
    for i, s in enumerate(vocab):
        mapping[s] = return_ohe(len(vocab), i)
    X_structure = np.stack(train["structure"].apply(lambda x : list(map(lambda y : mapping[y], list(x)))))
    
    
    X_node = np.concatenate([X_node, X_loop], axis = 2)
    
    ## interaction
    a = np.sum(X_node * (2 ** np.arange(X_node.shape[2])[None, None, :]), axis = 2)
    vocab = sorted(set(a.flatten()))
    print(vocab)
    ohes = []
    for v in vocab:
        ohes.append(a == v)
    ohes = np.stack(ohes, axis = 2)
    X_node = np.concatenate([X_node, ohes], axis = 2).astype(np.float32)
    
    ## distance to loops
#     loop_types = ["M", "I", "B", "H", "E", "X"]
#     dist_inv_to_loops = np.zeros((train.shape[0], As.shape[1], len(loop_types)))
#     for i in tqdm(range(len(train))):
#         idx = train.index[i]
#         for j, s in enumerate(loop_types):
#             dist_inv_to_loops[i,:,j] = get_inverse_distance_to_loop(train["predicted_loop_type"][idx], s)
    
#     X_node = np.concatenate([dist_inv_to_loops, X_node], axis = 2)
    
    print(X_node.shape)
    return X_node

X_node = get_input(train)

## Edge matrices (adj)

In [None]:
def get_structure_adj(train):
    Ss = []
    for i in tqdm(range(len(train))):
        seq_length = train["seq_length"].iloc[i]
        structure = train["structure"].iloc[i]
        sequence = train["sequence"].iloc[i]

        cue = []
        a_structures = {
            ("A", "U") : np.zeros([seq_length, seq_length]),
            ("C", "G") : np.zeros([seq_length, seq_length]),
            ("U", "G") : np.zeros([seq_length, seq_length]),
            ("U", "A") : np.zeros([seq_length, seq_length]),
            ("G", "C") : np.zeros([seq_length, seq_length]),
            ("G", "U") : np.zeros([seq_length, seq_length]),
        }
        a_structure = np.zeros([seq_length, seq_length])
        for i in range(seq_length):
            if structure[i] == "(":
                cue.append(i)
            elif structure[i] == ")":
                start = cue.pop()
#                 a_structure[start, i] = 1
#                 a_structure[i, start] = 1
                a_structures[(sequence[start], sequence[i])][start, i] = 1
                a_structures[(sequence[i], sequence[start])][i, start] = 1
        
        a_strc = np.stack([a for a in a_structures.values()], axis = 2)
        a_neighbor = np.diag(np.ones(seq_length-1),-1) + np.diag(np.ones(seq_length-1),1)
#         a_neighbor += np.diag(np.ones(seq_length))
        a_strc = np.concatenate([a_strc, a_neighbor[...,None]], axis = 2)
        a_strc = np.sum(a_strc, axis = 2, keepdims = True)
        Ss.append(a_strc)
    
    Ss = np.array(Ss)
    print(Ss.shape)
    return Ss

Ss = get_structure_adj(train)

In [None]:
def get_distance_matrix(As):
    idx = np.arange(As.shape[1])
    Ds = []
    for i in range(len(idx)):
        d = np.abs(idx[i] - idx)
        Ds.append(d)

    Ds = np.array(Ds) + 1
    Ds = 1/Ds
    Ds = Ds[None, :,:]
    Ds = np.repeat(Ds, len(As), axis = 0)
    
    Dss = []
    for i in [1, 2, 4]:
        Dss.append(Ds ** i)
    Ds = np.stack(Dss, axis = 3)
    Ds[...,-1] = Ds[...,0]*As
    print(Ds.shape)
    return Ds

Ds = get_distance_matrix(As)

# Ds = Ds[...,1]
# Ds = Ds[...,None]

In [None]:
## concat adjecent
# As = np.stack([As, As*(As>=BPPS_THRESHOLD)], axis=3)
# As = np.concatenate([As, Ss, Ds], axis = 3).astype(np.float32)
As = np.concatenate([As[:,:,:,None], Ss, Ds], axis = 3).astype(np.float32)
del Ss, Ds
print(As.shape)

## Model

In [None]:
import tensorflow as tf
from tensorflow.keras import layers as L
from tensorflow.keras import backend as K

def mcrmse(t, p, seq_len_target = TRUNCATED_LEN_y):
    ## calculate mcrmse score by using numpy
    t = t[:, :seq_len_target]
    p = p[:, :seq_len_target]
    
    score = np.mean(np.sqrt(np.mean(np.mean((p - t) ** 2, axis = 1), axis = 0)))
    return score

def mcrmse_loss(y_target, y_pred, seq_len_target = TRUNCATED_LEN_y):
    ## calculate mcrmse score by using tf
    y_target = y_target[:, :seq_len_target]
    y_pred = y_pred[:, :seq_len_target]
    
    loss = tf.reduce_mean(tf.sqrt(tf.reduce_mean(tf.reduce_mean((y_target - y_pred) ** 2, 
                                                                axis = 1), 
                                                 axis = 0)))
    loss += 6e-1*tf.reduce_mean(tf.sqrt(tf.reduce_mean(tf.reduce_mean(tf.abs(y_target - y_pred), 
                                                                      axis = 1), axis = 0)))
    return loss

def attention(x_inner, x_outer, n_factor, dropout):
    x_Q =  L.Conv1D(n_factor, 1, activation='linear', 
                  kernel_initializer='glorot_uniform',
                  bias_initializer='glorot_uniform',
                 )(x_inner)
    x_K =  L.Conv1D(n_factor, 1, activation='linear', 
                  kernel_initializer='glorot_uniform',
                  bias_initializer='glorot_uniform',
                 )(x_outer)
    x_V =  L.Conv1D(n_factor, 1, activation='linear', 
                  kernel_initializer='glorot_uniform',
                  bias_initializer='glorot_uniform',
                 )(x_outer)
    x_KT = L.Permute((2, 1))(x_K)
    res = L.Lambda(lambda c: K.batch_dot(c[0], c[1]) / np.sqrt(n_factor))([x_Q, x_KT])
#     res = tf.expand_dims(res, axis = 3)
#     res = L.Conv2D(16, 3, 1, padding = "same", activation = "relu")(res)
#     res = L.Conv2D(1, 3, 1, padding = "same", activation = "relu")(res)
#     res = tf.squeeze(res, axis = 3)
    att = L.Lambda(lambda c: K.softmax(c, axis=-1))(res)
    att = L.Lambda(lambda c: K.batch_dot(c[0], c[1]))([att, x_V])
    return att

def multi_head_attention(x, y, n_factor, n_head, dropout):
    if n_head == 1:
        att = attention(x, y, n_factor, dropout)
    else:
        n_factor_head = n_factor // n_head
        heads = [attention(x, y, n_factor_head, dropout) for i in range(n_head)]
        att = L.Concatenate()(heads)
        att = L.Dense(n_factor, 
                      kernel_initializer='glorot_uniform',
                      bias_initializer='glorot_uniform',
                     )(att)
    x = L.Add()([x, att])
    x = L.LayerNormalization()(x)
    if dropout > 0:
        x = L.Dropout(dropout)(x)
    return x

def res(x, unit, kernel = 3, rate = 0.1):
    h = L.Conv1D(unit, kernel, 1, padding = "same", activation = None)(x)
    h = L.LayerNormalization()(h)
    h = L.LeakyReLU()(h)
    h = L.Dropout(rate)(h)
    return L.Add()([x, h])

def forward(x, unit, kernel = 3, rate = 0.1):
#     h = L.Dense(unit, None)(x)
    h = L.Conv1D(unit, kernel, 1, padding = "same", activation = None)(x)
    h = L.LayerNormalization()(h)
    h = L.Dropout(rate)(h)
#         h = tf.keras.activations.swish(h)
    h = L.LeakyReLU()(h)
    h = res(h, unit, kernel, rate)
    return h

def adj_attn(x, adj, unit, n = 2, rate = 0.1):
    x_a = x
    x_as = []
    for i in range(n):
        x_a = forward(x_a, unit)
        x_a = tf.matmul(adj, x_a)
        x_as.append(x_a)
    if n == 1:
        x_a = x_as[0]
    else:
        x_a = L.Concatenate()(x_as)
    x_a = forward(x_a, unit)
    return x_a


def get_base(config):
    node = tf.keras.Input(shape = (None, X_node.shape[2]), name = "node")
    adj = tf.keras.Input(shape = (None, None, As.shape[3]), name = "adj")
    
    adj_learned = L.Dense(2, "relu")(adj)
#     adj_all = L.Concatenate(axis = 3)([adj, adj_learned])
    adj_learned_1 = L.Conv2D(4, 3, padding="same")(adj)
#     adj_learned_2 = L.Conv2D(2, 15, activation='relu', padding="same")(adj)
    adj_all = L.Concatenate(axis = 3)([adj, adj_learned, adj_learned_1])
        
    xs = []
    xs.append(node)
    x1 = forward(node, 128, kernel = 3, rate = 0.0)
    x2 = forward(x1, 64, kernel = 6, rate = 0.0)
    x3 = forward(x2, 32, kernel = 15, rate = 0.0)
    x4 = forward(x3, 16, kernel = 30, rate = 0.0)
    x = L.Concatenate()([x1, x2, x3, x4])
    
    for unit in [64, 32]:
        x_as = []
        for i in range(adj_all.shape[3]):
            x_a = adj_attn(x, adj_all[:, :, :, i], unit, rate = 0.1)
            x_as.append(x_a)
        x_c = forward(x, unit, kernel = 30)
        
        x = L.Concatenate()(x_as + [x_c])
        x = forward(x, unit)
        x = multi_head_attention(x, x, unit, 4, 0.1)
        xs.append(x)
        
    x = L.Concatenate()(xs)

    model = tf.keras.Model(inputs = [node, adj], outputs = [x])
    return model


def get_ae_model(base, config):
    node = tf.keras.Input(shape = (None, X_node.shape[2]), name = "node")
    adj = tf.keras.Input(shape = (None, None, As.shape[3]), name = "adj")

    x = base([L.SpatialDropout1D(0.3)(node), adj])
    x = forward(x, 64, rate = 0.3)
    p = L.Dense(X_node.shape[2], "sigmoid")(x)
    
    loss = - tf.reduce_mean(20 * node * tf.math.log(p + 1e-4) + (1 - node) * tf.math.log(1 - p + 1e-4))
    model = tf.keras.Model(inputs = [node, adj], outputs = [loss])
    
    opt = adam = tf.optimizers.Adam()
    model.compile(optimizer = opt, loss = lambda t, y : y)
    return model


def get_model(base, config):
    node = tf.keras.Input(shape = (None, X_node.shape[2]), name = "node")
    adj = tf.keras.Input(shape = (None, None, As.shape[3]), name = "adj")
    
    x = base([node, adj])
    x = forward(x, 128, rate = 0.4)
#     x = L.LSTM(32, return_sequences=True, dropout=0.0)(x)
    x = L.Dense(5, None)(x)

    model = tf.keras.Model(inputs = [node, adj], outputs = [x])
    
    opt = adam = tf.optimizers.Adam()
    model.compile(optimizer = opt, loss = mcrmse_loss)
    return model

## Private LB simulation

In [None]:
print('------Before-----')
print(X_node.shape)
print(As.shape)
print(y.shape)

In [None]:
np.random.seed(SEED_SPLIT)
idx_tr = np.random.choice(X_node.shape[0], SIZE_TRAIN, replace=False)
idx_all = np.arange(X_node.shape[0])
idx_pri = np.setdiff1d(idx_all,idx_tr)


X_node, X_node_pri = X_node[idx_tr, :TRUNCATED_LEN], X_node[idx_pri]
As, As_pri = As[idx_tr, :TRUNCATED_LEN, :TRUNCATED_LEN], As[idx_pri]
y, y_pri = y[idx_tr,:TRUNCATED_LEN], y[idx_pri]

In [None]:
set(idx_all) == set(list(idx_tr)+list(idx_pri))

In [None]:
print('------After-----\n')
print(X_node.shape)
print(As.shape)
print(y.shape)
print()
print(X_node_pri.shape)
print(As_pri.shape)
print(y_pri.shape)

## Pretrain the autoencoder

In [None]:
config = {} ## not use now
base = get_base(config)
ae_model = get_ae_model(base, config)
ae_model.summary()
gc.collect();
del ae_model, base

In [None]:
config = {}

if ae_epochs > 0:
    base = get_base(config)
    ae_model = get_ae_model(base, config)
    ## TODO : simultaneous train
    for i in range(ae_epochs//ae_epochs_each):
        print(f"------ {i} ------")
        print("--- train ---")
        ae_model.fit([X_node, As], [X_node[:,0]],
                  epochs = ae_epochs_each,
                  batch_size = ae_batch_size)
        print("--- private simulated ---")
        ae_model.fit([X_node_pri, As_pri], [X_node_pri[:,0]],
                  epochs = 2*ae_epochs_each,
                  batch_size = ae_batch_size)
        gc.collect()
    print("****** save ae model ******")
    base.save_weights("./base_ae")

## train

In [None]:
from sklearn.model_selection import KFold
kfold = KFold(5, shuffle = True, random_state = SEED_VAL)

scores = []
preds = np.zeros([len(X_node), X_node.shape[1], 5])

for i, (tr_idx, va_idx) in enumerate(kfold.split(X_node, As)):
    print(f"\n ------ fold {i} start -----\n ")
    print(f"Train on {len(tr_idx)}")
    print(f"Validate on {len(va_idx)}")
    X_node_tr = X_node[tr_idx]
    X_node_va = X_node[va_idx]
    As_tr = As[tr_idx]
    As_va = As[va_idx]
    y_tr = y[tr_idx]
    y_va = y[va_idx]
    
    base = get_base(config)
    if ae_epochs > 0:
        print("****** load ae model ******")
        base.load_weights("./base_ae")
        print("****** ae model loaded ******")
    model = get_model(base, config)
    if pretrain_dir is not None:
        d = f"./model{i}"
        print(f"--- load from {d} ---")
        model.load_weights(d)
    for epochs, batch_size in zip(epochs_list, batch_size_list):
        print(f"epochs : {epochs}, batch_size : {batch_size}")
        model.fit([X_node_tr, As_tr], [y_tr],
                  validation_data=([X_node_va, As_va], [y_va]),
                  epochs = epochs,
                  verbose=verbose,
                  batch_size = batch_size, 
                  validation_freq = 3)
        
    model.save_weights(f"./model{i}")
    p = model.predict([X_node_va, As_va])
    scores.append(mcrmse(y_va, p))
    print(f"fold {i}: mcrmse {scores[-1]}")
    preds[va_idx] = p
        
pd.to_pickle(preds, "oof.pkl")

In [None]:
model.summary()

In [None]:
print(scores,'\n')
print('CV score is:', np.mean(scores))

## Predict the private set simulated

In [None]:
print(mcrmse(y_va, p))
print(mcrmse(y_va, p, seq_len_target = TRUNCATED_LEN_y))
print(mcrmse(y_va, p, seq_len_target = seq_len_target))

In [None]:
p_pri_simu = 0

for i in range(5):
    model.load_weights(f"./model{i}")
    p_pri_simu += model.predict([X_node_pri, As_pri]) / 5

In [None]:
score_pri_simu = mcrmse(p_pri_simu, y_pri,seq_len_target = seq_len_target)
print(f'Simulate private score is: {score_pri_simu:.7f}')

In [None]:
score_pri_comp = mcrmse(p_pri_simu, y_pri,seq_len_target = y_pri.shape[1])
print(f'Full sequence score for comparison is: {score_pri_comp:.7f}')