## Brief comments

Just copying the discription I wrote on the discussion topic. It's more readable there [link](https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/209711)


Thanks for all the discussion and support from fellow Kagglers. I've learned a lot in the competition and tried a lot of things. Here a simple list of the stuff that I believe worked well. I'll try to cover the whole model but may make updates to clarify and add explanations later.


**Architecture**
My architecture was just 3 Encoder modules stacked together using 512 as d_model and 4 Encoder layers in each block.

My Questions, Interactions and Response sequences were all padded on the left by a unique vector made using historical stats from the user. This ensured that all 3 sequences were aligned on the sequence number

1. Questions Only Block (Q-Block):
    This one was just self attention over the questions.

1. Interactions Only Block (I-Block):
    This one was just self attention over the interactions.

1. Questions, Responses, Interaction Block (QRI Block):
    I used output from Q-Block as query, I-Block as keys and Responses/(QRI Block) as values. The residual connection after attention module was made using value vector (instead of the default query vector). Self Attention Mask was used in first 3 layers and for the last layer I used a custom mask that only attended to prior interactions and didn't have the residual connection.

Concatenated the output of Q-Block and QRI-Block, feed it into 3 Linear layers.

**Data Usage**

I split the longer sequences into windows of 256 length and 128 overlap. (0-256, 128-384, 256-512).
I saved them in tf.records and during inference I took a uniform random 128 window from each 256 length sequences. This ensures any event would be placed at 0-127 location in my training sequence with equal probability (excluding the last 128 events in the last window of a user). For users having smaller sequence length I padded with 0 to make the length equal 128 and used those 128 each time.

**Compute Resources**

I initially only used Kaggle GPUs to train the model. Started setting up a TPU pipeline in the last 3 weeks of the competition and in the end was able to use the TPUs as well. I think I've exhausted my complete TPU quota for last 3 weeks.

I'd like to thank @yihdarshieh (TPU Guru) for this TPU notebook. It helped a lot in setting up the TPU pipeline.

**Sequential Encodings**

I used 2 encodings that captured the sequence of events. I subtracted the first value in each sequence from the rest to ensure that all encodings started with a 0.

**Temporal Encoding**

For this I converted the timestamp into minutes and used the same scheme as that of positional encoding with power (60 x 24 x 365 = 1 year). I believe this enabled the model to know how far apart in time were 2 questions/interactions. I believe this to be a better implementation of the lag time variable used in SAINT+ as it captures the difference in time between all of the events simultaneously rather than just between 2 adjacent events.

**Positional Encoding**

For this I used the task_container_id as the position and a power of 10,000.

**Proxy for Knowledge**

Instead of just using 0/1 from responses I added an new feature which is a heuristic for knowledge in case the response was incorrect. If a user selects option 2 when 1 is the correct one. I would calculate (total number of times 2 was selected for that questions)/(total number of times the question has been answered incorrectly). This ratio was maintained and updated during inference as well. I checked the correlation of mean of proxy_knowledge for incorrect answers and overall user accuracy, it was around 0.4.

**Question difficulty**

Simple feature that is calculated as (total number of times the question has been answered correctly)/(total number of times the question has been answered). This ratio was maintained and updated during inference as well.

**Custom Masks**

I used self attention mask that didn't attend on events of the same bundle.
For the last layer of QRI block I removed the mask entries along the diagonal to ensure it only attends to prior values.

**Starting Vector**

For the starting vector I used counts of the time each tag was seen/answered correctly in questions and lectures. Just used a dense layer to encode this into the first vector of the sequence.

**Other Details**

* Used Noam LR provided on the Transformers page on TF documentation.
* Batch size 1024 (Trained on TPUs - 1 Epoch took 4-6 mins)
* For Validation I just separated around 3.4% of users initially and used their sequences.
* Model converged around 20-30 Epochs. (4 Hours training time at max)
* Sequence length of 128 was used.

**Question Embeddings**

I added up embeddings for content id, part id, (a weighted average of) tags ids and type_of (from lectures), then concatenated it with both Sequential embeddings and then into a dense layer with d_model dimensions.

**Response Embeddings**

I concatenated answered correctly, proxy knowledge, question difficulty, time elapsed and question had explanation and feed into a dense layer with d_model dimensions.

**Interaction Embeddings**

I just took the first d_model//2 units from both Questions/Response embeddings and concatenated them for this.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
max_sequence_length = 128

d_model = 512
embedding_dims = 512
mini_d = embedding_dims//4
num_layers = 4
num_heads = 8

temporal_encoder_power = 60*24*360
task_encoder_power     = 10000

In [None]:
for ii in {"max_sequence_length":max_sequence_length,
"d_model":d_model,
"embedding_dims":embedding_dims,
"mini_d":mini_d,
"num_layers":num_layers,
"num_heads":num_heads,
"temporal_encoder_power":temporal_encoder_power,
"task_encoder_power":task_encoder_power}.items():
    print(ii)

In [None]:
len_validation = 31306
len_train      = 904649

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa
from tensorflow.keras.utils import Sequence
import random
from sklearn import preprocessing
from sklearn.metrics import roc_auc_score
from collections import defaultdict 
from itertools import chain

%load_ext tensorboard

import joblib
import seaborn as sns
import matplotlib.pyplot as plt
import gc
import sys
from tqdm.notebook import  tqdm
from IPython.display import SVG

In [None]:
print(tf.__version__)
print(tfa.__version__)

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection. No parameters necessary if TPU_NAME environment variable is set. On Kaggle this is always the case.
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy() # default distribution strategy in Tensorflow. Works on CPU and single GPU.

tf.config.set_soft_device_placement(True)
REPLICAS =  strategy.num_replicas_in_sync
print("REPLICAS: ", REPLICAS)

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE

In [None]:
feature_description_parse = {
    "e_timestamp" : tf.io.RaggedFeature(value_key="e_timestamp", dtype=tf.int64),
    "e_content_id" : tf.io.RaggedFeature(value_key="e_content_id", dtype=tf.int64),
    "e_content_type_id" : tf.io.RaggedFeature(value_key="e_content_type_id", dtype=tf.int64),
    "r_question_elapsed_time" : tf.io.RaggedFeature(value_key="r_question_elapsed_time", dtype=tf.int64),
    "r_question_had_explanation" : tf.io.RaggedFeature(value_key="r_question_had_explanation", dtype=tf.int64),
    "e_task_container_id" : tf.io.RaggedFeature(value_key="e_task_container_id", dtype=tf.int64),
    "e_part" : tf.io.RaggedFeature(value_key="e_part", dtype=tf.int64),
    "e_tags" : tf.io.RaggedFeature(value_key="e_tags", dtype=tf.int64),
    "e_type_of" : tf.io.RaggedFeature(value_key="e_type_of", dtype=tf.int64),
    "r_answer" : tf.io.RaggedFeature(value_key="r_answer", dtype=tf.int64),
    "r_q_difficulty" : tf.io.RaggedFeature(value_key="r_q_difficulty", dtype=tf.float32),
    "r_proxy_knowledge" : tf.io.RaggedFeature(value_key="r_proxy_knowledge", dtype=tf.float32),
    "h_tags_counts" : tf.io.FixedLenFeature([1,188], dtype=tf.int64),
    "h_tags_correct" :  tf.io.FixedLenFeature([1,188], dtype=tf.int64),
    "h_lecture_counts" : tf.io.FixedLenFeature([1,188], dtype=tf.int64),
    "h_timestamp_sums" : tf.io.FixedLenFeature([1,1], dtype=tf.float32),
    "h_parts_counts" : tf.io.FixedLenFeature([1,1], dtype=tf.int64),
    "user": tf.io.FixedLenFeature([], dtype=tf.int64),
    "len_seq" : tf.io.FixedLenFeature([], dtype=tf.int64),
    "mask": tf.io.RaggedFeature(value_key="mask", dtype=tf.int64)
}

@tf.function
def _parse_function_ragged(example_proto):
    _parsed = tf.io.parse_single_example(example_proto, feature_description_parse)
    
    parsed = {}
    len_seq = _parsed['len_seq']
    max_len = tf.constant(128, dtype=tf.dtypes.int64)

    if len_seq > max_len:
        start = tf.random.uniform(shape=[], minval=0, maxval=len_seq-max_len, dtype=tf.dtypes.int64)
    else:
        start = tf.constant(0, dtype=tf.dtypes.int64)
    
    len_seq = tf.reduce_sum(tf.ones_like(_parsed['r_answer'], dtype=tf.int64))
    
    for values in feature_description_parse:
        if values in ['len_seq','user',"h_tags_counts","h_tags_correct","h_lecture_counts","h_timestamp_sums","h_parts_counts"]:
             parsed[values] = _parsed[values]
        else:   
            if values=='e_tags':
                parsed[values] = tf.reshape(_parsed[values][6*start:6*(start+max_len)],shape=(max_len,tf.constant(6, dtype=tf.dtypes.int64)))
            else:
                parsed[values] = tf.reshape(_parsed[values][start:start+max_len],shape=(max_len,))
    
    return parsed,parsed['r_answer'][...,tf.newaxis],parsed['mask']

def read_process_tfrecords(filenames):
    raw_dataset = tf.data.TFRecordDataset(filenames)
    parsed_dataset = raw_dataset.map(_parse_function_ragged)
    return parsed_dataset

In [None]:
#Set the GCP path
from kaggle_datasets import KaggleDatasets
GCS_DS_PATH = KaggleDatasets().get_gcs_path('riiid-random-sequences') 

local_path_data = '../input/riiid-random-sequences'

In [None]:
train_dataset = read_process_tfrecords([GCS_DS_PATH+"/"+ii for ii in os.listdir(local_path_data) if 'train' in ii])
validation_dataset = read_process_tfrecords([GCS_DS_PATH+"/"+'validation_data.tfrecord'])

## Model Utils

In [None]:
class Custom_Mask(tf.keras.layers.Layer):
    def __init__(self, histories = 1, max_sequence_length = max_sequence_length, **kwargs):
        super(Custom_Mask, self).__init__(**kwargs)
        
        self.histories = histories
        self.max_sequence_length = max_sequence_length
        len_mask = self.max_sequence_length+self.histories
        
        self.triu = (tf.keras.backend.cast(tf.linalg.band_part(tf.keras.backend.ones((self.max_sequence_length, self.max_sequence_length)), -1, 0),tf.bool)[tf.newaxis,...])
        self.ones = tf.eye(len_mask, num_columns=None, batch_shape=None, dtype=tf.dtypes.bool)[tf.newaxis,...]
        self.hist_mask = tf.keras.backend.cast(tf.keras.backend.ones(shape=(1,self.histories,self.histories)),tf.bool)
    
    def get_config(self):
        config = super(Custom_Mask, self).get_config()
        config.update({"histories": self.histories,
                      "max_sequence_length":self.max_sequence_length})
        return config
    
    
    
    def call(self,inputs):
        batch_size = tf.math.reduce_sum(tf.ones_like(inputs[:, :1], dtype=tf.int32))
        len_seq = tf.math.reduce_sum(tf.ones_like(inputs[:1, :], dtype=tf.int32))
        x = tf.broadcast_to(inputs[:, :, tf.newaxis], shape=[batch_size, len_seq, len_seq])
        y = tf.broadcast_to(inputs[:, tf.newaxis, :], shape=[batch_size, len_seq, len_seq])
        mask_bundles = tf.not_equal(x,y)
        
        bundles_pad = tf.logical_and(mask_bundles,self.triu[:,:len_seq,:len_seq])
        
        bundles_pad = tf.pad(bundles_pad, [[0,0],[1,0],[0,0]], mode='CONSTANT', constant_values=False)
        bundles_pad = tf.pad(bundles_pad, [[0,0],[0,0],[1,0]], mode='CONSTANT', constant_values=True)
        
        response_mask = bundles_pad
        
        event_mask = tf.logical_or(response_mask,self.ones[:,:len_seq+1,:len_seq+1])
        mask_bundles = tf.logical_not(mask_bundles)

        return event_mask,response_mask,mask_bundles

    def compute_output_shape(self, input_shape):
        s = input_shape
        h = self.histories
        if s[-1] is not None:
            with_hist = s[0],s[1]+h,s[1]+h
            without_hist = s[0],s[1],s[1]
        else:
            with_hist = None,None,None
            without_hist = with_hist
        return (with_hist),(with_hist),(without_hist)

In [None]:
class Temporal_Embedding(tf.keras.layers.Layer):

    def __init__(self, temporal_vector_dimensions,temporal_encoder_power, **kwargs):
        
        super(Temporal_Embedding, self).__init__(**kwargs)

        
        self.temporal_vector_dimensions = temporal_vector_dimensions
        self.temporal_encoder_power = temporal_encoder_power
        
        i_vector = tf.keras.backend.arange(self.temporal_vector_dimensions//2)

        self.factor = tf.cast(1 / (self.temporal_encoder_power ** (2*i_vector/self.temporal_vector_dimensions)), dtype=tf.float32)
        
        
    def get_config(self):
        config = super(Temporal_Embedding, self).get_config()
        config.update({"temporal_encoder_power": self.temporal_encoder_power,
                      "temporal_vector_dimensions":self.temporal_vector_dimensions})
        return config
        
     
    def get_sin(self, x):
        return tf.keras.backend.sin(x)
    
    def get_cos(self, x):
        return tf.keras.backend.cos(x)
    
    def rescale(self, x):
        return tf.cast(x, dtype=tf.float32) * self.factor
    

    def call(self, x):
        pos_vector = tf.keras.layers.Lambda(self.rescale)(x)
        pos_vector_sin = tf.keras.layers.Lambda(self.get_sin)(pos_vector)
        pos_vector_cos = tf.keras.layers.Lambda(self.get_cos)(pos_vector)
        pos_vector = tf.keras.layers.concatenate([pos_vector_sin,
                pos_vector_cos], axis=-1)
        return pos_vector


class Masked_Sum(tf.keras.layers.Layer):
    def __init__(self,**kwargs):
        super(Masked_Sum, self).__init__(**kwargs)
        
    def get_config(self):
        config = super(Masked_Sum, self).get_config()
        return config
    
    def call(self, x):
        x, mask = x
        mask = tf.cast(mask, x.dtype)[...,tf.newaxis]
        x = tf.keras.layers.Multiply()([x,mask])
        x = tf.keras.backend.sum(x,axis=-2)
        return x
    
    
    
class Elem_Divide(tf.keras.layers.Layer):
    def __init__(self,**kwargs):
        super(Elem_Divide, self).__init__(**kwargs)
        self.e =  tf.constant([1e-9],dtype=tf.float32)
        
    def get_config(self):
        config = super(Elem_Divide, self).get_config()
        return config
    
    def call(self,x):
        x,y = x
        x = tf.cast(x, tf.float32)
        y = tf.math.add(tf.cast(y, tf.float32),self.e)
        return tf.divide(x,y)

In [None]:
class weighted_embs(tf.keras.layers.Layer):
    def __init__(self,**kwargs):
        super(weighted_embs, self).__init__(**kwargs)
        self.dense_layer = tf.keras.layers.Dense(1,use_bias=False)
        self.permute_layer = tf.keras.layers.Permute((1,3,2))
        
    def get_config(self):
        config = super(weighted_embs, self).get_config()
        return config
    
    def call(self, inputs):
        inputs, mask = inputs
        ones_like = tf.ones_like(inputs)
        ones_like = ones_like*tf.cast(mask,tf.float32)[...,tf.newaxis]
        ones_like = self.permute_layer(ones_like)
        ones_like = self.dense_layer(ones_like)
        inputs = self.permute_layer(inputs)
        inputs = self.dense_layer(inputs)
        out = tf.keras.backend.squeeze(inputs/(ones_like+1e-6),axis=-1)
        return out

In [None]:
class Hindsight(tf.keras.layers.Layer):

    def __init__(self,d_model,num_heads,dff,rate=0.1,add_on = 'q',**kwargs):
        super(Hindsight, self).__init__(**kwargs)


        self.d_model = d_model
        self.num_heads = num_heads
        self.dff = dff
        self.rate = rate
        self.add_on = add_on
        self.mha = tfa.layers.MultiHeadAttention(head_size = self.d_model//self.num_heads, num_heads=self.num_heads)
        self.dense1 = tf.keras.layers.Dense(self.dff, activation='relu')  # (batch_size, seq_len, dff)
        self.dense2 = tf.keras.layers.Dense(self.d_model)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(self.rate)
        self.dropout2 = tf.keras.layers.Dropout(self.rate)
        self.dropout3 = tf.keras.layers.Dropout(self.rate)

        
        
    def get_config(self):
        config = super(Hindsight, self).get_config()
        config.update({"d_model": self.d_model,
                      "num_heads":self.num_heads,
                       "dff" : self.dff,
                      "rate":self.rate,
                     'add_on':self.add_on})
        return config
    
    
    def call(self, x):
        q, v, k, mask = x
        
        attn_output = self.mha([q, k, v], mask=mask)  # (batch_size, input_seq_len, d_model)
        attn_output = self.dropout1(attn_output)
        
        if self.add_on == 'q':
            out1 = self.layernorm1(q + attn_output)  # (batch_size, input_seq_len, d_model)
            
        elif self.add_on == 'v':
            out1 = self.layernorm1(v + attn_output)  # (batch_size, input_seq_len, d_model)
            
        else:
            out1 = self.layernorm1(attn_output)  # (batch_size, input_seq_len, d_model)
            
        dense1 = self.dense1(out1)
        dense_1_output = self.dropout2(dense1)

        dense2 = self.dense2(dense_1_output)  # (batch_size, input_seq_len, d_model)
        dense_2_output = self.dropout3(dense2)

        out2 = self.layernorm2(out1 + dense_2_output)  # (batch_size, input_seq_len, d_model)

        return out2

In [None]:
class drop_timestamp(tf.keras.layers.Layer):
    def __init__(self, rate,**kwargs):
        super(drop_timestamp, self).__init__(**kwargs)
        self.rate =  rate
        self.dropout = tf.keras.layers.Dropout(self.rate,noise_shape=(None,None,1))
        
    def get_config(self):
        config = super(drop_timestamp, self).get_config()
        config.update({"rate": self.rate})
        return config
    
    
    def call(self, inputs):
        ones = tf.keras.layers.Lambda(lambda x:tf.ones_like(x))(inputs)
        ones = self.dropout(ones)
        mask = tf.keras.layers.Lambda(lambda x:tf.cast(tf.math.not_equal(x, 0),tf.float32))(ones)
        output = tf.keras.layers.Lambda(lambda x:tf.multiply(x[0],x[1]))([inputs,mask])
        return output

In [None]:
tag_embedding_dimension = 189

content_id_embedding_dimension = 13942 # len(DM.encoder_content_id.classes_)+1

part_id_embedding_dimension = 8 #DM.contents_df['part'].max()+1
type_of_embedding_dimension = 5 #DM.contents_df['type_of'].max()+1

## Main Model

In [None]:
def Generate_Model(num_layers=num_layers,
                   d_model=d_model,
                   num_heads=num_heads,
                   dff=d_model*2,
                   rate=0.1,
                   embedding_dimensions = embedding_dims,
                   seq_len = max_sequence_length):
    
    mini_d = embedding_dimensions//8
    
    e_timestamp = tf.keras.layers.Input(shape=(seq_len,1),name='e_timestamp')
    e_content_id = tf.keras.layers.Input(shape=(seq_len,),name='e_content_id')
    
    e_content_type_id  = tf.keras.layers.Input(shape=(seq_len,),name='e_content_type_id')    
    
    e_task_container_id = tf.keras.layers.Input(shape=(seq_len,),name='e_task_container_id')
    e_part = tf.keras.layers.Input(shape=(seq_len,),name='e_part')
    e_tags = tf.keras.layers.Input(shape=(seq_len,6),name='e_tags')
    e_type_of = tf.keras.layers.Input(shape=(seq_len,),name='e_type_of')
    
    r_answer = tf.keras.layers.Input(shape=(seq_len,1),name='r_answer')
    r_question_had_explanation = tf.keras.layers.Input(shape=(seq_len,1),name='r_question_had_explanation')
    r_question_elapsed_time = tf.keras.layers.Input(shape=(seq_len,1),name='r_question_elapsed_time')
    
    r_q_difficulty = tf.keras.layers.Input(shape=(seq_len,1),name='r_q_difficulty')
    r_proxy_knowledge = tf.keras.layers.Input(shape=(seq_len,1),name='r_proxy_knowledge')
    
    h_tags_counts = tf.keras.layers.Input(shape=(1,188),name='h_tags_counts')
    h_tags_correct = tf.keras.layers.Input(shape=(1,188),name='h_tags_correct')
    h_lecture_counts = tf.keras.layers.Input(shape=(1,188),name='h_lecture_counts')
    h_timestamp_sums = tf.keras.layers.Input(shape=(1,1),name='h_timestamp_sums')
    h_parts_counts = tf.keras.layers.Input(shape=(1,1),name='h_parts_counts')    
    
    temporal_vector = tf.keras.layers.Lambda(lambda x:x-x[:,:1,:])(e_timestamp)
    temporal_vector = tf.keras.layers.Lambda(lambda x:x/600)(temporal_vector)
    temporal_vector = Temporal_Embedding(embedding_dimensions,temporal_encoder_power)(temporal_vector)
    
    
    task_id_encoding_layer = Temporal_Embedding(embedding_dimensions,task_encoder_power)
    task_id_encoding = tf.keras.layers.Lambda(lambda x:tf.keras.backend.expand_dims(x,axis=-1))(e_task_container_id)
    task_id_encoding = tf.keras.layers.Lambda(lambda x:x-x[:,:1,:])(task_id_encoding)
    task_id_encoding = task_id_encoding_layer(task_id_encoding)
    
    
    tags_embedding_layer = tf.keras.layers.Embedding(tag_embedding_dimension,embedding_dimensions,mask_zero=True)
    event_tags_embedding = tags_embedding_layer(e_tags)

    event_tags_embedding =  weighted_embs()([event_tags_embedding,event_tags_embedding._keras_mask])
    
    event_content_id_embeddings = tf.keras.layers.Embedding(content_id_embedding_dimension,embedding_dimensions)(e_content_id)

    event_content_id_embeddings = drop_timestamp(rate=rate)(event_content_id_embeddings)
    
    
    event_content_type_id_embeddings = tf.keras.layers.Lambda(lambda x:tf.keras.backend.expand_dims(x,axis=-1))(e_content_type_id)
    event_content_type_id_embeddings = tf.keras.layers.Dense(embedding_dimensions)(event_content_type_id_embeddings)
    
    
    part_id_embeddings_layer = tf.keras.layers.Embedding(part_id_embedding_dimension,embedding_dimensions)
    event_part_id_embeddings = part_id_embeddings_layer(e_part)
    
    event_type_of_embeddings = tf.keras.layers.Embedding(type_of_embedding_dimension,embedding_dimensions)(e_type_of)
    
    response_answer = tf.keras.layers.Dense(embedding_dimensions)(r_answer)
    
    response_hint = tf.keras.layers.Dense(embedding_dimensions)(r_question_had_explanation)
    
    response_time = tf.keras.layers.Dense(mini_d,activation='relu')(r_question_elapsed_time)
    response_time = tf.keras.layers.Dropout(rate)(response_time)
    response_time = tf.keras.layers.Dense(embedding_dimensions)(response_time)
    
    question_difficulty = tf.keras.layers.Dense(mini_d,activation='relu')(r_q_difficulty)
    question_difficulty = tf.keras.layers.Dropout(rate)(question_difficulty)
    question_difficulty = tf.keras.layers.Dense(embedding_dimensions)(question_difficulty)
    
    proxy_knowledge = tf.keras.layers.Dense(mini_d,activation='relu')(r_proxy_knowledge)
    proxy_knowledge = tf.keras.layers.Dropout(rate)(proxy_knowledge)
    proxy_knowledge = tf.keras.layers.Dense(embedding_dimensions)(proxy_knowledge)
    
    event_content_id_embeddings = tf.keras.layers.Add()([event_content_id_embeddings,
                                                         event_tags_embedding,
                                                         event_part_id_embeddings,
                                                         event_type_of_embeddings,
                                                         question_difficulty,])
    
    
    event_stream = tf.keras.layers.Concatenate(axis=-1)([event_content_id_embeddings,
                                                         temporal_vector,
                                                         task_id_encoding])
    
    event_stream = tf.keras.layers.Dense(d_model)(event_stream)
    
    
    response_stream = tf.keras.layers.Add()([response_answer,
                                             response_hint,
                                             response_time,
                                             proxy_knowledge,
                                             event_content_type_id_embeddings])    

    histories_timestamp = tf.keras.layers.Lambda(lambda x:x/600)(h_timestamp_sums)
    histories_timestamp = Elem_Divide()([histories_timestamp,h_parts_counts])
    histories_temporal = Temporal_Embedding(embedding_dimensions,temporal_encoder_power)(histories_timestamp)
    
    
    histories_stream = tf.keras.layers.Concatenate(axis=-1)([h_tags_counts,
                                                               h_lecture_counts])
    
    histories_stream = tf.keras.layers.GaussianNoise(stddev=0.5)(histories_stream)

    histories_stream = tf.keras.layers.Concatenate(axis=-1)([histories_temporal,
                                                               histories_stream])
    

    histories_stream = tf.keras.layers.Dropout(rate)(histories_stream)
    histories_stream = tf.keras.layers.Dense(embedding_dimensions,activation='relu')(histories_stream)
    
    histories_stream = tf.keras.layers.Dropout(rate)(histories_stream)    
    histories_stream = tf.keras.layers.Dense(d_model)(histories_stream)
    
    histories_response_stream = tf.keras.layers.Concatenate(axis=-2)([h_tags_correct,
                                                                      h_tags_counts])
    
    histories_response_stream = tf.keras.layers.GaussianNoise(stddev=0.5)(histories_response_stream)
    histories_response_stream = tf.keras.layers.Permute((2,1))(histories_response_stream)
    
    histories_response_stream = tf.keras.layers.Dense(mini_d,activation='relu')(histories_response_stream)
    histories_response_stream = tf.keras.layers.Dropout(rate)(histories_response_stream)
    
    histories_response_stream = tf.keras.layers.Dense(mini_d,activation='relu')(histories_response_stream)
    histories_response_stream = tf.keras.layers.Dropout(rate)(histories_response_stream)
    
    histories_response_stream = tf.keras.layers.Dense(1,activation='relu')(histories_response_stream)
    histories_response_stream = tf.keras.layers.Permute((2,1))(histories_response_stream)
    
    histories_response_stream = tf.keras.layers.Dropout(rate)(histories_response_stream)
    histories_response_stream = tf.keras.layers.Dense(embedding_dimensions,activation='relu')(histories_response_stream)
    
    histories_response_stream = tf.keras.layers.Dropout(rate)(histories_response_stream)
    histories_response_stream = tf.keras.layers.Dense(d_model)(histories_response_stream)
    

    queries_final = tf.keras.layers.Concatenate(axis=1,name='events_with_hist')([histories_stream,event_stream])
    keys_final_0 = tf.keras.layers.Lambda(lambda x:x[...,:d_model//2])(queries_final)
    queries_final = tf.keras.layers.LayerNormalization(epsilon=1e-6)(queries_final)
    queries_final = tf.keras.layers.Dropout(rate)(queries_final)
    
    values_final = tf.keras.layers.Concatenate(axis=1,name='responses_with_hist')([histories_response_stream,response_stream])
    keys_final_1 = tf.keras.layers.Lambda(lambda x:x[...,:d_model//2])(values_final)
    values_final = tf.keras.layers.LayerNormalization(epsilon=1e-6)(values_final) 
    values_final = tf.keras.layers.Dropout(rate)(values_final)

    
    keys_final = tf.keras.layers.Concatenate(axis=-1)([keys_final_0,
                                                       keys_final_1])

    keys_final = tf.keras.layers.Dense(d_model)(keys_final)
    keys_final = tf.keras.layers.LayerNormalization(epsilon=1e-6)(keys_final)
    keys_final = tf.keras.layers.Dropout(rate)(keys_final)

    mask = Custom_Mask(histories = 1, max_sequence_length = max_sequence_length)(e_task_container_id)

    for ii in range(num_layers):
        keys_final = Hindsight(d_model, num_heads, dff, rate=rate, add_on ='q',name=f'KKK_K_{ii+1}')([keys_final, keys_final, keys_final, mask[0]])
   
    keys_final = drop_timestamp(rate=rate)(queries_final)
    
    for ii in range(num_layers):
        queries_final = Hindsight(d_model, num_heads, dff, rate=rate, add_on ='q',name=f'QQQ_Q_{ii+1}')([queries_final, queries_final, queries_final, mask[0]])

    for ii in range(num_layers):        
        if ii<num_layers-1:
            values_final = Hindsight(d_model, num_heads, dff, rate=rate, add_on='v', name=f'QVK_V_{ii+1}')([queries_final, values_final, keys_final, mask[0]])
        else:
            values_final = Hindsight(d_model, num_heads, dff, rate=rate, add_on = None, name=f'QVK_0_{ii+1}')([queries_final, values_final, keys_final, mask[1]])   
    
    values_final = tf.keras.layers.Concatenate(axis=-1)([queries_final,values_final])
    cropped = tf.keras.layers.Cropping1D(cropping=(1, 0))(values_final) 
                    
    final_output = tf.keras.layers.Dropout(rate)(cropped)
    final_output = tf.keras.layers.Dense(d_model,activation='relu')(final_output)
    
    final_output = tf.keras.layers.Dropout(rate)(final_output)
    final_output = tf.keras.layers.Dense(d_model,activation='relu')(final_output)
    
    final_output = tf.keras.layers.Dropout(rate)(final_output)
    final_output = tf.keras.layers.Dense(d_model,activation='relu')(final_output)
    
    final_output = tf.keras.layers.Dropout(rate)(final_output)        
    answer_correctly = tf.keras.layers.Dense(1,activation=tf.keras.activations.sigmoid,name='answer_correctly')(final_output)
    
    
    sample_transformer = tf.keras.Model(inputs =[e_timestamp,
                                                 e_content_id,
                                                 e_content_type_id,
                                                 r_question_elapsed_time,
                                                 r_question_had_explanation,
                                                 e_task_container_id,
                                                 e_part,
                                                 e_tags,
                                                 e_type_of,
                                                 r_answer,
                                                 r_q_difficulty,
                                                 r_proxy_knowledge, 
                                                 h_tags_counts,
                                                 h_tags_correct,
                                                 h_lecture_counts,
                                                 h_timestamp_sums,
                                                 h_parts_counts],
                                        outputs = answer_correctly)
    return sample_transformer

In [None]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()
    
    self.d_model = d_model
    self.d_model_casted = tf.cast(self.d_model, tf.float32)
    
    self.warmup_steps = warmup_steps
    
  def get_config(self):
        config = {}
        config.update({"d_model": self.d_model,
                      "warmup_steps":self.warmup_steps})
        return config
  
  def __call__(self, step):
    step = tf.cast(step, tf.float32)
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (tf.cast(self.warmup_steps, tf.float32) ** -1.5)
    
    return tf.math.rsqrt(self.d_model_casted) * tf.math.minimum(arg1, arg2)

In [None]:
with strategy.scope():

    learning_rate = CustomSchedule(d_model)

    optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.999, 
                                 epsilon=1e-8)
    sample_transformer = Generate_Model()
    sample_transformer.compile(optimizer= optimizer,
                           loss=tf.keras.losses.binary_crossentropy,
                           weighted_metrics = [tf.keras.metrics.AUC()])
        
sample_transformer.summary()

In [None]:
# Incase anyone wants to try and train use following code
#
#
# train_dataset_batched = (train_dataset.repeat()
#                                        .shuffle(batch_size*32)
#                                        .batch(batch_size,drop_remainder=True)
#                                        .prefetch(AUTOTUNE))
#
# validation_dataset_batched = validation_dataset.cache().batch(batch_size,drop_remainder=True)
#
#
# hist = sample_transformer.fit(x=train_dataset_batched,
#                              validation_data = validation_dataset_batched,
#                              verbose=1,
#                              initial_epoch=0,
#                              epochs=60,
#                              steps_per_epoch=len_train//(batch_size))