**<font size="6">Welcome</font>**
<hr></hr>

<font size="5">This notebook will be a detailed guide for impelementing, training and performing inference using the SAINT model. I hope it will useful for participants who failed to get a decent ROC score using it.</font>


<font size="3">Using this implementation, with different hyperparameters and longer training time, I acheieved a 0.804 ROC score (private leaderboard), I ensembled my model along side with LGBM to the rank 39th. I used Tito's validation split to validate the model's score, and to perform inference.</font>


In [None]:
import sys
import numpy as np

import os
import glob
import time
from os import listdir
from typing import Dict
import datatable as dt

import sklearn.preprocessing as preprocessing
from sklearn.metrics import roc_auc_score

import pandas as pd

from sys import getsizeof

#supress warnings
import warnings
warnings.filterwarnings("ignore")

import time
import gc
import math 

<font size="3">For the sake of demonstration, I chose smaller hyperparameters, and smaller training data, however you can easily change it from here. For my model, I had the following hyperparameters:</font>

    
* **<font size="3">d_model</font>**: 128
    
* **<font size="3">batch_size</font>**: 512
    
* **<font size="3">seq_len</font>**: 100
    
* **<font size="3">train with whole data</font>**
    
* **<font size="3">Decoder, and encoder layers</font>**: 4
    
* **<font size="3">Dropout</font>**: 0.1
    

<br></br>
<font size="3"> If I had more computational resources, I would've performed a hyperparameter search, but per my experience, d_model, sequence length and dropout are the most crucial for a quality model. Sequence length is how many previous question answer pairs the model would use for its prediction, the bigger usually the better, but it would take more time to train.</font>

<br></br>
<font size="3"> The start token shifts past sequences forward so question answers would leak. For example, if the question answer input pair is ([53,23,15],[0,0,1]) , adding the answer start token woud make it like this: ([53,23,15],[2,0,0]). 

In [None]:
batch_size = 512
seq_len = 100

#If True, the model would be trained on +70 Million rows, 20M otherwise
train_full = False

#Answer start token
correct_start_token = 2
user_answer_start_token = 4


#Transformer hyperparameter 
d_model = 64

decoder_layers = 2
encoder_layers = 2

dropout = 0.1 
ff_model = d_model*4

att_heads = d_model // 32


#Loading questions, and every question corresponding part
que_data = pd.read_csv( "../input/riiid-test-answer-prediction/questions.csv")
part_valus = que_data.part.values

unique_ques = len(que_data)

In [None]:
#Load the validation / training splitted data created by Tito. The validation data occur after the training data.
train_data = pd.read_pickle("../input/riiid-cross-validation-files/cv1_train.pickle")
validation = pd.read_pickle("../input/riiid-cross-validation-files/cv1_valid.pickle")


#Remove uneeded rows, and drop lectures for the training data
del train_data["prior_question_had_explanation"]
del train_data["prior_question_elapsed_time"]
del train_data["max_time_stamp"]
del train_data["rand_time_stamp"]
del train_data["viretual_time_stamp"]

del validation["prior_question_had_explanation"]
del validation["prior_question_elapsed_time"]
del validation["max_time_stamp"]
del validation["rand_time_stamp"]
del validation["viretual_time_stamp"]

train_data = train_data[train_data.content_type_id == False]
del train_data["content_type_id"]
del train_data["row_id"]

<font size="3">This line creates a dictionary containing the last timestamp of every user in the dataset: {user_id: timestamp, .....}. It is going to be useful during inference.</font>

In [None]:
last_timestamp = train_data.groupby("user_id")[["timestamp","user_id"]].tail(1).set_index("user_id", drop=True)["timestamp"].to_dict()
train_data.reset_index(drop=True, inplace=True) #Resetting the index frees memory

<font size="3">Compute the difference of time between every two questions of every user in seconds (hence the division)</font>

In [None]:
train_data["timestamp"] = train_data.groupby("user_id")["timestamp"].diff().fillna(0)/1000
train_data["timestamp"] = train_data.timestamp.astype("int32")

**<font size="3">Encoding the timestamp</font>**<font size="3">: Here, I encoded the timestamp in buckets for embedding, instead of using a continuous embedding. There is 70 possible value ranging from 0 seconds till 604800 second (a week). This feature really made the difference for me in reaching an ROC 0.780</font>

In [None]:
boundaries = [120,600,1800,3600,10800,43200,86400,259200,604800]
x = train_data.timestamp.copy()

for i, boundary in enumerate(boundaries):
    
    if i == 0:
        start = 60
    else:
        start = boundaries[i-1]
        
    end = boundary
    
    train_data.loc[(x >= start) & (x < end), "timestamp"] = i+60
    
train_data.loc[x >= end, "timestamp"] = i+60+1

del x
train_data["timestamp"] =train_data["timestamp"].astype("int8")
gc.collect()

<font size="3">Every row of the group series contains a tuple of the user's past question, answers, timestamp difference (encoded) and correctness of his answers. It would be used in creating the dataloader.

In [None]:
group = train_data[['user_id', 'content_id', 'answered_correctly', 'timestamp',"user_answer"]].groupby('user_id').apply(lambda r: (
            r['content_id'].values,
            r['answered_correctly'].values, r['timestamp'].values,r['user_answer'].values))

<font size="6">Creating the training data, and local validation</font>

<font size="3"> **Validation data:** Users with history longer than 100 sequence, chosen randomly, and capped off to the last 100 interactions. Note that this validation scheme is a little bit biased, and for that reason, later we are using Tito's validation.

In [None]:
#Creating the validation data
user_counts = group.apply(lambda x: len(x[0])).sort_values(ascending=False)
user_counts = user_counts[(user_counts >= seq_len)]

accepted_ids = user_counts.index
val_group = group.loc[accepted_ids]


def f(x):
    return (x[0][:seq_len], x[1][:seq_len], x[2][:seq_len], x[3][:seq_len])

val_group = val_group.apply(f).sample(frac=0.1)
group = group.drop(index=val_group.index)

**<font size="3">Training data:</font>** <font size="3">If train_full is set to True, all users with history longer than 100 be selected for training. The sequences length is going to be 100 for every training point, I didn't use padding. </font>

<font size="3">For a user with 932 sequence for example, I take 9 sequences of 100 interactions and drop the last 32 interactions. Every one of these 9 sequences is independent, meaning the model would base its prediction solely on the 100 previous interaction. It is not the best approach, but it worked well given how large the dataset is. </font>

In [None]:
#Creating sequences of 100 of the all interactions less than 1000
user_counts = group.apply(lambda x: len(x[0])).sort_values(ascending=False)

if train_full:  #Train with subset of data or not
    user_counts = user_counts[(user_counts >= seq_len)]
else:
    user_counts = user_counts[((user_counts >= seq_len) & (user_counts <= 1000))]

accepted_ids = user_counts.index
group = group.loc[accepted_ids]

group.index = group.index.astype("str")

In [None]:
from tqdm import tqdm

auxiliary = []
k = 0
    
for line in tqdm(group):
    
    src, trg, ts, user_answer = line
    chunk_len = seq_len
    i = 0
    
    split_size = src.shape[0] - src.shape[0]%chunk_len
    n_splits = split_size/chunk_len
    
    lst = list(zip(np.split(src[:split_size],n_splits), 
                   np.split(trg[:split_size], n_splits), 
                   np.split(ts[:split_size], n_splits), 
                   np.split(user_answer[:split_size], n_splits),
                  ))
    
    auxiliary.extend(lst) 

In [None]:
auxiliary = pd.Series(auxiliary)

In [None]:
import torch
import torch.nn as nn
import torch.nn.utils.rnn as rnn_utils
from torch.autograd import Variable

import torch.nn.functional as F
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.nn import TransformerDecoder, TransformerDecoderLayer

In [None]:
class TextLoader(torch.utils.data.Dataset):
    def __init__(self, data):
        self.x, self.y, self.ts, self.user_answer = [], [], [], []
        
        for line in data:
            x, y, ts, user_answer = line
            
            self.x.append(x)
            self.y.append(y)      
            self.ts.append(ts)
            self.user_answer.append(user_answer)

    def __getitem__(self, index):
        return (torch.LongTensor(self.x[index]), 
                torch.LongTensor(self.y[index]), 
                torch.LongTensor(self.ts[index]),
               torch.LongTensor(self.user_answer[index]))

    def __len__(self):
        return len(self.x)

In [None]:
class TextCollate():
    
    def __call__(self, batch):
        
        x_padded = torch.LongTensor(seq_len, len(batch))
        y_padded = torch.LongTensor(seq_len, len(batch))        
        ts_padded = torch.LongTensor(seq_len, len(batch))     
        user_answer_padded = torch.LongTensor(seq_len, len(batch))

        for i in range(len(batch)):
            
            x = batch[i][0]
            x_padded[:x.size(0), i] = x
            
            y = batch[i][1]
            y_padded[:y.size(0), i] = y
            
            ts = batch[i][2]
            ts_padded[:y.size(0),i] = ts
            
            user_answer = batch[i][3]
            user_answer_padded[:y.size(0),i] = user_answer
            

        return x_padded, y_padded, ts_padded, user_answer_padded

In [None]:
pin_memory = True
num_workers = 2

trainset = TextLoader(auxiliary)
valset = TextLoader(val_group)

collate_fn = TextCollate()

train_loader = torch.utils.data.DataLoader(trainset, num_workers=num_workers, shuffle=True,
                          batch_size=batch_size, pin_memory=pin_memory,
                          drop_last=True, collate_fn=collate_fn)  #(seq_len, batch_size)

val_loader = torch.utils.data.DataLoader(valset, num_workers=num_workers, shuffle=False,
                        batch_size=batch_size, pin_memory=pin_memory,
                        drop_last=False, collate_fn=collate_fn)



<font size="6">Creating the model</font>

<font size="3">This model largely follows the SAINT+ paper. I used Pytorch's `nn.Transformer` for the implementation, and copied some snippets of code from </font> [this](https://nlp.seas.harvard.edu/2018/04/03/attention.html)<font size="3"> glorious notebook (The NOAM optimizer and positional encoding). I think the model's implementation is pretty clear, but tell me if any further explanation is needed.</font>
    
<font size="3">It is comprised of a encoder and a decoder. The encoder takes lag time, past questions and question part embedding as input; meanwhile the decoder takes the correctness of the user past answer, and past user answers as input. </font>
    
<font size="3">Usually the model converges in 60 epochs or so, but training it for longer can get you to 0.800 ROC. </font>

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
part_valus = torch.LongTensor(part_valus).to(device)

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        self.scale = nn.Parameter(torch.ones(1))

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(
            0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.scale * self.pe[:x.size(0), :]
        return self.dropout(x)

In [None]:
class NoamOpt:
    "Optim wrapper that implements rate."
    def __init__(self, model_size, factor, warmup, optimizer):
        self.optimizer = optimizer
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0
        
    def step(self):
        "Update parameters and rate"
        self._step += 1
        rate = self.rate()
        for p in self.optimizer.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()
        
    def rate(self, step = None):
        "Implement `lrate` above"
        if step is None:
            step = self._step
        return self.factor * \
            (self.model_size ** (-0.5) *
            min(step ** (-0.5), step * self.warmup ** (-1.5)))

In [None]:
class TransformerModel(nn.Module):
    
    def __init__(self, intoken, hidden, part_arr, enc_layers, dec_layers, dropout, nheads, ff_model, ts_unique=70):
        super(TransformerModel, self).__init__()
        
        self.encoder = nn.Embedding(intoken, hidden)
        self.pos_encoder = PositionalEncoding(hidden, dropout)

        self.decoder = nn.Embedding(3, hidden)  #0: False , 1: Correct , 3 : Padding
        self.pos_decoder = PositionalEncoding(hidden, dropout)
        
        
        self.transformer = nn.Transformer(d_model=hidden, nhead=nheads, num_encoder_layers=enc_layers, num_decoder_layers=dec_layers, dim_feedforward=ff_model, dropout=dropout, activation='relu')
        self.fc_out = nn.Linear(hidden, 1)

        self.src_mask = None
        self.trg_mask = None
        self.memory_mask = None
      
        self.part_embedding = nn.Embedding(7,hidden)
        self.part_arr = part_arr
        
        self.ts_embedding = nn.Embedding(ts_unique, hidden)        
        self.user_answer_embedding = nn.Embedding(5, hidden)

        
        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)
        self.dropout_3 = nn.Dropout(dropout)
        self.dropout_4 = nn.Dropout(dropout)
        self.dropout_5 = nn.Dropout(dropout)
        self.dropout_6 = nn.Dropout(dropout)

        
    def generate_square_subsequent_mask(self, sz, sz1=None):
        
        if sz1 == None:
            mask = torch.triu(torch.ones(sz, sz), 1)
        else:
            mask = torch.triu(torch.ones(sz, sz1), 1)
            
        return mask.masked_fill(mask==1, float('-inf'))


    def forward(self, src, trg, ts, user_answer):

        if self.trg_mask is None or self.trg_mask.size(0) != len(trg):
            self.trg_mask = self.generate_square_subsequent_mask(len(trg)).to(trg.device)
            
        if self.src_mask is None or self.src_mask.size(0) != len(src):
            self.src_mask = self.generate_square_subsequent_mask(len(src)).to(trg.device)
            
        if self.memory_mask is None or self.memory_mask.size(0) != len(trg) or self.memory_mask.size(1) != len(src):
            self.memory_mask = self.generate_square_subsequent_mask(len(trg),len(src)).to(trg.device)
            

            
        #Get part, prior, timestamp, task_container and user answer embedding
        part_emb = self.dropout_1(self.part_embedding(self.part_arr[src]-1))
        ts_emb = self.dropout_3(self.ts_embedding(ts))
        user_answer_emb = self.dropout_4(self.user_answer_embedding(user_answer))        
        
        
        #Add embeddings Encoder
        src = self.dropout_5(self.encoder(src))  #Embedding
        src = torch.add(src, part_emb)
        src = torch.add(src, ts_emb)   #Last interaction days 
        src = self.pos_encoder(src)   #Pos embedding
        
        
        #Add embedding decoder
        trg = self.dropout_6(self.decoder(trg))
        trg = torch.add(trg, user_answer_emb)
        trg = self.pos_decoder(trg)

        output = self.transformer(src, trg, src_mask=self.src_mask, tgt_mask=self.trg_mask, memory_mask=self.memory_mask)
        

        output = self.fc_out(output)

        return output

In [None]:
que_emb_size = unique_ques

model = TransformerModel(que_emb_size, hidden=d_model,part_arr=part_valus, dec_layers=decoder_layers, enc_layers=encoder_layers, dropout=dropout, nheads=att_heads, ff_model=ff_model).to(device)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')
print(model)

In [None]:
import torch.optim as optim


optimizer = NoamOpt(d_model, 1, 4000 ,optim.Adam(model.parameters(), lr=0))

#Since the objective of training is a binary classification
criterion = nn.BCEWithLogitsLoss()

In [None]:
#Add padding to decoder input
def add_shift(var, pad):
    
    var_pad = torch.ShortTensor(1, var.shape[1]).to(device)
    var_pad.fill_(pad)
    
    return torch.cat((var_pad, var))

In [None]:
def train(model, optimizer, criterion, iterator):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src, trg, ts, user_answer = batch
        src, trg, ts, user_answer = src.to(device), trg.to(device), ts.to(device), user_answer.to(device)

        
        trg = add_shift(trg, correct_start_token)
        user_answer = add_shift(user_answer, user_answer_start_token)        
        
        optimizer.optimizer.zero_grad()
        output = model(src, trg[:-1,:], ts, user_answer[:-1,:])
        
        loss = criterion(output.squeeze(), trg[1:,:].float())

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
from sklearn.metrics import roc_auc_score

def evaluate(model, criterion, iterator):

    model.eval()

    epoch_loss = 0
    acc = 0
    
    preds = []
    corr = []

    with torch.no_grad():    
        for i, batch in enumerate(iterator):


            src, trg, ts, user_answer = batch
            src, trg, ts, user_answer = src.to(device), trg.to(device), ts.to(device), user_answer.to(device)


            trg = add_shift(trg, correct_start_token)
            user_answer = add_shift(user_answer, user_answer_start_token)        

            optimizer.optimizer.zero_grad()
            output = model(src, trg[:-1,:], ts, user_answer[:-1,:])
            loss = criterion(output.squeeze(), trg[1:,:].float())
            
            preds.extend(F.sigmoid(output).squeeze().reshape(-1).detach().cpu().numpy().tolist())
            corr.extend(trg[1:,:].reshape(-1).detach().cpu().numpy().tolist())
            
            nb_correct = F.sigmoid(output).squeeze().transpose(0, 1).round().reshape(-1) == trg[1:,:].float().transpose(0, 1).reshape(-1)
            accuracy = nb_correct.sum()/float(output.squeeze().transpose(0, 1).round().reshape(-1).shape[0])
            
            
            epoch_loss += loss.item()
            
            acc += accuracy.item()
            
    
            
    return (epoch_loss / len(iterator), acc/len(iterator), roc_auc_score(np.array(corr),np.array(preds)))

In [None]:
%%time
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

N_EPOCHS = 60

best_roc = 0

for epoch in range(N_EPOCHS):
    print(f'Epoch: {epoch+1:02} ({best_roc})')

    start_time = time.time()

    train_loss = train(model, optimizer, criterion, train_loader)
    valid_loss, acc, roc = evaluate(model, criterion, val_loader)

    epoch_mins, epoch_secs = epoch_time(start_time, time.time())

    if roc > best_roc:
        best_roc = roc
        torch.save(model.state_dict(), 'model_best.torch')

    print(f'Time: {epoch_mins}m {epoch_secs}s')
    print(f'Train Loss: {train_loss:.3f}')
    print(f'Val   Loss: {valid_loss:.3f}    Acc {acc:.3f}   ROC {roc:.3f}')
    
print(best_roc)

<font size="6">Inference</font>

<font size="3">To simulate a real submission, I used the </font> [<font size="3">timeseries API emulator</font>](https://www.kaggle.com/its7171/time-series-api-iter-test-emulator), <font size="3">again created by Tito. I used the same strategy to validate my model, and usually the ROC score I get here matches the leaderboard consistently.</font>

    
<font size="3">I re-created the group sequences of every user, so I have access on their history.</font>


In [None]:
group = train_data[['user_id', 'content_id', 'answered_correctly', 'timestamp',"user_answer"]].groupby('user_id').apply(lambda r: (
            r['content_id'].values,
            r['answered_correctly'].values, r['timestamp'].values,r['user_answer'].values))

<font size="3"> The `pred_users` function takes a numpy array of the shape (batch, 4), and it returns the SAINT model's prediction if the user's answer is correct or not. This function:

1.  Loops through every item of the input array
2. Fetch this particular user history, and cap it to the last seq_len -1 sequences
3. Formats these sequences into Pytorch array and feeds them to the model
4. Retreive the prediction out of the output.
</font>

In [None]:
def pred_users(vals): #Input must be (eval_batch, 3): ["user_id", "content_id", "content_type_id", "timestamp"]

    eval_batch = vals.shape[0]

    tensor_question = np.zeros((eval_batch, seq_len), dtype=np.long)
    tensor_answers = np.zeros((eval_batch, seq_len), dtype=np.long)
    tensor_ts = np.zeros((eval_batch, seq_len), dtype=np.long)
    tensor_user_answer = np.zeros((eval_batch, seq_len), dtype=np.long)


    val_len = []
    preds = []
    group_index = group.index

    for i, line in enumerate(vals):

        if line[2] == True:
            val_len.append(0)
            continue

        user_id = line[0]
        question_id = line[1]
        timestamp = get_timestamp(line[3], user_id) #Compute timestamp difference correctly
        

        que_history = np.array([], dtype=np.int32)
        answers_history = np.array([], dtype=np.int32)  
        ts_history = np.array([], dtype=np.int32)  
        user_answer_history = np.array([], dtype=np.int32)  

        if user_id in group_index:

            cap = seq_len-1
            que_history, answers_history, ts_history, user_answer_history = group[user_id]

            que_history = que_history[-cap:]
            answers_history = answers_history[-cap:]
            ts_history = ts_history[-cap:]
            user_answer_history = user_answer_history[-cap:]


        #Decoder data, add start token
        answers_history = np.concatenate(([correct_start_token],answers_history))
        user_answer_history = np.concatenate(([user_answer_start_token],user_answer_history))

        #Decoder data
        que_history = np.concatenate((que_history, [question_id]))  #Add current question
        ts_history = np.concatenate((ts_history, [timestamp]))  

        tensor_question[i][:len(que_history)] = que_history
        tensor_answers[i][:len(que_history)] = answers_history
        tensor_ts[i][:len(que_history)] = ts_history
        tensor_user_answer[i][:len(que_history)] = user_answer_history

        val_len.append(len(que_history))

    tensor_question = torch.from_numpy(tensor_question).long().T.to(device)
    tensor_answers = torch.from_numpy(tensor_answers).long().T.to(device)
    tensor_ts = torch.from_numpy(tensor_ts).long().T.to(device)
    tensor_user_answer = torch.from_numpy(tensor_user_answer).long().T.to(device)
    
    with torch.no_grad():   #Disable gradients so prediction runs faster
        out = F.sigmoid(model(tensor_question, tensor_answers, tensor_ts, tensor_user_answer)).squeeze(dim=-1).T


    for j in range(len(val_len)):
        preds.append(out[j][val_len[j]-1].item())

    return preds

<font size="3"> The `update_group_var` function simply updates the `group` Series and the `last_timestamp` to keep track of what the user have learnt

In [None]:
def update_group_var(vals):
    
    global group
    
    for i, line in enumerate(vals):
        
        user_id = line[0]
        question_id = line[1]
        
        content_type_id = line[2]
        ts = get_timestamp(line[3], user_id)
        
        correct = line[4]
        user_answer = line[5]
        
        
        if content_type_id == True:
            continue

        if last_timestamp.get(user_id, -1) == -1:
            last_timestamp[user_id] = 0
        else:
            last_timestamp[user_id] = line[3]
            
        if user_id in group.index:
            questions= np.append(group[user_id][0],[question_id])
            answers= np.append(group[user_id][1],[correct])
            ts= np.append(group[user_id][2],[ts])
            user_answer= np.append(group[user_id][3],[user_answer])
            
            group[user_id] = (questions, answers, ts, user_answer)
        else:
            group[user_id] = (np.array([question_id], dtype=np.int32), np.array([correct], dtype=np.int32), np.array([ts], dtype=np.int32)
                             ,np.array([user_answer], dtype=np.int32))

In [None]:
#Re-creates the timestamp encoding
def get_timestamp(ts, user_id):
    
    if last_timestamp.get(user_id, -1) == -1:
        return 0
    
    diff = (ts - last_timestamp[user_id])/1000
    
    if diff < 0:
        return 0
    
    if diff <= 60:
        return int(diff)
    
    for i, boundary in enumerate(boundaries):
        if boundary > diff:
            break
            
    if i == len(boundaries) - 1:
        return 60+i+1
    
    return 60+i

In [None]:
#Tito's iterator: https://www.kaggle.com/its7171/time-series-api-iter-test-emulator

class Iter_Valid(object):
    def __init__(self, df, max_user=1000):
        df = df.reset_index(drop=True)
        self.df = df
        self.user_answer = df['user_answer'].astype(str).values
        self.answered_correctly = df['answered_correctly'].astype(str).values
        df['prior_group_responses'] = "[]"
        df['prior_group_answers_correct'] = "[]"
        self.sample_df = df[df['content_type_id'] == 0][['row_id']]
        self.sample_df['answered_correctly'] = 0
        self.len = len(df)
        self.user_id = df.user_id.values
        self.task_container_id = df.task_container_id.values
        self.content_type_id = df.content_type_id.values
        self.max_user = max_user
        self.current = 0
        self.pre_user_answer_list = []
        self.pre_answered_correctly_list = []

    def __iter__(self):
        return self
    
    def fix_df(self, user_answer_list, answered_correctly_list, pre_start):
        df= self.df[pre_start:self.current].copy()
        sample_df = self.sample_df[pre_start:self.current].copy()
        df.loc[pre_start,'prior_group_responses'] = '[' + ",".join(self.pre_user_answer_list) + ']'
        df.loc[pre_start,'prior_group_answers_correct'] = '[' + ",".join(self.pre_answered_correctly_list) + ']'
        self.pre_user_answer_list = user_answer_list
        self.pre_answered_correctly_list = answered_correctly_list
        return df, sample_df

    def __next__(self):
        added_user = set()
        pre_start = self.current
        pre_added_user = -1
        pre_task_container_id = -1

        user_answer_list = []
        answered_correctly_list = []
        while self.current < self.len:
            crr_user_id = self.user_id[self.current]
            crr_task_container_id = self.task_container_id[self.current]
            crr_content_type_id = self.content_type_id[self.current]
            if crr_content_type_id == 1:
                # no more than one task_container_id of "questions" from any single user
                # so we only care for content_type_id == 0 to break loop
                user_answer_list.append(self.user_answer[self.current])
                answered_correctly_list.append(self.answered_correctly[self.current])
                self.current += 1
                continue
            if crr_user_id in added_user and ((crr_user_id != pre_added_user) or (crr_task_container_id != pre_task_container_id)):
                # known user(not prev user or differnt task container)
                return self.fix_df(user_answer_list, answered_correctly_list, pre_start)
            if len(added_user) == self.max_user:
                if  crr_user_id == pre_added_user and crr_task_container_id == pre_task_container_id:
                    user_answer_list.append(self.user_answer[self.current])
                    answered_correctly_list.append(self.answered_correctly[self.current])
                    self.current += 1
                    continue
                else:
                    return self.fix_df(user_answer_list, answered_correctly_list, pre_start)
            added_user.add(crr_user_id)
            pre_added_user = crr_user_id
            pre_task_container_id = crr_task_container_id
            user_answer_list.append(self.user_answer[self.current])
            answered_correctly_list.append(self.answered_correctly[self.current])
            self.current += 1
        if pre_start < self.current:
            return self.fix_df(user_answer_list, answered_correctly_list, pre_start)
        else:
            raise StopIteration()

In [None]:
iter_test = Iter_Valid(validation,max_user=1000)
predicted = []
def set_predict(df):
    predicted.append(df)

<font size="5"> Finally looping, for d_model 64 it should take approx 1 hour. </font>

In [None]:
%%time
import ast

model.eval()

preds = []
pbar = tqdm(total=2500000, position=0, leave=True)
check = None

for (test_data, current_prediction_df) in iter_test:   
        
    if check is not None:
        past_vals = np.array(ast.literal_eval(test_data.iloc[0].prior_group_answers_correct)) 
        past_answers = np.array(ast.literal_eval(test_data.iloc[0].prior_group_responses))

        past_vals = np.concatenate((vals, past_vals.reshape(len(past_vals),1)), axis=1)
        past_vals = np.concatenate((past_vals, past_answers.reshape(len(past_answers),1)), axis=1)

        update_group_var(past_vals)  #Update database with the vals of the last batch        
        
    vals = test_data[["user_id","content_id","content_type_id","timestamp"]].values
    preds.extend(pred_users(vals))
    
    check = 1

    pbar.update(len(test_data))

In [None]:
df = validation.iloc[:len(preds)]
df["preds"] = preds

df = df[df.content_type_id == False]
print('Validation ROC:',roc_auc_score(df.answered_correctly, df.preds))

**<font size="6"> Conclusion</font>**

<font size="3"> In this notebook I wanted to prove that even with a relatively simple and small model (d_model = 64), you can get a fairly good accuracy, scaling it would for sure increase the score. I also realised how sensitive implementing transformers could be, changing one hyper-parameter can ruin everything. This led me to think that a simple notebook like this one can be super useful for those who want to play with the model and experiment with new ideas, future kagglers learning about this competition, or even for those who are studying transformers.</font>

<font size="3">I hope this notebook was helpful, if more explanation is needed, please feel free to ask in the comments. Also if you think some of my code is innificient, also point it out. </font>
    


**PS**: I am currently looking for an entry DS job, if you are hiring or you can recommend me, please check my linkedin profile: https://www.linkedin.com/in/abdessalem-boukil-37923637/ 