# Exploratory data analysis

We first import all the necessary libraries to perform the analysis and fit the first models. We also print the directory structure of the files of the competition:

In [None]:
# General libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Modules of this competition (we don't need them until the creation of the test prediction):
#
# import riiideducation
# env = riiideducation.make_env()

# Printing of all the files in the computetition
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
# Observe it includes all the mentioned csv files as well as the riiideducation module.

In [None]:
# The following libraries are need to implement the SAKT model
import gc # garbage colector
import random # random numbers
from tqdm import tqdm # progress bars (from Arabic taqaddum)
from sklearn.metrics import roc_auc_score # to obtain the AUC of the ROC curve
from sklearn.model_selection import train_test_split # for splitting the dataset in two

import seaborn as sns  # statistical data visualization
import matplotlib.pyplot as plt  # more data visualization

import torch # Self-explanatory
import torch.nn as nn # Building blocks for graphs (of neural networks)
import torch.nn.utils.rnn as rnn_utils # I don't know exactly what they include
from torch.autograd import Variable # Automatic differentiation
from torch.utils.data import Dataset, DataLoader # For loading data

## Load of the dataset

Since the dataset is large, we need to solve the problems related with the loading of the dataset.

Issues we need to solve:
* How to load the whole dataset (or use it to train any model and for explation).
* Perform EDA in order to discover the structure of the dataset (max 3 days to do this, preferably half a day)
* Fit a first model using random forest as "basis" score.
* Program the ROC curve (2 hours max) and using a library that has it programmed. This is important in order to testing and evaluating our model without needing to submit the predictions (which seems to be slow).
* Review literature about different models. Ideas: XGBoost, ensembles that involve RNN, LSTM, state-of-art NN and trees.

## Data loading (based on the notebook SAKT1)

In [None]:
# Hyperparameter of the sequence lenght.
MAX_SEQ = 100
epochs = 35

In [None]:
%%time
# dtype of all columns
'''
dtype={'row_id': 'int64',
    'timestamp': 'int64',
    'user_id': 'int32',
    'content_id': 'int16',
    'content_type_id': 'int8',
    'task_container_id': 'int16',
    'user_answer': 'int8', 
    'answered_correctly':'int8',
    'prior_question_elapsed_time': 'float32',
    'prior_question_had_explanation': 'boolean',}
'''
# dtype of the subsect of columns used in SAKT1
dtype = {'timestamp': 'int64',
         'user_id': 'int32',
         'content_id': 'int16',
         'content_type_id': 'int8',
         'answered_correctly':'int8'}

# This loads all the rows but only 5 columns
train_df = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/train.csv', usecols=[1, 2, 3, 4, 7], dtype=dtype)
train_df.head()

## Exploratory data analysis

* The dataset with only 5 columns has a size of 1.5 Gb, with 101,230,332 rows (100 million!)
* With respect of the timestamp: the iterations have an average timestamp of 89 days (7.70e+09 seconds) and a maximum timestamp of 1011.87 days (2 years and 9 months; 8.74e+10 miliseconds).
* The dataset consists of 1,959,032 lectures being watched and 99,171,300 questions posed to students. That is 98.06% of the iterations are questions.
* Approx. 65.72% of the questions were answered correctly.
* As for the particular questions (content_id), there are exactly 13,523 different questions in the training dataset.
* There are 393,656 different students in the training set. 

**Note**: we shouldn't use the used_id (maybe the content_id neither) to make our predictions, because we need to predict accurately for the new students.

### The dataset used by SAKT1

* It ignores the lectures, and only uses the information used about the questions.

In [None]:
# train_df.info()
#train_df[train_df.answered_correctly != -1].describe()

# Consider only questions as in SAKT1
train_df = train_df[train_df.content_type_id == False]

# The sort the dataframe by the timestamp
train_df = train_df.sort_values(['timestamp'], ascending=True).reset_index(drop = True)

## Research of models used

* Most notebooks use a model called "Self-Attentive model for Knowledge Tracing" (SAKT), which is based on this [paper](https://arxiv.org/abs/1907.06837) by Shalini Pandey and George Karypis.
    - Definition of Knowledge Tracing: the task of modeling each student's mastery of knowledge concepts (KCs) and s(he) engages with a sequence of learning activities.
    - Recent methods are based on RNN (e.g., "Deep Knowledge Tracing" DKT and "Dynamic Key-Value Memory Network" DKVMN) and outperformed traditional methods because of their ability to capture complex representation of human learning.
    - These recent methods have the issue of not generalizing well while dealing with sparse data (which is the case with real-world data as students interact with few KCs).
    - The proposed approach identifies the KCs from the student's past activities that are relevant to the given KC and predicts his mastery based on the relatively few KCs that it picked. This allows to handle sparse data better compared with RNNs (since it's based on relatively few past activities).
    - To identify the relevance between KCs, it's proposed a self-attention based approach (which is the SAKT).
    - The experimentation in the paper outperforms state-of-art models, improving AUC by an average of 4.43%
    - We describe the structure of the SAKT later.
    - Most popular public notebooks that use SAKT have a score of 0.773 (which we need to improve).
    - SAKT was tested (in the paper) on 5 datasets: 4 are real-world and the last one is synthetic. All the datasets contained several thousands of iterations and variable density (defined as (#Unique Iterations)/(#User * #Skill tags) ).
    - In the paper, the model is evaluated using the area under the ROC curve (AUC). The SAKT was compared with state-of-art models. Trained with 80% of the dataset and tested in the remaining dataset. SAKT was implemented in TensorFlow with an ADAM optimized and a learning rate of 0.001. Dropout rate of 0.2 for big datasets. The maximum length of the sequence $n$ was selected an roughtly proportional to the average exercise tags per student (that is, between 50 to 100).
    - Results: the score was only slightly better for the SAKT compared to other models. The improvement was of 15.87% for the more sparse model, from 0.727 to 0.854. However, I don't believe this improvement is achived on every sparse dataset (perhaps I'm wrong).

## Preprocess of the data (based on SAKT1)

* To preprocess, we first obtain a list with all possible question ids (all the different possible questions).
* Then we group the dataset by the user_id, where the aggregating function produces a tuple with two entries:
    - The first entry is a list of all the ids of the questions possed to that student
    - The second entry is a list of whether the student go the answer right (1) or not (0)
* **Note**: the preprocessing based on SAKT1 gets rid of the timestamp and the content_type_id. Therefore, we use only two columns to make our predictions (user_id and content_id) and the third column is the output that we use to train. Moreover, only considers users with more than 10 questions answered and limits the historical answers up to 100 questions (hence, the model don't consider your first iterations once you have more than 100 questions answered.)
* Thus, we end with 312,824 students in the training set and 78204 students in the test set (after a 80/20 split). That it, we ignored only 2,628 users.

In [None]:
# First we get the particular ids for each questions

questions = train_df["content_id"].unique()
n_questions = len(questions)
print("number questions:", len(questions))

In [None]:
# This "groups" by used id, so the final Series has 393,000 entries. The index is the used id and the value
# is a tuble with two entries: the first entry contrains all the content_id (in a list) for the questions posed to that
# student while the second entry contrains if the student answered correctly or not (also in a list)
group = train_df[['user_id', 'content_id', 'answered_correctly']].groupby('user_id').apply(lambda r: (
            r['content_id'].values,
            r['answered_correctly'].values))

del train_df
# Actually calling the garbage collector.
gc.collect()
# for getting the number of users:
# len(group)

In [None]:
# We import random because we need it to randomly select iterations.
import random
random.seed(1)

### Notes on the class for the dataset

* It should inherit from `Dataset` class (in `Pytorch.utils.data.Dataset`)
* The constructor takes the group Series and the n_questions number (the number of different questions), and another parameter called max_seq with default as 100.
* When called, the constructor initializes three fields: `samples` as the group Series, `n_questions`, and `max_seq`.
    - Then, it creates a list of all the `user_ids` with more than 10 questions posed (but we don't change the other fields)
* It defines `len` as the number of user_ids (with +10 questions posed)
* Finally, it defines the `getitem` method. This method is the method called when we use the brackets. This method gets an index (which is mapped to an used_id in that position) and return three arrays: one contains the question_ids for the user_id (except the first one), the correctness of those questions, and another array what contains the same question_id for questions with wrong answer and questions_id + no_questions for questions with right answer.

In [None]:
# Class for the data set
# Note: I don't like that it uses too few information. We could improve our model with more features,
# though we must use feature selection in order to avoid unnecessary noise.

# Note: in first version, MAX_SEQ was 100 and we ignored those
# users with less than 10 questions answered.

class SAKTDataset(Dataset):
    def __init__(self, group, n_questions, max_seq=MAX_SEQ):
        super(SAKTDataset, self).__init__()
        self.max_seq = max_seq
        self.n_questions = n_questions
        self.samples = group
        
        self.user_ids = [] # list of all the user ids
        for user_id in group.index:
            # cicles all possible user_ids
            
            # gets the questions, answers for that user
            q, qa = group[user_id]
            if len(q) < 2:
                # if we have less than 2 questions answered, then we
                # ignore the user_id
                continue
            # we only append those ids with more than 10 questions possed
            self.user_ids.append(user_id)
            
            # To reduce the memory:
            #
            #if len(q)>self.max_seq:
            #    group[user_id] = (q[-self.max_seq:],qa[-self.max_seq:])

    # we define the len of the dataset as the number of user_ids with more
    # than 10 questions posed.
    def __len__(self):
        return len(self.user_ids)

    # We define the getitem function (indexing using brakets: [])
    def __getitem__(self, index):
        # we get the used_id of that index
        user_id = self.user_ids[index]
        # then we get the questions and answers for that user id
        q_, qa_ = self.samples[user_id]
        # finally, we compute the numbers of questions answered by that student
        seq_len = len(q_)

        # we create np.ndarrays the length of max_seq
        q = np.zeros(self.max_seq, dtype=int)
        qa = np.zeros(self.max_seq, dtype=int)
        # Originally: In case that q_ and qa_ are larger, we assign to the np.ndarrays the last
        # max_seq elements.
        # Then it was defined a random selection of user interactions.
        if seq_len >= self.max_seq:
            if random.random()>0.1:
                # Here we obtain also a random number in (0, 1). Then, if
                # the number is greater than 0.1 (90% of the time), then
                # we obtain a random number in (0, k), where k is the
                # number of observations that are left out of the input
                # sequence of the neural network. This random number
                # will be the first element to be considered. This means
                # that we would consider the last exercises only if the
                # random number turns to be k.
                start = random.randint(0,(seq_len-self.max_seq))
                end = start + self.max_seq
                q[:] = q_[start:end]
                qa[:] = qa_[start:end]
            else:
                # The previous case.
                q[:] = q_[-self.max_seq:]
                qa[:] = qa_[-self.max_seq:]
            
        else:
            if random.random()>0.1:
                # We obtain a random number. If it's larger than 0.1
                # then we establish a random subsequence of the list
                #
                ############ Why would this improve performance?
                start = 0
                end = random.randint(2,seq_len)
                seq_len = end - start
                q[-seq_len:] = q_[0:seq_len]
                qa[-seq_len:] = qa_[0:seq_len]
            else:
                # The previous case: we store all the elements of q
                q[-seq_len:] = q_
                qa[-seq_len:] = qa_
                   
        # we store the all, but the first elements of q, qa in arrays called
        # target_id (questions) and label (correctness of answer)
        target_id = q[1:]
        label = qa[1:]
        
        # Finally, x is a ndarray with length of max_seq - 1, and its values are the same
        # question_id for the questions answered wrong and questions_id + n_questions for the
        # questions answered correctly (which produces a totally valid new id, at least
        # restricted on the training dataset, the id produced might be used in a new question
        # appearing in the testset.)
        # This x array is created this way because this is how the paper of SAKT creates it.
        # Still, I don't know why I have to use it like that.
        x = np.zeros(self.max_seq-1, dtype=int)
        x = q[:-1].copy()  # we obtain all the questions_ids, except for the last one
        x += (qa[:-1] == 1) * self.n_questions # then, we for the questions answered correctly
                                               # we add n_questions

        return x, target_id, label

Brief notes on DataLoader of Pytorch:

* The most important argument is the dataset argument, which indicates the data that will be loaded. Pytorch supports:
    - map-style datasets: the classes that have `__getitem__()` and `__len()` defined.
    - iterable-style datasets: classes that inherit form `IterableDataset` and implements `__iter__()`.
* Using num_workers as a positive integer makes the DataLoading parallel (several worker processes).
* `num_workers`: how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process.
* `batch_size`: how many samples per batch to load.
* `shuffle`: set to `True` to have the data reshuffled at every epoch.

In [None]:
# We could create an «if» statement and use it for different
# splitings. train/val for the selection of the model and
# 100% train for doing the last submission.

# This splits our dataset in order to avoid testing on the competition's public test set.
train, val = train_test_split(group, test_size=0.2)

# Initializes the class we defined above.
train_dataset = SAKTDataset(train, n_questions)
# Uses the DataLoader to load data into Pytorch
train_dataloader = DataLoader(train_dataset, batch_size=2048, shuffle=True, num_workers=8)
# garbage collector:
del train

# The same for the test dataset:
val_dataset = SAKTDataset(val, n_questions)
val_dataloader = DataLoader(val_dataset, batch_size=2048, shuffle=True, num_workers=8)
del val

# Actually calling the garbage collector.
gc.collect()

print(len(train_dataset))
print(len(val_dataset))

# The first entry in the out of getitem is the first element
# of the training dataset.
#
# print(val_dataset[0])
# print(item[1])
# print(item[2])

## Definition of the SAKT model (based on SAKT1)

The process to define a Pytorch model:
1. First we load and prepare the data. This implies having the data in map-style or iterable-style as described above. Note that we need numerical inputs and numerical outputs. *Did we need to create a class? We could use the same dataframe we used before.*
2. Then we define the model (structure of the NN). To define a model, we need to extend the nn.Module.
3. After that, we need to define a loss function and an optimization algorithm. This allows to train the NN.
4. We train the NN. Then, we test it and finally we make predictions with it.


### On how we define the structure of the NN

* We need to inherit from nn.Module.
* The constructor of the class defines the layers of the NN.
* The `forward()` function defines how to forward propagate input through the defined layers (overrides predefined method).
* Among the available layers, there are `Linear` (defines a fully connected layer), `Conv2d` (convolutional), `MaxPool2d`, etcetera. 
* Among the available activation we have `ReLU`, `Softmax` and `Sigmoid`.
* The following example defines a MLP with a single layer:


    class MLP(nn.Module):

        def __init__(self, n_inputs):
            super(MLP, self).__init__()
            self.layer = Linear(n_inputs, 1)
            self.activation = Sigmoid()

        def forward(self, X):
            X = self.layer(X)
            X = self.activation(X)
            return X


## Description of the structure of the SAKT model

* It consists of several blocks. The last block is a Feed-forward network (MLP with one hidden layer).
    - The FFN produces the output and receives an already processed input.
* In the SAKT model, we have the following notation:
    - $\mathbf{X} = (\mathbf{x}_1, \dots, \mathbf{x}_n)$ is the student's past iterations (vector).
    - $\mathbf{x}_i = (e_i, r_t)$ is the iteration at time $t$, where $e_t$ is the exercised posed to the student and $r_t$ is the correctness of the answer.
* Thus, our objective is to predict $r_{t + 1}$ or estimate $P[r_{t + 1} = 1 | e_{t + 1}, \mathbf{X}]$.
* We must consider the inputs as:--
* With respect to the layers, we have the following layers:
    - Embedding layer: We transform the input sequence $y$ into $s$, where $n$ is the maximum length.
    - Self-attention layer: 
    - FFN: it receives $S$ (the output of the self-attention layer) as input. It has only one hidden layer with ReLU activation functions and one linear output function.
    - Residual connections: residual connections are applied to both self-attetion and FFN layers. According to the paper, residual connections allow to propagate embeddings of resently solved exercises to the final layer, making it easier for model to leverage the low layer information.
    - Layer normalization: it's applied to the inputs of the general network, to the input of the self-attention layer and the input of the FFN layer. Accoding to the paper, it stabilizes and accelerates the NN.
    - 
* Observe we don't put the sigmoid function on the last layer of the SAKT model. Instead, we use only the dot product of the last layer and then use BCEWithLogitsLoss (which incorporates a Sigmoid layer and BCELoss in a single class, as this is more numerically stable because it can use a logarithm trick).
    
    
**Note**: the first snippet of code defines the FFN and the `future_mask`.

**Note**: where is the Self-attention layer? It's only defined the Multihead Attention

In [None]:
# The first class, FFN, defines a feed-forward network (hence, a MLP with dropout).
# However, the FFN doesn't has the dataset info as input.

class FFN(nn.Module):
    def __init__(self, state_size=200):
        # This NN has only 2 layers. They are densely connected.
        # The first layer has ReLU as activation function.
        # The second layer has Dropout.
        super(FFN, self).__init__()
        self.state_size = state_size
            # state_size only refers to the number of inputs. Hence
            # we have the same number of inputs as of hidden neurons and
            # output neurons.
        
        self.lr1 = nn.Linear(state_size, state_size)
            # Self-explanatory
        self.relu = nn.ReLU()
            # Self-explanatory
        self.lr2 = nn.Linear(state_size, state_size)
            # Having the second Linear layer only creates an output layer with
            # identity activation function
        self.dropout = nn.Dropout(0.2)
            # Note on dropout: it randomly zeroes some of the inputs
            # during training. Observe the shape of input and output is the same.

    def forward(self, x):
        # we simply activate propagate forward
        x = self.lr1(x)
        x = self.relu(x)
        x = self.lr2(x)
        return self.dropout(x)

# This creates the mask. I need to read the paper first
def future_mask(seq_length):
    # We first create a matrix of ones. Triu computes the upper triangular matrix
    # with k = 1 it means that the values below the 1st diagonal are zero.
    future_mask = np.triu(np.ones((seq_length, seq_length)), k=1).astype('bool')
    # The we convert the matrix into a tensor (so that PyTorch can process it).
    return torch.from_numpy(future_mask)

In [None]:
# This defines the whole SAKT model. Observe the last layer is a FFN network.
class SAKTModel(nn.Module):
    def __init__(self, n_questions, max_seq=100, embed_dim=128):
        # It takes three parameters: n_questions, max_seq and embed_dim
        super(SAKTModel, self).__init__()
        # We first initialize every parameter
        self.n_questions = n_questions
        self.embed_dim = embed_dim
        
        # From docs: A simple lookup table that stores embeddings of a fixed dictionary and size.
        # First argument `num_embeddings` is the size of the dictionary of embeddings.
        # The second argument `embedding_dim` is the size of each embedding vector.
        # This creates the embeddings: observe that all the embedding layers are created with the
        # torch layer nn.Embedding.
        # - embeding is a traditional embedding: from dictionary of questions to dimension d.
        # - pos_embeding is used to ############: from #### to dimension d.
        #       why we use that?
        # - e_embedding is similar to the first, but we don't differ using correctness.
        self.embedding = nn.Embedding(2*n_questions+1, embed_dim)
        self.pos_embedding = nn.Embedding(max_seq-1, embed_dim)
        self.e_embedding = nn.Embedding(n_questions+1, embed_dim)
        
        # This creates the Attention layer:
        # This defines the Multihead Attention as in the paper (and I believe is the same definition
        # as the first paper that defined the Multihead Attention).
        self.multi_att = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=8, dropout=0.2)
        
        # Dropout to the attention output with p = 0.2
        self.dropout = nn.Dropout(0.2)
        # Layer normalization (LayerNorm).
        self.layer_normal = nn.LayerNorm(embed_dim) 
        
        # Defined FFN:
        self.ffn = FFN(embed_dim)
        
        # The output is the linear of the outputs of FFN
        # Where's the sigmoid?
        self.pred = nn.Linear(embed_dim, 1)
    
    def forward(self, x, question_ids):
        # x is a tensor
        
        # Calling x.device returns device(type='cuda', index=0) in case it was defined
        # this way.
        device = x.device
        # Transforms the tensor using the embedding.
        x = self.embedding(x)
        
        # This does several things:
        # Creates a 1D tensor that goes from 0 to k, where k is the second dimension of x
        # Then inserts the dimension (converts the array into a vector (2d, but only one row))
        # Finally, moves the tensor to the device.
        # Thus, the result is [[0, 1, 2, 3, 4, ..., k]] in the device
        pos_id = torch.arange(x.size(1)).unsqueeze(0).to(device)
        
        # This creates the embedding of the position
        pos_x = self.pos_embedding(pos_id)
        # We add the position and the input tensor x (Seems like a Res Connection, but pos_x isn't
        # a layer-modification of x).
        x = x + pos_x
        
        # Finally, we make an embedding of the question_ids.
        e = self.e_embedding(question_ids)
        
        # NOTE: question_ids is pretty similar to that of x. In fact, they are the same when the
        # question's answer is wrong.

        # Changes the dimensions of the tensor. Generalization of the transpose.
        # Applied to both x and questions_id.
        x = x.permute(1, 0, 2) # x: [bs, s_len, embed] => [s_len, bs, embed]
        e = e.permute(1, 0, 2)
        
        # Creates the mask in order to avoid using the future
        # to predict the past (I think).
        att_mask = future_mask(x.size(0)).to(device)
        att_output, att_weight = self.multi_att(e, x, x, attn_mask=att_mask)
        # Note that «att_output + e» is a residual connection that "jumps" the ATT layer.
        # Then we apply layer normalization to the output of the Res Connection.
        att_output = self.layer_normal(att_output + e) # Residual connection
        att_output = att_output.permute(1, 0, 2) # att_output: [s_len, bs, embed] => [bs, s_len, embed]

        # We apply the FFN to the output of the Attention output.
        x = self.ffn(att_output)
        # Note that «x + att_output» is a residual connection that "jumps" the FFN.
        # Then we apply layer normalization to the output of the Res Connection.
        x = self.layer_normal(x + att_output) # Residual connection
        # Finally, we perform the prediction (dense connection to a single neuron).
        x = self.pred(x)

        return x.squeeze(-1), att_weight

With respect of particular hiperparameters:
* The dimension of embeddings d is 128.

In [None]:
# This line initializes the GPU (cuda).
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# This is the constructor of the model we defined.
model = SAKTModel(n_questions, embed_dim=128)

# We construct a optimizer object. In this case: ADAM.
#
# optimizer = torch.optim.SGD(model.parameters(), lr=1e-3, momentum=0.99, weight_decay=0.005)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()

# This line puts the model in the GPU.
model.to(device)
# This line puts the criterion in the GPU, also.
criterion.to(device)

In [None]:
# We define a function that defines each train epoch.
def train_epoch(model, train_iterator, optim, criterion, device="cpu"):
    # This sets our NN in the training mode (activates dropout and so on)
    model.train()

    # This variables are using throughout the training. Self-explanatory
    train_loss = []
    num_corrects = 0
    num_total = 0
    labels = []
    outs = []
    
    # Progress bar. Note that train_iterator is an argument of the function.
    tbar = tqdm(train_iterator)
    for item in tbar:
        # Gets the first item of train_iterator and sends it to the device and converts
        # it to long type.
        x = item[0].to(device).long()
        # Similar, but with the second index (of the first item).
        target_id = item[1].to(device).long()
        # Idem, but with the third index and converting it to float.
        label = item[2].to(device).float()
        # Creates the mask:
        target_mask = (target_id != 0)

        # We set the gradients to 0.
        optim.zero_grad()
        # Get the output and weights of the model.
        output, atten_weight = model(x, target_id)
        
        # Don't know        
        output = torch.masked_select(output, target_mask)
        # Don't know
        label = torch.masked_select(label, target_mask)
        
        loss = criterion(output, label)
        loss.backward()
        optim.step()
        train_loss.append(loss.item())
        pred = (torch.sigmoid(output) >= 0.5).long()
        
        num_corrects += (pred == label).sum().item()
        num_total += len(label)

        labels.extend(label.view(-1).data.cpu().numpy())
        outs.extend(output.view(-1).data.cpu().numpy())

        tbar.set_description('loss - {:.4f}'.format(loss))
    
    # After looping, we can finally compute the accuracy and loss:
    #
    # acc stands for Accuracy (proportion of right answers)
    acc = num_corrects / num_total
    # Area under the ROC curve
    auc = roc_auc_score(labels, outs)
    # Loss value using the loss function for training (BCE Loss).
    loss = np.average(train_loss)

    return loss, acc, auc

In [None]:
# This defines the epoch of validation.
def val_epoch(model, val_iterator, criterion, device="cpu"):
    model.eval()

    train_loss = []
    num_corrects = 0
    num_total = 0
    labels = []
    outs = []

    tbar = tqdm(val_iterator)
    for item in tbar:
        x = item[0].to(device).long()
        target_id = item[1].to(device).long()
        label = item[2].to(device).float()
        target_mask = (target_id != 0)

        with torch.no_grad():
            output, atten_weight = model(x, target_id)
        
        output = torch.masked_select(output, target_mask)
        label = torch.masked_select(label, target_mask)

        loss = criterion(output, label)
        train_loss.append(loss.item())

        pred = (torch.sigmoid(output) >= 0.5).long()
        
        num_corrects += (pred == label).sum().item()
        num_total += len(label)

        labels.extend(label.view(-1).data.cpu().numpy())
        outs.extend(output.view(-1).data.cpu().numpy())

        tbar.set_description('loss - {:.4f}'.format(loss))

    acc = num_corrects / num_total
    auc = roc_auc_score(labels, outs)
    loss = np.average(train_loss)

    return loss, acc, auc

In [None]:
over_fit = 0
last_auc = 0
for epoch in range(epochs):
    train_loss, train_acc, train_auc = train_epoch(model, train_dataloader, optimizer, criterion, device)
    print("epoch - {} train_loss - {:.2f} acc - {:.3f} auc - {:.3f}".format(epoch, train_loss, train_acc, train_auc))
    
    val_loss, avl_acc, val_auc = val_epoch(model, val_dataloader, criterion, device)
    print("epoch - {} val_loss - {:.2f} acc - {:.3f} auc - {:.3f}".format(epoch, val_loss, avl_acc, val_auc))
    
    if val_auc > last_auc:
        last_auc = val_auc
        over_fit = 0
    else:
        over_fit += 1
        
    
    if over_fit >= 2:
        print("early stop epoch ", epoch)
        break

In [None]:
torch.save(model.state_dict(), "SAKT-Capo.pt")

In [None]:
#del dataset
#gc.collect()

## Yield of the predictions for the model

* Since we don't have `answered_correctly` in the test dataset, we have to define a new class for testing the model.
* In this case, 

In [None]:
# Then we create the dataset for the test. This is similar to that of the training, without answers.
class TestDataset(Dataset):
    # Let's see if we do the same.
    def __init__(self, samples, test_df, questions, max_seq=MAX_SEQ):
        super(TestDataset, self).__init__()
        self.samples = samples
        self.user_ids = [x for x in test_df["user_id"].unique()]
        self.test_df = test_df
        self.questions = questions
        self.n_questions = len(questions)
        self.max_seq = max_seq

    def __len__(self):
        return self.test_df.shape[0]

    def __getitem__(self, index):
        test_info = self.test_df.iloc[index]

        user_id = test_info["user_id"]
        target_id = test_info["content_id"]

        q = np.zeros(self.max_seq, dtype=int)
        qa = np.zeros(self.max_seq, dtype=int)

        if user_id in self.samples.index:
            q_, qa_ = self.samples[user_id]
            
            seq_len = len(q_)

            if seq_len >= self.max_seq:
                q = q_[-self.max_seq:]
                qa = qa_[-self.max_seq:]
            else:
                q[-seq_len:] = q_
                qa[-seq_len:] = qa_          
        
        x = np.zeros(self.max_seq-1, dtype=int)
        x = q[1:].copy()
        x += (qa[1:] == 1) * self.n_questions
        
        questions = np.append(q[2:], [target_id])
        
        return x, questions

In [None]:
# We finally import the modules of the competition. They are needed to submit the predictions.
import riiideducation

env = riiideducation.make_env()
iter_test = env.iter_test()

In [None]:
import psutil

model.eval()

prev_test_df = None

for (test_df, sample_prediction_df) in tqdm(iter_test):
    # What
    if (prev_test_df is not None) & (psutil.virtual_memory().percent<90):
        print(psutil.virtual_memory().percent)
        prev_test_df['answered_correctly'] = eval(test_df['prior_group_answers_correct'].iloc[0])
        prev_test_df = prev_test_df[prev_test_df.content_type_id == False]
        prev_group = prev_test_df[['user_id', 'content_id', 'answered_correctly']].groupby('user_id').apply(lambda r: (
            r['content_id'].values,
            r['answered_correctly'].values))
        for prev_user_id in prev_group.index:
            prev_group_content = prev_group[prev_user_id][0]
            prev_group_ac = prev_group[prev_user_id][1]
            if prev_user_id in group.index:
                group[prev_user_id] = (np.append(group[prev_user_id][0],prev_group_content), 
                                       np.append(group[prev_user_id][1],prev_group_ac))
 
            else:
                group[prev_user_id] = (prev_group_content,prev_group_ac)
            if len(group[prev_user_id][0])>MAX_SEQ:
                new_group_content = group[prev_user_id][0][-MAX_SEQ:]
                new_group_ac = group[prev_user_id][1][-MAX_SEQ:]
                group[prev_user_id] = (new_group_content,new_group_ac)

    prev_test_df = test_df.copy()
    
    test_df = test_df[test_df.content_type_id == False]
    
    test_dataset = TestDataset(group, test_df, questions)
    test_dataloader = DataLoader(test_dataset, batch_size=51200, shuffle=False)
    
    outs = []

    for item in tqdm(test_dataloader):
        x = item[0].to(device).long()
        target_id = item[1].to(device).long()

        with torch.no_grad():
            output, att_weight = model(x, target_id)
        
        
        output = torch.sigmoid(output)
        output = output[:, -1]

        # pred = (output >= 0.5).long()
        # loss = criterion(output, label)

        # val_loss.append(loss.item())
        # num_corrects += (pred == label).sum().item()
        # num_total += len(label)

        # labels.extend(label.squeeze(-1).data.cpu().numpy())
        outs.extend(output.view(-1).data.cpu().numpy())
        
    test_df['answered_correctly'] =  outs
    
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])

In [None]:
# The following code were celds used to produce the dummy prediction:
#
'''
(test_df, sample_prediction_df) = next(iter_test)
test_df
'''
#
'''
sample_prediction_df
'''
#
'''
env.predict(sample_prediction_df)
'''
#
'''
for (test_df, sample_prediction_df) in iter_test:
    test_df['answered_correctly'] = 0.5
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])
'''

## Notes on the score

* The score is based on the area under the ROC curve. At 28/12/2020, the highest (public) score is of 0.816. The lowest possible score to achieve a silve medal is 0.786 and to achieve a bronze medal is 0.783. The public notebook with higher score has a score of 0.783

## Track of attempts

* The first attempt was done by using the dummy prediction (predicting 0.5 for all possible observations). This led to a score of **0.500**. Note this wasn't the worst score, using a poor model could lead to a lower score (there are scores of 0.425 at the time of this writing).
* The second attempt uses SAKT model based on this [notebook](kaggle.com/wangsg/a-self-attentive-model-for-knowledge-tracing/notebook). Simply replicates the model an gets a public score of **0.764**.
* The third attempt is based on this [notebook](https://www.kaggle.com/leadbest/sakt-with-randomization-state-updates). It uses random selection of user iterations and small optimization to get a score of 0.771. Since it uses the exact same features as the previous notebook, I'm going to also .... 