# Searching for a Magic

In this notebook, we tested two methods for neural net-based models:
* Label smoothing
* Post-processing

Some references for the original SAKT models:
* https://www.kaggle.com/leadbest/sakt-with-randomization-state-updates
* https://www.kaggle.com/wangsg/a-self-attentive-model-for-knowledge-tracing
* https://www.kaggle.com/leadbest/sakt-self-attentive-knowledge-tracing-submitter
* https://www.kaggle.com/its7171/cv-strategy
* https://www.kaggle.com/mpware/sakt-fork
* https://www.kaggle.com/scaomath/riiid-sakt-baseline-minimal-inference

# Label smoothing

Basically we test if the following label smoothing works for SAKT:
$$
(\text{new label}) = (\text{orig label})\times (1-\alpha) + \dfrac{\alpha}{2} \tag{1}
$$

The idea is quite simple: if we looks at the loss function of the binary classification for a sample with features $x$ and label $y$:

$$
L (w; x, y) = -  
\Bigl\{y \ln\big( h(x; w) \big) 
+ (1 - y) \ln\big( 1 - h(x;w) \big) \Bigr\}.
$$

where $h(x;w)$ is a number between 0 to 1, which is the prediction of the model. $w$ stands for the weights of whatever neural net based model we have. Whenenever we take the gradient of the logistic function:

$$
\nabla_{w} \big( L (w) \big) 
=\big( \underbrace{h(x;w) - y}_{(\star)} \big) x  \tag{2}. 
$$

As we can see here: $(\star)$ is the difference between the true label $y$ (0 or 1) and the predicted target (a number between 0 to 1). When the model struggles due to its own capacity, for example, getting a 0.3 for a sample that has true label 1, the gradient contribution from this sample becomes large, and could potentially deviates the optimization of this model. 

For some samples, it is better to let go as an outlier than putting it as a majorer-than-others contributing factor to the gradient. In optimization theory, Wolfe condition is a famous condition for first order line search methods, in which the curvature condition does somewhat the same thing, restricting the magnitude of the gradient. 

Formula (1) "smooths" the label, for example, if we choose $\alpha = 0.2$ then the original labeled 1 data become a sample with label 0.9, and 0 becomes 0.1. In this way, the gradient becomes smaller apparently from (2). It was commonly known that the label smoothing trick works for improving the loss value and the accuracy. But does it improve AUC as well? Lets find out.


# How about postprocessing?

We can try simple multiply a scaling before feeding the output to the softmax/logistic function.

Why it could work in certain scenario? For AUC metric, it punishes unconfident predictions...
For binary classification problem, a confident prediction is either close to 0 or 1, something like the following image:

![](https://i.imgur.com/rKtiWXW.png)

An unconfident prediction is closer to 0.5 than the two ends, the histogram looks like this:

![](https://imgur.com/x9ef7xF)

Because AUC is the area, the threshold will move from 0 to 1 to check the ration of false positive and false negative, if your model has a lot of prediction having probability near 0.5, then the AUC cannot be very high.

One simple way, given that your model is somewhat accurate, is to rescale the output of the NN before giving to sigmoid.

$$
\text{final probability estimate} = \frac{1}{1+\exp({\beta z})}
$$

where $z$ is your NN output, and $\beta$ is the scaling.

In [None]:
SCALING = 5

In [None]:
import gc
import psutil
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random
from tqdm import tqdm
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

import seaborn as sns
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader

# global variables
random.seed(42)
TQDM_INT = 8
EPOCHS = 10
MAX_SEQ = 150
NUM_EMBED = 128
NUM_HEADS = 8
WORKERS = 4
BATCH_SIZE = 1024
VAL_BATCH_SIZE = 4096
n_skill = 13523
LEARNING_RATE = 1e-3

# Comparison

Basically we compare the baseline SAKT model, one with label smoothing to the ones without (in many public notebook having public LB around 0.77). 

We then compare the AUC in valid set after 10 epoch with the same learning rate. For the label smoothed model, we need to change the iniialization of the variale (from `int` to half precision decimals). 



Note: that we do not change the validation sets' labels.

In [None]:
TRAIN_DTYPES = {'timestamp':'int64', 
         'user_id':'int32' ,
         'content_id':'int16',
         'content_type_id':'int8',
         'answered_correctly':'int8'}
TRAIN_COLS = TRAIN_DTYPES.keys()

In [None]:
%%time

train_df = pd.read_parquet('../input/cv-strategy-in-the-kaggle-environment/cv3_train.parquet')
train_df = train_df[TRAIN_COLS].astype(TRAIN_DTYPES)

train_df = train_df[train_df["content_type_id"] == False]
train_df = train_df.sort_values(['timestamp'], ascending=True).reset_index(drop = True)


In [None]:
SMOOTHING_FACTOR=0.2
train_df[['answered_correctly']] = train_df[['answered_correctly']]*(1-SMOOTHING_FACTOR)\
                                + SMOOTHING_FACTOR/2


train_group = train_df[['user_id', 'content_id', 'answered_correctly']]\
            .groupby('user_id').apply(lambda r: (
            r['content_id'].values,
            r['answered_correctly'].values))

In [None]:
train_df.head(10) # the ac column is now changed

In [None]:
del train_df
gc.collect();

In [None]:
valid_df = pd.read_parquet('../input/cv-strategy-in-the-kaggle-environment/cv3_valid.parquet')
valid_df = valid_df[TRAIN_COLS].astype(TRAIN_DTYPES)

valid_df = valid_df[valid_df["content_type_id"] == False]
valid_group = valid_df[['user_id', 'content_id', 'answered_correctly']]\
        .groupby('user_id').apply(lambda r: (
        r['content_id'].values,
        r['answered_correctly'].values))

del valid_df
gc.collect();

In [None]:
class FFN(nn.Module):
    def __init__(self, state_size=200):
        super(FFN, self).__init__()
        self.state_size = state_size

        self.lr1 = nn.Linear(state_size, state_size)
        self.relu = nn.ReLU()
        self.lr2 = nn.Linear(state_size, state_size)
        self.dropout = nn.Dropout(0.2)
    
    def forward(self, x):
        x = self.lr1(x)
        x = self.relu(x)
        x = self.lr2(x)
        return self.dropout(x)

def future_mask(seq_length):
    future_mask = np.triu(np.ones((seq_length, seq_length)), k=1).astype('bool')
    return torch.from_numpy(future_mask)


class SAKTModel(nn.Module):
    def __init__(self, n_skill, 
                       max_seq=MAX_SEQ, 
                       embed_dim=NUM_EMBED, 
                       num_heads=NUM_HEADS,
                       num_layers=1): 
        super(SAKTModel, self).__init__()
        self.n_skill = n_skill
        self.embed_dim = embed_dim

        self.embedding = nn.Embedding(2*n_skill+1, embed_dim)
        self.pos_embedding = nn.Embedding(max_seq-1, embed_dim)
        self.e_embedding = nn.Embedding(n_skill+1, embed_dim)

        self.multi_att = nn.MultiheadAttention(embed_dim=embed_dim, num_heads=num_heads, dropout=0.2)

        self.dropout = nn.Dropout(0.2)
        self.layer_normal = nn.LayerNorm(embed_dim) 

        self.ffn = FFN(embed_dim)
        self.pred = nn.Linear(embed_dim, 1)
    
    def forward(self, x, question_ids):
        device = x.device        
        x = self.embedding(x)
        pos_id = torch.arange(x.size(1)).unsqueeze(0).to(device)

        pos_x = self.pos_embedding(pos_id)
        x = x + pos_x

        e = self.e_embedding(question_ids)

        x = x.permute(1, 0, 2) # x: [bs, s_len, embed] => [s_len, bs, embed]
        e = e.permute(1, 0, 2)
        att_mask = future_mask(x.size(0)).to(device)
        att_output, att_weight = self.multi_att(e, x, x, attn_mask=att_mask)
        att_output = self.layer_normal(att_output + e)
        att_output = att_output.permute(1, 0, 2) # att_output: [s_len, bs, embed] => [bs, s_len, embed]

        x = self.ffn(att_output)
        x = self.layer_normal(x+att_output)
        x = self.pred(x)

        return x.squeeze(-1), att_weight
    
class SAKTDataset(Dataset):
    def __init__(self, group, n_skill, subset="train", max_seq=MAX_SEQ):
        super(SAKTDataset, self).__init__()
        self.max_seq = max_seq
        self.n_skill = n_skill # 13523
        self.samples = group
        self.subset = subset
        
        # self.user_ids = [x for x in group.index]
        self.user_ids = []
        for user_id in group.index:
            '''
            q: question_id
            qa: question answer correct or not
            '''
            q, qa = group[user_id] 
            if len(q) < 2: # 2 interactions minimum
                continue
            self.user_ids.append(user_id) # user_ids indexes

    def __len__(self):
        return len(self.user_ids)

    def __getitem__(self, index):
        user_id = self.user_ids[index] # Pick a user
        q_, qa_ = self.samples[user_id] # Pick full sequence for user
        seq_len = len(q_)

        q = np.zeros(self.max_seq, dtype=int)
        qa = np.zeros(self.max_seq, dtype=np.float16)

        if seq_len >= self.max_seq:
            if self.subset == "train":
#                 if seq_len > self.max_seq:
                if random.random() > 0.1:
                    random_start_index = random.randint(0, seq_len - self.max_seq)
                    '''
                    Pick 100 questions, answers, prior question time, 
                    priori question explain from a random index
                    '''
                    end_index = random_start_index + self.max_seq
                    q[:] = q_[random_start_index:end_index] 
                    qa[:] = qa_[random_start_index:end_index] 
                else:
                    q[:] = q_[-self.max_seq:]
                    qa[:] = qa_[-self.max_seq:]
            else:
                q[:] = q_[-self.max_seq:] # Pick last 100 questions
                qa[:] = qa_[-self.max_seq:] # Pick last 100 answers
        else:
            if random.random()>0.1:
                seq_len = random.randint(2,seq_len)
                q[-seq_len:] = q_[:seq_len]
                qa[-seq_len:] = qa_[:seq_len]
            else:
                q[-seq_len:] = q_ # Pick last N question with zero padding
                qa[-seq_len:] = qa_ # Pick last N answers with zero padding
                
        target_id = q[1:] # Ignore first item 1 to 99
        label = qa[1:] # Ignore first item 1 to 99

        # x = np.zeros(self.max_seq-1, dtype=int)
        x = q[:-1].copy() # 0 to 98
        x += (qa[:-1] == 1) * self.n_skill # y = et + rt x E

        return x, target_id,  label
    
def train_epoch(model, train_iterator, optim, criterion, device="cuda"):
    model.train()

    train_loss = []
    num_corrects = 0
    num_total = 0
    labels = []
    outs = []

    len_dataset = len(train_iterator)

#     with tqdm(total=len_dataset) as pbar:
    for idx, item in enumerate(train_iterator): 
        x = item[0].to(device).long()
        target_id = item[1].to(device).long()
        label = item[2].to(device).float()

        optim.zero_grad()
        output, atten_weight = model(x, target_id)
        # print(f'X shape: {x.shape}, target_id shape: {target_id.shape}')
        loss = criterion(output, label)
        loss.backward()
        optim.step()
        train_loss.append(loss.item())

        output = output[:, -1]
        label = (label[:, -1] >=0.5).long()
        output = torch.sigmoid(output)
        pred = (output >= 0.5).long()

        num_corrects += (pred == label).sum().item()
        num_total += len(label)

        labels.extend(label.view(-1).data.cpu().numpy())
        outs.extend(output.view(-1).data.cpu().numpy())

#             if idx % TQDM_INT == 0:
#                 pbar.set_description(f'train loss - {train_loss[-1]:.4f}')
#                 pbar.update(TQDM_INT)
    
    acc = num_corrects / num_total
    auc = roc_auc_score(labels, outs)
    loss = np.mean(train_loss)

    return loss, acc, auc


def valid_epoch(model, valid_iterator, criterion, device="cuda", scaling=1):
    model.eval()
    
    valid_loss = []
    num_corrects = 0
    num_total = 0
    labels = []
    outs = []
    len_dataset = len(valid_iterator)
    
    for idx, item in enumerate(valid_iterator): 
        x = item[0].to(device).long()
        target_id = item[1].to(device).long()
        label = item[2].to(device).float()

        with torch.no_grad():
            output, _ = model(x, target_id)
        loss = criterion(output, label)
        valid_loss.append(loss.item())

        output = scaling*output[:, -1] # (BS, 1)
        output = torch.sigmoid(output)
        label = label[:, -1] 
        pred = (output >= 0.5).long()

        num_corrects += (pred == label).sum().item()
        num_total += len(label)

        labels.extend(label.view(-1).data.cpu().numpy())
        outs.extend(output.view(-1).data.cpu().numpy())

    acc = num_corrects / num_total
    auc = roc_auc_score(labels, outs)
    loss = np.mean(valid_loss)

    return loss, acc, auc

In [None]:
train_dataset = SAKTDataset(train_group, n_skill, subset="train")
train_loader = DataLoader(train_dataset, 
                              batch_size=BATCH_SIZE, 
                              shuffle=True, 
                              num_workers=WORKERS)

valid_dataset = SAKTDataset(valid_group, n_skill, subset="valid")
val_loader = DataLoader(valid_dataset, 
                              batch_size=VAL_BATCH_SIZE, 
                              shuffle=False, 
                              num_workers=WORKERS)

In [None]:
item = train_dataset.__getitem__(5)

print("x", len(item[0]), item[0], '\n\n')
print("target_id", len(item[1]), item[1] , '\n\n')
print("label", len(item[2]), item[2], '\n\n')

In [None]:
def get_num_params(model):
    model_parameters = filter(lambda p: p.requires_grad, model.parameters())
    no_params = sum([np.prod(p.size()) for p in model_parameters])
    return no_params

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = SAKTModel(n_skill, embed_dim=NUM_EMBED, num_heads=NUM_HEADS)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
criterion = nn.BCEWithLogitsLoss()

model.to(device)
criterion.to(device)
num_params = get_num_params(model)
print(f"# heads  : {NUM_HEADS}")
print(f"# embed  : {NUM_EMBED}")
print(f"seq len  : {MAX_SEQ}")
print(f"# params : {num_params}")

## Train the model with a smoothed label

In [None]:
epochs = 10
for epoch in range(epochs):
    loss, acc, auc = train_epoch(model, train_loader, optimizer, criterion, device)
    print(f"Epoch - [{epoch}/{epochs}]")
    print(f"Train with label smoothing: loss - {loss:.4f} acc - {acc:.4f} auc - {auc:.4f}")
    val_loss, val_acc, val_auc = valid_epoch(model, val_loader, criterion, device=device)
    print(f"Valid without scaling     : loss - {val_loss:.4f} acc - {val_acc:.4f} auc - {val_auc:.4f}")
    val_loss, val_acc, val_auc = valid_epoch(model, val_loader, criterion, device=device, scaling=SCALING)
    print(f"Valid with a scaling of {SCALING} : loss - {val_loss:.4f} acc - {val_acc:.4f} auc - {val_auc:.4f}")

# Observation

Apparently the original model using the unsmoothed label works better (around 0.75 validation AUC), and the one with smoothed label performs worse on AUC despite being more accurate, even the post-processing won't help.

# Final inference with a scaling


Here for the final test, we multiply the following factor whenever doing the inference. For fairness, basically this is the minimal baseline https://www.kaggle.com/scaomath/riiid-sakt-baseline-minimal-inference with output multiplied by a scaling...and we can compare the LB score (0.772 for the baseline) vs this notebook.

In [None]:
SCALING = 3.5

model_file = '../input/riiid-models/sakt_layer_1_head_8_embed_128_seq_150_auc_0.7605.pt'

In [None]:
def load_sakt_model(model_file, device='cuda'):
    # creating the model and load the weights
    configs = []
    model_file_lst = model_file.split('_')
    for c in ['head', 'embed', 'seq', 'layer']:
        idx = model_file_lst.index(c) + 1
        configs.append(int(model_file_lst[idx]))

    # configs.append(int(model_file[model_file.rfind('head')+5]))
    # configs.append(int(model_file[model_file.rfind('embed')+6:model_file.rfind('embed')+9]))
    # configs.append(int(model_file[model_file.rfind('seq')+4:model_file.rfind('seq')+7]))
    conf_dict = dict(n_skill=n_skill,
                     num_heads=configs[0],
                     num_layers=configs[3],
                     embed_dim=configs[1], 
                     max_seq=configs[2], 
                     )

    model = SAKTModel(**conf_dict)
        
    model = model.to(device)
    model.load_state_dict(torch.load(model_file, map_location=device))

    return model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = load_sakt_model(model_file, device=device)

model.to(device)
print(model)

In [None]:
class TestDataset(Dataset):
    def __init__(self, samples, test_df, n_skill, max_seq=MAX_SEQ): 
        super(TestDataset, self).__init__()
        self.samples = samples
        self.user_ids = [x for x in test_df["user_id"].unique()]
        self.test_df = test_df
        self.n_skill = n_skill
        self.max_seq = max_seq

    def __len__(self):
        return self.test_df.shape[0]

    def __getitem__(self, index):
        test_info = self.test_df.iloc[index]

        user_id = test_info["user_id"]
        target_id = test_info["content_id"]

        q = np.zeros(self.max_seq, dtype=int)
        qa = np.zeros(self.max_seq, dtype=int)

        if user_id in self.samples.index:
            q_, qa_ = self.samples[user_id]
            
            seq_len = len(q_)

            if seq_len >= self.max_seq:
                q = q_[-self.max_seq:]
                qa = qa_[-self.max_seq:]
            else:
                q[-seq_len:] = q_
                qa[-seq_len:] = qa_          
        
        x = np.zeros(self.max_seq-1, dtype=int)
        x = q[1:].copy()
        x += (qa[1:] == 1) * self.n_skill
        
        questions = np.append(q[2:], [target_id])
        
        return x, questions

In [None]:
import riiideducation

env = riiideducation.make_env()
iter_test = env.iter_test()

In [None]:
train_df = pd.read_parquet('../input/cv-strategy-in-the-kaggle-environment/cv3_train.parquet')
train_df = train_df[TRAIN_COLS].astype(TRAIN_DTYPES)

train_df = train_df[train_df["content_type_id"] == False]
train_df = train_df.sort_values(['timestamp'], ascending=True).reset_index(drop = True)
group = train_df[['user_id', 'content_id', 'answered_correctly']]\
            .groupby('user_id').apply(lambda r: (
            r['content_id'].values,
            r['answered_correctly'].values))

In [None]:
%%time
model.eval()

prev_test_df = None


for (test_df, sample_prediction_df) in iter_test:
    if (prev_test_df is not None) & (psutil.virtual_memory().percent<95):
#         print(psutil.virtual_memory().percent)
        prev_test_df['answered_correctly'] = eval(test_df['prior_group_answers_correct'].iloc[0])
        prev_test_df = prev_test_df[prev_test_df.content_type_id == False]
        prev_group = prev_test_df[['user_id', 'content_id', 'answered_correctly']].groupby('user_id').apply(lambda r: (
            r['content_id'].values,
            r['answered_correctly'].values))
        for prev_user_id in prev_group.index:
            prev_group_content = prev_group[prev_user_id][0]
            prev_group_ac = prev_group[prev_user_id][1]
            if prev_user_id in group.index:
                group[prev_user_id] = (np.append(group[prev_user_id][0],prev_group_content), 
                                       np.append(group[prev_user_id][1],prev_group_ac))
 
            else:
                group[prev_user_id] = (prev_group_content,prev_group_ac)
            if len(group[prev_user_id][0])>MAX_SEQ:
                new_group_content = group[prev_user_id][0][-MAX_SEQ:]
                new_group_ac = group[prev_user_id][1][-MAX_SEQ:]
                group[prev_user_id] = (new_group_content,new_group_ac)

    prev_test_df = test_df.copy()
    
    test_df = test_df[test_df.content_type_id == False]
                
    test_dataset = TestDataset(group, test_df, n_skill)
    test_dataloader = DataLoader(test_dataset, batch_size=25600, shuffle=False)
    
    outs = []

    for item in test_dataloader:
        x = item[0].to(device).long()
        target_id = item[1].to(device).long()

        with torch.no_grad():
            output, _ = model(x, target_id)
        
        # a scaling is multiplied
        output = torch.sigmoid(SCALING*output)
        output = output[:, -1]
        output = 0.25 + 0.75*output
        outs.extend(output.view(-1).data.cpu().numpy())
        
    test_df['answered_correctly'] =  outs
    
    env.predict(test_df.loc[test_df['content_type_id'] == 0, 
                            ['row_id', 'answered_correctly']])

## Sanity check
The average of probability is much more skewed toward 1 than the unscaled version. Even though on the right end of the spectrum it is good. There is a sigificant portion of unconfident predictions...We can see from the leaderboard there is no improvement.

In [None]:
sns.set()
sub = pd.read_csv('../working/submission.csv')
sub['answered_correctly'].hist();

# Conclusion

- Label smoothing does not work toward AUC metric.
- A simple scaling post-processing won't help.

In [None]:
# debug:

# test_df, sample_prediction_df = next(iter_test)
# if (prev_test_df is not None):
# #         print(psutil.virtual_memory().percent)
#     prev_test_df['answered_correctly'] = eval(test_df['prior_group_answers_correct'].iloc[0])
#     prev_test_df = prev_test_df[prev_test_df.content_type_id == False]
#     prev_group = prev_test_df[['user_id', 'content_id', 'answered_correctly']].groupby('user_id').apply(lambda r: (
#         r['content_id'].values,
#         r['answered_correctly'].values))
#     for prev_user_id in prev_group.index:
#         prev_group_content = prev_group[prev_user_id][0]
#         prev_group_ac = prev_group[prev_user_id][1]
#         if prev_user_id in group.index:
#             group[prev_user_id] = (np.append(group[prev_user_id][0],prev_group_content), 
#                                    np.append(group[prev_user_id][1],prev_group_ac))

#         else:
#             group[prev_user_id] = (prev_group_content,prev_group_ac)
#         if len(group[prev_user_id][0])>MAX_SEQ:
#             new_group_content = group[prev_user_id][0][-MAX_SEQ:]
#             new_group_ac = group[prev_user_id][1][-MAX_SEQ:]
#             group[prev_user_id] = (new_group_content,new_group_ac)

# prev_test_df = test_df.copy()

# test_df = test_df[test_df.content_type_id == False]

# test_dataset = TestDataset(group, test_df, n_skill)
# test_dataloader = DataLoader(test_dataset, batch_size=25600, shuffle=False)

# outs = []

# for item in test_dataloader:
#     x = item[0].to(device).long()
#     target_id = item[1].to(device).long()

#     with torch.no_grad():
#         output, _ = model(x, target_id)

#     # a scaling is multiplied
#     output = torch.sigmoid(SCALING*output)
#     output = output[:, -1]
#     outs.extend(output.view(-1).data.cpu().numpy())

# test_df['answered_correctly'] =  outs

# env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])