In [None]:
import fasttext, string, collections
import pandas as pd, numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from tqdm import tqdm

# Embeddings for Knowledge Tracing

**Motivation comes from [this paper](https://arxiv.org/pdf/2005.12442.pdf). I want to treat the questions as words in a corpus, so that I can embed them such that similar questions are clustered together in the same way that we use GloVe or Word2Vec vectors to cluster words with similar meaning together.** 

**Facebook's [fastText](https://github.com/facebookresearch/fastText) makes use of character level information. Each word is represented as a bag-of-character *n*-gram as well as the word itself. For example, the word `queen` with *n*-gram=3 is composed of `<qu`, `que`, `uee`, `een`, and `en>`. This can allow you to capture the meaning of suffixes/prefixes, which motivates the encoding procedure I carry out in this notebook.**

**fastText can also generate embeddings for words/questions not in our corpus by adding the character n-gram of all n-gram representations. It essentially builds a representation for OOV tokens by stitching together character level n-grams that it has seen during training.**

In [None]:
#full length of dataset is 101,230,332
N_SAMPLES = 1000000

train = pd.read_pickle('../input/riiid-answer-correctness-prediction-files/train.pkl')
train = train[train['content_type_id'] == 0]

questions = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv').rename({'question_id':'content_id'},
                                                                                        axis=1)

print(train.shape)
train = train.sample(N_SAMPLES, random_state=34)
print(train.shape)
train = pd.merge(train, questions, how='left', on='content_id')
print(train.shape)
print(train['user_id'].nunique())
train.head()

**As per the [data description tab](https://www.kaggle.com/c/riiid-test-answer-prediction/data), the `tags` of the questions found in the `questions.csv` of the competition are sufficient for clustering the questions together. In total, we have the following features to play around with to group questions together. We can of course engineer more clustering features, but for now we stick with these:**

**Train**
* `content_id` - (int16) ID code for the user interaction

**Questions**
* `tags` - one or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.
* `bundle_id` - code for which questions are served together.
* `part` - the relevant section of the TOEIC test

**Let's quickly explore these features. We begin with `content_id`. Bare in mind I am using a specific subsample of `train.csv` defined by the first block of the notebook.**

In [None]:
print(f"There are {train['content_id'].nunique()} unique content ids in train.csv")
print(f"On average, students get {round(train['answered_correctly'].mean(), 3)} of questions correct")

In [None]:
train['content_id'].value_counts()

**Now we move onto the `tags` feature. Since there are multiple tags per questions, I mainly want to know what the maximum number of tags a given question can have and how many unique tags there are.**

In [None]:
max_ = 0
iter_ = 0
for i, row in enumerate(train['tags'].values):
    if len(row.split()) > max_: 
        max_ = len(row.split())
        iter_ = i
train.iloc[iter_]

In [None]:
tags = []

for row in tqdm(train.index):
    tags.append(train.iloc[row]['tags'].split())

In [None]:
#check nunique tags
len(set([tag for sublist in tags for tag in sublist]))

**Let's also look at the distribution of `part` and `bundle_id`:**

In [None]:
questions['part'].value_counts(normalize=True)

In [None]:
print(questions['bundle_id'].nunique())
questions['bundle_id'].value_counts()

**Okay, maybe too many unique `bundle_ids` to use for clustering, but we can definitely use `tags` and potentially `part`, although the it is dominated by the value `5`.**

**Now, it is important for the embedding strategy I wish to test that each unique question tag is mapped to a single character. You will see why shortly. If you run `string.printable`, it will only give you 100 characters to use, so we are going to need some other characters. Let's go to the Greek alphabet first and maybe some Cyrillic/Latin letters if we need them. If anyone has a cleaner way of generating characters, do tell.**

In [None]:
greek = '\u03A9\u0394\u03BC\u00B0\u0302\u03C0\u03F4\u03BB\u03B1\u03B3\u03B4\u03B5\u03B6\u03B7\u03B8\u03B9\u03BA\u03BD\u03BE\u03C1\u03C2\u03C3\u03C4\u03C5\u03C6\u03C7\u03C8\u03C9\u0391\u0395\u0396\u0397\u0398\u0399\u039A\u039B\u039C\u039D\u039E\u039F\u03A0\u03A1\u03A3\u03A4\u03A5\u03A6\u03A7\u03A8'
greek

In [None]:
cyrillic = '\u0410\u0430\u0411\u0413\u0414\u0415\u0416\u0417\u0418\u0419\u0431\u0432\u0433\u0434\u0435\u0436\u0437\u0438\u0439\u041a\u041b\u041c\u041d\u041e\u041f\u043a\u043b\u043c\u043d\u043e\u043f\u0420\u0421\u0422\u0423\u0424\u0425\u0426\u0427\u0428\u0429'
cyrillic

In [None]:
latin = '\u00A1\u00A2\u00A3\u00A4\u00A5\u00A6\u00A7\u00A8\u00A9\u00B0\u00B1\u00B2\u00B3\u00B4\u00B5\u00B6\u00B7\u00B8\u00B9\u00C0\u00C1\u00C3\u00C5\u00C6\u00C7\u00C8\u00C9\u00D0\u00D1\u00D2\u00D4\u00D5\u00D6\u00D7\u00D8\u00D9'
latin

In [None]:
#not using 012 because they are reserved for labels
chars = '3456789' + string.ascii_letters + greek + cyrillic + latin
len(chars)

In [None]:
most_common = [word for word, word_count in collections.Counter([tag for sublist in tags for tag in sublist]).most_common(len(chars))]
least_common = [word for word, word_count in collections.Counter([tag for sublist in tags for tag in sublist]).most_common()[len(chars):]]
len(most_common)

In [None]:
char_dict = {}

for tag, char in zip(most_common, chars):
    char_dict[tag] = char

In [None]:
#remove low frequency tags and maps to unique character
def filter_tags(tag):
    return_tag = []
    
    for tag in tag.split():
        if tag not in least_common:
            tag = char_dict.get(tag)
            return_tag.append(tag)
            
    return " ".join(return_tag)
     
train['tags'] = train['tags'].apply(filter_tags)

**After mapping each tag to one of the characters in `string.printable`, we then 'pad' our tags such that even those with only 1 tag can be compared to those with 6. Now we can string concat the answer of the question so that the model learns it as a sort of word suffix.**

In [None]:
#I know I know, this is sloppy
def pad_tags(tag):
    if len(tag.split()) == 1:
        return "".join([tag]*6)
    
    if len(tag.split()) == 2:
        tag1 = tag.split()[0]
        tag2 = tag.split()[1]
        return "".join([tag1, tag2]*3)
    
    if len(tag.split()) == 3:
        tag1 = tag.split()[0]
        tag2 = tag.split()[1]
        tag3 = tag.split()[2]
        return "".join([tag1, tag2, tag3]*2)
    
    if len(tag.split()) == 4:
        tag1 = tag.split()[0]
        tag2 = tag.split()[1]
        tag3 = tag.split()[2]
        tag4 = tag.split()[3]
        return "".join([tag1, tag2, tag3, tag4, tag1, tag2])
    
    if len(tag.split()) == 5:
        tag1 = tag.split()[0]
        tag2 = tag.split()[1]
        tag3 = tag.split()[2]
        tag4 = tag.split()[3]
        tag5 = tag.split()[4]
        return "".join([tag1, tag2, tag3, tag4, tag5, tag1])
    
    else: return tag      
    
train['tags'] = train['tags'].apply(pad_tags)

**I will also changed the `answered_correctly = -1` value to `2`. I want to include this is a category on par with the other answers, so that we can cluster questions together that students did not answer. This makes sense if there is some logic behind the lack of an answer: was the question too hard or poorly worded? I do not know if any such logic underlies these `-1` values, but I will assume (for now) there is.**

In [None]:
train['answered_correctly'] = train['answered_correctly'].apply(str)
train['answered_correctly'] = train['answered_correctly'].replace({'-1':'2'})
train['interaction_enc'] = train['tags'] + train['answered_correctly']

In [None]:
#debugging step
i = 0
for i in range(100000):
    if len([_ for _ in train.iloc[i]['interaction_enc']]) != 7: print('stop'); print(i); break

In [None]:
with open('corpus.txt', 'w') as file:
    for user in tqdm(train['user_id'].unique()):
        user_df = train[train['user_id'] == user]
        line=' '.join(user_df['interaction_enc'].values)
        file.write(line+'\n')

**We can train a skipgram or a continuous bag-of-words (CBOW) model. In general, CBOW learns better syntactic relationships between words while skipgram is better at capturing semantic relationships. Suppose we have the word `general`. CBOW would fetch morphologically similar words, like `generalize` or `generalized`, whereas the skipgram would find words like `universal` and `common`. As we don't really have words here, let's just try both.**

**From the [fastText documentation](https://fasttext.cc/docs/en/python-module.html), we can play with the following parameters. Values in `[]` are default values.**

* **input** - training file path (required)
* **model** - unsupervised fasttext model {cbow, skipgram} [skipgram]
* **lr** - learning rate [0.05]
* **dim** - size of word vectors [100]
* **ws** - size of the context window [5]
* **epoch** - number of epochs [5]
* **minCount** - minimal number of word occurences [5]
* **minn** - min length of char ngram [3]
* **maxn** - max length of char ngram [6]
* **neg** - number of negatives sampled [5]
* **wordNgrams** - max length of word ngram [1]
* **loss** - loss function {ns, hs, softmax, ova} [ns]
* **bucket** - number of buckets [2000000]
* **thread** - number of threads [number of cpus]
* **lrUpdateRate** - change the rate of updates for the learning rate [100]
* **t** - sampling threshold [0.0001]
* **verbose** - verbose [2]
    
**This is why I formatted the encoded question/answer pairs the way I did: we can just set `minn=1` and `maxn=1` to capture similarities between the tags associated to a particular problem and how the user answered it.** 

In [None]:
DIM = 200
WINDOW = 6
PERPLEXITY = 40
N_ITER = 2500
SEED = 34

In [None]:
%%time

cbow = fasttext.train_unsupervised('corpus.txt', model='cbow',
                                    dim=DIM, minn=1, maxn=1, ws=WINDOW)
skipgram = fasttext.train_unsupervised('corpus.txt', model='skipgram',
                                    dim=DIM, minn=1, maxn=1, ws=WINDOW)
print(f"{cbow.get_output_matrix().mean()}")                                                                                                                    
print(f"{skipgram.get_output_matrix().mean()}")

In [None]:
#sanity check
cbow.get_subwords('Xy5Xy51')

**Great, seems that the CBOW model is forming character *n*-grams for each unique tag ID and question answer just as we wanted. Now let's visualize these embeddings with t-SNE to see what sort of structure there is.**

In [None]:
# https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne/comments
def tsne_plot(model):
    labels = []
    tokens = []

    for word in model.words:
        tokens.append(model[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=PERPLEXITY, n_components=2, init='pca', 
                      n_iter=N_ITER, n_jobs=-1, random_state=SEED)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(15, 15)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        
        if i%50==0:
            plt.annotate(labels[i],
                         xy=(x[i], y[i]),
                         xytext=(5, 2),
                         textcoords='offset points',
                         ha='right',
                         va='bottom')
    plt.show()

In [None]:
tsne_plot(cbow)

In [None]:
tsne_plot(skipgram)

In [None]:
question_vectors = list(zip(train['tags'].values, [cbow.get_word_vector(word) for word in train['tags']]))

**Disclaimer: I could be wildly off with my approach here, so I welcome thoughts and suggestions. If you see a mistake or have a question, just let me know.**