Due to the handicap that we can only use task's training data and public pre-trained word embeddings, this mostly limits us to pre-transformers era NLP.

One of last major similar competitions was 2018 [jigsaw toxicity competition](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/) on kaggle. Some ideas from winners:
* [1st place](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/discussion/52557):
  * diverse pre-trained embeddings
    * "Given that >90% of a model’s complexity resides in the embedding layer, we decided to focus on the embedding layer rather than the post-embedding layers"
    * "Since most of the model complexity lay in the pre-trained embeddings, minor architecture changes made very little impact on score. Additional dense layers, gaussian vs. spatial dropout, additional dropout layers at the dense level, attention instead of max pooling, time distributed dense layers, and more barely changed the overall score of the model."
    * using FastText and Glove embeddings
  * model: "our work-horse was two BiGRU layers feeding into two final Dense layers"
   * translations augmentation - not applicable
   * pseudo-labelling, extensive work on cv+stacking - probably only marginally useful here, not worth the effort
* [2nd place](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/discussion/52612)
  * ensemble of RNN, DPCNN and GBM models
  * pre-trained embeddings: FastText, Glove twitter, BPEmb, Word2Vec, LexVec
  * translations augmentation - not applicable
* [3rd place](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/discussion/52762)
  * kitchen sink ensemble of various models from team members
  * fasttest and glove twitter vectors; alone and concatenated
  * models: GRU, LSTM, GRU+CNN, BiLSTM+GRU
  * some char models in the mix instead of usual word/token-based models
* 5th place
  * concatenated glove, fasttext embeddings with subword information
  * main model: 2-level bidirectional gru followed by max pooling and 2 fully-connected layers
  * char-level DPCNN and RNN trained over wordparts
  * 2 layers of stacking

Proposal:
  * bidirectional 2-layer LSTM model with fully connected layers at the head
  * fasttext and glove embeddings (TODO: others), concatenate multiple embeddings together
  * enhance hidden state output (at last timestamp) with its avg/maxpool over time
  * ensemble models across different CV folds and seeds, TODO: checkpoint ensembling, ensemble different archs
  * minor: 0-pad tokens at the beginning so that last hidden state vector is closer to actual tokens; lower learning rate for embedding layer; TODO: cyclic learning rate scheduler

In [1]:
import json
import os
import pickle
import numpy as np
import pandas as pd
import sklearn.model_selection

### Split training data

In [2]:
CV_FOLDS = 10
CV_SEED = 42
CV_PATH_FMT = 'cache/intent/cv{fold}/{split}.json'

In [3]:
train_df = pd.read_json('data/intent/train.json')
eval_df = pd.read_json('data/intent/eval.json')
test_df = pd.read_json('data/intent/test.json')

#TODO: each class should be equally distributed in train and eval fold
#Use sklearn.model_selection.StratifiedKFold instead
cv = sklearn.model_selection.KFold(
    n_splits=CV_FOLDS, shuffle=True, random_state=CV_SEED
    )

for fold_idx, (train_idx, eval_idx) in enumerate(cv.split(train_df.index)):
    for split in ['train', 'eval']:
        filename = CV_PATH_FMT.format(fold=fold_idx, split=split)
        os.makedirs(os.path.dirname(filename), exist_ok=True)
        df = train_df.iloc[train_idx if split == 'train' else eval_idx]
        df.to_json(filename, orient='records')

In [4]:
intents = sorted(set(train_df.intent))
intent2idx = {s: i for (i, s) in enumerate(intents)}
!mkdir -p cache/intent
with open('cache/intent/intent2idx.json', 'w') as fp:
    json.dump(intent2idx, fp, indent=2)

c = train_df.intent.value_counts()
print(f'{len(intents)} classes, training samples per class: {c.min()}..{c.max()}')

150 classes, training samples per class: 100..100


Training data is perfectly balanced!

### Download and parse pre-trained embeddings

In [5]:
# Fasttext: https://fasttext.cc/docs/en/english-vectors.html
!cd cache/ && wget -nc https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip  # 1.5G
!cd cache/ && unzip -n crawl-300d-2M.vec.zip

# Glove: https://nlp.stanford.edu/projects/glove/
!cd cache/ && wget -nc https://nlp.stanford.edu/data/glove.840B.300d.zip  # 2.0G
!cd cache/ && unzip -n glove.840B.300d.zip

# More ideas
# https://separius.github.io/awesome-sentence-embedding/#word-embeddings esp. lexvec, bpemb

File ‘crawl-300d-2M.vec.zip’ already there; not retrieving.

Archive:  crawl-300d-2M.vec.zip
File ‘glove.840B.300d.zip’ already there; not retrieving.

Archive:  glove.840B.300d.zip


In [6]:
# parse .txt with fasttext/glove embeddings
def parse_embedding_txt(path):
    vectors = {}
    dim = 0
    with open(path) as fp:
        for line in fp:
            line = line.split()
            if len(line) == 2: continue  # fasttext header
            if dim == 0:
                dim = len(line) - 1
            elif dim != len(line) - 1:
                continue
            vectors[line[0]] = np.array(line[1:], dtype=np.float32)  # will parse strings
    print('Parsed %d x %dd vectors from %s' % (len(vectors), dim, path))
    return vectors

fasttext_vec = parse_embedding_txt('cache/crawl-300d-2M.vec')
glove_vec = parse_embedding_txt('cache/glove.840B.300d.txt')

Parsed 1999995 x 300d vectors from cache/crawl-300d-2M.vec
Parsed 2195875 x 300d vectors from cache/glove.840B.300d.txt


In [7]:
glove_vec["'t"] = glove_vec["n't"] 
# alias 't to n't for glove
# a quick fix for one fasttext/glove discrepancy

### Generate vocab and embedding matrix

In [8]:
import torch

from dataset import basic_tokenizer
from utils import Vocab

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
# for speed, only leave tokens we would ever encounter
# alt: use full fasttext+glove vocab; +oov tokens? 
# use subword vectors for oov?
vocab = set()
lens = []
for text in list(train_df.text) + list(eval_df.text) + list(test_df.text):
    tokens = basic_tokenizer(text)
    lens += [len(tokens)]
    vocab |= set(tokens)
vocab = Vocab(list(sorted(vocab)))

with open('cache/intent/vocab.pkl', 'wb') as fp:
    pickle.dump(vocab, fp)
with open('cache/intent/vocab.json', 'w') as fp:
    json.dump(vocab.tokens, fp, indent=2)

print(f'vocab size {len(vocab.tokens)}, max len {max(lens)}')

vocab size 6320, max len 29


Concatenate both pre-trained embeddings for 600 dims in total, trim to our vocab, initializing OOV tokens randomly

In [10]:
emb = np.random.normal(
    size=(len(vocab.tokens), 600), loc=0.0, scale=0.2
    )
for token in vocab.tokens:
    i = vocab.token_to_id(token)
    if token in fasttext_vec:
        emb[i, :300] = fasttext_vec[token]
    if token in glove_vec:
        emb[i, 300:] = glove_vec[token]

emb[0, :] = 0.  # zero init the padding token

emb = torch.tensor(emb, dtype=torch.float32)
torch.save(emb, 'cache/intent/embeddings.pt')
print(emb.shape)

torch.Size([6320, 600])
