# Preprocessing: FastText Sequences & Embeddings

Based on the tokenized questions and a pre-built word embedding database, build fixed-length (padded) sequences of word indices for each question, as well as a lookup matrix that maps word indices to word vectors.

## Imports

This utility package imports `numpy`, `pandas`, `matplotlib` and a helper `kg` module into the root namespace.

In [54]:
from pygoose import *

In [55]:
from gensim.models.wrappers.fasttext import FastText

Hide all GPUs from TensorFlow to not automatically occupy any GPU RAM.

In [56]:
kg.gpu.cuda_disable_gpus()

In [57]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

## Config

Automatically discover the paths to various data folders and compose the project structure.

In [58]:
project = kg.Project.discover()

The maximum allowed size of the embedding matrix and the maximum length our sequences will be padded/trimmed to.

In [59]:
MAX_VOCAB_SIZE = 226268
MAX_SEQUENCE_LENGTH = 30
MAX_SEQUENCE_LENGTH_TITLES = 15

## Load data

Preprocessed and tokenized questions. Stopwords should be kept for neural models.

In [60]:
tokens_train = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_titles_train.pickle')
tokens_test = kg.io.load(project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_titles_test.pickle')

Word embedding database queried from the trained FastText model.

In [61]:
embedding_model = FastText.load_word2vec_format(project.aux_dir + 'fasttext_vocab.vec')

In [62]:
EMBEDDING_DIM = len(embedding_model['apple'])

## Build features

### Collect all texts

In [63]:
texts_train = [' '.join(pair) for pair in tokens_train]

In [68]:
texts_test = [' '.join(pair) for pair in tokens_test]

In [69]:
unique_question_texts = list(set(texts_train + texts_test))

### Create question sequences

In [70]:
tokenizer = Tokenizer(
    num_words=MAX_VOCAB_SIZE,
    split=' ',
    lower=True,
    char_level=False,
)

In [71]:
tokenizer.fit_on_texts(unique_question_texts)

In [72]:
sequences_train = tokenizer.texts_to_sequences(texts_train)

In [73]:
sequences_test = tokenizer.texts_to_sequences(texts_test)

### Create embedding lookup matrix

In [74]:
num_words = min(MAX_VOCAB_SIZE, len(tokenizer.word_index))

Allocate an embedding matrix. Include the NULL word.

In [75]:
embedding_matrix = np.zeros((num_words + 1, EMBEDDING_DIM))

Fill the matrix using the vectors for individual words.

In [76]:
for word, index in progressbar(tokenizer.word_index.items()):
    if word in embedding_model.vocab:
        embedding_matrix[index] = embedding_model[word]

100%|██████████| 67535/67535 [00:00<00:00, 292896.87it/s]


## Save features

Word embedding lookup matrix.

In [77]:
kg.io.save(embedding_matrix, project.aux_dir + 'fasttext_vocab_embedding_matrix_titles.pickle')

Padded word index sequences.

### Descriptions

In [41]:
sequences_train = pad_sequences(sequences_train, maxlen=MAX_SEQUENCE_LENGTH)

In [42]:
sequences_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)

In [44]:
kg.io.save(sequences_train, project.preprocessed_data_dir + 'sequences_fasttext_train.pickle')

In [43]:
kg.io.save(sequences_test, project.preprocessed_data_dir + 'sequences_fasttext_test.pickle')

### Titles

In [78]:
sequences_train = pad_sequences(sequences_train, maxlen=MAX_SEQUENCE_LENGTH_TITLES)

In [79]:
sequences_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH_TITLES)

In [80]:
kg.io.save(sequences_train, project.preprocessed_data_dir + 'sequences_fasttext_titles_train.pickle')

In [81]:
kg.io.save(sequences_test, project.preprocessed_data_dir + 'sequences_fasttext_titles_test.pickle')