# Word Embeddings Evaluation and Training

In [1]:
import warnings
warnings.filterwarnings('ignore')

import itertools
import math

import mxnet as mx
import gluonnlp as nlp
import numpy as np

# context = mx.cpu()  # Enable this to run on CPU
context = mx.gpu(0)  # Enable this to run on GPU

## Unknown token handling and subword information

Sometimes we may run into a word for which the embedding does not include a word vector.
While the `vocab` object is happy to replace it with a special index for unknown tokens.

In [2]:
pretrained_embedding = nlp.embedding.create('fasttext', source='wiki.simple')
print(pretrained_embedding.idx_to_vec.shape)

(111052, 300)


In [3]:
'unknownword' in pretrained_embedding.idx_to_token

False

In [4]:
pretrained_embedding['unknownword'][:5]


[ 0.  0.  0.  0.  0.]
<NDArray 5 @cpu(0)>

We first load pretrained fastText word embeddings.
fastText embeddings support computing vectors for unknown words by falling back to vectors learned for ngram level features.
In GluonNLP it is possible to specify `load_ngrams=True` when loading pretrained fastText embeddings to load the ngram level features and consequently support meaningful embeddings for unknown words.

In [5]:
pretrained_embedding = nlp.embedding.create('fasttext', source='wiki.en', load_ngrams=True)

In [6]:
pretrained_embedding['unknownword'][:5]


[ 0.08307433  0.06700231 -0.25606179  0.16879943 -0.02845737]
<NDArray 5 @cpu(0)>

Some embedding models such as the FastText model support computing word vectors for unknown words by taking into account their subword units.

- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. ICLR Workshop , 2013.

## Quantifying Word Embeddings Quality - Evaluation

The previous example has introduced how to load pre-trained word embeddings
from a set of sources included in the Gluon NLP toolkit. It was shown how make
use of the word vectors to find the top most similar words of a given words or
to solve the analogy task.

Besides manually investigating similar words or the predicted analogous words,
we can facilitate word embedding evaluation datasets to quantify the
evaluation.

Datasets for the *similarity* task come with a list of word pairs together with
a human similarity judgement. The task is to recover the order of most-similar
to least-similar pairs.

Datasets for the *analogy* tasks supply a set of analogy quadruples of the form
‘a : b :: c : d’ and the task is to recover find the correct ‘d’ in as many
cases as possible given just ‘a’, ‘b’, ‘c’. For instance, “man : woman :: son :
daughter” is an analogy.

The Gluon NLP toolkit includes a set of popular *similarity* and *analogy* task
datasets as well as helpers for computing the evaluation scores. Here we show
how to make use of them.


### Word Similarity and Relatedness Task

Word embeddings should capture the relationsship between words in natural language.
In the Word Similarity and Relatedness Task word embeddings are evaluated by comparing word similarity scores computed from a pair of words with human labels for the similarity or relatedness of the pair.

`gluonnlp` includes a number of common datasets for the Word Similarity and Relatedness Task. The included datasets are listed in the [API documentation](http://gluon-nlp.mxnet.io/api/data.html#word-embedding-evaluation-datasets). We use several of them in the evaluation example below.

We first show a few samples from the WordSim353 dataset, to get an overall feeling of the Dataset structur

In [7]:
wordsim353 = nlp.data.WordSim353()
for i in range(0, len(wordsim353), 30):
    print("{:<15}{:<15}{}".format(*wordsim353[i]))

drink          mouth          5.96
nature         environment    8.31
type           kind           8.97
wood           forest         7.73
Jerusalem      Palestinian    7.65
day            summer         3.94
lobster        wine           5.7
architecture   century        3.78
shower         flood          6.03
psychology     Freud          8.21
money          dollar         8.42
impartiality   interest       5.16


In [8]:
counter = nlp.data.count_tokens(wordsim353.transform(lambda e: e[0]))
counter.update(wordsim353.transform(lambda e: e[1]))

In [9]:
vocab_wordsim353 = nlp.Vocab(counter)
vocab_wordsim353.set_embedding(pretrained_embedding)

In [10]:
print(len(vocab_wordsim353))

441


The Gluon NLP toolkit includes a `WordEmbeddingSimilarity` block, which predicts similarity score between word pairs given an embedding matrix.

In [11]:
evaluator = nlp.embedding.evaluation.WordEmbeddingSimilarity(
    idx_to_vec=vocab_wordsim353.embedding.idx_to_vec,
    similarity_function="CosineSimilarity")

In [12]:
evaluator.initialize(ctx=context)
evaluator.hybridize()

In [13]:
wordsim353_coded = wordsim353.transform(
    lambda e: (vocab_wordsim353[e[0]], vocab_wordsim353[e[1]], e[2]))
wordsim353_nd = mx.nd.array(wordsim353_coded, ctx=context)

The similarities can be predicted by passing the two arrays of words through the evaluator. Thereby the *ith* word in `words1` will be compared with the *ith* word in `words2`.

In [14]:
pred_similarity = evaluator(wordsim353_nd[:, 0], wordsim353_nd[:, 1])

In [15]:
for (w1, w2, s), ps in zip(wordsim353[:10], pred_similarity[:10].asnumpy()):
    print("{:<15}{:<15}{:<10.2f}{:.2f}".format(w1, w2, s/10, ps))

drink          mouth          0.60      0.30
start          match          0.45      0.25
development    issue          0.40      0.14
volunteer      motto          0.26      0.29
money          laundering     0.57      0.59
energy         secretary      0.18      0.20
midday         noon           0.93      0.70
attempt        peace          0.42      0.20
psychology     science        0.67      0.57
professor      cucumber       0.03      0.10


We can evaluate the predicted similarities, and thereby the word embeddings, by computing the Spearman Rank Correlation between the predicted similarities and the groundtruth, human, similarity scores from the dataset:

In [16]:
from scipy import stats
sr = stats.spearmanr(pred_similarity.asnumpy(), wordsim353_nd[:, 2].asnumpy())
print('Spearman rank correlation', sr.correlation.round(3))

Spearman rank correlation 0.68


## Training word embeddings

Next to making it easy to work with pre-trained word embeddings, `gluonnlp`
also provides everything needed to train your own embeddings. Datasets as well
as model definitions are included.

### Training data

We first load the Text8 corpus from the [Large Text Compression
Benchmark](http://mattmahoney.net/dc/textdata.html) which includes the first
100 MB of cleaned text from the English Wikipedia. We follow the common practice
of splitting every 10'000 tokens to obtain "sentences" for embedding training.

In [17]:
dataset = nlp.data.Text8(segment='train')
print('# sentences:', len(dataset))
for sentence in dataset[:3]:
    print('# tokens:', len(sentence), sentence[:5])

# sentences: 1701
# tokens: 10000 ['anarchism', 'originated', 'as', 'a', 'term']
# tokens: 10000 ['reciprocity', 'qualitative', 'impairments', 'in', 'communication']
# tokens: 10000 ['with', 'the', 'aegis', 'of', 'zeus']


We then build a vocabulary of all the tokens in the dataset that occur more
than 5 times and replace the words with their indices.

In [18]:
counter = nlp.data.count_tokens(itertools.chain.from_iterable(dataset))
vocab_training = nlp.Vocab(
    counter,
    unknown_token=None,
    padding_token=None,
    bos_token=None,
    eos_token=None,
    min_freq=5)


def code(s):
    return [vocab_training[t] for t in s if t in vocab_training]


coded_dataset = dataset.transform(code, lazy=False)

### Sampling distribution

- Subsample frequent words
- Sampling distribution $$\sqrt{f(w_i)}$$

where $f(w_i)$ is the frequency with which a word is.

Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 2013.

In [19]:
subsampling_constant = 1e-5
total_count = sum(counter[w] for w in vocab_training.idx_to_token)
idx_to_pdiscard = [
    1 - math.sqrt(subsampling_constant / (counter[w] / total_count))
    for w in vocab_training.idx_to_token
]

In [20]:
def subsample(s):
    return [
        t for t, r in zip(s, np.random.uniform(0, 1, size=len(s)))
        if r > idx_to_pdiscard[t]
    ]

In [21]:
subsampled_dataset = coded_dataset.transform(subsample, lazy=False)

print('# tokens for sentences in coded_dataset:')
for i in range(3):
    print(len(coded_dataset[i]), coded_dataset[i][:5])

print('\n# tokens for sentences in subsampled_dataset:')
for i in range(3):
    print(len(subsampled_dataset[i]), subsampled_dataset[i][:5])

# tokens for sentences in coded_dataset:
9895 [5233, 3083, 11, 5, 194]
9858 [18214, 17356, 36672, 4, 1753]
9926 [23, 0, 19754, 1, 4829]

# tokens for sentences in subsampled_dataset:
2955 [5233, 3133, 741, 10619, 27497]
2824 [18214, 17356, 36672, 1753, 13001]
2751 [19754, 1799, 7069, 950, 8712]


### Handling subword features

`gluonnlp` provides the concept of a SubwordFunction which maps words to a list of indices representing their subword.
Possible SubwordFunctions include mapping a word to the sequence of it's characters/bytes or hashes of all its ngrams.

FastText models use a hash function to map each ngram of a word to a number in range `[0, num_subwords)`.
We include the same hash function.

In [22]:
subword_function = nlp.vocab.create_subword_function(
    'NGramHashes', ngrams=[3, 4, 5, 6], num_subwords=500000)

idx_to_subwordidxs = subword_function(vocab_training.idx_to_token)
for word, subwords in zip(vocab_training.idx_to_token[:3], idx_to_subwordidxs[:3]):
    print('<'+word+'>', subwords, sep = '\t')

<the>	[151151, 409726, 148960, 361980, 60934, 316280]
<of>	[497102, 164528, 228930]
<and>	[378080, 235020, 30390, 395046, 119624, 125443]


As words are of varying length, we have to pad the lists of subwords to obtain a batch. To distinguish padded values from valid subword indices we use a mask.
We first pad the subword arrays with `-1`, compute the mask and change the `-1` entries to some valid subword index (here `0`).

In [23]:
subword_padding = nlp.data.batchify.Pad(pad_val=-1)

subwords = subword_padding(idx_to_subwordidxs[:3])
subwords_mask = subwords != -1
subwords += subwords == -1  # -1 is invalid. Change to 0
print(subwords)
print(subwords_mask)


[[ 151151.  409726.  148960.  361980.   60934.  316280.]
 [ 497102.  164528.  228930.       0.       0.       0.]
 [ 378080.  235020.   30390.  395046.  119624.  125443.]]
<NDArray 3x6 @cpu_shared(0)>

[[ 1.  1.  1.  1.  1.  1.]
 [ 1.  1.  1.  0.  0.  0.]
 [ 1.  1.  1.  1.  1.  1.]]
<NDArray 3x6 @cpu(0)>


### Model

`gluonnlp` provides model definitions for popular embedding models as Gluon Blocks.
Here we show how to train them with the Skip-Gram objective, a
simple and popular embedding training objective. It was introduced
by "Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient
estimation of word representations in vector space. ICLR Workshop , 2013."

The Skip-Gram objective trains word vectors such that the word vector of a word
at some position in a sentence can best predict the surrounding words. We call
these words *center* and *context* words.

<img src="http://blog.aylien.com/wp-content/uploads/2016/10/skip-gram.png" width="300">

Skip-Gram and picture from "Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. Efficient estimation of word representations in vector space. ICLR
Workshop , 2013."


For the Skip-Gram objective, we initialize two embedding models: `embedding`
and `embedding_out`. `embedding` is used to look up embeddings for the *center*
words. `embedding_out` is used for the *context* words.

The weights of `embedding` are the final word embedding weights.

In [24]:
emsize = 300

In [25]:
embedding = nlp.model.train.FasttextEmbeddingModel(
    token_to_idx=vocab_training.token_to_idx,
    subword_function=subword_function,
    embedding_size=emsize,
    weight_initializer=mx.init.Uniform(scale=1 / emsize))
embedding_out = nlp.model.train.SimpleEmbeddingModel(
    token_to_idx=vocab_training.token_to_idx,
    embedding_size=emsize,
    weight_initializer=mx.init.Uniform(scale=1 / emsize))

In [26]:
for e in [embedding, embedding_out]:
    e.initialize(ctx=context)
    e.hybridize(static_alloc=True)

In [27]:
params = embedding.collect_params()
params.update(embedding_out.collect_params())
trainer = mx.gluon.Trainer(params, 'adagrad', dict(learning_rate=0.05))

In [28]:
weights = mx.nd.array([counter[w] for w in vocab_training.idx_to_token])**0.75

In [29]:
import utils
utils.train_fasttext_embedding(1, embedding, embedding_out, subsampled_dataset,
                         weights, idx_to_subwordidxs, context, trainer)

epoch 1, time 2.43s, iteration 0, throughput=0.84K wps
epoch 1, time 20.99s, iteration 500, throughput=48.88K wps
epoch 1, time 38.32s, iteration 1000, throughput=53.50K wps
epoch 1, time 54.56s, iteration 1500, throughput=56.35K wps
epoch 1, time 70.92s, iteration 2000, throughput=57.79K wps
epoch 1, time 75.18s, train loss 0.28



### Evaluation of trained embedding

We create a new `TokenEmbedding` object and set the embedding vectors for the words we care about for evaluation.

In [30]:
token_embedding = nlp.embedding.TokenEmbedding(unknown_token=None, allow_extend=True)
token_embedding[vocab_wordsim353.idx_to_token] = embedding[vocab_wordsim353.idx_to_token]

vocab_wordsim353.set_embedding(token_embedding)

In [31]:
evaluator = nlp.embedding.evaluation.WordEmbeddingSimilarity(
    idx_to_vec=vocab_wordsim353.embedding.idx_to_vec,
    similarity_function="CosineSimilarity")
evaluator.initialize(ctx=context)
evaluator.hybridize()

In [32]:
pred_similarity = evaluator(wordsim353_nd[:, 0], wordsim353_nd[:, 1])
sr = stats.spearmanr(pred_similarity.asnumpy(), wordsim353_nd[:, 2].asnumpy())
print('Spearman rank correlation', sr.correlation.round(3))

Spearman rank correlation 0.404


# Practice - Quantifying Analogy Evaluation

## Background

In the Word Analogy Task word embeddings are evaluated by inferring an analogous word `D`, which is related to a given word `C` in the same way as a given pair of words `A, B` are related.

`gluonnlp` includes a number of common datasets for the Word Analogy Task. The included datasets are listed in the [API documentation](http://gluon-nlp.mxnet.io/api/data.html#word-embedding-evaluation-datasets). In this notebook we use the GoogleAnalogyTestSet dataset.


In [33]:
google_analogy = nlp.data.GoogleAnalogyTestSet()

We first demonstrate the structure of the dataset by printing a few examples

In [34]:
for i in range(0, len(google_analogy), 1000):
    print("{:<20}{:<20}{:<20}{:<20}".format(*google_analogy[i]))

athens              greece              baghdad             iraq                
baku                azerbaijan          dushanbe            tajikistan          
dublin              ireland             kathmandu           nepal               
lusaka              zambia              tehran              iran                
rome                italy               windhoek            namibia             
zagreb              croatia             astana              kazakhstan          
philadelphia        pennsylvania        tampa               florida             
wichita             kansas              shreveport          louisiana           
shreveport          louisiana           oxnard              california          
complete            completely          lucky               luckily             
comfortable         uncomfortable       clear               unclear             
good                better              high                higher              
young               younger 

## Task

- Create a vocabulary containing
  - the (most frequent) 300000 words of the pretrained embedding
  - and all words of the GoogleAnalogyTestSet
- Attach the pretrained_embedding to the vocabulary to obtain vectors for all words
- Then run below evaluation code

In [35]:
# counter = nlp.data.utils.Counter(...)   # First 300000 entries of pretrained_embedding.idx_to_token
# counter.update(itertools.chain.from_iterable(google_analogy))

# vocab_google_analogy = nlp.Vocab(...)
# vocab_google_analogy.set_embedding(pretrained_embedding)

In [36]:
# google_analogy_batches = mx.gluon.data.DataLoader(
#     google_analogy.transform(vocab_google_analogy.to_indices),
#     batch_size=256)

In [37]:
# evaluator = nlp.embedding.evaluation.WordEmbeddingAnalogy(
#     idx_to_vec=vocab_google_analogy.embedding.idx_to_vec,
#     exclude_question_words=True,
#     analogy_function="ThreeCosMul")
# evaluator.initialize(ctx=context)
# evaluator.hybridize()

In [38]:
# acc = mx.metric.Accuracy()

# for batch in google_analogy_batches:
#     batch = batch.as_in_context(context)
#     pred_idxs = evaluator(batch[:, 0], batch[:, 1], batch[:, 2])
#     acc.update(pred_idxs[:, 0], batch[:, 3])

# print('Accuracy', acc.get()[1].round(3))