# Using Pre-trained Word Embeddings

Word Embedding - Numerical representation for language

How?

*"You shall know a word by the company it keeps."* - John Rupert Firth

**Tezgüino** <- What does this word mean?

* A bottle of *Tezgüino* is on the table
* *Tezgüino* makes you drunk
* Everybody likes *Tezgüino*


How about now?

## Examples

Word2Vec

FastText

GloVe

## Let's see these in practice

In [None]:
!pip install gluonnlp

In [None]:
import mxnet as mx
from mxnet import gluon
from mxnet import nd
import gluonnlp as nlp

import re
import io
import time
import multiprocessing as mp
import numpy as np

In [None]:
ctx = mx.gpu(0) if mx.test_utils.list_gpus() else mx.cpu()

In [None]:
text = " hello world \n hello nice world \n hi world \n"

We need a tokenizer to process this string

In [None]:
def simple_tokenize(source_str, token_delim=' ', seq_delim='\n'):
    return filter(None, re.split(token_delim + '|' + seq_delim, source_str))
counter = nlp.data.count_tokens(simple_tokenize(text))

In [None]:
counter

In [None]:
vocab = nlp.Vocab(counter)

In [None]:
vocab.idx_to_token

In [None]:
fasttext_simple = nlp.embedding.create('fasttext', source='wiki.simple')

In [None]:
vocab.set_embedding(fasttext_simple)

In [None]:
vocab.embedding['beautiful']

In [None]:
vocab.embedding['hello', 'world'][:, :5]

## Application of Pre-trained Word Embeddings

In [None]:
embedding = nlp.embedding.create('glove', source='glove.6B.50d')

In [None]:
vocab = nlp.Vocab(nlp.data.Counter(embedding.idx_to_token))
vocab.set_embedding(embedding)

In [None]:
len(vocab.idx_to_token)

In [None]:
print(vocab['beautiful'])
print(vocab.idx_to_token[71424])

### Word Similarity

![](support/cosinesimilarity.png)

In [None]:
def cos_sim(x, y):
    return nd.dot(x, y) / (nd.norm(x) * nd.norm(y))

In [None]:
def norm_vecs_by_row(x):
    return x / nd.sqrt(nd.sum(x * x, axis=1)).reshape((-1,1))

def get_knn(vocab, k, word):
    word_vec = vocab.embedding[word].reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs[4:], word_vec)
    indices = nd.topk(dot_prod.squeeze(), k=k+1, ret_typ='indices')
    indices = [int(i.asscalar())+4 for i in indices]
    # Remove unknown and input tokens.
    return vocab.to_tokens(indices[1:])

In [None]:
get_knn(vocab, 5, 'baby')

We can verify the cosine similarity of vectors of 'baby' and 'babies'.

In [None]:
cos_sim(vocab.embedding['baby'], vocab.embedding['babies'])

Let us find the 5 most similar words of 'beautiful' from the vocabulary.

In [None]:
get_knn(vocab, 5, 'beautiful')

### Word Analogy

In [None]:
def get_top_k_by_analogy(vocab, k, word1, word2, word3):
    word_vecs = vocab.embedding[word1, word2, word3]
    
    word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2])
    
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs[4:], word_diff.squeeze()).squeeze()
    
    indices = dot_prod.topk(k=k, ret_typ='indices')
    indices = [int(i.asscalar())+4 for i in indices]
    return vocab.to_tokens(indices)

In [None]:
get_top_k_by_analogy(vocab, 1, 'man', 'woman', 'son')

In [None]:
get_top_k_by_analogy(vocab, 3, 'argentina', 'messi', 'france')

In [None]:
get_top_k_by_analogy(vocab, 1, 'argentina', 'football', 'india')

In [None]:
get_top_k_by_analogy(vocab, 1, 'france', 'crepes', 'argentina')

![](support/elmo-embedding-robin-williams.png)


Context matters

# Sentence Embeddings with Pretrained ELMo



<img align="middle" src="https://pbs.twimg.com/profile_images/1092451830758547457/EqQ6Csl3_400x400.jpg" />

In [None]:
elmo_intro = """
Extensive experiments demonstrate that ELMo representations work extremely well in practice.
We first show that they can be easily added to existing models for six diverse and challenging language understanding problems, including textual entailment, question answering and sentiment analysis.
The addition of ELMo representations alone significantly improves the state of the art in every case, including up to 20% relative error reductions.
For tasks where direct comparisons are possible, ELMo outperforms CoVe (McCann et al., 2017), which computes contextualized representations using a neural machine translation encoder.
Finally, an analysis of both ELMo and CoVe reveals that deep representations outperform those derived from just the top layer of an LSTM.
Our trained models and code are publicly available, and we expect that ELMo will provide similar gains for many other NLP problems.
"""

elmo_intro_file = 'elmo_intro.txt'
with io.open(elmo_intro_file, 'w', encoding='utf8') as f:
    f.write(elmo_intro)

dataset = nlp.data.TextLineDataset(elmo_intro_file, 'utf8')
print(len(dataset))
print(dataset[2]) # print an example sentence from the input data

## Data Transform

### Tokenization

In [None]:
tokenizer = nlp.data.NLTKMosesTokenizer()
dataset = dataset.transform(tokenizer)
dataset = dataset.transform(lambda x: ['<bos>'] + x + ['<eos>'])
print(dataset[2]) # print the same tokenized sentence as above

### Pretrained ELMo Vocab

In [None]:
vocab = nlp.vocab.ELMoCharVocab()
dataset = dataset.transform(lambda x: (vocab[x], len(x)), lazy=False)

### DataLoader

In [None]:
batch_size = 2
dataset_batchify_fn = nlp.data.batchify.Tuple(nlp.data.batchify.Pad(),
                                              nlp.data.batchify.Stack())
data_loader = gluon.data.DataLoader(dataset,
                                    batch_size=batch_size,
                                    batchify_fn=dataset_batchify_fn)

## Load pretrained ELMo Model

In [None]:
elmo_bilm, _ = nlp.model.get_model('elmo_2x1024_128_2048cnn_1xhighway',
                                   dataset_name='gbw',
                                   pretrained=True,
                                   ctx=mx.cpu())
#print(elmo_bilm)

![](support/elmo_arch.png)

## Get sentence Features from Elmo

In [None]:
def get_features(data, valid_lengths):
    length = data.shape[1]
    hidden_state = elmo_bilm.begin_state(mx.nd.zeros, batch_size=batch_size)
    mask = mx.nd.arange(length).expand_dims(0).broadcast_axes(axis=(0,), size=(batch_size,))
    mask = mask < valid_lengths.expand_dims(1).astype('float32')
    output, hidden_state = elmo_bilm(data, hidden_state, mask)
    return output


In [None]:
batch = next(iter(data_loader))
features = get_features(*batch)
print([x.shape for x in features])

# Finetuning BERT for sentence classification

<img align="middle" src="https://miro.medium.com/max/854/1*oUpWrMdvDWcWE_QSne-jOw.jpeg" />

## Get BERT base model

In [None]:
bert_base, vocabulary = nlp.model.get_model('bert_12_768_12',
                                             dataset_name='book_corpus_wiki_en_uncased',
                                             pretrained=True, ctx=ctx, use_pooler=True,
                                             use_decoder=False, use_classifier=False)

## Data Preprocessing


In [None]:
train_dataset, test_dataset = [nlp.data.IMDB(root='data/imdb', segment=segment)
                               for segment in ('train', 'test')]

In [None]:
def process_label(x):
    data, label = x
    # Label is a review score from 1 to 10. We take 6..10 as a positive sentiment
    # and 1..5 as a negative
    label = int(label > 5)
    return [data, label]

def process_dataset(dataset):
    start = time.time()
    with mp.Pool() as pool:
        # Each sample is processed in an asynchronous manner.
        dataset = gluon.data.SimpleDataset(pool.map(process_label, dataset))
    end = time.time()
    print('Done! Label processing Time={:.2f}s, #Sentences={}'.format(end - start, len(dataset)))
    return dataset

train_dataset = process_dataset(train_dataset)
test_dataset = process_dataset(test_dataset)

## Data preprocessing for BERT

In [None]:
from gluonnlp.data import BERTSentenceTransform

class BERTDatasetTransform(object):
    def __init__(self, tokenizer, max_seq_length, class_labels=None,
                 label_alias=None, pad=True, pair=True, has_label=True):
        self.class_labels = class_labels
        self.has_label = has_label
        self._label_dtype = 'int32' if class_labels else 'float32'
        
        if has_label and class_labels:
            self._label_map = {}
            for (i, label) in enumerate(class_labels):
                self._label_map[label] = i
            if label_alias:
                for key in label_alias:
                    self._label_map[key] = self._label_map[label_alias[key]]
        
        self._bert_xform = BERTSentenceTransform(
            tokenizer, max_seq_length, pad=pad, pair=pair)

    def __call__(self, line):
        if self.has_label:
            input_ids, valid_length, segment_ids = self._bert_xform(line[:-1])
            label = line[-1]
            # map to int if class labels are available
            if self.class_labels:
                label = self._label_map[label]
            label = np.array([label], dtype=self._label_dtype)
            return input_ids, valid_length, segment_ids, label
        else:
            return self._bert_xform(line)


In [None]:
# Use the vocabulary from pre-trained model for tokenization
bert_tokenizer = nlp.data.BERTTokenizer(vocabulary, lower=True)

# The maximum length of an input sequence
max_len = 500

# The labels for the two classes
all_labels = [0, 1]

transform = BERTDatasetTransform(bert_tokenizer, max_len,
                                 class_labels=all_labels,
                                 has_label=True,
                                 pad=True,
                                 pair=False)

data_train = train_dataset.transform(transform)
data_test = test_dataset.transform(transform)


In [None]:
sample_id = 5
print(vocabulary)
print('%s token id = %s' % (vocabulary.padding_token, vocabulary[vocabulary.padding_token]))
print('%s token id = %s' % (vocabulary.cls_token, vocabulary[vocabulary.cls_token]))
print('%s token id = %s' % (vocabulary.sep_token, vocabulary[vocabulary.sep_token]))
print('token ids = %s' % data_train[sample_id][0][:11])
print('valid length = %s' % data_train[sample_id][1])
print('label = %s' % data_train[sample_id][3])

## Classifier Model using BERT

In [None]:
class BERTClassifier(gluon.nn.Block):
    def __init__(self, bert, num_classes=2, dropout=0.0, prefix=None, params=None):
        super(BERTClassifier, self).__init__(prefix=prefix, params=params)
        self.bert = bert

        with self.name_scope():
            self.classifier = gluon.nn.HybridSequential(prefix=prefix)
            if dropout:
                self.classifier.add(gluon.nn.Dropout(rate=dropout))
            self.classifier.add(gluon.nn.Dense(units=num_classes))

    def forward(self, inputs, token_types, valid_length=None):
        _, pooler_out = self.bert(inputs, token_types, valid_length)
        return self.classifier(pooler_out)

In [None]:
model = BERTClassifier(bert_base, num_classes=2, dropout=0.1)
# only need to initialize the classifier layer.
model.classifier.initialize(init=mx.init.Normal(0.02), ctx=ctx)
#bert_classifier.hybridize(static_alloc=True)

## Loss, Trainer, DataLoader

In [None]:
batch_size = 10
lr = 5e-6
log_interval = 300
num_epochs = 1
grad_clip = 1

loss_function = mx.gluon.loss.SoftmaxCELoss()
trainer = mx.gluon.Trainer(model.collect_params(), 'adam', {'learning_rate': lr, 'epsilon': 1e-9})

train_dataloader = mx.gluon.data.DataLoader(data_train, batch_size=batch_size, shuffle=True, num_workers=10)
test_dataloader = mx.gluon.data.DataLoader(data_test, batch_size=batch_size, shuffle=False, num_workers=10)

## Evaluation Metric

In [None]:
#accuracy
metric = mx.metric.Accuracy()

def evaluate(model, dataloader, context):
    metric = mx.metric.Accuracy()
    step_loss = 0
    
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(dataloader):
        token_ids = token_ids.as_in_context(ctx)
        valid_length = valid_length.as_in_context(ctx)
        segment_ids = segment_ids.as_in_context(ctx)
        label = label.as_in_context(ctx)

        out = model(token_ids, segment_ids, valid_length.astype('float32'))
        ls = loss_function(out, label).mean()

        step_loss += ls.asscalar()
        metric.update([label], [out])

    return metric.get()[1], step_loss / len(dataloader)

## Training Loop

In [None]:
def train(model, ctx, num_epochs):
    metric = mx.metric.Accuracy()
    # Collect all differentiable parameters for gradient clipping
    params = [p for p in model.collect_params().values() if p.grad_req != 'null']

    for epoch_id in range(num_epochs):
        metric.reset()
        step_loss = 0
        for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(train_dataloader):
            token_ids = token_ids.as_in_context(ctx)
            valid_length = valid_length.as_in_context(ctx)
            segment_ids = segment_ids.as_in_context(ctx)
            label = label.as_in_context(ctx)

            with mx.autograd.record():
                out = model(token_ids, segment_ids, valid_length.astype('float32'))
                ls = loss_function(out, label).mean()

            ls.backward()

            trainer.allreduce_grads()
            nlp.utils.clip_grad_global_norm(params, 1)
            trainer.update(1)

            step_loss += ls.asscalar()
            metric.update([label], [out])

            if (batch_id + 1) % (log_interval) == 0:
                print('[Epoch {} Batch {}/{}] loss={:.4f}, lr={:.7f}, acc={:.3f}'
                             .format(epoch_id, batch_id + 1, len(train_dataloader),
                                     step_loss / log_interval, trainer.learning_rate, metric.get()[1]))
                step_loss = 0
        
        test_acc, test_loss = evaluate(model, test_dataloader, ctx)
        print('[Epoch {}] test_loss={:.4f}, test_acc={:.3f}'
             .format(epoch_id, test_loss, test_acc))

In [None]:
train(model, ctx, num_epochs)

## Test with example reveiw

In [None]:
review_text = 'I would like to say something positive about this movie, and I can\'t'
review_transformed = transform((review_text, 0))

token_ids =  mx.nd.array(review_transformed[0], ctx=ctx).reshape(1, -1)
segment_ids =  mx.nd.array(review_transformed[2], ctx=ctx).reshape(1, -1)
valid_length = mx.nd.array(review_transformed[1], ctx=ctx).reshape(1)

In [None]:
positive_review_probability = model(token_ids, segment_ids, valid_length.astype('float32')).softmax()
print('"{}" is {:.1f}% likely to be positive'.format(
    review_text,
    100 * positive_review_probability[0][1].asscalar()))