# Word Embeddings

## word2vec
- Continuous bag-of-words model: predicts the middle word based on surrounding context words
- continuous skip-gram model: predict words within a certain range before and after the current word

## skip-gram and negative sampling
skip-gram pairs of (target_word, context_word) where context_word appears in the neighboring context of target_word

The training objective of the skip-gram is to max the prob of predicting context words given the target word. For a seq of words $w_1, w_2, \ldots, w_T$, the obj can be written as the avg log prob
>$$\frac{1}{T} \sum^T_{t=1} \sum_{-c\leq j \leq c, j\neq0} log p(w_{t+1}|w_t)$$

where c is the size of the training context. the basic skip-gram formulation defines the prob using the softmax function
>$$p(w_O|w_I) = \frac{\text{exp}({v'_{w_O}}^T w_{w_I})}{\sum^W_{w=1}\text{exp}({v'_w}^T v_{w_I})}$$

computing the denominator of this formulation involves performing a full softmax over the entire vocab, which is too large. 

The noise contrastive estimation (NCE) loss function is an efficient approx for a full softmax. With an objective to learn word embeddings instead of modeling the word distribution, the NCE loss can be simplified to use negative sampling. 

The simplified negative sampling objective for a target word is to distinguish the context word from num_ns negative samples drawn from noise distribution Pn(w) of words. More precisely, an efficient approximation of full softmax over the vocabulary is, for a skip-gram pair, to **pose the loss for a target word as a classification problem between the context word and num_ns negative samples**




## Setup

In [6]:
import io
import re
import string
import tqdm

import numpy as np

import tensorflow as tf
from tensorflow.keras import layers


In [7]:
# Load the TensorBoard notebook extension
%load_ext tensorboard


In [8]:
SEED = 42
AUTOTUNE = tf.data.AUTOTUNE


### Vectorize an example sentence

In [1]:
sentence = "The wide road shimmered in the hot sun"
tokens = list(sentence.lower().split())
print(tokens)

['the', 'wide', 'road', 'shimmered', 'in', 'the', 'hot', 'sun']


In [3]:
vocab, index = {}, 1
vocab['<pad>'] = 0 # add a padding token
for token in tokens:
    if token not in vocab:
        vocab[token] = index
        index += 1
vocab_size = len(vocab)
print(vocab)

{'<pad>': 0, 'the': 1, 'wide': 2, 'road': 3, 'shimmered': 4, 'in': 5, 'hot': 6, 'sun': 7}


In [4]:
inverse_vocab = {index: token for token, index in vocab.items()}
print(inverse_vocab)

{0: '<pad>', 1: 'the', 2: 'wide', 3: 'road', 4: 'shimmered', 5: 'in', 6: 'hot', 7: 'sun'}


vectorize example sentence

In [5]:
example_sentence = [vocab[word] for word in tokens]
print(example_sentence)

[1, 2, 3, 4, 5, 1, 6, 7]


### generate skip-gram from one sentence

In [12]:
window_size = 2
positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
    example_sentence,
    vocabulary_size = vocab_size,
    window_size = window_size,
    negative_samples = 0
)
print(len(positive_skip_grams))

26


In [15]:
for target, context in positive_skip_grams:
    print(f"({target},{context}): ({inverse_vocab[target]},{inverse_vocab[context]})")

(5,6): (in,hot)
(4,2): (shimmered,wide)
(3,2): (road,wide)
(1,4): (the,shimmered)
(2,3): (wide,road)
(7,1): (sun,the)
(4,3): (shimmered,road)
(1,5): (the,in)
(7,6): (sun,hot)
(1,6): (the,hot)
(3,5): (road,in)
(3,1): (road,the)
(2,4): (wide,shimmered)
(4,1): (shimmered,the)
(6,1): (hot,the)
(5,1): (in,the)
(5,4): (in,shimmered)
(1,2): (the,wide)
(4,5): (shimmered,in)
(5,3): (in,road)
(6,7): (hot,sun)
(1,3): (the,road)
(2,1): (wide,the)
(3,4): (road,shimmered)
(6,5): (hot,in)
(1,7): (the,sun)


### Negative sampling for one skip-gram
num_ns (the number of negative samples per a positive context word) in the [5, 20] range is [shown to work](https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf) best for smaller datasets, while num_ns in the [2, 5] range suffices for larger datasets

In [22]:
# Get target and context words for one positive skip-gram
target_word, context_word = positive_skip_grams[0]

# Set the number of negative samples per positive context.
num_ns = 4

context_class = tf.reshape(tf.constant(context_word,dtype="int64"),(1,1))
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
    true_classes=context_class,
    num_true=1,
    num_sampled=num_ns,
    unique=True,
    range_max=vocab_size,
    seed=SEED,
    name='negative_sampling'
)

print(negative_sampling_candidates)
print([inverse_vocab[index.numpy()] for index in negative_sampling_candidates])


tf.Tensor([3 6 0 1], shape=(4,), dtype=int64)
['road', 'hot', '<pad>', 'the']


![](https://tensorflow.org/tutorials/text/images/word2vec_negative_sampling.png)

## Compile all steps into one function 

### Skip-gram sampling table
training examples obtained from sampling commoly occuring words( such as the, is, on) dont add much usuful info for the model to elearn from. sugguest to subsampling of frequent words as a helpful practice to improve embedding quality. 

tf.keras.preprocessing.sequence.make_sampling_table to generate a word-frequency rank based probabilistic sampling table and pass it to the skipgrams function. The function assumes a Zipf's distribution of the word frequencies  for sampling

### Generate training data

In [24]:
# Generates skip-gram pairs with negative sampling for a list of sequences
# (int-encoded sentences) based on window size, number of negative samples
# and vocabulary size.
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
    # Elements of each training example are appended to these lists
    targets, contexts, labels = [], [], []

    # Build the sampling table for 'vocab_size' tokens
    sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)

    # Iterate over all sequences(sentences) in the dataset
    for sequence in tqdm.tqdm(sequences):

        # Genera positive skip-gram pairs for a sentence
        positive_skip_grams, _ = tf.keras.preprocessing.sequence.skipgrams(
            sequence,
            vocabulary_size = vocab_size,
            sampling_table = sampling_table,
            window_size = window_size,
            negative_samples = 0
        )

        # Iterate over each positive skip-gram pair to produce training examples
        # with a positive context word and negative samples

        for target_word, context_word in positive_skip_grams:
            context_class = tf.expand_dims(
                tf.constant([context_word], dtype='int64'),1
            )
            negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
                true_classes=context_class,
                num_true=1,
                num_sampled=num_ns,
                unique=True,
                range_max=vocab_size,
                seed=seed,
                name="negative_sampling"
            )

            # Build context and label vectors (for one target word)
            negative_sampling_candidates = tf.expand_dims(
                negative_sampling_candidates, 1
            )

            context = tf.concat([context_class, negative_sampling_candidates],0)
            label = tf.constant([1] + [0]*num_ns, dtype='int64')

            # Append each element from the training example to global lists
            targets.append(target_word)
            contexts.append(context)
            labels.append(label)
        
    return targets, contexts, labels



### Prepare training data for word2vec

In [25]:
path_to_file = tf.keras.utils.get_file(
    'shakespeare.txt', 
    'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')


Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


In [26]:
with open(path_to_file) as f:
    lines = f.read().splitlines()
for line in lines[:20]:
    print(line)


First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.


In [27]:
text_ds = tf.data.TextLineDataset(path_to_file).filter(lambda x: tf.cast(tf.strings.length(x), bool))


### Vectorize sentences form the corpus
Notice from the first few sentences above that the text needs to be in one case and punctuation needs to be removed. To do this, define a custom_standardization function that can be used in the TextVectorization layer

In [28]:
# Now, create a custom standardization function to lowercase the text and
# remove punctuation.
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    return tf.strings.regex_replace(lowercase,
                                  '[%s]' % re.escape(string.punctuation), '')


# Define the vocabulary size and the number of words in a sequence.
vocab_size = 4096
sequence_length = 10

# Use the `TextVectorization` layer to normalize, split, and map strings to
# integers. Set the `output_sequence_length` length to pad all samples to the
# same length.
vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)


In [29]:
# Call TextVectorization.adapt on the text dataset to create vocabulary
vectorize_layer.adapt(text_ds.batch(1024))

In [30]:
# Save the created vocabulary for reference.
inverse_vocab = vectorize_layer.get_vocabulary()


In [31]:
# The vectorize_layer can now be used to generate vectors for each element in the text_ds 
# (a tf.data.Dataset). Apply Dataset.batch, Dataset.prefetch, Dataset.map, and Dataset.unbatch
# Vectorize the data in text_ds.
text_vector_ds = text_ds.batch(1024).prefetch(AUTOTUNE).map(vectorize_layer).unbatch()


### obtain sequences form the dataset
You now have a tf.data.Dataset of integer encoded sentences. To prepare the dataset for training a word2vec model, flatten the dataset into a list of sentence vector sequences. This step is required as you would iterate over each sentence in the dataset to produce positive and negative examples

In [32]:
sequences = list(text_vector_ds.as_numpy_iterator())
print(len(sequences))


32777


In [33]:
for seq in sequences[:10]:
    print(f"{seq}=>{[inverse_vocab[i] for i in seq]}")

[ 89 270   0   0   0   0   0   0   0   0]=>['first', 'citizen', '', '', '', '', '', '', '', '']
[138  36 982 144 673 125  16 106   0   0]=>['before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak', '', '']
[34  0  0  0  0  0  0  0  0  0]=>['all', '', '', '', '', '', '', '', '', '']
[106 106   0   0   0   0   0   0   0   0]=>['speak', 'speak', '', '', '', '', '', '', '', '']
[ 89 270   0   0   0   0   0   0   0   0]=>['first', 'citizen', '', '', '', '', '', '', '', '']
[   7   41   34 1286  344    4  200   64    4 3690]=>['you', 'are', 'all', 'resolved', 'rather', 'to', 'die', 'than', 'to', 'famish']
[34  0  0  0  0  0  0  0  0  0]=>['all', '', '', '', '', '', '', '', '', '']
[1286 1286    0    0    0    0    0    0    0    0]=>['resolved', 'resolved', '', '', '', '', '', '', '', '']
[ 89 270   0   0   0   0   0   0   0   0]=>['first', 'citizen', '', '', '', '', '', '', '', '']
[  89    7   93 1187  225   12 2442  592    4    2]=>['first', 'you', 'know', 'caius', 'marcius', 'i

### Generate training examples from sequences


In [35]:
targets, contexts, labels = generate_training_data(
    sequences=sequences,
    window_size=2,
    num_ns=4,
    vocab_size=vocab_size,
    seed=SEED
)



100%|██████████| 32777/32777 [00:52<00:00, 620.36it/s]


In [43]:
targets = np.array(targets)
contexts = np.array(contexts)[:,:,0]
labels = np.array(labels)

print('\n')
print(f"targets.shape: {targets.shape}")
print(f"contexts.shape: {contexts.shape}")
print(f"labels.shape: {labels.shape}")




targets.shape: (65528,)
contexts.shape: (65528, 5)
labels.shape: (65528, 5)


### Config the dataset for performance 

In [44]:
BATCH_SIZE = 1024
BUFFER_SIZE = 10000
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset)


<BatchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None))>


In [45]:
dataset = dataset.cache().prefetch(buffer_size=AUTOTUNE)
print(dataset)


<PrefetchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None))>


## Model and training
The word2vec model can be implemented as a classifier to distinguish between true context words from skip-grams and false context words obtained through negative sampling. You can perform a dot product multiplication between the embeddings of target and context words to obtain predictions for labels and compute the loss function against true labels in the dataset.
- target_embedding
- context_embedding
- dots: a layer that computes the dot product of target and context embeddings from a training pair

The target_embedding and context_embedding layers can be shared as well. You could also use a concatenation of both embeddings as the final word2vec embedding

In [46]:
class Word2Vec(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim):
        super(Word2Vec, self).__init__()
        self.target_embedding = layers.Embedding(
            vocab_size,
            embedding_dim,
            input_length=1,
            name="w2v_embedding"
        )

        self.context_embedding = layers.Embedding(
            vocab_size,
            embedding_dim,
            input_length = num_ns+1
        )
    
    def call(self, pair):
        target, context = pair
        # target: (batch, dummy?)  # The dummy axis doesn't exist in TF2.7+
        # context: (batch, context)
        if len(target.shape) == 2:
            target = tf.squeeze(target, axis=1)
        # target: (batch,)
        word_emb = self.target_embedding(target)
        # word_emb: (batch, embed)
        context_emb = self.context_embedding(context)
        # context_emb: (batch, context, embed)
        dots = tf.einsum('be,bce->bc', word_emb, context_emb)
        # dots: (batch, context)
        return dots


In [47]:
def custom_loss(x_logit, y_true):
    return tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=y_true)


In [48]:
embedding_dim = 128
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
                 metrics=['accuracy'])


In [49]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")


Train the model on dataset for some number of epochs

In [50]:
word2vec.fit(dataset, epochs=20, callbacks=[tensorboard_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1dddbd0a130>

In [52]:
#docs_infra: no_execute
%tensorboard --logdir logs

Reusing TensorBoard on port 6006 (pid 31564), started 0:00:58 ago. (Use '!kill 31564' to kill it.)

## Embedding lookup and analysis 
Obtain the weights from the model using Model.get_layer and Layer.get_weights. The TextVectorization.get_vocabulary function provides the vocabulary to build a metadata file with one token per line.

In [53]:
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [54]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
    if index == 0:
        continue  # skip 0, it's padding.
    vec = weights[index]
    out_v.write('\t'.join([str(x) for x in vec]) + "\n")
    out_m.write(word + "\n")
out_v.close()
out_m.close()
