# Word2Vec as a Model of Interactions between Words

## Word2Vec Introduction

The [word2vec](https://en.wikipedia.org/wiki/Word2vec) model was proposed to vectorize words. A word is a string. It cannot be "computed" on a computer. We have to encode words to vectors. One direct way of encoding is called one-hot. That is, given a vocabulary which is a list of words, the i-th word has vector $x$ with $x^i$ = 1 and all other components vanish, like $(0, \cdots, 0, 1, 0, \cdots, 0)$. One-hot encoding is not very efficient since its dimension equals to the vocabulary size. But the vocabulary may be quite large. This motives the idea of word2vec, that is, encoding words to dense vectors with a small dimension.

The basic idea behind word2vec is modeling the probability of appearance of two given words $w_1$ and $w_2$ as a neighbour in a corpus. Given two words $w_1$ and $w_2$ with vectors $x_1$ and $x_2$ respectively, the probability of being neighbour is assumed to be

$$ p_{\text{neighbour}} (w_1, w_2) \propto \exp(x_1 \cdot x_2). $$

So, for word2vec model, the learning task find a vector for each word so that the $p_{\text{neighbour}}$ fits the real data.

## Interactions between Words

If words have been one-hot encoded, then for each one-hot encoded word $w_i$, its vector is given by $x_i = W \cdot w_i$. The matrix $W$ has dimension $(E, V)$ where $E$ represents the word-vector dimension and $V$ the vocabulary size. Then, it can be derived directly that

$$ x_1 \cdot x_2 = u^t \cdot A \cdot u, $$

where

$$ u := w_1 + w_2 $$

and

$$ A := W^t \cdot W - \textrm{diag} (W^t \cdot W). $$

The matrix $A$ has dimension $(V, V)$. It is symmetric, with vanished diagonal elements. It is recognized as a Boltzmann machine with the energy given by $E(u; A) := -(1/2) u^t \cdot A \cdot u$ and unit temperature. Fitting a Boltzmann machine is minimizing the loss

$$ L(W) = E(w_1, w_2; W) - E(\tilde{w}_1, \tilde{w}_2; W), $$

for any two neighboured words (one-hot encoded) $(w_1, w_2)$ and two "fantacy" words $(\tilde{w}_1, \tilde{w}_2)$. The key point is ensuring that $E(w_1, w_2; W) > E(\tilde{w}_1, \tilde{w}_2; W)$ is more probable than the inverse. In this way, the $W$ is adjusted so that the $(w_1, w_2)$ is going to be a local minimum of the energy.

Boltzmann machine ensures this by sampling $u \sim \text{Bernoulli}(\sigma(A \cdot u))$, where $\sigma$ is the sigmoid function. Generally, the sampled is not a two-hot vector, but multi-hot. We shall select only two word-indices from it. Notice that the greater $(A \cdot u)_{\alpha}$ is, the more probable we select word-index $\alpha$. This hints for the categorical distribution

$$ p_{\alpha} = \frac{ \exp(\sum_{\beta} A_{\alpha \beta} u^{\beta} / T) }{ \sum_{\alpha'} \exp(\sum_{\beta'} A_{\alpha' \beta'} u^{\beta'} / T) }, $$

where $T$ is a positive number that characterizes the randomness. It is a categorical distribution with alphabet size $V$. It is indicated that we shall sample two word-indices from this distribution, as the fantacy words.

## Basic Setup

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.python import keras
from collections import Counter

2024-01-23 10:24:16.277983: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [42]:
VOCAB_SIZE = 2 ** 12  # vocabulary size.
BATCH_SIZE = 128  # batch size of training.
VECTOR_DIM = 500  # dimension of word-vector
NEIGHBOURS = 2  # window size used for generating dataset.

## Text Data

The original word2vec model is trained on the "text8" dataset. It is preprocessed and can be found on [internet](https://mattmahoney.net/dc/text8.zip). It is a zip file. By unzipping, we get a text file named `text8`. This is the preprocessed data that we can use directly, except for excluding words with single character.

In [3]:
with open('../data/text8', 'r') as f:
    text8 = [word for word in list(f)[0].split(' ') if len(word) > 1]

There are lots of different words in the "text8" text. We shall limit the vocabulary used for building our model. For this purpose, we employ the most frequent words.

In [4]:
%%time
counter = Counter(text8)
vocab = {}
for i, (word, _) in enumerate(counter.most_common(VOCAB_SIZE)):
    vocab[word] = i

CPU times: user 1.67 s, sys: 10.8 ms, total: 1.68 s
Wall time: 1.67 s


In [5]:
id_to_word = {i: w for w, i in vocab.items()}

Now, we construct the collection of pairs of center word (called "target" in the original paper of word2vec) and its neighbour (called "context") in the corpus.

In [6]:
%%time
targets = []
contexts = []
for (i, target) in enumerate(text8[NEIGHBOURS:-NEIGHBOURS]):
    for j in range(i-NEIGHBOURS, i+NEIGHBOURS+1):
        if j == i: continue
        context = text8[j]
        targets.append(target)
        contexts.append(context)

CPU times: user 13 s, sys: 286 ms, total: 13.3 s
Wall time: 13.3 s


While converting from word to its index in the vocabulary, we have to drop the pairs in which there is at least one word that is absent in the vocabulary.

In [7]:
%%time
target_ids = []
context_ids = []
for w1, w2 in zip(targets, contexts):
    if w1 not in vocab or w2 not in vocab:
        continue
    target_ids.append(vocab[w1])
    context_ids.append(vocab[w2])

# List -> np.ndarray -> Dataset is much faster than List -> Dataset.
target_ids = np.asarray(target_ids, dtype='int32')
context_ids = np.asarray(context_ids, dtype='int32')

CPU times: user 13.5 s, sys: 222 ms, total: 13.7 s
Wall time: 13.7 s


In [8]:
target_ids.shape

(46226743,)

Now, convert the processed data to TensorFlow's dataset protocol for training.

In [9]:
ds = tf.data.Dataset.from_tensor_slices((target_ids, context_ids))

Let see some instances.

In [10]:
for x, y in ds.batch(5).take(1):
    tf.print(f'x: {x}')
    tf.print(f'y: {y}')

x: [   4  724 3255  126 3209]
y: [288 724 892 126   1]


## Model Implementation

In addition to the "model of interactions between words" described previously, we have to regularize the word vectors, equivalently the matrix $W$, so that the word vectors are normalized. It helps avoid some word vector becomes too large that it dominates the interaction.

In [11]:
class Word2Vec:
    """Word2Vec as a model of interactions between words.

    Args:
        vocab_size: Integer for the vocabulary size.
        vector_dim: Integer for the word-vector dimension.
        T: Positive float for the randomness in generating fantasy data.
        test_ansatz_steps: Positve integer for a step length. If it is greater
          than zero, then during the training, we test the ansatz that the
          fantasy data have lower energy than the real data per `test_ansatz_steps`
          steps.
    """

    # Implementation conventions:
    # 1. The x and y employed throughout the implementation represent
    #    word-indices. Thus they are tensors with shape [batch_size]
    #    and dtype int32.
    # 2. We use B for batch size, V for vocabulary size, and D for vector
    #    dimension.

    def __init__(self, vocab_size, vector_dim, T=1e-2, test_ansatz_steps=0):
        self.vocab_size = vocab_size
        self.vector_dim = vector_dim
        self.T = T
        self.test_ansatz_steps = test_ansatz_steps

        # It is convenient to use layer.Layer API to implement the W matrix.
        # The constraint is a projection to the W after each step of gradient
        # descent.
        self.embed = keras.layers.Embedding(
            vocab_size, vector_dim,
            embeddings_constraint=keras.constraints.UnitNorm(axis=1),
        )
        self.embed.build([vocab_size])
        self.W = self.embed.weights[0]  # (V, D).

    def __call__(self, x):
        return self.embed(x)

    def energy(self, x, y):
        return -tf.reduce_sum(self(x) * self(y), axis=1)

    def sample_fantasy(self, x, y):
        # Compute logits (the A \cdot u, where u := x + y)
        z = self(x) + self(y)  # (B, D)
        raw_logits = tf.matmul(z, tf.transpose(self.W))  # (B, V)
        indices = tf.stack([tf.range(tf.shape(x)[0]), x], axis=1)
        update = tf.zeros(tf.shape(x))
        logits = tf.tensor_scatter_nd_update(raw_logits, indices, update)

        # Sample two samples by probability proportional to `exp(logits / self.T)`.
        samples = tf.random.categorical((1/self.T) * logits, 2, dtype=tf.int32)
        return tf.unstack(samples, axis=1)

    def loss(self, real, fantacy):
        return tf.reduce_mean(
            self.energy(real[0], real[1]) -
            self.energy(fantacy[0], fantacy[1])
        )

    def test_ansatz(self, real, fantacy):
        """The result ratio shall be no less than a half. As the training
        continues, this ratio will gradually decreases and close to 0.5.
        """
        ansatz = (
            self.energy(real[0], real[1]) >
            self.energy(fantacy[0], fantacy[1])
        )
        ansatz_correct_ratio = tf.reduce_mean(tf.cast(ansatz, tf.float32))
        return ansatz_correct_ratio

    def get_train_step(self, optimizer):
        step = tf.Variable(0, dytpe=tf.int32)

        @tf.function
        def train_step(x, y):
            """The x and y are word-indices, tensors with shape [batch_size]
            and dtype int32.
            """
            real = (x, y)
            fantasy = self.sample_fantasy(x, y)

            # Compute loss and its gradient, and optimize.
            # The gradient to the weights in embedding layer is treated as sparse,
            # Convert sparse to dense for optimizer.
            with tf.GradientTape() as tape:
                loss_value = self.loss(real, fantasy)
            grads = tf.convert_to_tensor(tape.gradient(loss_value, self.W))
            optimizer.apply_gradients([(grads, self.W)])

            # Test ansatz if needed
            if self.test_ansatz_steps and step % self.test_ansatz_steps == 0:
                ansatz_correct_ratio = self.test_ansatz(real, fantasy)
                tf.print(f'\nansatz correct sample ratio: {ansatz_correct_ratio}\n')

            step.assign_add(1)
            return loss_value

        return train_step, step

## Model Training

In [43]:
model = Word2Vec(VOCAB_SIZE, VECTOR_DIM)
# optimizer = keras.optimizers.gradient_descent_v2.SGD()
optimizer = keras.optimizers.adam_v2.Adam()
train_step, step = model.get_train_step(optimizer)

In [52]:
pbar = keras.utils.generic_utils.Progbar(len(ds.batch(BATCH_SIZE)))
for x, y in ds.shuffle(10000).batch(BATCH_SIZE):
    loss_value = train_step(x, y)
    pbar.update(tf.cast(step, tf.float32), values=[('loss', loss_value)])

 37587/361147 [==>...........................] - ETA: 54:00 - loss: 0.0431

KeyboardInterrupt: 

## Evaluation

Since the word-vectors are all normalized, it is natural to consider angular distance as a measurement of the relation between words.

In [50]:
def get_closest_k(model, vector, k):
    z = tf.convert_to_tensor([vector])  # (1, D)
    distances = tf.math.acos(tf.matmul(z, tf.transpose(model.W)))  # (1, V)
    _, top_ids = tf.math.top_k(-distances, k=k)
    return top_ids.numpy()

In [53]:
for word in ('world', 'boy', 'happy', 'zero', 'sun', 'football'):
    closest_indices = get_closest_k(model, model(vocab[word]), 5)
    print(f'{word}: {", ".join([id_to_word[idx] for idx in closest_indices[0,:]])}\n')

world: world, war, century, prime, states

boy: boy, young, poor, former, regard

happy: happy, month, jackson, legislation, canon

zero: zero, two, three, nine, seven

sun: sun, evil, begins, marine, read

football: football, baseball, world, absolute, links



From this simple evaluation, it has been found that the word2vec re-implemented from the aspect of interaction reveals some deeper relations of words.

In addition, we drop the contribution in the loss from the fantasy data, the training fails in such a way that only the most frequent words (like "the", "of", and "in") appear as the closest for any word.