# Custom Embedding by CBOW Example (Dense Vector)

As you saw in the [previous example](./01_sparse_vector.ipynb), the generated count vectors are sparse and a lot of algorithms won't work well with this high-dimensional vectors.<br>
For this reason, the today's refined trainers will transform sparse vectors into non-sparse forms (dense vectors) and process some tasks (such as, NLP classification, etc) againt these dense vectors in practice. (See below.)

![Dense vectorize](images/dense_vectorize.png?raw=true)

In this network, the generated dense vector (i.e, non-sparse form) will represent some aspects (meaning) for words or documents. For instance, if "dog" and "cat" are closely related each other in this task, the generated dense vectors for "dog" and "cat" might have close cosine similarity. In this representation, "burger" and "hot-dogs" might be closer than "ice-cream". (This is called **distributional hypothesis**.)<br>
The well-defined vectors might have analogies for words - such as, "king" - "man" + "woman" = "queen". (See Mikolov et al.)

In order to get dense vectors, you can take the following 3 options :

1. Train embeddings from the beginning.
2. Use existing pre-trained embeddings trained by a large text corpus. (See Hugging Face hub or TF-Hub for a lot of pre-trained SOTA models.)
3. Use pre-trained embeddings and train (fine-tune) by yourself furthermore.

> Note : I assume that $\mathbf{w}$ is a word index vector (sparse vector) with voculabrary size $|V|$ and $\mathbf{E}$ is $ |V| \times d $ matrix which converts a sparse vector to a $d$-dimensional dense vector by $ \mathbf{w} \mathbf{E} $. (i.e, The i-th row of $\mathbf{E}$ is a dense vector for a word $\mathbf{w}$, when the i-th elememnt of $\mathbf{w}$ is $1$ and other elements are $0$.)<br>
> In order to fine-tune the pre-trained vectors, there also exists the following approaches :<br>
> - Find an additional matrix $\mathbf{T} \in \mathbb{R}^{d \times d} $, with which we can obtain new embedding $\mathbf{E} \mathbf{T}$
> - Find an additional matrix $\mathbf{A} \in \mathbb{R}^{|V| \times d} $, with which we can obtain new embedding $\mathbf{E} + \mathbf{A}$
> - Hybrid of 1 and 2

In this exercise, here I'll show you the brief example for self-trained embeddings.

In a lot of today's NLP models, the word is embedded into dense vectors and the sequence of words in document is trained by RNN-based learners (such as, LSTN or Transformer) with a large corpus. (See [RNN example](./06_language_model_rnn.ipynb) for details.)<br>
However, for your first tutorial, I'll introduce a simple regression (or classification) trainer, in which the word is embedded and the sequence is combined by using CBOW (continuos bag-of-words).

**CBOW** (continuos bag-of-words) is a primitive vector's combination by the mean (average) of vectors as follows, which ignores the order in word's sequence.

$ \frac{1}{k} \sum_{i=1}^{k} v(w_i) $ &nbsp;&nbsp;&nbsp; where $v(\cdot)$ is dense vector.

> Note : As I have mentioned in "[Sparse Vector](01_sparse_vector.ipynb)", you can also use weighted coefficients (such as, position weighting, TF-IDF weighting, etc) in also CBOW. This is called weighted CBOW or WCBOW shortly.

*back to [index](https://github.com/tsmatz/nlp-tutorials/)*

## Install required packages

In [None]:
!pip install tensorflow==2.6.2 tensorflow-datasets nltk numpy

In [None]:
import nltk
nltk.download("popular")

## Prepare data

In this example, we use IMDB dataset (movie review dataset).<br>
In this dataset, it includes the review text and 2-class flag (0 or 1) for satisfied/dissatisfied.

In [1]:
import tensorflow_datasets as tfds

train_data = tfds.load(
    name="imdb_reviews",
    split=("train"),
    as_supervised=True)

In [2]:
len(list(train_data))

25000

In [2]:
for text, label in train_data.take(1):
    print(text)
    print(label)

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string)
tf.Tensor(0, shape=(), dtype=int64)


To get the better performance (accuracy), we standarize the input review text as follows.
- Make all words to lowercase
- Remove all stop words, such as, "a", "the", "is", "I", etc
- Remove all punctuation

Furthermore I have changed the label integer (0 or 1) to one-hot vector ([1, 0] or [0, 1]).

> Note : You can define text standarization in ```tf.keras.layers.TextVectorization()```, such as, ```TextVectorization(... , standardize=my_custom_standarize_func)```, in TensorFlow.<br>
> To show the results of standarization, I have implemented the standarization, separated from text vectorization.

In [3]:
import tensorflow as tf
import re
import string
import nltk
from nltk.corpus import stopwords

def standarize_input(text, label):
    #
    # to lowercase
    #
    text = tf.strings.lower(text)

    #
    # remove stop words
    #

    # # This doesn't work...
    # word_list = tf.strings.split(text)
    # for w in stopwords.words("english"):
    #     word_list = tf.gather(word_list, tf.where(tf.math.not_equal(word_list, w)))
    # text = tf.strings.reduce_join(word_list, separator=" ")

    for w in stopwords.words("english"):
        text = tf.strings.regex_replace(
            text,
            "(^|\s+)%s(\s+|$)" % re.escape(w),
            " ")
    text = tf.strings.strip(text)

    #
    # remove punctuation
    #
    text = tf.strings.regex_replace(
        text,
        "[%s]" % re.escape(string.punctuation),
        "")

    #
    # get first 150 characters
    #

    #text = tf.strings.substr(
    #    text, pos=0, len=150, unit="UTF8_CHAR"
    #)
    #text = tf.strings.regex_replace(
    #    text,
    #    "\w+$",
    #    "")
    #text = tf.strings.strip(text)
    return text, label
train_data = train_data.map(standarize_input)

def label_to_one_hot(text, label):
    label = tf.one_hot(label, depth=2)
    return text, label
train_data = train_data.map(label_to_one_hot)

Cause: for/else statement not yet supported


Cause: for/else statement not yet supported


Cause: for/else statement not yet supported


In [4]:
for text, label in train_data.take(1):
    print(text)
    print(label)

tf.Tensor(b'absolutely terrible movie lured christopher walken michael ironside great actors must simply worst role history even great acting could redeem movies ridiculous storyline movie early nineties us propaganda piece pathetic scenes columbian rebels making cases revolutions maria conchita alonso appeared phony pseudolove affair walken nothing pathetic emotional plug movie devoid real meaning disappointed movies like this ruining actors like christopher walkens good name could barely sit it', shape=(), dtype=string)
tf.Tensor([1. 0.], shape=(2,), dtype=float32)


In [4]:
tf.data.experimental.save(train_data, "saved_data")

## Build network

First we'll build the embedding network.

![Embedding layer](images/embedding_layer.png?raw=true)

In the first step, I create a list for words used in training set, and change each words in training set into indices in word's list (i.e, tokenize text).

![Index vectorize](images/index_vectorize.png?raw=true)

In [5]:
vocab_size = 10000

# Set up vectorizer
vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_sequence_length=None, # maximum length of sequences
    output_mode="int",
    pad_to_max_tokens=False)

# create vocabulary list (max 10000)
# (UNK is automatically included)
text_data = train_data.map(lambda x, y: x)
vectorize_layer.adapt(text_data)

In [7]:
len(vectorize_layer.get_vocabulary())

10000

In [6]:
model = tf.keras.models.Sequential()
model.add(vectorize_layer)

Next we change each index (i.e, word) in row into corresponding embedded vector (dense vector).

![Word embeddings](images/word_embedding.png?raw=true)

In [7]:
embedding_dim = 16

model.add(tf.keras.layers.Embedding(
    vocab_size,
    embedding_dim,
    trainable=True,
    name="embedding"))
# model.add(tf.keras.layers.Dropout(0.2))

Now we apply CBOW (continuous bag-of-words) for word's embedded vectors as follows.

$$ \frac{1}{k} \sum_{i=1}^{k} v(w_i) $$

where $w_i$ is a word vector (in this case, the scalar number representing a word) and $v(\cdot)$ is embedding function.

![CBOW](images/continuous_bow.png?raw=true)

In this CBOW representation, the order of words in the sentence will be ignored.

In [8]:
model.add(tf.keras.layers.GlobalAveragePooling1D())

Now we'll build the task layer.

![Task layer](images/task_layer.png?raw=true)

In our network, we just use fully connected feed-forward network (DenseNet), in which the final output is one-hot logits.

In [9]:
# # Remove hidden layer to train embedding more
# model.add(tf.keras.layers.Dense(
#     16,
#     activation="relu",
#     trainable=True))
model.add(tf.keras.layers.Dense(
    2,
    activation=None,
    trainable=True))
model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"])

## Train model

Now let's train our network.

In [10]:
train_data = tf.data.experimental.load("saved_data")

In [11]:
class CustomOutputCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        if epoch % 10 == 0:
            print("Epoch {} - loss: {:2.4f} - accuracy: {:2.4f}".format(epoch, logs["loss"], logs["accuracy"]))
    def on_train_end(self, logs=None):
        print("Final - loss: {:2.4f} - accuracy: {:2.4f}".format(logs["loss"], logs["accuracy"]))

model.fit(
    train_data.shuffle(10000).batch(512),
    epochs=300,
    verbose=0,
    callbacks=[CustomOutputCallback()])

Epoch 0 - loss: 0.6920 - accuracy: 0.5395
Epoch 10 - loss: 0.5716 - accuracy: 0.8300
Epoch 20 - loss: 0.4307 - accuracy: 0.8751
Epoch 30 - loss: 0.3425 - accuracy: 0.8953
Epoch 40 - loss: 0.2908 - accuracy: 0.9084
Epoch 50 - loss: 0.2553 - accuracy: 0.9186
Epoch 60 - loss: 0.2276 - accuracy: 0.9261
Epoch 70 - loss: 0.2051 - accuracy: 0.9328
Epoch 80 - loss: 0.1882 - accuracy: 0.9398
Epoch 90 - loss: 0.1724 - accuracy: 0.9444
Epoch 100 - loss: 0.1589 - accuracy: 0.9497
Epoch 110 - loss: 0.1466 - accuracy: 0.9541
Epoch 120 - loss: 0.1370 - accuracy: 0.9578
Epoch 130 - loss: 0.1275 - accuracy: 0.9607
Epoch 140 - loss: 0.1191 - accuracy: 0.9644
Epoch 150 - loss: 0.1092 - accuracy: 0.9666
Epoch 160 - loss: 0.1030 - accuracy: 0.9702
Epoch 170 - loss: 0.0946 - accuracy: 0.9730
Epoch 180 - loss: 0.0865 - accuracy: 0.9760
Epoch 190 - loss: 0.0833 - accuracy: 0.9783
Epoch 200 - loss: 0.0777 - accuracy: 0.9805
Epoch 210 - loss: 0.0713 - accuracy: 0.9826
Epoch 220 - loss: 0.0667 - accuracy: 0.9846

<keras.callbacks.History at 0x7efc84f0cb70>

## Show embedding results

Now let's see how the trained embedding layer performs.<br>
In this example, I'll briefly show you top 10 words similar to the word "```great```" using the generated embedding.

First, restore the trained embedding layer.

In [34]:
weights = model.get_layer("embedding").get_weights()
embedding_layer = tf.keras.layers.Embedding(
    vocab_size,
    embedding_dim)
embedding_layer.build((None, ))
embedding_layer.set_weights(weights)
embedding_layer.trainable = False
test_model = tf.keras.models.Sequential([vectorize_layer, embedding_layer])

Get distance for all words in vocabulary againt the word "```great```".

> Note : Here I didn't use cosine similarity, but used distance to measure similarity.

In [35]:
import numpy as np

# Get embedding vector for the word "great"
input_data = [["great"]]
target_word_vector = tf.squeeze(test_model.predict(input_data)).numpy()

# Get vector list for all words (10,000 words)
vocab_list = vectorize_layer.get_vocabulary()
vocab_list = vocab_list[1:] # erase blank
vocab_vector_list = tf.squeeze(test_model.predict([[" ".join(vocab_list)]]))

# Get distance in all words
distance_list = [np.sum(np.square(v - target_word_vector)) for v in vocab_vector_list]
#[np.square(v - target_word_vector) for v in vocab_vector_list]

Get top 10 words similar to the word "```great```". (In this example, it won't show n-gram words, such as, "```nice job```" or "```good job```".)

This embedding is trained to capture the tone for sentiment in each word, and it won't then detect other similarity, such like, "```dog```" and "```puppy```".

In [36]:
import numpy as np

indices_list = np.argsort(distance_list)
np.array(vocab_list)[indices_list[:10]]

array(['great', 'true', 'bit', 'best', 'job', 'apartment', 'glover',
       'worth', 'ossessione', 'intense'], dtype='<U17')