# Word2Vec algorithm (Negative Sampling example)

In the [previous example](./03_language_model_basic.ipynb), we have trained the next word prediction for the given sequence of words.<br>
However, in this example, I'll train whether the word (context word) will appear in some size of window for the target word (focus word), using the same training set. For example, when the following sentence is given, the correct context words will be 3 words - "Barack", "is", and "president" - for the target word "Obama".

"Barack Obama is president of U.S."

As you saw in the [previous example](./03_language_model_basic.ipynb), the model will become computationally expensive, when you handle a large size of vocabulary. In order for making it scalable to unlimited vocabularies, the algorithm can be modified by sampling k incorrect words and training the part of words, instead of computing possibilities for all words. (See papers in Collobert & Weston or Bengio et al.)<br>
This method is called **Negative Sampling (NS)**.

> Note : In Word2Vec family, you can take another optimization objectives, called **Hierarchical Softmax**, instead of Negative Sampling (NS).

Today's refined embedding algorithms - such as, Word2Vec or GloVe - includes this idea of this Negative Sampling method.<br>

**Word2Vec** algorithm is based on the distributional hypothesis, which derives from word similarities by representing target words according to the contexts in which they occur.<br>
In this example, I'll introduce Word2Vec model in neural networks with Negative Sampling (NS) method.

When the target word (focus word) is given, first we'll pick up by sampling both correct and incorrect context words.<br>
For each collected context words, we will then compute the difference between correct word's score and incorrect word's score.<br>
Finally we then optimize the loss of scores to train Word2Vec model.

> Note : This is called **Skip-Gram (SG)** model in Word2Vec algorithms. (See below note for another CBOW model.)

*back to [index](https://github.com/tsmatz/nlp-tutorials/)*

## Install required packages

In [None]:
!pip install tensorflow==2.6.2 pandas nltk scipy numpy

In [None]:
import nltk
nltk.download("popular")

## Prepare data

Same as in [previous example](./03_language_model_basic.ipynb), here I also use short description text in news papers dataset.<br>
Before starting, please download [News_Category_Dataset_v2.json](https://www.kaggle.com/datasets/rmisra/news-category-dataset) (collected by HuffPost) in Kaggle.

In [1]:
import pandas as pd

df = pd.read_json("News_Category_Dataset_v2.json",lines=True)
train_data = df["short_description"]
train_data

0         She left her husband. He killed their children...
1                                  Of course it has a song.
2         The actor and his longtime girlfriend Anna Ebe...
3         The actor gives Dems an ass-kicking for not fi...
4         The "Dietland" actress said using the bags is ...
                                ...                        
200848    Verizon Wireless and AT&T are already promotin...
200849    Afterward, Azarenka, more effusive with the pr...
200850    Leading up to Super Bowl XLVI, the most talked...
200851    CORRECTION: An earlier version of this story i...
200852    The five-time all-star center tore into his te...
Name: short_description, Length: 200853, dtype: object

To get the better performance (accuracy), we standarize the input text as follows.
- Make all words to lowercase in order to reduce words
- Make "-" (hyphen) to space
- Remove all punctuation

> Note : N-gram words (such as, "New York", "Barack Obama") and lemmatization (standardization for such as "have", "had" or "having") should be dealed with, but here I have skipped these pre-processing.<br>
> In the strict pre-processing, we should also care about the polysemy. (The different meanings in the same word should have different tokens.)<br>
> For N-gram detection, see [exercise05](./05_ngram_cnn.ipynb).

In [2]:
import nltk
from nltk.corpus import stopwords
import re
import string

# to lowercase
train_data = train_data.str.lower()

# replace hyphen
train_data = train_data.str.replace("-"," ")

# remove stop words (only when it includes punctuation)
for w in stopwords.words("english"):
    if re.match("(^|\w+)[%s](\w+|$)" % re.escape(string.punctuation), w):
        train_data = train_data.str.replace("(^|\s+)%s(\s+|$)" % re.escape(w)," ",regex=True)
train_data = train_data.str.strip()

# remove punctuation
train_data = train_data.str.replace("[%s]" % re.escape(string.punctuation),"",regex=True)
train_data = train_data.str.strip()

# remove stop words (only when it doesn't include punctuation)
for w in stopwords.words("english"):
    if not re.match("(^|\w+)[%s](\w+|$)" % re.escape(string.punctuation), w):
        train_data = train_data.str.replace("(^|\s+)%s(\s+|$)" % re.escape(w)," ",regex=True)
train_data = train_data.str.strip()

# drop Nan
train_data = train_data.dropna()

In [13]:
# train_data.to_csv("exercise05.csv", header=True, index=False)
# train_data = pd.read_csv("exercise05.csv")

## Generate inputs

Now let's generate inputs for training.<br>
Same as in previous examples, first we will generate the sequence of word's indices (i.e, tokenize) from text.

![Index vectorize](images/index_vectorize.png?raw=true)

I note that the generated word's index is sorted by the word's frequency.<br>
For instance, the 10-th word in word's index list means the 10-th most frequently occurring token in this corpus, except for "[UNK]".

In [3]:
import tensorflow as tf

vocab_size = 50000

corpus = " ".join(train_data)
new_tokens = [w for w in corpus.split() if w.isalpha()]
new_corpus = " ".join(new_tokens)
tokenizer = tf.keras.preprocessing.text.Tokenizer(
    num_words=vocab_size,
    oov_token="[UNK]"
)
tokenizer.fit_on_texts([new_corpus])

sequences = tokenizer.texts_to_sequences(train_data)

In [4]:
list(tokenizer.word_index.items())[:20] # show top 20 word's index

[('[UNK]', 1),
 ('one', 2),
 ('new', 3),
 ('us', 4),
 ('time', 5),
 ('people', 6),
 ('like', 7),
 ('day', 8),
 ('said', 9),
 ('life', 10),
 ('get', 11),
 ('year', 12),
 ('many', 13),
 ('would', 14),
 ('make', 15),
 ('years', 16),
 ('first', 17),
 ('know', 18),
 ('want', 19),
 ('may', 20)]

Now let's generate inputs by Skip-Gram (SG) with Negative Sampling (NS).<br>
For instance, when the following sentence is given and we want to find context words for the target word "obama" in window size 2, 

"in 2012 us president obama won votes and republican romney got 206 votes"

"us", "president", or "won" will be positive context words, but "2021", "republican", or "romney" will be negative context words.

![Skip-Gram](images/skip_gram.png?raw=true)

> Note : In this example, we pick up context words evenly, regardless of window position. For instance, the context words "us" and "president" has same weight against target word "obama" in above example.<br>
> In Word2Vec, you can take another variation with positional context.

In order for generating Skip-Gram word's pairs in TensorFlow, you can use ```tf.keras.preprocessing.sequence.skipgrams``` as follows.

I note that the training set will have a bias by word's frequency. For instance, the word "one", "new", or "make" will be frequently used in this corpus and it then won't be much useful information for training.<br>
By specifing a sampling table as follows, these words will then be rarely (with low possibility) picked up.

In [5]:
window_size = 3
target_list, context_list, label_list = [], [], []

sampling_tbl = tf.keras.preprocessing.sequence.make_sampling_table(vocab_size)
for seq in sequences:
    samples, labels = tf.keras.preprocessing.sequence.skipgrams(
        seq,
        vocabulary_size=vocab_size,
        sampling_table=sampling_tbl,
        window_size=window_size,
        negative_samples=4.0)
    target_list.extend([t for t, c in samples])
    context_list.extend([c for t, c in samples])
    label_list.extend(labels)

In [6]:
train_tf_data = tf.data.Dataset.from_tensor_slices((
    (target_list, context_list),
    label_list))

In [50]:
#tf.data.experimental.save(train_tf_data, "saved_data")

## Build network and Train

Now let's build Word2Vec (with Skip-Gram) network and train.

In this network, we generate dense vectors for both target and context words by embedding layers, and perform dot product operation as follows.

Here I don't go so far, but in traditional NLP, the matrix for word-context pairs (so called, PMI matrix) is considered and the dimension can be reduced with factorization by SVD (Singular Value Decomposition) in order for preventing from high computational costs and sparsity. (It's based on the idea of **PMI**, point-wise mutual information.)<br>
In this Word2Vec model (neural methods), this PMI-based idea can be simply achieved by **dot product operation** between word's embedding vector and context's embedding vector, based on the sampling of word's frequency.

We will then evaluate the loss by [sigmoid](https://tsmatz.wordpress.com/2017/08/30/glm-regression-logistic-poisson-gaussian-gamma-tutorial-with-r/) $\prod_{i=1}^{k} \frac{1}{1+e^{-\mathbf{w}\cdot\mathbf{c}_i}}$, where $\mathbf{w}$ is target word (focus word) and $\mathbf{c}_i$ is its corresponding context words.

![Word2Vec model](images/word2vec_network.png?raw=true)

> Note : In Word2Vec family, you can also take another context representation, $\frac{1}{1 + e^{-\sum \mathbf{w}\cdot\mathbf{c}_i}}$, instead. This is called **CBOW** approach, compared to Skip-Gram (SG).

In this model, only embedding is trained and it will then eventually give you a well-trained model for word vectorization. This is because why this model is widely used for getting model for word vectorization.

In [None]:
#train_tf_data = tf.data.experimental.load("saved_data")

In [7]:
class Word2VecModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim):
        super(Word2VecModel, self).__init__()

        self.embedding_target = tf.keras.layers.Embedding(
            vocab_size,
            embedding_dim,
            trainable=True,
            name="embedding_target")
        self.embedding_context = tf.keras.layers.Embedding(
            vocab_size,
            embedding_dim,
            trainable=True,
            name="embedding_context")

    def call(self, inputs):
        input_target, input_context = inputs
        emb_tar = self.embedding_target(input_target)
        emb_con = self.embedding_context(input_context)
        emb_mul = tf.math.multiply(emb_tar, emb_con)
        emb_dot = tf.math.reduce_sum(emb_mul, axis=-1)
        return emb_dot

embedding_dim = 100
model = Word2VecModel(vocab_size, embedding_dim)

def custom_loss(y_true, x_pred):
    return tf.nn.sigmoid_cross_entropy_with_logits(logits=x_pred, labels=float(y_true))

model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    #loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
    loss=custom_loss,
    metrics=["accuracy"])

In [8]:
model.fit(
    train_tf_data.shuffle(10000).batch(512),
    epochs=10)

# class CustomOutputCallback(tf.keras.callbacks.Callback):
#     def on_train_end(self, logs=None):
#         print("Final - loss: {:2.4f} - accuracy: {:2.4f}".format(logs["loss"], logs["accuracy"]))

# model.fit(
#     train_tf_data.shuffle(10000).batch(512),
#     epochs=10,
#     verbose=0,
#     callbacks=[CustomOutputCallback()])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fc61f666048>

In [12]:
# model.save("trained_model/exercise05")

INFO:tensorflow:Assets written to: trained_model/exercise05/assets


## Get similar vectors

In this example, we will get top 15 context words for the target word "president" using the trained model above.

In [None]:
# model = tf.keras.models.load_model(
#     "trained_model/exercise05",
#     custom_objects={"custom_loss": custom_loss})

First we restore embedding layers for both target and context.

In [9]:
weights = model.get_layer("embedding_target").get_weights()
embedding_layer = tf.keras.layers.Embedding(
    vocab_size,
    embedding_dim)
embedding_layer.build((None, ))
embedding_layer.set_weights(weights)
embedding_layer.trainable = False
trained_model = tf.keras.models.Sequential([embedding_layer])

Now let's get top 15 positive context with the restored model.<br>
I note that here I used corpus in news paper (not like Wikipedia) and it will then include a lot of contrasting conjunctions (antonyms), such as, "democratic" and "republican", "obama" and "trump", etc.

In [12]:
from scipy import spatial
import numpy as np

# get embedding vector for the word "president"
words_list = list(tokenizer.word_index.keys())
index_list = list(tokenizer.word_index.values())
target_index = index_list[words_list.index("president")]
target_vector = tf.squeeze(trained_model.predict([target_index]))

# get vectors for all words
vocab_vector_list = tf.squeeze(trained_model.predict(index_list))

# get (1.0 - cosine) between target vector ("president") and others
distance_list = [spatial.distance.cosine(target_vector, v) for v in vocab_vector_list]

# sort and get top 10 similar vectors
index_list_sorted = np.argsort(distance_list)
for i in index_list_sorted[:15]:
    print(words_list[i])

president
disapprove
adamantly
ghani
elects
elect
nauseam
expendable
vocally
hirono
inauguration
autocracy
romney
destroying
testify


Here I have implemented Word2Vec algorithm and saw Negative Sampling (NS) with TensorFlow, but you can use the efficient implementations for Word2vec algorithm in ```gensim``` package.<br>
Pre-trained word vectors for English (which are well-trained by large corpora) is available in Google (Word2Vec) and Stanford (GloVe). Pre-trained word vectors for other languages are available in Polyglot project.<br>
When you use these off-the-shelf embeddings, it's better to apply the same normalization (standarization) scheme in pre-processing.