# Skip-gram in Action

## Imports

Here are the packages we need to import.

In [1]:
from nlpmodels.models import word2vec
from nlpmodels.utils import skipgram_dataset,utils
from nlpmodels.utils import train
from argparse import Namespace
import torch
utils.set_seed_everywhere()



## Hyper-parameters

These are the data processing, skip-gram, and model training hyper-parameters for this run.

In [2]:
args = Namespace(
    # skip gram data hyper-parameters
    context_window_size = 5,
    subsample_t = 10.e-15, # param for sub-sampling frequent words (10.e-5 suggested by paper)
    # Model hyper-parameters
    embedding_size = 300,
    negative_sample_size= 20, # k examples to be used in negative sampling loss function
    # Training hyper-parameters
    num_epochs=100,
    learning_rate=0.0001,
    batch_size = 4096,
)

## Get Data

Call the function that grabs training data (via hugging faces) and a dictionary.

In [3]:
train_dataloader, vocab = skipgram_dataset.SkipGramDataset.get_training_dataloader(args.context_window_size,
                                                                                   args.subsample_t,
                                                                                   args.batch_size)

Using custom data configuration default


In [4]:
vocab_size = len(vocab)

print(f"The gist: context_window_size = {args.context_window_size}, "
      f"batch_size = {args.batch_size}, vocab_size = {vocab_size}, "
      f"embedding_size = {args.embedding_size}, k = {args.negative_sample_size}, "
      f"train_size = {len(train_dataloader.dataset)}"
      )

The gist: context_window_size = 5, batch_size = 4096, vocab_size = 61810, embedding_size = 300, k = 20, train_size = 240032


## Training

Here we build the model and call the trainer.

In [5]:
word_frequencies = torch.from_numpy(vocab.get_word_frequencies())
model = word2vec.SkipGramNSModel(vocab_size, args.embedding_size, args.negative_sample_size,word_frequencies)
trainer = train.Word2VecTrainer(args,model,train_dataloader)
trainer.run()

[Epoch 0]: 100%|██████████| 59/59 [01:04<00:00,  1.09s/it, loss=14.5]
[Epoch 99]: 100%|██████████| 59/59 [00:50<00:00,  1.16it/s, loss=0.744]


Finished Training...


## Examine Similarity of Embeddings

Now that we've trained our embeddings, let's see if the words that are clustered together make any sense.

In [6]:
embeddings = model.get_embeddings()

### Computer

Let's see the top 5 words associated with "computer".

In [15]:
utils.get_cosine_similar("computer",vocab._token_to_idx,embeddings)[0:5]

[('of', tensor(1.0000)),
 ('apple', tensor(1.0000)),
 ('israel', tensor(1.0000)),
 ('leader', tensor(1.0000)),
 ('game', tensor(1.0000))]

### Market

Let's see the top 5 words associated with "market".

In [10]:
utils.get_cosine_similar("market",vocab._token_to_idx,embeddings)[0:5]

[('investors', tensor(1.0000)),
 ('korea', tensor(1.0000)),
 ('out', tensor(1.0000)),
 ('israel', tensor(1.0000)),
 ('china', tensor(1.0000))]

In this particular example, we sub-selected heavily so that our training set would be manageable.
With a training_N = ~200k and vocab_size = ~60k, we might consider increasing  N >> p to improve our embeddings.