# Skip-gram in Action

## Imports

First, if running from colab, you must install the package. (You may skip if you installed already).

In [1]:
!git clone https://github.com/will-thompson-k/deeplearning-nlp-models.git
%cd deeplearning-nlp-models

fatal: destination path 'deeplearning-nlp-models' already exists and is not an empty directory.
/content/deeplearning-nlp-models


In [2]:
!python setup.py install

running install
running bdist_egg
running egg_info
writing deeplearning_nlp_models.egg-info/PKG-INFO
writing dependency_links to deeplearning_nlp_models.egg-info/dependency_links.txt
writing top-level names to deeplearning_nlp_models.egg-info/top_level.txt
writing manifest file 'deeplearning_nlp_models.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/nlpmodels
creating build/bdist.linux-x86_64/egg/nlpmodels/models
creating build/bdist.linux-x86_64/egg/nlpmodels/models/transformer_blocks
copying build/lib/nlpmodels/models/transformer_blocks/gpt_decoder.py -> build/bdist.linux-x86_64/egg/nlpmodels/models/transformer_blocks
copying build/lib/nlpmodels/models/transformer_blocks/decoder.py -> build/bdist.linux-x86_64/egg/nlpmodels/models/transformer_blocks
copying build/lib/nlpmodels/models/transformer_blocks/encoder.py -> build/bdist.linux-x86_64/egg

In [4]:
!pip install -r requirements.txt

Collecting torchtext==0.7.0
  Using cached https://files.pythonhosted.org/packages/b9/f9/224b3893ab11d83d47fde357a7dcc75f00ba219f34f3d15e06fe4cb62e05/torchtext-0.7.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting torch==1.6.0
  Using cached https://files.pythonhosted.org/packages/38/53/914885a93a44b96c0dd1c36f36ff10afe341f091230aad68f7228d61db1e/torch-1.6.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting numpy==1.19.1
  Using cached https://files.pythonhosted.org/packages/b1/9a/7d474ba0860a41f771c9523d8c4ea56b084840b5ca4092d96bdee8a3b684/numpy-1.19.1-cp36-cp36m-manylinux2010_x86_64.whl
Collecting datasets==1.1.2
  Using cached https://files.pythonhosted.org/packages/f0/f4/2a3d6aee93ae7fce6c936dda2d7f534ad5f044a21238f85e28f0b205adf0/datasets-1.1.2-py3-none-any.whl
Collecting tqdm==4.49.0
  Using cached https://files.pythonhosted.org/packages/73/d5/f220e0c69b2f346b5649b66abebb391df1a00a59997a7ccf823325bd7a3e/tqdm-4.49.0-py2.py3-none-any.whl
Collecting sentencepiece
  Using cached https://files.

Here are the packages we need to import.

In [3]:
from nlpmodels.models import word2vec
from nlpmodels.utils import utils, train
from nlpmodels.utils.elt import skipgram_dataset
from argparse import Namespace
import torch
utils.set_seed_everywhere()

## Hyper-parameters

These are the data processing, skip-gram, and model training hyper-parameters for this run.

In [20]:
args = Namespace(
    # skip gram data hyper-parameters
    context_window_size = 5,
    subsample_t = 10.e-15, # param for sub-sampling frequent words (10.e-5 suggested by paper)
    # Model hyper-parameters
    embedding_size = 300,
    negative_sample_size= 20, # k examples to be used in negative sampling loss function
    # Training hyper-parameters
    num_epochs=100,
    learning_rate=0.001,
    batch_size = 8192,
)

## Get Data

Call the function that grabs training data (via hugging faces) and a dictionary.

In [21]:
train_dataloader, vocab = skipgram_dataset.SkipGramDataset.get_training_dataloader(args.context_window_size,
                                                                                   args.subsample_t,
                                                                                   args.batch_size)

Using custom data configuration default
Reusing dataset ag_news (/root/.cache/huggingface/datasets/ag_news/default/0.0.0/fb5c5e74a110037311ef5e904583ce9f8b9fbc1354290f97b4929f01b3f48b1a)


In [22]:
vocab_size = len(vocab)

print(f"The gist: context_window_size = {args.context_window_size}, "
      f"batch_size = {args.batch_size}, vocab_size = {vocab_size}, "
      f"embedding_size = {args.embedding_size}, k = {args.negative_sample_size}, "
      f"train_size = {len(train_dataloader.dataset)}"
      )

The gist: context_window_size = 5, batch_size = 8192, vocab_size = 61811, embedding_size = 300, k = 20, train_size = 720036


## Training

Here we build the model and call the trainer.

In [23]:
word_frequencies = torch.from_numpy(vocab.get_word_frequencies())
model = word2vec.SkipGramNSModel(vocab_size, args.embedding_size, args.negative_sample_size,word_frequencies)
trainer = train.Word2VecTrainer(args,model,train_dataloader)
trainer.run()

[Epoch 0]:  36%|███▋      | 32/88 [00:34<00:59,  1.07s/it, loss=12.7]


KeyboardInterrupt: ignored

## Examine Similarity of Embeddings

Now that we've trained our embeddings, let's see if the words that are clustered together make any sense.

We will use cosine similarity to find the embeddings that are most similar in the embeddings space. This is one metric
for similarity. Another popular metric is based on euclidean distance. To use that metric, check out pytorch's
cdist() function. Also, can't speak highly enough of `spotify::annoy` package.

In [None]:
embeddings = model.get_embeddings()

### Computer

Let's see the top 5 words associated with "computer".

In [None]:
utils.get_cosine_similar("computer",vocab._token_to_idx,embeddings)[0:5]

[('of', tensor(1.0000)),
 ('apple', tensor(1.0000)),
 ('israel', tensor(1.0000)),
 ('leader', tensor(1.0000)),
 ('game', tensor(1.0000))]

### Market

Let's see the top 5 words associated with "market".

In [None]:
utils.get_cosine_similar("market",vocab._token_to_idx,embeddings)[0:5]

[('investors', tensor(1.0000)),
 ('korea', tensor(1.0000)),
 ('out', tensor(1.0000)),
 ('israel', tensor(1.0000)),
 ('china', tensor(1.0000))]

In this particular example, we sub-selected heavily so that our training set would be manageable.
With a training_N = ~200k and vocab_size = ~60k, we might consider increasing  N >> p to improve our embeddings.