# Skip-gram in Action

## Colab Setup

You can skip this section if not running on Google's colab.

If running with GPUs, sanity check that the GPUs are enabled.

In [1]:
!nvidia-smi

Mon Nov 30 02:08:31 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P8    11W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [12]:
import torch
torch.cuda.is_available()

True

The above should be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly).

First, if running from colab, you must install the package. (You may skip if you installed already).

In [23]:
!git clone --single-branch --branch colab https://github.com/will-thompson-k/deeplearning-nlp-models.git
%cd deeplearning-nlp-models

Cloning into 'deeplearning-nlp-models'...
remote: Enumerating objects: 73, done.[K
remote: Counting objects:   1% (1/73)[Kremote: Counting objects:   2% (2/73)[Kremote: Counting objects:   4% (3/73)[Kremote: Counting objects:   5% (4/73)[Kremote: Counting objects:   6% (5/73)[Kremote: Counting objects:   8% (6/73)[Kremote: Counting objects:   9% (7/73)[Kremote: Counting objects:  10% (8/73)[Kremote: Counting objects:  12% (9/73)[Kremote: Counting objects:  13% (10/73)[Kremote: Counting objects:  15% (11/73)[Kremote: Counting objects:  16% (12/73)[Kremote: Counting objects:  17% (13/73)[Kremote: Counting objects:  19% (14/73)[Kremote: Counting objects:  20% (15/73)[Kremote: Counting objects:  21% (16/73)[Kremote: Counting objects:  23% (17/73)[Kremote: Counting objects:  24% (18/73)[Kremote: Counting objects:  26% (19/73)[Kremote: Counting objects:  27% (20/73)[Kremote: Counting objects:  28% (21/73)[Kremote: Counting objects:  30% (22/73)[Kr

In [14]:
!pip install datasets



In [15]:
!python setup.py install

running install
running bdist_egg
running egg_info
creating deeplearning_nlp_models.egg-info
writing deeplearning_nlp_models.egg-info/PKG-INFO
writing dependency_links to deeplearning_nlp_models.egg-info/dependency_links.txt
writing top-level names to deeplearning_nlp_models.egg-info/top_level.txt
writing manifest file 'deeplearning_nlp_models.egg-info/SOURCES.txt'
writing manifest file 'deeplearning_nlp_models.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/nlpmodels
copying nlpmodels/__init__.py -> build/lib/nlpmodels
creating build/lib/nlpmodels/utils
copying nlpmodels/utils/gpt_sampler.py -> build/lib/nlpmodels/utils
copying nlpmodels/utils/label_smoother.py -> build/lib/nlpmodels/utils
copying nlpmodels/utils/tokenizer.py -> build/lib/nlpmodels/utils
copying nlpmodels/utils/__init__.py -> build/lib/nlpmodels/utils
copying nlpmodels/utils/train.py -> build/lib/nlp

## Imports

Here are the packages we need to import.

In [16]:
from nlpmodels.models import word2vec
from nlpmodels.utils import utils, train
from nlpmodels.utils.elt import skipgram_dataset
from argparse import Namespace
import torch
utils.set_seed_everywhere()

## Hyper-parameters

These are the data processing, skip-gram, and model training hyper-parameters for this run.

In [17]:
args = Namespace(
    # skip gram data hyper-parameters
    context_window_size = 5,
    subsample_t = 10.e-15, # param for sub-sampling frequent words (10.e-5 suggested by paper)
    # Model hyper-parameters
    embedding_size = 300,
    negative_sample_size= 20, # k examples to be used in negative sampling loss function
    # Training hyper-parameters
    num_epochs=100,
    learning_rate=0.001,
    batch_size = 8192,
)

## Get Data

Call the function that grabs training data (via hugging faces) and a dictionary.

In [19]:
train_dataloader, vocab = skipgram_dataset.SkipGramDataset.get_training_dataloader(args.context_window_size,
                                                                                   args.subsample_t,
                                                                                   args.batch_size)

Using custom data configuration default
Reusing dataset ag_news (/root/.cache/huggingface/datasets/ag_news/default/0.0.0/fb5c5e74a110037311ef5e904583ce9f8b9fbc1354290f97b4929f01b3f48b1a)


In [20]:
vocab_size = len(vocab)

print(f"The gist: context_window_size = {args.context_window_size}, "
      f"batch_size = {args.batch_size}, vocab_size = {vocab_size}, "
      f"embedding_size = {args.embedding_size}, k = {args.negative_sample_size}, "
      f"train_size = {len(train_dataloader.dataset)}"
      )

The gist: context_window_size = 5, batch_size = 8192, vocab_size = 61811, embedding_size = 300, k = 20, train_size = 720038


## Training

Here we build the model and call the trainer.

In [21]:
word_frequencies = torch.from_numpy(vocab.get_word_frequencies())
model = word2vec.SkipGramNSModel(vocab_size, args.embedding_size, args.negative_sample_size,word_frequencies)
trainer = train.Word2VecTrainer(args,model,train_dataloader)
trainer.run()


  0%|          | 0/88 [00:00<?, ?it/s][A
[Epoch 0]:   0%|          | 0/88 [00:00<?, ?it/s][A

RuntimeError: ignored

## Examine Similarity of Embeddings

Now that we've trained our embeddings, let's see if the words that are clustered together make any sense.

We will use cosine similarity to find the embeddings that are most similar in the embeddings space. This is one metric
for similarity. Another popular metric is based on euclidean distance. To use that metric, check out pytorch's
cdist() function. Also, can't speak highly enough of `spotify::annoy` package.

In [None]:
embeddings = model.get_embeddings()

### Computer

Let's see the top 5 words associated with "computer".

In [None]:
utils.get_cosine_similar("computer",vocab._token_to_idx,embeddings)[0:5]

[('of', tensor(1.0000)),
 ('apple', tensor(1.0000)),
 ('israel', tensor(1.0000)),
 ('leader', tensor(1.0000)),
 ('game', tensor(1.0000))]

### Market

Let's see the top 5 words associated with "market".

In [None]:
utils.get_cosine_similar("market",vocab._token_to_idx,embeddings)[0:5]

[('investors', tensor(1.0000)),
 ('korea', tensor(1.0000)),
 ('out', tensor(1.0000)),
 ('israel', tensor(1.0000)),
 ('china', tensor(1.0000))]

In this particular example, we sub-selected heavily so that our training set would be manageable.
With a training_N = ~200k and vocab_size = ~60k, we might consider increasing  N >> p to improve our embeddings.