# Skip-gram in Action

## Colab Setup

You can skip this section if not running on Google's colab.

If running with GPUs, sanity check that the GPUs are enabled.

In [1]:
!nvidia-smi

Tue Dec  1 13:13:01 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
import torch
torch.cuda.is_available()

True

The above should be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly).

First, if running from colab, you must install the package. (You may skip if you installed already).

In [3]:
!git clone --single-branch --branch colab https://github.com/will-thompson-k/deeplearning-nlp-models.git
%cd deeplearning-nlp-models

Cloning into 'deeplearning-nlp-models'...
remote: Enumerating objects: 88, done.[K
remote: Counting objects: 100% (88/88), done.[K
remote: Compressing objects: 100% (71/71), done.[K
remote: Total 914 (delta 48), reused 34 (delta 17), pack-reused 826[K
Receiving objects: 100% (914/914), 3.63 MiB | 3.07 MiB/s, done.
Resolving deltas: 100% (540/540), done.
/content/deeplearning-nlp-models


In [5]:
!pip install datasets==1.0.2

Collecting datasets==1.0.2
[?25l  Downloading https://files.pythonhosted.org/packages/83/7e/8d9e2fd30e3819e6042927d379f3668a0b49fe38b92d5639194808a1d877/datasets-1.0.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 5.9MB/s 
Installing collected packages: datasets
  Found existing installation: datasets 1.1.3
    Uninstalling datasets-1.1.3:
      Successfully uninstalled datasets-1.1.3
Successfully installed datasets-1.0.2


In [6]:
!python setup.py install

[K     |████████████████████████████████| 71kB 3.5MB/s 
[K     |████████████████████████████████| 153kB 4.3MB/s 
[K     |████████████████████████████████| 14.5MB 244kB/s 
[K     |████████████████████████████████| 748.8MB 23kB/s 
[K     |████████████████████████████████| 4.5MB 4.2MB/s 
[K     |████████████████████████████████| 1.1MB 5.5MB/s 
[?25hrunning install
running bdist_egg
running egg_info
creating deeplearning_nlp_models.egg-info
writing deeplearning_nlp_models.egg-info/PKG-INFO
writing dependency_links to deeplearning_nlp_models.egg-info/dependency_links.txt
writing top-level names to deeplearning_nlp_models.egg-info/top_level.txt
writing manifest file 'deeplearning_nlp_models.egg-info/SOURCES.txt'
writing manifest file 'deeplearning_nlp_models.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/nlpmodels
copying nlpmodels/__init__.py -> build/lib/nlpmodels

## Imports

Here are the packages we need to import.

In [7]:
from nlpmodels.models import word2vec
from nlpmodels.utils import utils, train
from nlpmodels.utils.elt import skipgram_dataset
from argparse import Namespace
import torch
utils.set_seed_everywhere()

## Hyper-parameters

These are the data processing, skip-gram, and model training hyper-parameters for this run.

In [14]:
args = Namespace(
    # skip gram data hyper-parameters
    context_window_size = 5,
    subsample_t = 10.e-5, # param for sub-sampling frequent words (10.e-5 suggested by paper)
    # Model hyper-parameters
    embedding_size = 512,
    negative_sample_size= 20, # k examples to be used in negative sampling loss function
    # Training hyper-parameters
    num_epochs=50,
    learning_rate=0.0001,
    batch_size = 4096,
)

## Get Data

Call the function that grabs training data (via hugging faces) and a dictionary.

In [15]:
train_dataloader, vocab = skipgram_dataset.SkipGramDataset.get_training_dataloader(args.context_window_size,
                                                                                   args.subsample_t,
                                                                                   args.batch_size)

Using custom data configuration default
Reusing dataset ag_news (/root/.cache/huggingface/datasets/ag_news/default/0.0.0/fb5c5e74a110037311ef5e904583ce9f8b9fbc1354290f97b4929f01b3f48b1a)


In [16]:
vocab_size = len(vocab)

print(f"The gist: context_window_size = {args.context_window_size}, "
      f"batch_size = {args.batch_size}, vocab_size = {vocab_size}, "
      f"embedding_size = {args.embedding_size}, k = {args.negative_sample_size}, "
      f"train_size = {len(train_dataloader.dataset)}"
      )

The gist: context_window_size = 5, batch_size = 4096, vocab_size = 61811, embedding_size = 512, k = 20, train_size = 16103933


## Training

Here we build the model and call the trainer.

In [17]:
word_frequencies = torch.from_numpy(vocab.get_word_frequencies())
model = word2vec.SkipGramNSModel(vocab_size, args.embedding_size, args.negative_sample_size,word_frequencies)
trainer = train.Word2VecTrainer(args,model,train_dataloader)
trainer.run()

[Epoch 0]: 100%|██████████| 3932/3932 [03:04<00:00, 21.26it/s, loss=0.888]
[Epoch 1]: 100%|██████████| 3932/3932 [03:00<00:00, 21.73it/s, loss=0.732]
[Epoch 2]: 100%|██████████| 3932/3932 [03:00<00:00, 21.84it/s, loss=0.71]
[Epoch 3]: 100%|██████████| 3932/3932 [02:59<00:00, 21.91it/s, loss=0.705]
[Epoch 4]: 100%|██████████| 3932/3932 [02:59<00:00, 21.96it/s, loss=0.703]
[Epoch 5]: 100%|██████████| 3932/3932 [03:01<00:00, 21.71it/s, loss=0.703]
[Epoch 6]: 100%|██████████| 3932/3932 [03:01<00:00, 21.61it/s, loss=0.703]
[Epoch 7]: 100%|██████████| 3932/3932 [03:01<00:00, 21.72it/s, loss=0.703]
[Epoch 8]: 100%|██████████| 3932/3932 [03:00<00:00, 21.75it/s, loss=0.702]
[Epoch 9]: 100%|██████████| 3932/3932 [03:00<00:00, 21.84it/s, loss=0.702]
[Epoch 10]: 100%|██████████| 3932/3932 [02:59<00:00, 21.86it/s, loss=0.702]
[Epoch 11]: 100%|██████████| 3932/3932 [02:59<00:00, 21.88it/s, loss=0.702]
[Epoch 12]:  12%|█▏        | 459/3932 [00:23<02:29, 23.23it/s, loss=0.702]

KeyboardInterrupt: ignored

## Examine Similarity of Embeddings

Now that we've trained our embeddings, let's see if the words that are clustered together make any sense.

We will use cosine similarity to find the embeddings that are most similar in the embeddings space. This is one metric
for similarity. Another popular metric is based on euclidean distance. To use that metric, check out pytorch's
cdist() function. Also, can't speak highly enough of `spotify::annoy` package.

In [18]:
embeddings = model.get_embeddings().to(torch.device('cpu'))

In [19]:
embeddings

tensor([[ 3.3240e-04, -1.8160e-04,  8.7463e-05,  ...,  4.8631e-05,
         -2.8457e-04, -6.9950e-04],
        [-1.0018e-06,  7.1809e-04, -7.2560e-04,  ..., -8.2127e-04,
         -3.2868e-04, -9.1119e-05],
        [-5.3387e-04, -2.3233e-04,  7.4901e-04,  ...,  4.8751e-04,
         -6.4552e-04, -1.1948e-04],
        ...,
        [ 6.9722e-02, -6.8210e-02,  6.9298e-02,  ...,  6.8851e-02,
          6.9208e-02, -6.6931e-02],
        [ 7.6255e-02, -7.4938e-02,  7.5374e-02,  ...,  7.5360e-02,
          7.5257e-02, -7.4191e-02],
        [ 8.0253e-02, -7.9104e-02,  7.8547e-02,  ...,  7.9173e-02,
          8.0599e-02, -7.5644e-02]])

### Computer

Let's see the top 5 words associated with "computer".

In [20]:
utils.get_cosine_similar("computer",vocab._token_to_idx,embeddings)[0:10]

[Epoch 12]:  12%|█▏        | 459/3932 [00:40<02:29, 23.23it/s, loss=0.702]

[('management', tensor(0.9933)),
 ('based', tensor(0.9907)),
 ('software', tensor(0.9903)),
 ('phone', tensor(0.9903)),
 ('services', tensor(0.9897)),
 ('systems', tensor(0.9864)),
 ('technology', tensor(0.9864)),
 ('business', tensor(0.9854)),
 ('personal', tensor(0.9849)),
 ('devices', tensor(0.9846))]

### Market

Let's see the top 5 words associated with "market".

In [21]:
utils.get_cosine_similar("market",vocab._token_to_idx,embeddings)[0:10]

[('exchange', tensor(0.9840)),
 ('cost', tensor(0.9811)),
 ('store', tensor(0.9786)),
 ('awaited', tensor(0.9784)),
 ('initial', tensor(0.9769)),
 ('industry', tensor(0.9766)),
 ('stock', tensor(0.9761)),
 ('auction', tensor(0.9752)),
 ('sector', tensor(0.9751)),
 ('slashed', tensor(0.9734))]

In this particular example, we sub-selected heavily so that our training set would be manageable.
With a training_N = ~200k and vocab_size = ~60k, we might consider increasing  N >> p to improve our embeddings.