# Skip-gram in Action

## Colab Setup

You can skip this section if not running on Google's colab.

If running with GPUs, sanity check that the GPUs are enabled.

In [1]:
!nvidia-smi

Tue Dec  1 12:32:24 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
import torch
torch.cuda.is_available()

True

The above should be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly).

First, if running from colab, you must install the package. (You may skip if you installed already).

In [3]:
!git clone --single-branch --branch colab https://github.com/will-thompson-k/deeplearning-nlp-models.git
%cd deeplearning-nlp-models

Cloning into 'deeplearning-nlp-models'...
remote: Enumerating objects: 78, done.[K
remote: Counting objects: 100% (78/78), done.[K
remote: Compressing objects: 100% (63/63), done.[K
remote: Total 904 (delta 41), reused 30 (delta 15), pack-reused 826[K
Receiving objects: 100% (904/904), 3.63 MiB | 3.13 MiB/s, done.
Resolving deltas: 100% (533/533), done.
/content/deeplearning-nlp-models


In [4]:
!pip install datasets

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/1a/38/0c24dce24767386123d528d27109024220db0e7a04467b658d587695241a/datasets-1.1.3-py3-none-any.whl (153kB)
[K     |██▏                             | 10kB 18.6MB/s eta 0:00:01[K     |████▎                           | 20kB 21.9MB/s eta 0:00:01[K     |██████▍                         | 30kB 11.2MB/s eta 0:00:01[K     |████████▌                       | 40kB 9.1MB/s eta 0:00:01[K     |██████████▋                     | 51kB 4.4MB/s eta 0:00:01[K     |████████████▉                   | 61kB 5.0MB/s eta 0:00:01[K     |███████████████                 | 71kB 5.2MB/s eta 0:00:01[K     |█████████████████               | 81kB 5.5MB/s eta 0:00:01[K     |███████████████████▏            | 92kB 5.8MB/s eta 0:00:01[K     |█████████████████████▎          | 102kB 6.1MB/s eta 0:00:01[K     |███████████████████████▌        | 112kB 6.1MB/s eta 0:00:01[K     |█████████████████████████▋      | 122kB 6.1MB/s et

In [5]:
!python setup.py install



## Imports

Here are the packages we need to import.

In [6]:
from nlpmodels.models import word2vec
from nlpmodels.utils import utils, train
from nlpmodels.utils.elt import skipgram_dataset
from argparse import Namespace
import torch
utils.set_seed_everywhere()

## Hyper-parameters

These are the data processing, skip-gram, and model training hyper-parameters for this run.

In [19]:
args = Namespace(
    # skip gram data hyper-parameters
    context_window_size = 5,
    subsample_t = 10.e-5, # param for sub-sampling frequent words (10.e-5 suggested by paper)
    # Model hyper-parameters
    embedding_size = 300,
    negative_sample_size= 20, # k examples to be used in negative sampling loss function
    # Training hyper-parameters
    num_epochs=50,
    learning_rate=0.0001,
    batch_size = 4096,
)

## Get Data

Call the function that grabs training data (via hugging faces) and a dictionary.

In [20]:
train_dataloader, vocab = skipgram_dataset.SkipGramDataset.get_training_dataloader(args.context_window_size,
                                                                                   args.subsample_t,
                                                                                   args.batch_size)

Using custom data configuration default
Reusing dataset ag_news (/root/.cache/huggingface/datasets/ag_news/default/0.0.0/fb5c5e74a110037311ef5e904583ce9f8b9fbc1354290f97b4929f01b3f48b1a)


In [21]:
vocab_size = len(vocab)

print(f"The gist: context_window_size = {args.context_window_size}, "
      f"batch_size = {args.batch_size}, vocab_size = {vocab_size}, "
      f"embedding_size = {args.embedding_size}, k = {args.negative_sample_size}, "
      f"train_size = {len(train_dataloader.dataset)}"
      )

The gist: context_window_size = 5, batch_size = 4096, vocab_size = 61811, embedding_size = 300, k = 20, train_size = 16100272


## Training

Here we build the model and call the trainer.

In [22]:
word_frequencies = torch.from_numpy(vocab.get_word_frequencies())
model = word2vec.SkipGramNSModel(vocab_size, args.embedding_size, args.negative_sample_size,word_frequencies)
trainer = train.Word2VecTrainer(args,model,train_dataloader)
trainer.run()

[Epoch 0]: 100%|██████████| 3931/3931 [02:17<00:00, 28.58it/s, loss=1]
[Epoch 1]: 100%|██████████| 3931/3931 [02:16<00:00, 28.82it/s, loss=0.759]
[Epoch 2]: 100%|██████████| 3931/3931 [02:20<00:00, 28.05it/s, loss=0.729]
[Epoch 3]:  34%|███▎      | 1319/3931 [00:47<01:21, 32.03it/s, loss=0.717]

KeyboardInterrupt: ignored

## Examine Similarity of Embeddings

Now that we've trained our embeddings, let's see if the words that are clustered together make any sense.

We will use cosine similarity to find the embeddings that are most similar in the embeddings space. This is one metric
for similarity. Another popular metric is based on euclidean distance. To use that metric, check out pytorch's
cdist() function. Also, can't speak highly enough of `spotify::annoy` package.

In [23]:
embeddings = model.get_embeddings().to(torch.device('cpu'))

### Computer

Let's see the top 10 words associated with "computer".

In [25]:
utils.get_cosine_similar("computer",vocab._token_to_idx,embeddings)[0:10]

[('the', tensor(0.9999)),
 ('ron', tensor(0.9999)),
 ('raptors', tensor(0.9999)),
 ('backed', tensor(0.9999)),
 ('rated', tensor(0.9999)),
 ('ramadi', tensor(0.9999)),
 ('returning', tensor(0.9999)),
 ('veterans', tensor(0.9999)),
 ('arrest', tensor(0.9999)),
 ('unbeaten', tensor(0.9999))]

### Market

Let's see the top 5 words associated with "market".

In [26]:
utils.get_cosine_similar("market",vocab._token_to_idx,embeddings)[0:10]

[('maria', tensor(0.9999)),
 ('the', tensor(0.9999)),
 ('malicious', tensor(0.9999)),
 ('quarter', tensor(0.9999)),
 ('lay', tensor(0.9999)),
 ('hire', tensor(0.9999)),
 ('rush', tensor(0.9999)),
 ('did', tensor(0.9999)),
 ('passing', tensor(0.9999)),
 ('capriati', tensor(0.9999))]