# Skip-gram in Action

## Colab Setup

You can skip this section if not running on Google's colab.

If running with GPUs, sanity check that the GPUs are enabled.

In [None]:
!nvidia-smi

In [1]:
import torch
torch.cuda.is_available()

True

Ahould be True. If not, debug (Note: version of pytorch I used is not capatible with CUDA drivers on colab. Follow these instructions here explicitly).

In [2]:
!pwd

/content


This should be "/content" on Colab.

First, if running from colab, you must install the package. (You may skip if you installed already).

In [None]:
!git clone --single-branch --branch colab https://github.com/will-thompson-k/deeplearning-nlp-models.git
%cd deeplearning-nlp-models

In [None]:
!pip install datasets

In [None]:
!python setup.py install

## Imports

Here are the packages we need to import.

In [7]:
from nlpmodels.models import word2vec
from nlpmodels.utils import utils, train
from nlpmodels.utils.elt import skipgram_dataset
from argparse import Namespace
import torch
utils.set_seed_everywhere()

## Hyper-parameters

These are the data processing, skip-gram, and model training hyper-parameters for this run.

In [8]:
args = Namespace(
    # skip gram data hyper-parameters
    context_window_size = 10,
    subsample_t = 10.e-7, # param for sub-sampling frequent words (10.e-5 suggested by paper)
    # Model hyper-parameters
    embedding_size = 512,
    negative_sample_size= 20, # k examples to be used in negative sampling loss function
    # Training hyper-parameters
    num_epochs=50,
    learning_rate=0.0001,
    batch_size = 4096,
)

## Get Data

Call the function that grabs training data (via hugging faces) and a dictionary.

In [9]:
train_dataloader, vocab = skipgram_dataset.SkipGramDataset.get_training_dataloader(args.context_window_size,
                                                                                   args.subsample_t,
                                                                                   args.batch_size)

In [10]:
vocab_size = len(vocab)

print(f"The gist: context_window_size = {args.context_window_size}, "
      f"batch_size = {args.batch_size}, vocab_size = {vocab_size}, "
      f"embedding_size = {args.embedding_size}, k = {args.negative_sample_size}, "
      f"train_size = {len(train_dataloader.dataset)}"
      )

The gist: context_window_size = 10, batch_size = 4096, vocab_size = 61811, embedding_size = 512, k = 20, train_size = 2787170


## Training

Here we build the model and call the trainer.

In [None]:
word_frequencies = torch.from_numpy(vocab.get_word_frequencies())
model = word2vec.SkipGramNSModel(vocab_size, args.embedding_size, args.negative_sample_size,word_frequencies)
trainer = train.Word2VecTrainer(args,model,train_dataloader)
trainer.run()

## Examine Similarity of Embeddings

Now that we've trained our embeddings, let's see if the words that are clustered together make any sense.

We will use cosine similarity to find the embeddings that are most similar in the embeddings space. This is one metric
for similarity. Another popular metric is based on euclidean distance. To use that metric, check out pytorch's
cdist() function. Also, can't speak highly enough of `spotify::annoy` package.

In [12]:
embeddings = model.get_embeddings().to(torch.device('cpu'))

In [13]:
embeddings

tensor([[ 3.3240e-04, -1.8160e-04,  8.7463e-05,  ...,  4.8631e-05,
         -2.8457e-04, -6.9950e-04],
        [-1.0018e-06,  7.1809e-04, -7.2560e-04,  ..., -8.2127e-04,
         -3.2868e-04, -9.1119e-05],
        [-5.3387e-04, -2.3233e-04,  7.4901e-04,  ...,  4.8751e-04,
         -6.4552e-04, -1.1948e-04],
        ...,
        [ 7.5621e-04,  3.8125e-04,  3.1393e-04,  ...,  7.3202e-04,
          5.5612e-04,  3.6417e-04],
        [ 8.0143e-02, -7.9639e-02,  7.9455e-02,  ...,  7.9445e-02,
          7.9570e-02, -8.0183e-02],
        [ 8.9229e-02, -8.9240e-02,  8.8531e-02,  ...,  8.8135e-02,
          8.8876e-02, -8.9196e-02]])

### Computer

Let's see the top 10 words associated with "computer".

In [14]:
utils.get_cosine_similar("computer",vocab._token_to_idx,embeddings)[0:10]

[('systems', tensor(0.9948)),
 ('phone', tensor(0.9944)),
 ('online', tensor(0.9920)),
 ('ipod', tensor(0.9917)),
 ('digital', tensor(0.9907)),
 ('server', tensor(0.9885)),
 ('product', tensor(0.9883)),
 ('storage', tensor(0.9869)),
 ('services', tensor(0.9860)),
 ('technology', tensor(0.9856))]

### Market

Let's see the top 10 words associated with "market".

In [15]:
utils.get_cosine_similar("market",vocab._token_to_idx,embeddings)[0:10]

[('expectations', tensor(0.9880)),
 ('income', tensor(0.9880)),
 ('nasdaq', tensor(0.9869)),
 ('analysts', tensor(0.9869)),
 ('october', tensor(0.9856)),
 ('raised', tensor(0.9843)),
 ('awaited', tensor(0.9839)),
 ('low', tensor(0.9835)),
 ('financial', tensor(0.9830)),
 ('global', tensor(0.9824))]