The sources of this material:
* [Kaggle competition for "Women's E-Commerce Clothing Reviews" TF-IDF kernel](https://www.kaggle.com/shivam1600/simple-information-retrieval-using-tf-idf-and-lsa)
* [Information retrival: TF-IDF Ranking](https://github.com/williamscott701/Information-Retrieval/blob/master/2.%20TF-IDF%20Ranking%20-%20Cosine%20Similarity,%20Matching%20Score/TF-IDF.ipynb)
* [YDS word vectors seminar](https://github.com/yandexdataschool/nlp_course/tree/2019/week01_embeddings)
* [Lena Voita's Word Embeddings Lecture](https://drive.google.com/file/d/1y2GKIKBzie7l8iycBO6gTKGiTTfJc4Dr/view)
* [Word2Vec Pytorch implementation](https://github.com/blackredscarf/pytorch-SkipGram)
* [Doc2Vec tutorial](https://github.com/RaRe-Technologies/gensim/blob/ca0dcaa1eca8b1764f6456adac5719309e0d8e6d/docs/notebooks/doc2vec-IMDB.ipynb)

Prerequisite download:

In [None]:
# For Count-based models section
!wget https://raw.githubusercontent.com/dardem/word2vec_seminar/master/Womens%20Clothing%20E-Commerce%20Reviews.csv

In [None]:
# For Word2Vec section
!wget http://mattmahoney.net/dc/text8.zip && unzip text8.zip
!wget https://github.com/blackredscarf/pytorch-SkipGram/raw/master/data_utils.py
!wget https://github.com/blackredscarf/pytorch-SkipGram/raw/master/vector_handle.py
!wget https://github.com/dardem/word2vec_seminar/raw/master/eval.zip && unzip eval.zip

In [3]:
# For Pretrained models examples

# English
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
!wget nlp.stanford.edu/data/wordvecs/glove.6B.zip
!unzip glove.6B.zip

# Russian
!wget http://vectors.nlpl.eu/repository/20/214.zip
!unzip 214.zip -d ru_fasttext_model

--2023-03-31 15:03:25--  https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.81.18, 2620:100:6031:18::a27d:5112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.81.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/obaitrix9jyu84r/quora.txt [following]
--2023-03-31 15:03:25--  https://www.dropbox.com/s/dl/obaitrix9jyu84r/quora.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc2653f1b6f7cce51a03f02f759c.dl.dropboxusercontent.com/cd/0/get/B5Q3lDh5OIzofQyBCR7BIuX4ntYb_-uESPgoqx6eUUl7B3A2BZUi3f0bPPf4spf3P4jR1yy-YqNir0wC9jKdMmiT-eBlf2tMqj8StBp_U3b0hey1VPH1b70irCdeOllM-72ZzXTI-1RvLRp_kjGyblsnSViGoCKn236us1BFygkIVg/file?dl=1# [following]
--2023-03-31 15:03:26--  https://uc2653f1b6f7cce51a03f02f759c.dl.dropboxusercontent.com/cd/0/get/B5Q3lDh5OIzofQyBCR7BIuX4ntYb_-uESPgoqx6eUUl7B3A2BZUi3f0bPPf4spf3P4jR1yy-YqNir0wC9jKd

In [4]:
# For Application examples
!wget http://panchenko.me/slides/nnlp/data/cc.ru.300.vec.zip
!unzip cc.ru.300.vec.zip
!wget http://panchenko.me/slides/nnlp/data/cc.uk.300.vec.zip
!unzip cc.uk.300.vec.zip
!wget http://panchenko.me/slides/nnlp/data/ukr_rus.train.txt
!wget http://panchenko.me/slides/nnlp/data/ukr_rus.test.txt
!wget http://panchenko.me/slides/nnlp/data/fairy_tale.txt

--2023-03-31 15:33:48--  http://panchenko.me/slides/nnlp/data/cc.ru.300.vec.zip
Resolving panchenko.me (panchenko.me)... 130.104.253.4
Connecting to panchenko.me (panchenko.me)|130.104.253.4|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2023-03-31 15:33:49 ERROR 404: Not Found.

unzip:  cannot find or open cc.ru.300.vec.zip, cc.ru.300.vec.zip.zip or cc.ru.300.vec.zip.ZIP.
--2023-03-31 15:33:49--  http://panchenko.me/slides/nnlp/data/cc.uk.300.vec.zip
Resolving panchenko.me (panchenko.me)... 130.104.253.4
Connecting to panchenko.me (panchenko.me)|130.104.253.4|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2023-03-31 15:33:49 ERROR 404: Not Found.

unzip:  cannot find or open cc.uk.300.vec.zip, cc.uk.300.vec.zip.zip or cc.uk.300.vec.zip.ZIP.
--2023-03-31 15:33:49--  http://panchenko.me/slides/nnlp/data/ukr_rus.train.txt
Resolving panchenko.me (panchenko.me)... 130.104.253.4
Connecting to panchenko.me (panchenko.me)|130.104.253.4|:80... 

## Motivation for embeddings



<img src="https://github.com/dardem/word2vec_seminar/raw/master/img/Vector-representation-motivation.png" style="width:100%">

Source: https://drive.google.com/file/d/1y2GKIKBzie7l8iycBO6gTKGiTTfJc4Dr/view

## word2vec

`Word2Vec` is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. For example, `strong` and `powerful` would be close together and `strong` and `Paris` would be relatively far. 

### Main idea

<img src="https://github.com/dardem/word2vec_seminar/raw/master/img/w2v-example1.png" style="width:100%">
<img src="https://github.com/dardem/word2vec_seminar/raw/master/img/w2v-example2.png" style="width:100%">

Source: https://drive.google.com/file/d/1y2GKIKBzie7l8iycBO6gTKGiTTfJc4Dr/view

## Theory

There are two types of word to vec.

### Skipgram

Predicting outside word $o$ from central $c$. We have two embedding mattrix $u$ and $v$.



![](https://i.ibb.co/xgT4k8b/2020-10-02-10-10-21.png)

$$P(o \mid c)=\frac{\exp \left(u_{o}^{T} v_{c}\right)}{\sum\limits_{w \in V} \exp \left(u_{w}^{T} v_{c}\right)}$$.

More formally we need to maximize Likelihood:
$$
L(\theta)=\prod_{t=1}^{T} \prod_{-m \leq j \leq m, j \neq 0} P\left(w_{t+j} \mid w_{t}, \theta\right)
$$

$$
L_{\log}(\theta) = \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j \neq 0} \log P\left(w_{t+j} \mid w_{t}, \theta\right) = \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j \neq 0} \log \frac{\exp \left(u_{t+j}^{T} v_{t}\right)}{\sum\limits_{w \in V} \exp \left(u_{w}^{T} v_{t}\right)} = \\ \sum_{t=1}^{T} \sum_{-m \leq j \leq m, j \neq 0} u_{t+j}^{T} v_{t} - \log \sum_{w \in V} \exp \left(u_{w}^{T} v_{t}\right)
$$

$$
loss = -L_{\log}(\theta)
$$

Let's count derivative!

**Reminder**

$$\frac{\partial x^T y}{\partial y} = x$$



$$
\frac{\partial L_{log}(\theta)}{\partial v_t} = u_o - \dfrac{1}{\sum\limits_{w \in V}\exp(u_w^T v_t)}\cdot\sum_x \exp(u_x^T v_t) u_x = u_o - \sum_x \frac{\exp(u_x v_t)}{\sum\limits_{w \in V} \exp(u_w v_t)} u_x = \\ = u_0 - \sum_x P(u_x| v_t) u_x
$$

### Implementation

Simplest implementation (without negative sampling):


```python

class VanillaSkipGram(nn.Module):
    def __init__(self, vocab_size, embedding_size):
        super(SkipGram, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.output = nn.Linear(embedding_size, vocab_size)

    def forward(self, center_word):
        center_emb = self.embedding(center_word)
        out = self.output(center_emb)
        return out

```

What is the problem with this approach? 

(Hint: look at the formulas above and try to guess which operation there might be computationally bad)

How to solve this? There are several ideas introduced in Mikolov et al. 

1. **Negative Sampling**: Instead of computing the probability of all words in the vocabulary, negative sampling only computes the probabilities of a small number (e.g., 5-20) of negative samples. This reduces the computational cost of computing the loss and speeds up training. Read more [here.](https://www.baeldung.com/cs/nlps-word2vec-negative-sampling)

2. **Subsampling frequent words**: Instead of using all instances of a word in the corpus, some of them are randomly discarded. This can help to remove noise and improve the quality of the learned embeddings, especially for frequent words that are likely to occur in many different contexts. Read more [here.](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/)

3. **Hierarchical Softmax**: Hierarchical softmax is an alternative to the standard softmax that uses a binary tree structure to represent the words in the vocabulary. The probability of each word is computed by traversing the tree from the root to a leaf node that represents the word. This reduces the computational complexity of computing the softmax to $O(\log_2{V})$ instead of $O(V)$, where $V$ is the size of the vocabulary. Read more [here.](https://www.ruder.io/word-embeddings-softmax/#:~:text=Hierarchical%20softmax%20(H%2DSoftmax),be%20seen%20in%20Figure%201)





Let's derive our objective with negative sampling applied!

In negative sampling, we sample a few negative words for each target word instead of using all the other words as negative examples. Let's assume we sample $k$ negative words for each target word. We can rewrite the loss function as:

$$L_{\log }(\theta)=\sum_{t=1}^T \sum\limits_{-m \leq j \leq m, j \neq 0}\left[\log \sigma\left(u_{w_{t+j}}^T v_{w_t}\right)+\sum_{i=1}^k \log \sigma\left(-u_{n_i}^T v_{w_t}\right)\right],$$

where $n_i$ denotes the $i^{th}$ negative sample.

We need to compute the gradients of the loss with respect to the input word vector $v_{w_t}$ and the output word vectors $u_{w_{t+j}}$ and $u_{n_i}$.

Let's first consider the gradient with respect to the input vector $v_{w_t}$. We have:

$$\frac{\partial L_{\log }(\theta)}{\partial v_{w_t}}=\sum\limits_{-m \leq j \leq m, j \neq 0}\left(\frac{\partial}{\partial v_{w_t}} \log \sigma\left(u_{w_{t+j}}^T v_{w_t}\right)+\sum_{i=1}^k \frac{\partial}{\partial v_{w_t}} \log \sigma\left(-u_{n_i}^T v_{w_t}\right)\right)$$

Using the chain rule, we can rewrite the first term in the sum as:

$$\frac{\partial}{\partial v_{w_t}} \log \sigma\left(u_{w_{t+j}}^T v_{w_t}\right)=\frac{\partial}{\partial u_{w_{t+j}}} \log \sigma\left(u_{w_{t+j}}^T v_{w_t}\right) \cdot \frac{\partial}{\partial v_{w_t}} u_{w_{t+j}}^T v_{w_t}=\sigma\left(u_{w_{t+j}}^T v_{w_t}\right) u_{w_{t+j}}$$

And the second term as:

$$\begin{aligned}
\frac{\partial}{\partial v_{w_t}} \sum_{i=1}^k \log \sigma\left(-u_{n_i}^T v_{w_t}\right) & = - \sum\limits_{i=1}^{k}\frac{u_{n_i}}{1+\exp{(u^{T}_{n_i}v_{w_i})}}
\end{aligned}$$


Finally, we have $$\frac{\partial L_{\log }(\theta)}{\partial v_{w_t}} = \sum\limits_{-m \leq j \leq m, j \neq 0}\left(\sigma\left(u_{w_{t+j}}^T v_{w_t}\right) u_{w_{t+j}} - \sum\limits_{i=1}^{k}\frac{u_{n_i}}{1+\exp{(u^{T}_{n_i}v_{w_i})}}\right)$$

#### Some additional downloading

Dataset download:

(for more details about the data please see: http://mattmahoney.net/dc/textdata.html, section Relationship of Wikipedia Text to Clean Text)

Supplementary functions for dataset preprocessing download:

In [None]:
import collections
import os
import pickle
import random
import urllib
from io import open
from typing import Dict, List, Tuple

import numpy as np
import torch


def maybe_download(filename: str, expected_bytes: int) -> str:
    """
    Download a file if it does not exist and verify its size.
    :param filename: name of the file to download
    :param expected_bytes: expected size of the file in bytes
    :return: the path to the downloaded file
    """
    if not os.path.exists(filename):
        print("start downloading...")
        url = "http://mattmahoney.net/dc/"
        filename, _ = urllib.request.urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print("Found and verified", filename)
    else:
        print(statinfo.st_size)
        raise Exception(f"Failed to verify {filename}. Download manually.")
    return filename


def read_own_data(filename: str) -> list:
    """
    Read data from a file.
    :param filename: name of the file to read
    :return: a list of words
    """
    print("reading data...")
    with open(filename, "r", encoding="utf-8") as f:
        data = f.read().split()
    print("corpus size", len(data))
    return data


def build_dataset(
    words: List[str], n_words: int
) -> Tuple[List[int], List[Tuple[int, int]], Dict[str, int], Dict[int, str]]:
    """
    build dataset
    :param words: corpus
    :param n_words: learn most common n_words
    :return:
        - data: [word_index]
        - count: [ [word_index, word_count], ]
        - dictionary: {word_str: word_index}
        - reversed_dictionary: {word_index: word_str}
    """
    count = [["UNK", -1]]
    count.extend(collections.Counter(words).most_common(n_words - 1))
    dictionary = {}
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = []
    unk_count = 0
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0  # UNK index is 0
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reversed_dictionary


def dataset_tofile(
    data: List[int],
    count: List[Tuple[int, int]],
    dictionary: Dict[str, int],
    reversed_dictionary: Dict[int, str],
) -> None:
    """
    Writes the dataset to files.
    :param data: A list of word indices
    :param count: A list of tuples containing word index and its count
    - dictionary: A dictionary mapping word strings to word indices
    :param reversed_dictionary: A dictionary mapping word indices to word strings
    """
    pickle.dump(data, open("data/data.list", "wb"))
    pickle.dump(count, open("data/count.list", "wb"))
    pickle.dump(dictionary, open("data/word2index.dict", "wb"))
    pickle.dump(reversed_dictionary, open("data/index2word.dict", "wb"))


def read_fromfile() -> Tuple[
    List[int], List[Tuple[int, int]], Dict[str, int], Dict[int, str]
]:
    """
    Reads the dataset from files.
    :return:
    - data: A list of word indices
    - count: A list of tuples containing word index and its count
    - dictionary: A dictionary mapping word strings to word indices
    - reversed_dictionary: A dictionary mapping word indices to word strings
    """
    data = pickle.load(open("data/data.list", "rb"))
    count = pickle.load(open("data/count.list", "rb"))
    dictionary = pickle.load(open("data/word2index.dict", "rb"))
    reversed_dictionary = pickle.load(open("data/index2word.dict", "rb"))
    return data, count, dictionary, reversed_dictionary


def noise(vocabs: List[int], word_count: List[Tuple[int, int]]) -> List[int]:
    """
    Generates a noise distribution.
    :param vocabs: A list of word indices
    :param word_count: A list of tuples containing word index and its count
    :return: A list of word indices according to their frequency distribution
    """
    Z = 0.001
    unigram_table = []
    num_total_words = sum(c for w, c in word_count)
    for vo in vocabs:
        unigram_table.extend(
            [vo] * int(((word_count[vo][1] / num_total_words) ** 0.75) / Z)
        )

    print("vocabulary size", len(vocabs))
    print("unigram_table size:", len(unigram_table))
    return unigram_table


class DataPipeline:
    def __init__(self, data, vocabs, word_count, data_index=0, use_noise_neg=True):
        self.data = data
        self.data_index = data_index
        self.unigram_table = noise(vocabs, word_count) if use_noise_neg else vocabs

    def get_neg_data(self, batch_size, num, target_inputs):
        """
        sample the negative data. Don't use np.random.choice(), it is very slow.
        :param batch_size: int
        :param num: int
        :param target_inputs: []
        :return:
        """
        neg = np.zeros((num))
        for i in range(batch_size):
            delta = random.sample(self.unigram_table, num)
            while target_inputs[i] in delta:
                delta = random.sample(self.unigram_table, num)
            neg = np.vstack([neg, delta])
        return neg[1 : batch_size + 1]

    def generate_batch(self, batch_size, num_skips, skip_window):
        """
        get the data batch
        :param batch_size:
        :param num_skips:
        :param skip_window:
        :return: target batch and context batch
        """
        assert batch_size % num_skips == 0
        assert num_skips <= 2 * skip_window
        batch = np.ndarray(shape=(batch_size), dtype=np.int32)
        labels = np.ndarray(shape=(batch_size), dtype=np.int32)
        span = 2 * skip_window + 1  # [ skip_window, target, skip_window ]
        buffer = collections.deque(maxlen=span)
        for _ in range(span):
            buffer.append(self.data[self.data_index])
            self.data_index = (self.data_index + 1) % len(self.data)
        for i in range(batch_size // num_skips):
            target = skip_window
            targets_to_avoid = [skip_window]
            for j in range(num_skips):
                while target in targets_to_avoid:
                    target = random.randint(0, span - 1)
                targets_to_avoid.append(target)
                batch[i * num_skips + j] = buffer[skip_window]
                labels[i * num_skips + j] = buffer[target]
            buffer.append(self.data[self.data_index])
            self.data_index = (self.data_index + 1) % len(self.data)
        self.data_index = (self.data_index + len(self.data) - span) % len(self.data)
        return batch, labels


def model_to_vector(
    model: torch.nn.Module, emb_layer_name: str = "input_emb"
) -> np.ndarray:
    """
    Get the word embedding weights from the model.

    :param model: The model containing the word embeddings.
    :param emb_layer_name: The name of the embedding layer.
    :return: A numpy array of word embeddings.
    """
    sd = model.state_dict()
    return sd[f"{emb_layer_name}.weight"].cpu().numpy().tolist()


def save_embedding(file_name: str, embeddings: np.ndarray, id2word: dict):
    """
    Save the word embeddings to a text file.

    :param file_name: The name of the file to save the embeddings to.
    :param embeddings: A numpy array of word embeddings.
    :param id2word: A dictionary mapping word indices to words.
    """
    with open(file_name, "w") as fo:
        for idx in range(len(embeddings)):
            word = id2word[idx]
            embed = embeddings[idx]
            embed_list = [str(i) for i in embed]
            line_str = " ".join(embed_list)
            fo.write(f"{word} {line_str}" + "\n")


def nearest(
    model: torch.nn.Module,
    vali_examples: np.ndarray,
    vali_size: int,
    id2word_dict: Dict[int, str],
    top_k: int = 8,
):
    """
    Find the nearest words to the given validation examples.

    :param model: The trained model.
    :param vali_examples: An array of validation examples.
    :param vali_size: The number of validation examples.
    :param id2word_dict: A dictionary mapping word indices to words.
    :param top_k: The number of nearest words to return.
    """
    vali_examples = torch.tensor(vali_examples, dtype=torch.long).cuda()
    vali_emb = model.predict(vali_examples)
    # sim: [batch_size, vocab_size]
    sim = torch.mm(vali_emb, model.input_emb.weight.transpose(0, 1))
    for i in range(vali_size):
        vali_word = id2word_dict[vali_examples[i].item()]
        nearest = (-sim[i, :]).sort()[1][1 : top_k + 1]
        log_str = f"Nearest to {vali_word}:"
        for k in range(top_k):
            close_word = id2word_dict[nearest[k].item()]
            log_str = f"{log_str} {close_word},"
        print(log_str)


Supplementary functions for model evaluation download:

In [None]:
!wget https://github.com/dardem/word2vec_seminar/raw/master/eval.zip && eval.zip

--2022-11-21 17:08:54--  https://github.com/dardem/word2vec_seminar/raw/master/eval.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dardem/word2vec_seminar/master/eval.zip [following]
--2022-11-21 17:08:54--  https://raw.githubusercontent.com/dardem/word2vec_seminar/master/eval.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7054 (6.9K) [application/zip]
Saving to: ‘eval.zip’


2022-11-21 17:08:55 (75.3 MB/s) - ‘eval.zip’ saved [7054/7054]

Archive:  eval.zip
   creating: eval/
  inflating: eval/wordsim.py         
  inflating: eval/read_write.py      
  inflating: eval/ranking.py       

Check the installed files:

In [None]:
!ls

 214.zip	     glove.6B.zip
 214.zip.1	     quora.txt
 data_utils.py	     ru_fasttext_model
 eval		     sample_data
 eval.zip	     text8
 glove.6B.100d.txt   text8.zip
 glove.6B.200d.txt   text8.zip.1
 glove.6B.300d.txt   vector_handle.py
 glove.6B.50d.txt   'Womens Clothing E-Commerce Reviews.csv'


#### SkipGram model

<img src="https://raw.githubusercontent.com/dardem/word2vec_seminar/master/img/word2vec_diagram-1.jpg" style="width:100%">

<img src="https://raw.githubusercontent.com/dardem/word2vec_seminar/master/img/Skip-gram-architecture-2.jpg" style="width:30%">

<img src="https://raw.githubusercontent.com/dardem/word2vec_seminar/master/img/w2v-loss.png" style="width:100%">

source: https://drive.google.com/file/d/1y2GKIKBzie7l8iycBO6gTKGiTTfJc4Dr/view

<img src="https://raw.githubusercontent.com/dardem/word2vec_seminar/master/img/w2v-objective.png" style="width:100%">

In [None]:
import torch
from torch import nn


class SkipGramNeg(nn.Module):
    def __init__(self, vocab_size: int, emb_dim: int):
        """
        Initialize SkipGramNeg model.

        :param vocab_size: size of the vocabulary
        :param emb_dim: dimensionality of the embedding 
        """
        super(SkipGramNeg, self).__init__()
        self.input_emb = nn.Embedding(vocab_size, emb_dim)
        self.output_emb = nn.Embedding(vocab_size, emb_dim)
        self.log_sigmoid = nn.LogSigmoid()

        initrange = (2.0 / (vocab_size + emb_dim)) ** 0.5  # Xavier init
        # read more: https://cs230.stanford.edu/section/4/#:~:text=The%20goal%20of%20Xavier%20Initialization,gradient%20from%20exploding%20or%20vanishing. 
        self.input_emb.weight.data.uniform_(-initrange, initrange)
        self.output_emb.weight.data.uniform_(-0, 0)


    def forward(
        self, target_input: torch.Tensor, context: torch.Tensor, neg: torch.Tensor
    ) -> torch.Tensor:
        """
        Compute the loss for the SkipGramNeg model.

        :param target_input: tensor of shape [batch_size] containing target inputs
        :param context: tensor of shape [batch_size] containing context inputs
        :param neg: tensor of shape [batch_size, neg_size] containing negative samples
        :return: tensor of shape [] representing the loss
        """
        # u,v: [batch_size, emb_dim]
        v = self.input_emb(target_input)
        u = self.output_emb(context)
        # positive_val: [batch_size]
        positive_val = self.log_sigmoid(torch.sum(u * v, dim=1)).squeeze()

        # u_hat: [batch_size, neg_size, emb_dim]
        u_hat = self.output_emb(neg)
        # [batch_size, neg_size, emb_dim] x [batch_size, emb_dim, 1] = [batch_size, neg_size, 1]
        # neg_vals: [batch_size, neg_size]
        neg_vals = torch.bmm(u_hat, v.unsqueeze(2)).squeeze(2) # batch matrix-matrix product of matrices
        # neg_val: [batch_size]
        neg_val = self.log_sigmoid(-torch.sum(neg_vals, dim=1)).squeeze()

        loss = positive_val + neg_val
        return -loss.mean()

    def predict(self, inputs):
        return self.input_emb(inputs)

In [None]:
import os
import random
import torch
from torch.optim import SGD
from tqdm.auto import trange

class Word2Vec:
    def __init__(
        self,
        data_path: str,
        vocabulary_size: int,
        embedding_size: int,
        learning_rate: float = 1.0,
    ) -> None:
        """
        Initialize Word2Vec model.

        :param data_path: Path to the input text file.
        :param vocabulary_size: Size of the vocabulary.
        :param embedding_size: Size of the embedding.
        :param learning_rate: Learning rate for the optimizer.
        """
        self.corpus = read_own_data(data_path)

        self.data, self.word_count, self.word2index, self.index2word = build_dataset(
            self.corpus, vocabulary_size
        )
        self.vocabs = list(set(self.data))

        self.model: SkipGramNeg = SkipGramNeg(vocabulary_size, embedding_size).cuda()
        self.model_optim = SGD(self.model.parameters(), lr=learning_rate)

    def train(
        self,
        train_steps: int,
        skip_window: int = 1,
        num_skips: int = 2,
        num_neg: int = 20,
        batch_size: int = 128,
        data_offset: int = 0,
        vali_size: int = 3,
        output_dir: str = "out",
    ) -> None:
        """
        Train the Word2Vec model.

        :param train_steps: Number of training steps.
        :param skip_window: Window size for skip-gram model.
        :param num_skips: Number of times to reuse an input to generate a label.
        :param num_neg: Number of negative samples for negative sampling.
        :param batch_size: Size of each training batch.
        :param data_offset: Starting position in the input data.
        :param vali_size: Number of validation examples to sample.
        :param output_dir: Directory to save the trained model.
        """
        if not os.path.exists(output_dir):
            os.mkdir(output_dir)
        self.outputdir = output_dir

        self.model.to(torch.device("cuda"))

        avg_loss = 0
        pipeline = DataPipeline(self.data, self.vocabs, self.word_count, data_offset)
        vali_examples = random.sample(self.vocabs, vali_size)

        for step in trange(train_steps):
            batch_inputs, batch_labels = pipeline.generate_batch(
                batch_size, num_skips, skip_window
            )
            batch_neg = pipeline.get_neg_data(batch_size, num_neg, batch_inputs)

            batch_inputs = torch.tensor(batch_inputs, dtype=torch.long).cuda()
            batch_labels = torch.tensor(batch_labels, dtype=torch.long).cuda()
            batch_neg = torch.tensor(batch_neg, dtype=torch.long).cuda()

            loss = self.model(
                batch_inputs.to(torch.device("cuda")),
                batch_labels.to(torch.device("cuda")),
                batch_neg.to(torch.device("cuda")),
            )

            self.model_optim.zero_grad()
            loss.backward()
            self.model_optim.step()

            avg_loss += loss.item()

            if step % 2000 == 0 and step > 0:
                avg_loss /= 2000
                print("Average loss at step ", step, ": ", avg_loss)
                avg_loss = 0
            if step % 10000 == 0 and vali_size > 0:
                nearest(self.model, vali_examples, vali_size, self.index2word, top_k=8)
            # checkpoint
            if step % 100000 == 0 and step > 0:
                torch.save(
                    self.model.state_dict(), self.outputdir + "/model_step%d.pt" % step
                )
        # save model at last
        torch.save(
            self.model.state_dict(), self.outputdir + "/model_step%d.pt" % train_steps
        )

    def save_model(self, out_path):
        torch.save(self.model.state_dict(), f"{out_path}/model.pt")

    def get_list_vector(self) -> List[float]:
        sd = self.model.state_dict()
        return sd["input_emb.weight"].tolist()

    def save_vector_txt(self, path_dir: str) -> None:
        embeddings = self.get_list_vector()
        with open(f"{path_dir}/vector.txt", "w") as fo:
            for idx in range(len(embeddings)):
                word = self.index2word[idx]
                embed = embeddings[idx]
                embed_list = [str(i) for i in embed]
                line_str = " ".join(embed_list)
                fo.write(f"{word} {line_str}" + "\n")

    def load_model(self, model_path: str) -> None:
        self.model.load_state_dict(torch.load(model_path))

    
    def vector(self, index:int) -> torch.Tensor:
        self.model.predict(index)

    def most_similar(self, word, top_k=8):
        index = self.word2index[word]
        index = torch.tensor(index, dtype=torch.long).cuda().unsqueeze(0)
        emb = self.model.predict(index)
        sim = torch.mm(emb, self.model.input_emb.weight.transpose(0, 1))
        nearest = (-sim[0]).sort()[1][1 : top_k + 1]
        top_list = []
        for k in range(top_k):
            close_word = self.index2word[nearest[k].item()]
            top_list.append(close_word)
        return top_list


#### Your own Word2Vec model!

Soooo, let's finally build your own Word2Vec model!

In [None]:
# init dataset and model
word2vec = Word2Vec(data_path='text8',
                    vocabulary_size=50000,
                    embedding_size=300)

reading data...
corpus size 17005207


In [None]:
# additional check for output folder
if not os.path.exists('out'):
  os.mkdir('out')

In [None]:
%%time

# train model
word2vec.train(train_steps=200000, #100000, 200000,
               skip_window=5,
               num_skips=2,
               num_neg=20,
               output_dir='out/run-1')


# save vector txt file
word2vec.save_vector_txt(path_dir='out/run-1')

vocabulary size 50000
unigram_table size: 2870


  0%|          | 0/200000 [00:00<?, ?it/s]

Nearest to surreal: forgotten, qatar, coexisted, orientalist, arsenic, preexisting, singer, galleria,
Nearest to hydride: global, cv, spires, brass, fool, lina, guangdong, dated,
Nearest to brownlow: lux, tsr, authorised, elle, rite, almohades, lexicology, prices,
Average loss at step  2000 :  0.9121238741278649
Average loss at step  4000 :  0.8025087597221137
Average loss at step  6000 :  0.78270406255126
Average loss at step  8000 :  0.7669008255451918
Average loss at step  10000 :  0.7435391616523266
Nearest to surreal: zero, one, seven, archie, three, four, eight, five,
Nearest to hydride: zero, seven, nine, coke, one, four, three, gland,
Nearest to brownlow: agave, or, such, be, not, brownlow, as, that,
Average loss at step  12000 :  0.7421490262299776
Average loss at step  14000 :  0.7295556342899799
Average loss at step  16000 :  0.6951937739104033
Average loss at step  18000 :  0.6859835149757564
Average loss at step  20000 :  0.7142104227691889
Nearest to surreal: dasyprocta, 

In [None]:
 !ls ./out/run-1

In [None]:
# Example of extracting word's representation

vector = word2vec.get_list_vector()
print(vector[123])
print(vector[word2vec.word2index['hello']])

[-0.08280034363269806, -0.08368674665689468, -0.0058118379674851894, -0.07497856020927429, 0.0861855149269104, 0.08410929143428802, -0.07177767902612686, 0.10033005475997925, 0.028514793142676353, -0.03580036759376526, 0.008292997255921364, -0.14421528577804565, -0.021862002089619637, -0.09427840262651443, -0.11394796520471573, -0.05866638943552971, -0.07892245799303055, -0.0436282753944397, -0.038860686123371124, -2.3294560378417373e-06, -0.08930206298828125, 0.1010827049612999, -0.000437460868852213, -0.08761536329984665, 0.020901096984744072, 0.07310371100902557, 0.08587057888507843, 0.03580355644226074, 0.04494701325893402, -0.08114765584468842, 0.0046803816221654415, -0.004630394279956818, -0.03269350156188011, 0.041956283152103424, -0.0048202998004853725, -0.04971674457192421, 0.09384263306856155, 0.037126947194337845, -0.02730819769203663, -0.019478222355246544, -0.10872356593608856, -0.14582288265228271, 0.03835826367139816, 0.05003008246421814, 0.06937114149332047, -0.06047366

In [None]:
# get top k similar word
sim_list = word2vec.most_similar('one', top_k=8)
print(sim_list)

['nine', 'zero', 'eight', 'two', 'six', 'five', 'seven', 'four']


In [None]:
# try also for random validation samples and check if the model became better
sim_list = word2vec.most_similar("myself", top_k=8)
print(sim_list)

['if', 'be', 'not', 'do', 'so', 'often', 'are', 'have']


In [None]:
# load pre-train model
# word2vec.load_model('out/run-1/model_step200000.pt')

In [None]:
# some magic for the famous trick
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

mystery_word = (
    np.array(vector[word2vec.word2index["king"]])
    - np.array(vector[word2vec.word2index["man"]])
).reshape(1, -1)

# try with othe random words, e.g. kitty :)
cosine_similarity(
    mystery_word, np.array(vector[word2vec.word2index["queen"]]).reshape(1, -1)
)


array([[0.78697514]])

## Pretrained models

__Word vectors:__ as the saying goes, there's more than one way to train word embeddings. There's Word2Vec and GloVe with different objective functions. Then there's fasttext that uses character-level models to train word embeddings. 

The choice is huge, so let's start someplace small: __gensim__ is another nlp library that features many vector-based models incuding word2vec.

### English language

#### Predefined architecture

Train data downloading:

In [None]:
# download the data:
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

--2023-03-30 11:10:49--  https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.18, 2620:100:6018:18::a27d:312
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/obaitrix9jyu84r/quora.txt [following]
--2023-03-30 11:10:49--  https://www.dropbox.com/s/dl/obaitrix9jyu84r/quora.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uced0c2069ba326520799200e5cd.dl.dropboxusercontent.com/cd/0/get/B5NzlnjNasaphNjgUGp5kMn_uT-ukkB9Do7KeUpqgGgP3SyI2AWmpuE9H8bg_8_WG79dd3vkkg8BnjxW_OIbVNBOPCam_iI5jjGVfndfm0RZSpqAUyCQcaTQiRSZ-D8ZKvVOXNmEIhYDE4Ymq4zDn_8rKRR-upge3UH9ptlvy2d5mA/file?dl=1# [following]
--2023-03-30 11:10:49--  https://uced0c2069ba326520799200e5cd.dl.dropboxusercontent.com/cd/0/get/B5NzlnjNasaphNjgUGp5kMn_uT-ukkB9Do7KeUpqgGgP3SyI2AWmpuE9H8bg_8_WG79dd3vkkg8BnjxW_OIbVNB

In [None]:
import numpy as np

data = list(open("./quora.txt", encoding="utf-8"))
print(data[42])
print(data[20:22])

How does the finance credit score work?

['Who discovered plate tectonics and how?\n', 'Is the Earth the only planet that has life on it?\n']


__Tokenization:__ a typical first step for an nlp task is to split raw data into words.
The text we're working with is in raw format: with all the punctuation and smiles attached to some words, so a simple str.split won't do.

Let's use __`nltk`__ - a library that handles many nlp tasks like tokenization, stemming or part-of-speech tagging.

In [None]:
!pip install pymorphy2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pymorphy2
  Downloading pymorphy2-0.9.1-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 KB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docopt>=0.6
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pymorphy2-dicts-ru<3.0,>=2.4
  Downloading pymorphy2_dicts_ru-2.4.417127.4579844-py2.py3-none-any.whl (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dawg-python>=0.7.1
  Downloading DAWG_Python-0.7.2-py2.py3-none-any.whl (11 kB)
Building wheels for collected packages: docopt
  Building wheel for docopt (setup.py) ... [?25l[?25hdone
  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13721 sha256=0acd3839d028116b2c8620888a0f41c8eb98090849b22b2

In [None]:
from nltk.tokenize import WordPunctTokenizer
# !pip install pymorphy2
import pymorphy2
morph = pymorphy2.MorphAnalyzer()
tokenizer = WordPunctTokenizer()
# data = ['я тебя люблю очень сильно', 'ты здание банка китая']
# print(tokenizer.tokenize(data[42]))
# morph.parse(data[0][])

In [None]:
data_tok = [tokenizer.tokenize(sent.lower()) for sent in data]

In [None]:
data_tok[0]
morph.parse(data_tok[1][3])

[Parse(word='ways', tag=OpencorporaTag('LATN'), normal_form='ways', score=1.0, methods_stack=((LatinAnalyzer(score=0.9), 'ways'),))]

In [None]:
print([' '.join(row) for row in data_tok[:2]])

["can i get back with my ex even though she is pregnant with another guy ' s baby ?", 'what are some ways to overcome a fast food addiction ?']


Load the model architecture:

In [None]:
from gensim.models import Word2Vec

In [None]:
%%time
en_w2v_model = Word2Vec(min_count=1)
en_w2v_model.build_vocab(data_tok)
en_w2v_model.train(data, total_examples=en_w2v_model.corpus_count, epochs=20)

CPU times: user 16min 31s, sys: 6.49 s, total: 16min 38s
Wall time: 9min 31s


(441400321, 675877020)

In [None]:
# now you can get word vectors !
en_w2v_model.wv.get_vector('word')

array([ 7.3056435e-03,  8.3349934e-03, -2.1576332e-03, -9.4048977e-05,
       -3.8331449e-03,  7.7697132e-03,  6.5034034e-04,  8.0241179e-03,
        3.5500217e-03,  2.3648036e-03,  4.4918335e-03, -1.5495574e-03,
       -2.1903538e-03, -5.0523970e-03,  4.9190316e-03, -3.8146186e-03,
       -6.6211843e-03,  4.3856646e-03, -6.5402986e-05,  1.4451467e-03,
        8.2000699e-03, -8.8243233e-03,  3.5642840e-03, -6.4596846e-03,
       -9.9855112e-03,  8.8587627e-03, -8.7040290e-03,  6.1674272e-03,
       -4.2020869e-03,  6.8458365e-03,  8.1843212e-03, -6.5535353e-03,
        8.4244572e-03, -5.9298789e-03, -2.9107512e-03, -3.6678123e-03,
        1.7893672e-03,  4.8961402e-03,  4.0990366e-03,  3.1442130e-03,
       -7.9993606e-03, -4.7615124e-03,  5.3579807e-03,  7.1888389e-03,
        8.3703818e-03, -1.5936756e-03,  3.0256081e-03,  9.4592134e-03,
       -6.3856770e-03, -5.9466134e-03, -7.4866484e-03,  2.9876507e-03,
       -9.4257053e-03, -2.5619245e-03,  4.3914199e-04,  5.4570688e-03,
      

In [None]:
# or query similar words directly. Go play with it!
en_w2v_model.wv.most_similar('hi')

[('kohler', 0.4700702130794525),
 ('taimur', 0.3958079516887665),
 ('xb250', 0.38997596502304077),
 ('idiots', 0.38746178150177),
 ('use', 0.38728031516075134),
 ('vallabh', 0.3713241517543793),
 ('patellectomy', 0.3661700487136841),
 ('liberator', 0.35739490389823914),
 ('bondi', 0.35669052600860596),
 ('stardust', 0.35603001713752747)]

In [None]:
en_w2v_model.wv.most_similar(positive=['king', 'man'], negative=['woman'])

[('pgdilpoma', 0.43496444821357727),
 ('pandering', 0.40375855565071106),
 ('lincoln', 0.3860081732273102),
 ('vermicelli', 0.37535130977630615),
 ('soundacious', 0.37532466650009155),
 ('naturalistic', 0.3657115697860718),
 ('pedastols', 0.3617115616798401),
 ('antler', 0.36132314801216125),
 ('xs3', 0.35966956615448),
 ('nifty', 0.3554002642631531)]

#### Pretrained weights

Download model based on Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 300d vectors, 822 MB download)

In [None]:
! wget nlp.stanford.edu/data/wordvecs/glove.6B.zip

--2022-11-21 19:30:09--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2022-11-21 19:30:09--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2022-11-21 19:30:10--  https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [app

In [None]:
! unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt        


More about different pretrained corpuses:
[GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file="glove.6B.300d.txt", word2vec_output_file="gensim_glove_vectors.txt")

  glove2word2vec(glove_input_file="glove.6B.300d.txt", word2vec_output_file="gensim_glove_vectors.txt")


(400001, 300)

In [None]:
%%time

from gensim.models.keyedvectors import KeyedVectors
en_w2v_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

CPU times: user 1min 6s, sys: 2.42 s, total: 1min 9s
Wall time: 1min 6s


In [None]:
# %%time
# Alternative method for glove model download

# import gensim.downloader as api

# model = api.load('glove-twitter-100')

In [None]:
en_w2v_model.get_vector("language")

array([-6.7832e-01, -2.8658e-01, -2.8904e-01,  1.5099e-01, -4.6720e-01,
       -1.7424e-01, -7.7790e-01,  3.5469e-01,  6.9431e-02, -1.7409e+00,
       -4.8699e-03,  3.2813e-01, -5.5443e-01,  5.1388e-01,  5.3065e-01,
        2.3718e-02,  2.2542e-01,  7.6866e-01,  1.8348e-01,  1.6765e-01,
       -1.5293e-01, -2.7201e-01, -5.3389e-02,  1.0727e+00, -4.6678e-01,
       -2.4596e-01,  1.9205e-01, -7.6138e-02,  3.9775e-02,  1.6546e-01,
        6.4188e-02,  4.1207e-01, -4.1290e-01,  8.8176e-01, -6.5510e-01,
       -1.9994e-01,  2.8036e-01, -8.3058e-01,  1.0374e-02,  2.5017e-01,
       -2.7072e-01, -5.8058e-02,  4.0706e-01, -2.3871e-01,  1.8965e-01,
       -4.7930e-02, -2.0027e-01,  8.7983e-01, -1.5852e-01, -2.8104e-01,
        1.5497e-01, -4.3207e-02,  4.2794e-01, -8.6033e-01, -2.6242e-01,
       -1.0455e-02,  2.3501e-01, -6.6707e-01,  9.1331e-01,  5.2429e-01,
        5.8939e-01,  5.7586e-01,  5.5180e-01,  7.6329e-03, -8.5204e-03,
        3.0554e-01,  7.6697e-01,  5.9108e-01,  7.0538e-01,  1.12

In [None]:
en_w2v_model.most_similar(positive=["queen", "man"], negative=["woman"])

[('king', 0.6552621126174927),
 ('ii', 0.5050469040870667),
 ('prince', 0.491478830575943),
 ('majesty', 0.48908838629722595),
 ('monarch', 0.47834306955337524),
 ('royal', 0.46305179595947266),
 ('elizabeth', 0.45092126727104187),
 ('vi', 0.44612547755241394),
 ('crown', 0.4368758201599121),
 ('brother', 0.43661490082740784)]

In [None]:
# try with your own example
en_w2v_model.most_similar(positive=["physicist", "brain"], negative=["money"])

[('neuroscientist', 0.524850070476532),
 ('mathematician', 0.4939815104007721),
 ('biologist', 0.4928779602050781),
 ('geneticist', 0.4879351854324341),
 ('biochemist', 0.47275030612945557),
 ('scientist', 0.4704717993736267),
 ('chemist', 0.46199890971183777),
 ('astrophysicist', 0.45520147681236267),
 ('physics', 0.45381951332092285),
 ('neurologist', 0.45040658116340637)]

In [None]:
en_w2v_model.most_similar(positive=["python"])

[('monty', 0.6837382316589355),
 ('perl', 0.519283652305603),
 ('cleese', 0.5092198252677917),
 ('pythons', 0.5007115006446838),
 ('php', 0.4942314326763153),
 ('grail', 0.4683017134666443),
 ('scripting', 0.46761268377304077),
 ('skit', 0.4474538266658783),
 ('javascript', 0.4312553107738495),
 ('spamalot', 0.43117913603782654)]

In [None]:
en_w2v_model.most_similar(positive=["phd"])

[('ph.d.', 0.8992077708244324),
 ('ph.d', 0.8668009638786316),
 ('doctoral', 0.8411757349967957),
 ('doctorate', 0.8270341157913208),
 ('dissertation', 0.7371240854263306),
 ('thesis', 0.7319737672805786),
 ('graduate', 0.6834654808044434),
 ('postgraduate', 0.6737526059150696),
 ('b.a.', 0.6614392399787903),
 ('post-graduate', 0.6560631990432739)]

In [None]:
en_w2v_model.most_similar(positive=["phd"], negative=["panini", "coffee", "code", "experiments", "paper", "conference"])

[('step-sister', 0.42128294706344604),
 ('edmore', 0.4122830927371979),
 ('kalwa', 0.4101993143558502),
 ('grammia', 0.4099090099334717),
 ('thất', 0.4093276560306549),
 ('cw96', 0.4002326428890228),
 ('rw96', 0.39980965852737427),
 ('pangle', 0.39888158440589905),
 ('iliyan', 0.39772501587867737),
 ('3.6730', 0.392292320728302)]

**Have fun**: how good are you at word vectors algebra now?


Let's check it: play [Semantic Space Surfer](https://lena-voita.github.io/nlp_course/word_embeddings.html#have_fun). 

### Russian language

One of the main hubs of pretrained models for Russian language is [**RusVectores**](https://rusvectores.org/ru/). The whole list of models is presented [here](https://rusvectores.org/ru/models/). We will also try some examples of usage.

In [None]:
import gensim

In [None]:
# model download. For this example we will use fasttex pretrained model.
# !wget http://vectors.nlpl.eu/repository/20/214.zip

--2022-11-21 19:42:00--  http://vectors.nlpl.eu/repository/20/214.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.181
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1920218982 (1.8G) [application/zip]
Saving to: ‘214.zip’


2022-11-21 19:42:19 (99.8 MB/s) - ‘214.zip’ saved [1920218982/1920218982]



In [None]:
# !unzip 214.zip -d ru_fasttext_model

Archive:  214.zip
  inflating: ru_fasttext_model/meta.json  
  inflating: ru_fasttext_model/model.model  
  inflating: ru_fasttext_model/model.model.vectors_ngrams.npy  
  inflating: ru_fasttext_model/model.model.vectors.npy  
  inflating: ru_fasttext_model/model.model.vectors_vocab.npy  
  inflating: ru_fasttext_model/README  


In [None]:
ru_fasttext_model = gensim.models.KeyedVectors.load('ru_fasttext_model/model.model')

In [None]:
ru_fasttext_model.get_vector("естественный")

array([-3.32608342e-01, -1.26286536e-01, -1.79735586e-01,  2.79385895e-01,
       -3.69018912e-01,  3.44438046e-01, -2.34145205e-02,  5.60232043e-01,
       -3.31771702e-01,  7.50510246e-02, -3.97710502e-02,  9.58546773e-02,
        6.02775812e-01,  2.64807463e-01,  4.67248619e-01,  2.27449253e-01,
       -1.75586492e-02,  3.64083916e-01,  2.85187215e-01,  1.60460010e-01,
       -1.00663744e-01, -2.84378797e-01, -3.49444419e-01,  3.71782854e-02,
       -2.86672674e-02, -2.15512160e-02, -1.13702953e-01, -1.83207437e-01,
       -1.48359165e-01, -3.76394279e-02,  1.14443544e-02,  2.52620071e-01,
       -1.51189208e-01, -2.27908477e-01,  2.39898071e-01, -3.15357924e-01,
        2.69230425e-01, -3.75274599e-01, -1.18599355e-01, -1.66700840e-01,
        1.93800889e-02, -2.19127648e-02,  7.95794204e-02,  2.42556125e-01,
       -3.45300317e-01,  2.60304689e-01, -2.70207196e-01, -1.52698100e-01,
        3.82836431e-01,  2.37352714e-01, -4.83656645e-01, -3.28339636e-02,
        3.38784158e-02,  

In [None]:
ru_fasttext_model.most_similar("ягуар")

[('ягуара', 0.6922987103462219),
 ('ягуаре', 0.658053457736969),
 ('джип', 0.6504993438720703),
 ('ягуары', 0.6424834728240967),
 ('крайслер', 0.6181568503379822),
 ('митсубиши', 0.6164671182632446),
 ('бмв', 0.6137527823448181),
 ('тигуан', 0.6030640602111816),
 ('ситроен', 0.6001013517379761),
 ('хаммер', 0.5977892279624939)]

In [None]:
ru_fasttext_model.most_similar(positive=["учеба", "время"], negative="экзамен")

[('время-то', 0.4860388934612274),
 ('врем', 0.48514583706855774),
 ('время,а', 0.4616581201553345),
 ('десятилетие', 0.45002153515815735),
 ('еда', 0.44088080525398254),
 ('детство', 0.43911275267601013),
 ('времяпровождение', 0.43864935636520386),
 ('продолжительное', 0.4309692084789276),
 ('готовка', 0.4296607971191406),
 ('продолжительная', 0.42860114574432373)]

Again, have fun and check [vectors calculator](https://rusvectores.org/ru/calculator/)!

### Visualization

#### Single words

One way to see if our vectors are any good is to plot them. Thing is, those vectors are in 30D+ space and we humans are more used to 2-3D.

Luckily, we machine learners know about __dimensionality reduction__ methods.

Let's use that to plot 1000 most frequent words

In [None]:
words = sorted(en_w2v_model.index_to_key[:1000], 
               key=lambda word: en_w2v_model.index_to_key[:1000].count(word),
               reverse=True)

print(words[::100])

['the', 'so', 'according', 'man', 'troops', 'working', 'together', 'meet', '40', 'either']


In [None]:
# for each word, compute it's vector with model
word_vectors = np.array([en_w2v_model[word] for word in words])

Linear projection: PCA

The simplest linear dimensionality reduction method is __P__rincipial __C__omponent __A__nalysis.

In geometric terms, PCA tries to find axes along which most of the variance occurs. The "natural" axes, if you wish.

<img src="https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/pca_fish.png" style="width:30%">


Under the hood, it attempts to decompose object-feature matrix $X$ into two smaller matrices: $W$ and $\hat W$ minimizing _mean squared error_:

$$\|(X W) \hat{W} - X\|^2_2 \to_{W, \hat{W}} \min$$
- $X \in \mathbb{R}^{n \times m}$ - object matrix (**centered**);
- $W \in \mathbb{R}^{m \times d}$ - matrix of direct transformation;
- $\hat{W} \in \mathbb{R}^{d \times m}$ - matrix of reverse transformation;
- $n$ samples, $m$ original dimensions and $d$ target dimensions;



In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [None]:
%%time

# map word vectors onto 2d plane with PCA. Use good old sklearn api (fit, transform)
# after that, normalize vectors to make sure they have zero mean and unit variance
word_vectors_pca = PCA(n_components=2).fit_transform(word_vectors)
word_vectors_pca = StandardScaler().fit_transform(word_vectors_pca)

CPU times: user 22.6 ms, sys: 90 ms, total: 113 ms
Wall time: 75.3 ms


Let's draw it!

In [None]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook

output_notebook()


def draw_vectors(
    x,
    y,
    radius=10,
    alpha=0.25,
    color="blue",
    width=600,
    height=400,
    show=True,
    **kwargs
):
    """draws an interactive plot for data points with auxilirary info on hover"""
    if isinstance(color, str):
        color = [color] * len(x)
    data_source = bm.ColumnDataSource({"x": x, "y": y, "color": color, **kwargs})

    fig = pl.figure(active_scroll="wheel_zoom", width=width, height=height)
    fig.scatter("x", "y", size=radius, color="color", alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show:
        pl.show(fig)
    return fig


In [None]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=words)
# hover a mouse over there and see if you can identify the clusters

#### Phrases

Word embeddings can also be used to represent short phrases. The simplest way is to take __an average__ of vectors for all tokens in the phrase with some weights.

This trick is useful to identify what data are you working with: find if there are any outliers, clusters or other artefacts.

Let's try this new hammer on our data!

In [None]:
def get_phrase_embedding(model, phrase):
    """
    Convert phrase to a vector by aggregating it's word embeddings. See description above.
    """
    # 1. lowercase phrase
    # 2. tokenize phrase
    # 3. average word vectors for all words in tokenized phrase
    # skip words that are not in model's vocabulary
    # if all words are missing from vocabulary, return zeros
    
    vector = np.zeros([model.vector_size], dtype='float32')

    phrase = phrase.lower()
    tokens = tokenizer.tokenize(phrase)
    used_words = 0
    
    for word in tokens:
        if word in model:
            vector += model[word]
            used_words += 1
    
    if used_words > 0:
        vector = vector / used_words
    
    return vector

In [None]:
vector = get_phrase_embedding(en_w2v_model, "I'm very sure. This never happened to me before...")

In [None]:
# let's only consider ~5k phrases for a first run.
chosen_phrases = data[::len(data) // 1000]

# compute vectors for chosen phrases
phrase_vectors = np.array([get_phrase_embedding(en_w2v_model, phrase) for phrase in chosen_phrases]) #SOLUTION

In [None]:
assert isinstance(phrase_vectors, np.ndarray) and np.isfinite(phrase_vectors).all()
assert phrase_vectors.shape == (len(chosen_phrases), en_w2v_model.vector_size)

In [None]:
# map vectors into 2d space with pca, tsne or your other method of choice
# don't forget to normalize

phrase_vectors_2d = PCA(n_components=2).fit_transform(phrase_vectors)

phrase_vectors_2d = (phrase_vectors_2d - phrase_vectors_2d.mean(axis=0)) / phrase_vectors_2d.std(axis=0)

In [None]:
draw_vectors(phrase_vectors_2d[:, 0], phrase_vectors_2d[:, 1],
             phrase=[phrase[:50] for phrase in chosen_phrases],
             radius=20,)

## doc2vec

<img src="https://github.com/dardem/word2vec_seminar/raw/master/img/doc2vec.png" style="width:100%">

The straightforward approach of averaging each of a text's words' word-vectors creates a quick and crude document-vector that can often be useful. However, Le and Mikolov in 2014 introduced the <i>Paragraph Vector</i>, which usually outperforms such simple-averaging.

The basic idea is: act as if a document has another floating word-like vector, which contributes to all training predictions, and is updated like other word-vectors, but we will call it a doc-vector. Gensim's `Doc2Vec` class implements this algorithm. 

**Paragraph Vector - Distributed Memory (PV-DM)**
This is the Paragraph Vector model analogous to Word2Vec CBOW. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a center word based an average of both context word-vectors and the full document's doc-vector.

**Paragraph Vector - Distributed Bag of Words (PV-DBOW)**
This is the Paragraph Vector model analogous to Word2Vec SG. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a target word just from the full document's doc-vector. (It is also common to combine this with skip-gram testing, using both the doc-vector and nearby word-vectors to predict a single target word, but only one at a time.)

#### Requirements 

Download the data:



In [None]:
!gdown --id 1T0DhOjZwXRTk-d1RT7MvtvVKQq7Rmkch && unzip data.zip 

Downloading...
From: https://drive.google.com/uc?id=1T0DhOjZwXRTk-d1RT7MvtvVKQq7Rmkch
To: /content/data.zip
100% 1.34M/1.34M [00:00<00:00, 164MB/s]
Archive:  data.zip
   creating: data/
   creating: data/bbc/
 extracting: data/bbc/business.csv.bz2  
 extracting: data/bbc/entertainment.csv.bz2  
 extracting: data/bbc/politics.csv.bz2  
  inflating: data/bbc/README.TXT     
 extracting: data/bbc/sport.csv.bz2  
 extracting: data/bbc/tech.csv.bz2   
  inflating: data/example.csv        
  inflating: data/munge-bbc.sh       


#### Data preparation

We'll use the BBC's dataset.

In [None]:
!python -m spacy download en

2023-03-30 12:19:27.943434: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-30 12:19:30.882401: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2023-03-30 12:19:30.882596: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline 

In [None]:
import pandas as pd
import spacy
from multiprocessing import Pool

nlp = spacy.load("en_core_web_sm")
pd.set_option("display.max_colwidth", 100)


def tokenize_text(df):
    df["tokens"] = (
        df.text.str.lower()
        .str.strip()
        .apply(
            lambda x: [token.text.strip() for token in nlp(x) if token.text.isalnum()]
        )
    )
    return df


def process_csv(document_set):
    df = pd.read_csv(f"data/bbc/{document_set}.csv.bz2", encoding="latin1")
    df = tokenize_text(df)
    df["group"] = document_set
    return df


with Pool() as pool:
    dfs = pool.map(
        process_csv, ["sport", "business", "politics", "tech", "entertainment"]
    )
bbc_df = pd.concat(dfs)


In [None]:
bbc_df.head()

Unnamed: 0,text,tokens,group
0,Claxton hunting first major medal British hurdler Sarah Claxton is confident she can win her fi...,"[claxton, hunting, first, major, medal, british, hurdler, sarah, claxton, is, confident, she, ca...",sport
1,O'Sullivan could run in Worlds Sonia O'Sullivan has indicated that she would like to participat...,"[could, run, in, worlds, sonia, has, indicated, that, she, would, like, to, participate, in, nex...",sport
2,Greene sets sights on world title Maurice Greene aims to wipe out the pain of losing his Olympi...,"[greene, sets, sights, on, world, title, maurice, greene, aims, to, wipe, out, the, pain, of, lo...",sport
3,IAAF launches fight against drugs The IAAF - athletics' world governing body - has met anti-dop...,"[iaaf, launches, fight, against, drugs, the, iaaf, athletics, world, governing, body, has, met, ...",sport
4,"Dibaba breaks 5,000m world record Ethiopia's Tirunesh Dibaba set a new world record in winning ...","[dibaba, breaks, m, world, record, ethiopia, tirunesh, dibaba, set, a, new, world, record, in, w...",sport


Now we construct the vocabulary based on the data. Each word will be associated to an ID. 

In [None]:
from collections import Counter

class Vocab:
    def __init__(self, all_tokens, min_count=2):
        self.min_count = min_count
        self.freqs = {t:n for t, n in Counter(all_tokens).items() if n >= min_count}
        self.words = sorted(self.freqs.keys())
        self.word2idx = {w: i for i, w in enumerate(self.words)}

Words that appear extremely rarely can harm performance, so we add a simple mechanism to strip those out.

In [None]:
def clean_tokens(df, vocab):
    df["length"] = df.tokens.apply(len)
    df["clean_tokens"] = df.tokens.apply(lambda x: [t for t in x if t in vocab.freqs.keys()])
    df["clean_length"] = df.clean_tokens.apply(len)
    return df

bbc_vocab = Vocab([tok for tokens in bbc_df.tokens for tok in tokens])
bbc_df = clean_tokens(bbc_df, bbc_vocab)

print(f"Dataset comprises {len(bbc_df)} documents and {len(bbc_vocab.words)} unique words")

Dataset comprises 2225 documents and 19065 unique words


#### Implementation

The difficulty with our "the cat _ on the mat" problem is that the missing word could be any one in the vocabulary V and so the network would have |V| outputs for each input e.g. a huge vector containing zero for every word in the vocabulary and some positive number for "sat" if the network was perfectly trained. For calculating loss we need to turn that into a probabilty distribution, i.e. _softmax_ it. Computing the softmax for such a large vector is expensive.

So the trick (one of many possible) we will use is _Noise Contrastive Estimation (NCE)_. We change our "the cat _ on the mat" problem into a multiple choice problem, asking the network to choose between "sat" and some random wrong answers like "hopscotch" and "luxuriated". This is easier to compute the softmax for since it's now a binary classifier (right or wrong answer) and the output is simply of a vector of size 1 + k where k is the number of random incorrect options.

Happily, this alternative problem still learns equally useful word representations. We just need to adjust the examples and the loss function. There is a simplified version of the NCE loss function called _Negative Sampling (NEG)_ that we can use here.

[Notes on Noise Contrastive Estimation and Negative Sampling (C. Dyer)](https://arxiv.org/abs/1410.8251) explains the derivation of the NCE and NEG loss functions.

When we implement the loss function, we assume that the first element in a samples/scores vector is the score for the positive sample and the rest are negative samples. This convention saves us from having to pass around an auxiliary vector indicating which sample was positive.

In [None]:
import torch.nn as nn

class NegativeSampling(nn.Module):
    def __init__(self):
        super(NegativeSampling, self).__init__()
        self.log_sigmoid = nn.LogSigmoid()
    def forward(self, scores):
        batch_size = scores.shape[0]
        n_negative_samples = scores.shape[1] - 1   
        positive = self.log_sigmoid(scores[:,0])
        negatives = torch.sum(self.log_sigmoid(-scores[:,1:]), dim=1)
        return -torch.sum(positive + negatives) / batch_size  # average for batch

loss = NegativeSampling()

It's helpful to play with some values to reassure ourselves that this function does the right thing.

In [None]:
import torch 

# this dummy data uses -1 to 1, but the real model is unconstrained
data = [[[1, -1, -1, -1]],  
        [[0.5, -1, -1, -1]],
        [[0, -1, -1, -1]],
        [[0, 0, 0, 0]],
        [[0, 0, 0, 1]],
        [[0, 1, 1, 1]],
        [[0.5, 1, 1, 1]],
        [[1, 1, 1, 1]]]

loss_df = pd.DataFrame(data, columns=["scores"])
loss_df["loss"] = loss_df.scores.apply(lambda x: loss(torch.FloatTensor([x])))

loss_df

Unnamed: 0,scores,loss
0,"[1, -1, -1, -1]",tensor(1.2530)
1,"[0.5, -1, -1, -1]",tensor(1.4139)
2,"[0, -1, -1, -1]",tensor(1.6329)
3,"[0, 0, 0, 0]",tensor(2.7726)
4,"[0, 0, 0, 1]",tensor(3.3927)
5,"[0, 1, 1, 1]",tensor(4.6329)
6,"[0.5, 1, 1, 1]",tensor(4.4139)
7,"[1, 1, 1, 1]",tensor(4.2530)


Higher scores for the positive sample (always the first element) reduce the loss but higher scores for the negative samples increase the loss. This looks like the right behaviour.

With that in the bag, let's look at creating training data. The general idea is to create a set of examples where each example has:

- doc id
- sample ids - a collection of the target token and some noise tokens
- context ids - tokens before and after the target token

e.g. If our context size was 2, the first example from the above dataset would be:

```
{"doc_id": 0,
 "sample_ids": [word2idx[x] for x in ["week", "random-word-from-vocab", "random-word-from-vocab"...],
 "context_ids": [word2idx[x] for x in ["in", "the", "before", "their"]]}
 ```
 
 The random words are chosen according to a probability distribution:
 
 > a unigram distribution raised to the 3/4rd power, as proposed by T. Mikolov et al. in Distributed Representations of Words and Phrases and their Compositionality

This has the effect of slightly increasing the relative probability of rare words (look at the graph of `y=x^0.75` below and see how the lower end is raised above `y=x`).

In [None]:
import altair as alt
import numpy as np

data = pd.DataFrame(zip(np.arange(0,1,0.01), np.power(np.arange(0,1,0.01), 0.75)), columns=["x", "y"])
alt.Chart(data, title="x^0.75").mark_line().encode(x="x", y="y")

In [None]:
import numpy as np

class NoiseDistribution:
    def __init__(self, vocab):
        self.probs = np.array([vocab.freqs[w] for w in vocab.words])
        self.probs = np.power(self.probs, 0.75)
        self.probs /= np.sum(self.probs)
    def sample(self, n):
        "Returns the indices of n words randomly sampled from the vocabulary."
        return np.random.choice(a=self.probs.shape[0], size=n, p=self.probs)
        

With this distribution, we advance through the documents creating examples. Note that we are always putting the positive sample first in the samples vector, following the convention the loss function expects.

In [None]:
import torch
import numpy as np

def example_generator(df, context_size, noise, n_negative_samples, vocab):
    for doc_id, doc in df.iterrows():
        for i in range(context_size, len(doc.clean_tokens) - context_size):
            positive_sample = vocab.word2idx[doc.clean_tokens[i]]
            sample_ids = noise.sample(n_negative_samples).tolist()
            # Fix a wee bug - ensure negative samples don't accidentally include the positive
            sample_ids = [sample_id if sample_id != positive_sample else -1 for sample_id in sample_ids]
            sample_ids.insert(0, positive_sample)                
            context = doc.clean_tokens[i - context_size:i] + doc.clean_tokens[i + 1:i + context_size + 1]
            context_ids = [vocab.word2idx[w] for w in context]
            yield {"doc_ids": torch.tensor(doc_id),  # we use plural here because it will be batched
                   "sample_ids": torch.tensor(sample_ids), 
                   "context_ids": torch.tensor(context_ids)}

In [None]:
bbc_noise = NoiseDistribution(bbc_vocab)
bbc_examples = list(example_generator(bbc_df, context_size=5, noise=bbc_noise, n_negative_samples=5, vocab=bbc_vocab))

Now we create a PyTorch dataset and DataLoader from our processed data.

In [None]:
from torch.utils.data import Dataset, DataLoader

class NCEDataset(Dataset):
    def __init__(self, examples):
        self.examples = list(examples) 
    def __len__(self):
        return len(self.examples)
    def __getitem__(self, index):
        return self.examples[index]
    
dataset = NCEDataset(bbc_examples)
dataloader = DataLoader(dataset, batch_size=2, drop_last=True, shuffle=True)

It's going to also be useful to have a way to convert batches back to a readable form for debugging, so we add a helper function.

In [None]:
def describe_batch(batch, vocab):
    results = []
    for doc_id, context_ids, sample_ids in zip(batch["doc_ids"], batch["context_ids"], batch["sample_ids"]):
        context = [vocab.words[i] for i in context_ids]
        context.insert(len(context_ids) // 2, "____")
        samples = [vocab.words[i] for i in sample_ids]
        result = {"doc_id": doc_id,
                  "context": " ".join(context), 
                  "context_ids": context_ids, 
                  "samples": samples, 
                  "sample_ids": sample_ids}
        results.append(result)
    return results

describe_batch(next(iter(dataloader)), bbc_vocab)

[{'doc_id': tensor(294),
  'context': 'the musical staged at the ____ edward theatre she watched laura',
  'context_ids': tensor([17140, 11445, 16192,  1508, 17140,  5743, 17142, 15430, 18503,  9915]),
  'samples': ['prince', 'ski', 'peer', 'public', 'suitable', 'rob'],
  'sample_ids': tensor([13256, 15697, 12570, 13502, 16615, 14621])},
 {'doc_id': tensor(235),
  'context': 'to avoid any bloody tune ____ want people to go away',
  'context_ids': tensor([17326,  1653,  1228,  2274, 17680, 18451, 12602, 17326,  7547,  1669]),
  'samples': ['we', 'actions', 'with', 'with', 'heading', 'extreme'],
  'sample_ids': tensor([18531,   635, 18775, 18775,  8062,  6403])}]

Let's jump into creating the model itself. There isn't much to it - we multiply the input paragraph and word matrices by the output layer. Combining the paragraph and word matrices is done by summing here, but it could also be done by concatenating the inputs. The original paper actually found concatenation works better, perhaps because summing loses word order information.

In [None]:
import torch.nn as nn

class DistributedMemory(nn.Module):
    def __init__(self, vec_dim, n_docs, n_words):
        super(DistributedMemory, self).__init__()
        self.paragraph_matrix = nn.Parameter(torch.randn(n_docs, vec_dim))
        self.word_matrix = nn.Parameter(torch.randn(n_words, vec_dim))
        self.outputs = nn.Parameter(torch.zeros(vec_dim, n_words))
    
    def forward(self, doc_ids, context_ids, sample_ids):
        # first add doc ids to context word ids to make the inputs (batch_size, vec_dim)
        inputs = torch.add(self.paragraph_matrix[doc_ids,:],                  
                           torch.sum(self.word_matrix[context_ids,:], dim=1))  
                        #(batch_size,2*context, vec_dim)->(batch_size, vec_dim)

        # select subset of output layer for NCE test
        outputs = self.outputs[:,sample_ids] # (vec_dim, batch_size, n_negative_samples + 1) 

        # then multiply with some munging to make the tensor shapes line up
        return torch.bmm(inputs.unsqueeze(dim=1),                               
                         outputs.permute(1, 0, 2)).squeeze()

Create simple training loop.

In [None]:
from tqdm import tqdm, trange
from torch.optim import Adam
import numpy as np

def train(model, dataloader, epochs=40, lr=1e-3):
    optimizer = Adam(model.parameters(), lr=lr)
    training_losses = []
    for epoch in trange(epochs, desc="Epochs"):
        epoch_losses = []
        for batch in dataloader:
            model.zero_grad()
            logits = model.forward(**batch)
            batch_loss = loss(logits)
            epoch_losses.append(batch_loss.item())
            batch_loss.backward()
            optimizer.step()
        training_losses.append(np.mean(epoch_losses))

    return training_losses

#### Training

Initialize dataset and model.

In [None]:
bbc_dataset = NCEDataset(bbc_examples)
bbc_dataloader = DataLoader(bbc_dataset, batch_size=1024, drop_last=True, shuffle=True)

bbc_model = DistributedMemory(vec_dim=50,
                              n_docs=len(bbc_df),
                              n_words=len(bbc_vocab.words))

In [None]:
bbc_training_losses = train(bbc_model.cuda(), bbc_dataloader, epochs=500, lr=1e-3)

Epochs: 100%|██████████| 160/160 [20:27<00:00,  7.67s/it]


#### Evaluation

In [None]:
alt.Chart(pd.DataFrame(enumerate(bbc_training_losses), columns=["epoch", "training_loss"])).mark_bar().encode(x="epoch", y="training_loss")

Let's take a look at the reduced dimensionality paragraph vectors.

In [None]:
from sklearn.decomposition import PCA

def pca_2d(paragraph_matrix, groups):
    pca = PCA(n_components=2)
    reduced_dims = pca.fit_transform(paragraph_matrix)
    print(f"2-component PCA, explains {sum(pca.explained_variance_):.2f}% of variance")
    df = pd.DataFrame(reduced_dims, columns=["x", "y"])
    df["group"] = groups
    return df

In [None]:
bbc_2d = pca_2d(bbc_model.paragraph_matrix.data.detach().cpu().numpy(), bbc_df.group.to_numpy())
chart = alt.Chart(bbc_2d).mark_point().encode(x="x", y="y", color="group")

chart

2-component PCA, explains 2.70% of variance


These results aren't great, but we can see the beginnings of separation. If we look at just two topics it becomes more obvious.

In [None]:
chart = alt.Chart(bbc_2d[bbc_2d["group"].isin(["sport", "business"])]).mark_point().encode(x="x", y="y", color="group")
chart

Likewise we can see sorting by similarity produces reasonable, but not ideal, results.

In [None]:
from sklearn.preprocessing import normalize

def most_similar(paragraph_matrix, docs_df, index, n=None):
    pm = normalize(paragraph_matrix, norm="l2")  # in a smarter implementation we would cache this somewhere
    sims = np.dot(pm, pm[index,:])
    df = pd.DataFrame(enumerate(sims), columns=["doc_id", "similarity"])
    n = n if n is not None else len(sims)
    return df.merge(docs_df[["text"]].reset_index(drop=True), left_index=True, right_index=True).sort_values(by="similarity", ascending=False)[:n]


In [None]:
most_similar(bbc_model.paragraph_matrix.data.detach().cpu().numpy(), bbc_df, 0, n=10)

Unnamed: 0,doc_id,similarity,text
0,0,1.0,Claxton hunting first major medal British hurdler Sarah Claxton is confident she can win her fi...
575,575,0.545567,EU aiming to fuel development aid European Union finance ministers meet on Thursday to discuss ...
53,53,0.50286,Thanou bullish over drugs hearing Katerina Thanou is confident she and fellow sprinter Kostas K...
84,84,0.499714,"Tulu to appear at Caledonian run Two-time Olympic 10,000 metres champion Derartu Tulu has confi..."
193,193,0.48143,Klinsmann issues Lehmann warning Germany coach Jurgen Klinsmann has warned goalkeeper Jens Lehm...
28,28,0.481314,Isinbayeva heads for Birmingham Olympic pole vault champion Yelena Isinbayeva has confirmed she...
151,151,0.475924,Ferguson fears Milan cutting edge Manchester United manager Sir Alex Ferguson said his side's t...
311,311,0.474377,Wales win in Rome Wales secured their first away win in the RBS Six Nations for nearly four yea...
54,54,0.472789,Holmes is hit by hamstring injury Kelly Holmes has been forced out of this weekend's European I...
87,87,0.466263,Jones files lawsuit against Conte Marion Jones has filed a lawsuit for defamation against Balco...


Next steps:
- look for better hyperparameters, since the training loss remains quite high
- benchmark against `gensim` and Ilenic's PyTorch implementation; it should be very similar to the latter
- implement the inference step for new documents, which freezes the word and output matrices and adds a new column to the paragraph matrix
- use inferred paragraph vectors as the input for a topic classifier; looking at the business/sport plot above it could be quite successful
- try visualization with a better dimensionality reduction algorithm than PCA (I've used [LargeVis](https://arxiv.org/abs/1602.00370) in the past)

Play with pretrained Doc2Vec model by yourself! The pretrained models can be found [here](https://github.com/jhlau/doc2vec).