## Imports

In [1]:
import torch
import torch.nn.functional as F
import sys
from pathlib import Path

## Data loading

This is small dataset with question, good choice for Word2Vec

In [2]:
data = list(open("data/quora.txt", encoding="utf-8"))
data[50]

"What TV shows or books help you read people's body language?\n"

## Example of training and using my word2vec model

In [3]:
# import os
# import sys

# os.chdir(os.path.expanduser("~"))

In [None]:
from word2vec.Word2VecSkipGram import Word2VecSkipGram
from word2vec.Word2VecSkipGram import Word2VecSkipGramModel
from word2vec.Word2VecCBOW import Word2VecCBOWModel
from word2vec.Word2VecCBOW import Word2VecCBOW

### CBOW

In [11]:
# create model CBOW
cbow = Word2VecCBOW()

# set data - this step save data, tokenize it and 
# create word_to_index and index_to_word dictionaries
cbow.set_text_before_context_pairs(data)

# set context groups and model - this step create groups
# [center_word, context_word_1, ... context_word_{window_radius * 2}]
# and set model for training
cbow.set_context_groups_and_model()

# subsampling probabilities - this step calculate probabilities
# for each word to be deleted from training. I use formula from
# paper:
# 
# P(w_i) = 1 - sqrt(t / f(w_i))
# 
# where t is threshold and f(w_i) is frequency of w_i in the dataset
cbow.subsampling_probabilities()

# negative sampling probabilities - this step calculate probabilities
# for each word to be used as negative sample. I use formula from
# paper:
# 
# P(w_i) = (f(w_i) / sum(f(w_j))) ^ (3/4) / Z
# 
# where f(w_i) is frequency of w_i in the dataset and Z is normalization constant
cbow.negative_sampling_probabilities()  

In [12]:
# Let's train the model
cbow.train_model(steps=10001, batch_size=128, negative_number=15)

  0%|          | 4/10001 [00:00<10:03, 16.58it/s]

Step 0, Loss: 11.091211318969727, learning rate: [0.001]


 10%|█         | 1003/10001 [00:51<07:48, 19.21it/s]

Step 1000, Loss: 3.510890483856201, learning rate: [0.001]


 20%|██        | 2004/10001 [01:44<06:44, 19.79it/s]

Step 2000, Loss: 3.0968456268310547, learning rate: [0.001]


 30%|███       | 3004/10001 [02:37<05:57, 19.57it/s]

Step 3000, Loss: 3.0199851989746094, learning rate: [0.001]


 40%|████      | 4004/10001 [03:30<05:26, 18.36it/s]

Step 4000, Loss: 2.773721694946289, learning rate: [0.001]


 50%|█████     | 5003/10001 [04:21<04:09, 20.01it/s]

Step 5000, Loss: 2.8424835205078125, learning rate: [0.0005]


 60%|██████    | 6004/10001 [05:12<03:22, 19.77it/s]

Step 6000, Loss: 2.58914852142334, learning rate: [0.0005]


 70%|███████   | 7004/10001 [06:03<02:31, 19.77it/s]

Step 7000, Loss: 2.610912799835205, learning rate: [0.0005]


 80%|████████  | 8004/10001 [06:54<01:41, 19.68it/s]

Step 8000, Loss: 2.727409839630127, learning rate: [0.0005]


 90%|█████████ | 9004/10001 [07:45<00:53, 18.76it/s]

Step 9000, Loss: 2.714916229248047, learning rate: [0.0005]


100%|██████████| 10001/10001 [08:37<00:00, 19.34it/s]

Step 10000, Loss: 2.436723232269287, learning rate: [0.00025]





Loss is not decreasing, but its normal - if you will train 2000 steps and 10000 steps, the difference will be huge

In [13]:
_model_parameters = cbow.model.parameters()
embedding_matrix_center = next(
    _model_parameters
).detach()  # Assuming that first matrix was for central word
embedding_matrix_context = next(
    _model_parameters
).detach()  # Assuming that second matrix was for co

In [14]:
def find_nearest(words):
    word_vector = cbow.get_center_embeddings_by_words(words)
    dists = F.cosine_similarity(embedding_matrix_center, word_vector)
    index_sorted = torch.argsort(dists)
    top_k = index_sorted[-10:]
    return [(cbow.index_to_word[x], dists[x].item()) for x in top_k.numpy()]
find_nearest(["python"])

[('finance', 0.7676918506622314),
 ('coding', 0.7689043283462524),
 ('frontend', 0.7759082913398743),
 ('sql', 0.7793285846710205),
 ('django', 0.7799999713897705),
 ('javascript', 0.8302862644195557),
 ('php', 0.8529316186904907),
 ('java', 0.8859761953353882),
 ('c', 0.9090428352355957),
 ('python', 1.0)]

Python is near with another languages of programming! Succes!!!

### Skip-Gram

In [15]:
# create model CBOW
skipgram = Word2VecSkipGram()

# set data - this step save data, tokenize it and 
# create word_to_index and index_to_word dictionaries
skipgram.set_text_before_context_pairs(data)

# set context groups and model - this step create groups
# [center_word, context_word_1, ... context_word_{window_radius * 2}]
# and set model for training
skipgram.set_context_groups_and_model()

# subsampling probabilities - this step calculate probabilities
# for each word to be deleted from training. I use formula from
# paper:
# 
# P(w_i) = 1 - sqrt(t / f(w_i))
# 
# where t is threshold and f(w_i) is frequency of w_i in the dataset
skipgram.subsampling_probabilities()

# negative sampling probabilities - this step calculate probabilities
# for each word to be used as negative sample. I use formula from
# paper:
# 
# P(w_i) = (f(w_i) / sum(f(w_j))) ^ (3/4) / Z
# 
# where f(w_i) is frequency of w_i in the dataset and Z is normalization constant
skipgram.negative_sampling_probabilities()  

In [16]:
# Let's train the model
skipgram.train_model(steps=10001, batch_size=128, negative_number=15)

  0%|          | 3/10001 [00:00<14:36, 11.40it/s]

Step 0, Loss: 11.090789794921875, learning rate: [0.001]


 10%|█         | 1004/10001 [00:51<07:36, 19.71it/s]

Step 1000, Loss: 4.32220983505249, learning rate: [0.001]


 20%|██        | 2003/10001 [01:44<06:44, 19.77it/s]

Step 2000, Loss: 3.7768969535827637, learning rate: [0.001]


 30%|███       | 3004/10001 [02:36<06:08, 19.00it/s]

Step 3000, Loss: 3.6793813705444336, learning rate: [0.001]


 40%|████      | 4004/10001 [03:27<05:01, 19.89it/s]

Step 4000, Loss: 3.430278778076172, learning rate: [0.001]


 50%|█████     | 5004/10001 [04:18<04:10, 19.93it/s]

Step 5000, Loss: 3.496466636657715, learning rate: [0.0005]


 60%|██████    | 6003/10001 [05:09<04:05, 16.28it/s]

Step 6000, Loss: 3.4756500720977783, learning rate: [0.0005]


 70%|███████   | 7004/10001 [06:01<02:29, 20.06it/s]

Step 7000, Loss: 3.4848098754882812, learning rate: [0.0005]


 80%|████████  | 8003/10001 [06:51<01:40, 19.94it/s]

Step 8000, Loss: 3.3417301177978516, learning rate: [0.0005]


 90%|█████████ | 9004/10001 [07:42<00:53, 18.71it/s]

Step 9000, Loss: 3.298994779586792, learning rate: [0.0005]


100%|██████████| 10001/10001 [08:33<00:00, 19.48it/s]

Step 10000, Loss: 3.2748324871063232, learning rate: [0.00025]





Loss is not decreasing, but its normal - if you will train 2000 steps and 10000 steps, the difference will be huge

In [17]:
_model_parameters = skipgram.model.parameters()
embedding_matrix_center = next(
    _model_parameters
).detach()  # Assuming that first matrix was for central word
embedding_matrix_context = next(
    _model_parameters
).detach()  # Assuming that second matrix was for co

In [18]:
def find_nearest(words):
    word_vector = skipgram.get_center_embeddings_by_words(words)
    dists = F.cosine_similarity(embedding_matrix_center, word_vector)
    index_sorted = torch.argsort(dists)
    top_k = index_sorted[-10:]
    return [(skipgram.index_to_word[x], dists[x].item()) for x in top_k.numpy()]
find_nearest(["python"])

[('cantonese', 0.6946611404418945),
 ('grammar', 0.6997422575950623),
 ('scratch', 0.7020895481109619),
 ('language', 0.7183550596237183),
 ('programming', 0.7199528217315674),
 ('basics', 0.7256700992584229),
 ('javascript', 0.7391355633735657),
 ('java', 0.7444342374801636),
 ('c', 0.7820838689804077),
 ('python', 1.0000001192092896)]

Also python with languages!!! Succes!