# Word2Vec Tuning, Training and Saving

In this notebook, we will tune the Word2Vec model to be as performant as possible. We will then train it and finally save it.

## 1. Loading of the required libraries

In [1]:
import os
os.chdir('..')

import torch
import torch.nn as nn
import numpy as np
from torch.optim import Adam
from models.word2vec import Word2Vec
from torch.utils.data import DataLoader, TensorDataset
from utils.utils import word2vecFineTuning, DataSamplization

dataSamplization = DataSamplization()

Loading saved FastText model...


## 2. Fetching of all the relevant data

In [2]:
# Fetching of the skipgram pairs from data/skipgramPairs/word_pairs_fromWikiDump.txt
skipGramWordIDPairs = []
with open('data/skipgramPairs/word_pairs_fromWikiDump.txt', 'r') as f:
    for line in f:
        skipGramWordIDPairs.append(line.strip().split())
skipGramWordIDPairs = [(int(target), int(context)) for target, context in skipGramWordIDPairs]

# Fetching of the 30-dim pre-trained embeddings for fine-tuning
embeddings = np.array(np.load('data/modelsSavedLocally/wikipedia/30dim_embeddings_ArraySimple.npy'))
embeddings = torch.tensor(embeddings)
vocab_size = len(embeddings)
embedding_dim = 30

skipgram_data = torch.tensor(skipGramWordIDPairs, dtype=torch.long)
dataset = TensorDataset(skipgram_data[:, 0], skipgram_data[:, 1])
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)



## 3. Training of the Model

In [3]:
model = Word2Vec(vocab_size, embedding_dim, embeddings)
model = word2vecFineTuning(model, dataloader, epochs=500, lr=0.001)

fine_tuned_embeddings = model.target_embeddings.weight.data.numpy()
dic_before_tuning_embeds_30dim = np.load(allow_pickle=True, file="data/modelsSavedLocally/wikipedia/30dim_embeddings_DictWithWords.npy").item()
dic_fine_tuned_embeds_30dim = {word : fine_tuned_embeddings[i] for i, word in enumerate(dic_before_tuning_embeds_30dim.keys())}
np.save('data/modelsSavedLocally/wikipedia/Tuned30dim_embeddings_DictWithWords.npy', dic_fine_tuned_embeds_30dim)



Epoch [1/500], Loss: 62.6818
Epoch [25/500], Loss: 18.6224
Epoch [50/500], Loss: 8.4228
Epoch [75/500], Loss: 5.2409
Epoch [100/500], Loss: 4.0894
Epoch [125/500], Loss: 3.5895
Epoch [150/500], Loss: 3.3594
Epoch [175/500], Loss: 3.2479
Epoch [200/500], Loss: 3.1940
Epoch [225/500], Loss: 3.1655
Epoch [250/500], Loss: 3.1509
Epoch [275/500], Loss: 3.1409
Epoch [300/500], Loss: 3.1353
Epoch [325/500], Loss: 3.1309
Epoch [350/500], Loss: 3.1034
Epoch [375/500], Loss: 3.0795
Epoch [400/500], Loss: 3.0769
Epoch [425/500], Loss: 3.0761
Early stopping triggered at epoch 425.
Fine-tuning completed.


## 5. Conclusion

In this notebook, we loaded 30-dimensional embeddings, trained them with Wikipedia-extracted French data using a basic word2vec architecture.