## Task 2 - Implement Word2vec 25 Marks
You are tasked with building a pipeline for training a Word2Vec model using the CBOW (Continuous Bag
of Words) approach FROM SCRATCH in PyTorch. It consist of the following components:

1. You are required to create a Python class named Word2VecDataset that will serve as a custom dataset
for training the Word2Vec model. The implementaion should include the following components:

    - The custom implementation should work with PyTorch’s DataLoader to efficiently load the train-
ing data.. You can refer this guide [Tutorial] on creating custom dataset classes in PyTroch.

    - preprocess data - In this method, you will be preprocessing the provided corpus and prepare
the CBOW training data for training the Word2Vec model.

    - During preprocessing, you must use the WordPieceTokenizer implemented in Task 1 to tokenize
the input text corpus.

2. You required to create a Python class named Word2VecModel which implement Word2Vec CBOW
architecture from scratch using PyTorch. After training the the model, save the trained model’s
checkpoint for later use.

3. Develop a function named train to manage the entire training process of the Word2Vec model. This
function should include all the training logic.

In [119]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

import json
import os
import matplotlib.pyplot as plt

from datetime import datetime
from tqdm import tqdm

import sys
import os


In [120]:
sys.path.append(os.path.abspath('../Task 1'))

print(os.getcwd())

# changing directories to get the WordPieceTokenizer class from task1
os.chdir('../Task1')

print(os.getcwd())

from task1 import WordPieceTokenizer

tokenizer = WordPieceTokenizer()

os.chdir('../Task2')

print(os.getcwd())


c:\Users\ISHITA\OneDrive\Desktop\CSAM6\NLP\Assignments\Assignment 1\CSE556-NLP\assignment1\Task2
c:\Users\ISHITA\OneDrive\Desktop\CSAM6\NLP\Assignments\Assignment 1\CSE556-NLP\assignment1\Task1
c:\Users\ISHITA\OneDrive\Desktop\CSAM6\NLP\Assignments\Assignment 1\CSE556-NLP\assignment1\Task2


## Word2VecDataset Class

In [121]:
# Word2VecDataset(Dataset) => inherits from the Dataset class of Pytorch
class Word2VecDataset(Dataset):
    
    def __init__(self, window_size, vocabulary_size):
        
        """
        text : input corput as string
        window size : defines how many words to take on either side as context
        pad token : a special token used for padding in case of insufficient context
        """

        self.pad_token = '[PAD]'
        self.window_size = window_size
        self.tokenizer = WordPieceTokenizer()
        self.vocabulary_size = vocabulary_size

        self.text = None

        # stores unique words in the corpus
        self.vocabulary = None
        # a dictionary mapping words to indices
        self.word_to_idx = None
        # a dictionary mapping indices to words
        self.idx_to_word = None
        # a list that will store training pairs
        self.cbow_pairs = []
        
        self.preprocess_data()

    # uses Word Piece Tokenizer from Task 1 to tokenize input corpus
    def tokenize_txt_file(self, input_file, output_file):
        
        with open(input_file, 'r', encoding='utf-8') as f:
            lines = f.readlines()
        
        results = {}
        
        for idx, line in enumerate(lines):
            sentence = line.strip() 
            tokens = self.tokenizer.tokenize(sentence)
            results[str(idx)] = tokens 

        
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(results, f, indent=2)

    

    def update_vocabulary(self):

        self.tokenizer.construct_vocabulary("corpus.txt", vocab_size=self.vocabulary_size)

        vocabulary = []

        # Open the file in read mode
        with open('vocabulary_35.txt', 'r') as file:
            
            for line in file:
                word = line.strip()
                if word:  # to avoid adding empty lines
                    vocabulary.append(word)

        self.vocabulary = vocabulary
        print("got and updated vocab")

    def tokenize_corpus(self):
        
        self.tokenize_txt_file("corpus.txt", "tokenized_corpus.json")

        corpus = None

        with open('tokenized_corpus.json', 'r', encoding='utf-8') as f:

            # Load the JSON data
            tokenized_corpus = json.load(f)
            
            # Convert the dictionary into a list of sentences (list of tokenized words)
            corpus = [tokens for tokens in tokenized_corpus.values()]
            
        
        self.text = corpus


    def generate_cbow_pairs(self):

        # loops over all the  in the tokenized corpus
        for sentence in self.text:

            for j in range(len(sentence)):
                # Get context words within window
                context_words = (sentence[max(0, j - self.window_size):j] + 
                                sentence[j + 1:min(len(sentence), j + self.window_size + 1)])
                
                if len(context_words) > 0:
                    # Pad context if necessary
                    while len(context_words) < self.window_size * 2:
                        context_words.append(self.pad_token)
                    
                    # Convert context words and target words into numerical indices using self.word2idx
                    context_indices = [self.word_to_idx.get(w, self.word_to_idx[self.pad_token]) for w in context_words]
                    target_idx = self.word_to_idx.get(sentence[j], self.word_to_idx[self.pad_token])
                    
                    # Store the (context, target) pairs in self.cbow_pairs
                    self.cbow_pairs.append((context_indices, target_idx))


    # this function tokenizes text, creates the vocabulary and generates CBOW training pairs
    def preprocess_data(self):

        print("updating vocabulary")     
        self.update_vocabulary()

        print("tokenizing corpus")
        self.tokenize_corpus()

        print("updating mapping - word2idx and idx2word")

        # updates the word to index mapping
        self.word_to_idx = {word: idx for idx, word in enumerate(self.vocabulary)}
        # updates the reverse index to word mapping
        self.idx_to_word = {idx: word for word, idx in self.word_to_idx.items()}
        

        print("generating cbow pairs")
        self.generate_cbow_pairs()
    
    # returns the total number of CBOW training pairs
    def __len__(self):
        return len(self.cbow_pairs)
    
    
    # used to retrieve a single CBOW pair (context, target) pair
    def __getitem__(self, idx):

        # get the pair of ith index as mentioned parameter 
        context_indices, target_idx = self.cbow_pairs[idx]
        
        # Convert to tensors with explicit types
        context_tensor = torch.tensor(context_indices, dtype=torch.long)
        target_tensor = torch.tensor(target_idx, dtype=torch.long)
        
        return context_tensor, target_tensor


## Word2VecModel class

In [122]:
# Word2VecModel(nn.Module) => Inherits from nn.Module class in Pytorch
class Word2VecModel(nn.Module):

    def __init__(self, vocab_size, embedding_dim):
        
        '''
        vocab size : the size of the vocabulary
        embedding dimension : the number of features that represent each word
        '''

        super().__init__()
        
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        
        # creates a special table that stores word representations for each word
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)

        self.output_layer = nn.Linear(embedding_dim, vocab_size)
        

        self._initialize_weights()
    
    # function to initialize weight
    def _initialize_weights(self):
        
        # range of weights
        initrange = 0.5 / self.embedding_dim
        
        # initializing embedding layer
        self.embeddings.weight.data.uniform_(-initrange, initrange)
        
        # initializing output layer
        self.output_layer.weight.data.uniform_(-initrange, initrange)

        # initializing the bias of output layer
        self.output_layer.bias.data.zero_()
    

    # function for forward pass of the model
    def forward(self, x: torch.Tensor):

        """
        x : input tensor, which represents the indices of the context words in the vocabulary. Its shape is [batch size, window size*2] -- [target word, [list of indexes of context words]]
        """

        # will look up for embeddings for each word in context and output list of embeddings of each congext word
        # size = [batch size, window size*2, embedding dimension]
        embedded = self.embeddings(x)
        
        # calculates average of context words... horizontally - row by row
        # size = [batch_size, embedding_dim]
        context_embedding = torch.mean(embedded, dim=1)
        
        # passing the average context embedding to output layer
        # output size vector = [batch_size, vocab_size]
        output = self.output_layer(context_embedding)

        # returning after application of softmax function
        return F.log_softmax(output, dim=1)



## Functions for saving and loading checkpoints

In [123]:
def save_checkpoint(model, checkpoint_path, epoch, optimizer, loss, accuracy):

    checkpoint = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
        'accuracy': accuracy,
        'vocab_size': model.vocab_size,
        'embedding_dim': model.embedding_dim,
        'timestamp': datetime.now().isoformat()
    }
    torch.save(checkpoint, checkpoint_path)


def load_checkpoint(model, checkpoint_path, device: str = 'cuda' if torch.cuda.is_available() else 'cpu'):

    checkpoint = torch.load(checkpoint_path, map_location=device)
    
    # Load model state
    model.load_state_dict(checkpoint['model_state_dict'])
    model.to(device)
    
    return (
        model,
        checkpoint['optimizer_state_dict'],
        checkpoint['epoch'],
        checkpoint['loss']
    )

## Training Function

In [218]:
def train_word2vec(
    model: nn.Module,
    train_loader: DataLoader,
    val_loader: DataLoader,
    num_epochs: int,
    learning_rate: float,
    checkpoint_dir: str = 'checkpoints',
    save_frequency: int = 5,
    device: str = 'cuda' if torch.cuda.is_available() else 'cpu'
):
    
    model.to(device)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.NLLLoss()
    os.makedirs(checkpoint_dir, exist_ok=True)
    
    history = {
        'epoch_losses': [],
        'epoch_accuracies': [],
        'batch_losses': [],
        'batch_accuracies': []
    }
    
    for epoch in range(num_epochs):
        # Compute decayed learning rate
        # current_lr = learning_rate * (1 - epoch / num_epochs)
        current_lr = learning_rate
        for param_group in optimizer.param_groups:
            param_group['lr'] = current_lr
        
        epoch_loss = 0.0
        epoch_correct = 0
        epoch_total = 0
        num_batches = len(train_loader)
        
        progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{num_epochs}')
        model.train()

        for batch_idx, (context, target) in enumerate(progress_bar):
            context, target = context.to(device), target.to(device)
            optimizer.zero_grad()
            output = model.forward(context)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            
            predicted = output.argmax(dim=1)
            correct = (predicted == target).sum().item()
            total = target.size(0)
            
            batch_accuracy = (correct / total) * 100
            batch_loss = loss.item()
            epoch_loss += batch_loss
            epoch_correct += correct
            epoch_total += total
            
            history['batch_losses'].append(batch_loss)
            history['batch_accuracies'].append(batch_accuracy)
            
            progress_bar.set_postfix({
                'batch_loss': f'{batch_loss:.4f}',
                'batch_acc': f'{batch_accuracy:.2f}%',
                'avg_loss': f'{epoch_loss/(batch_idx+1):.4f}',
                'lr': f'{current_lr:.6f}'
            })
        
        avg_epoch_loss = epoch_loss / num_batches
        epoch_accuracy = (epoch_correct / epoch_total) * 100
        history['epoch_losses'].append(avg_epoch_loss)
        history['epoch_accuracies'].append(epoch_accuracy)
        
        print(f'\nEpoch {epoch+1}/{num_epochs}:')
        print(f'Average Loss: {avg_epoch_loss:.4f}')
        print(f'Training Accuracy: {epoch_accuracy:.2f}%')
        
        print("\nValidating model...")
        val_loss, accuracy, cosine_similarity = validate_model(model, val_loader, device)
        print(f"Validation Loss: {val_loss:.4f}")
        print(f"Validation Accuracy: {accuracy:.2f}%")
        print(f"Cosine Similarity: {cosine_similarity}")
        
        if (epoch + 1) % save_frequency == 0:
            checkpoint_path = os.path.join(checkpoint_dir, f'word2vec_checkpoint_epoch_{epoch+1}.pt')
            save_checkpoint(model, checkpoint_path, epoch, optimizer, avg_epoch_loss, epoch_accuracy)
            print(f'Checkpoint saved: {checkpoint_path}')
    
    final_checkpoint_path = os.path.join(checkpoint_dir, 'word2vec_final_model.pt')
    save_checkpoint(model, final_checkpoint_path, num_epochs-1, optimizer, history['epoch_losses'][-1], history['epoch_accuracies'][-1])
    print(f'Final model saved: {final_checkpoint_path}')
    
    return history


## Evaluation Functions

In [125]:

def validate_model(model, val_loader, device: str = 'cuda' if torch.cuda.is_available() else 'cpu'):
    
    model.eval()
    criterion = nn.NLLLoss()
    total_loss = 0.0
    total_correct = 0
    total_samples = 0
    total_cosine_sim = 0.0
    num_batches = len(val_loader)
    
    # Cosine similarity function
    cos_sim = nn.CosineSimilarity(dim=1)
    
    with torch.no_grad():
        for context, target in val_loader:
            context = context.to(device)
            target = target.to(device)
            
            # Forward pass
            output = model(context)
            loss = criterion(output, target)
            
            # Calculate loss
            total_loss += loss.item()
            
            # Calculate accuracy
            predicted = output.argmax(dim=1)
            total_correct += (predicted == target).sum().item()
            total_samples += target.size(0)
            
            # Calculate cosine similarity
            # Get embeddings for predicted and target words
            predicted_embeddings = model.embeddings(predicted)
            target_embeddings = model.embeddings(target)
            
            # Calculate cosine similarity between predicted and target embeddings
            batch_cosine_sim = cos_sim(predicted_embeddings, target_embeddings).mean()
            total_cosine_sim += batch_cosine_sim.item()
    
    # Calculate average metrics
    avg_loss = total_loss / num_batches
    accuracy = (total_correct / total_samples) * 100
    avg_cosine_sim = total_cosine_sim / num_batches
    
    return avg_loss, accuracy, avg_cosine_sim

def evaluate_model(model, val_loader, device, dataset, BATCH_SIZE) :
    model.eval()
    
    with torch.no_grad():
        for i, (context, target) in enumerate(val_loader):
            context = context.to(device)
            target = target.to(device)
            
            # Get model prediction
            output = model(context)
            predicted_indices = output.argmax(dim=1)
            
            # Convert to words
            for j in range(len(context)):
                context_words = [dataset.idx2word[idx.item()] for idx in context[j]]
                true_word = dataset.idx2word[target[j].item()]
                predicted_word = dataset.idx2word[predicted_indices[j].item()]
                
                print(f"\nPair {i*BATCH_SIZE + j + 1}:")
                print(f"Context: {context_words}")
                print(f"True word: {true_word}")
                print(f"Predicted: {predicted_word}")
                print(f"Correct: {'✓' if true_word == predicted_word else '✗'}")
            
            # Print only first 5 validation pairs for brevity
            if i >= 4:
                print("\n... (showing first 20 pairs only)")
                break

def plot_training_history(history):
    """Plot both training loss and accuracy over epochs"""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    # Plot loss
    ax1.plot(history['epoch_losses'], label='Training Loss')
    ax1.set_title('Training Loss Over Epochs')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Loss')
    ax1.legend()
    ax1.grid(True)
    
    # Plot accuracy
    ax2.plot(history['epoch_accuracies'], label='Training Accuracy', color='green')
    ax2.set_title('Training Accuracy Over Epochs')
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Accuracy (%)')
    ax2.legend()
    ax2.grid(True)
    
    plt.tight_layout()
    plt.show()

## Saving Model

In [126]:
def save_model(final_model_dir, model, val_loss, accuracy):
    
    model_path = os.path.join(final_model_dir, 'final_model.pt')
    torch.save({
        'model_state_dict': model.state_dict(),
        'vocab_size': model.vocab_size,
        'embedding_dim': model.embedding_dim,
        'val_loss': val_loss,
        'val_accuracy': accuracy
    }, model_path)


def save_vocabulary(final_model_dir, dataset):  

    vocab_path = os.path.join(final_model_dir, 'vocabulary.json')
    vocab_data = {
        'word2idx': dataset.word2idx,
        'idx2word': dataset.idx2word
    }
    with open(vocab_path, 'w') as f:
        json.dump(vocab_data, f)

## Main Function

In [216]:
WINDOW_SIZE = 4
EMBEDDING_DIM = 10
BATCH_SIZE = 256
NUM_EPOCHS = 15
LEARNING_RATE = 0.02
TRAIN_SPLIT = 0.8    
VOCAB_SIZE = 8500


In [200]:
# Create dataset
print("Creating dataset...")
dataset = Word2VecDataset(window_size=WINDOW_SIZE, vocabulary_size=VOCAB_SIZE)
    

Creating dataset...
updating vocabulary
got and updated vocab
tokenizing corpus
updating mapping - word2idx and idx2word
generating cbow pairs


In [201]:
def generate_cbow_pairs(dataset):
        cbow_pairs = []
        print(WINDOW_SIZE)
        # loops over all the  in the tokenized corpus
        for sentence in dataset.text:

            for j in range(len(sentence)):
                # Get context words within window
                context_words = (sentence[max(0, j - WINDOW_SIZE):j] + 
                                sentence[j + 1:min(len(sentence), j + WINDOW_SIZE + 1)])
                
                if len(context_words) > 0:
                    # Pad context if necessary
                    while len(context_words) < WINDOW_SIZE * 2:
                        context_words.append(dataset.pad_token)
                    
                    # Convert context words and target words into numerical indices using self.word2idx
                    context_indices = [dataset.word_to_idx.get(w, dataset.word_to_idx[dataset.pad_token]) for w in context_words]
                    target_idx = dataset.word_to_idx.get(sentence[j], dataset.word_to_idx[dataset.pad_token])
                    
                    # Store the (context, target) pairs in self.cbow_pairs
                    cbow_pairs.append((context_indices, target_idx))

        
        return cbow_pairs

In [202]:
dataset.cbow_pairs = generate_cbow_pairs(dataset)

print(dataset.cbow_pairs[:5])
print(len(dataset.cbow_pairs))

1
[([7550, 3010], 5634), ([5634, 5395], 7550), ([7550, 643], 5395), ([5395, 1940], 643), ([643, 643], 1940)]
87749


In [203]:
# Split dataset into training and validation
train_size = int(TRAIN_SPLIT * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(
        dataset, [train_size, val_size])

In [204]:
  
# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
    

In [205]:
# Print dataset information
print(f"\nVocabulary Size: {len(dataset.vocabulary)}")
print(f"Total Pairs: {len(dataset)}")
print(f"Training Pairs: {len(train_dataset)}")
print(f"Validation Pairs: {len(val_dataset)}")
print("\nSample vocabulary items:", list(dataset.vocabulary)[:5])
    


Vocabulary Size: 8500
Total Pairs: 87749
Training Pairs: 70199
Validation Pairs: 17550

Sample vocabulary items: ['##a', '##aachan', '##ab', '##abb', '##abbing']


In [None]:


# Create model and set device
print("\nInitializing model...")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Word2VecModel(
        vocab_size=len(dataset.vocabulary),
        embedding_dim=EMBEDDING_DIM
    ).to(device)
    
# Train model
print("\nStarting training...")

history = train_word2vec(
        model=model,
        train_loader=train_loader,
        val_loader=val_loader,
        num_epochs=NUM_EPOCHS,
        learning_rate=LEARNING_RATE,
        checkpoint_dir='word2vec_checkpoints',
        save_frequency=2,
        device=device
)
    


In [215]:


# Create model and set device
print("\nInitializing model...")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Word2VecModel(
        vocab_size=len(dataset.vocabulary),
        embedding_dim=EMBEDDING_DIM
    ).to(device)
    
# Train model
print("\nStarting training...")

history = train_word2vec(
        model=model,
        train_loader=train_loader,
        val_loader=val_loader,
        num_epochs=NUM_EPOCHS,
        learning_rate=LEARNING_RATE,
        checkpoint_dir='word2vec_checkpoints',
        save_frequency=2,
        device=device
)
    



Initializing model...

Starting training...


Epoch 1/15: 100%|██████████| 138/138 [00:08<00:00, 16.07it/s, batch_loss=3.8772, batch_acc=32.73%, avg_loss=5.3069, lr=0.020000]



Epoch 1/15:
Average Loss: 5.3069
Training Accuracy: 27.79%

Validating model...
Validation Loss: 4.3826
Validation Accuracy: 32.52%
Cosine Similarity: 0.6795270885740008


Epoch 2/15: 100%|██████████| 138/138 [00:08<00:00, 15.48it/s, batch_loss=4.1709, batch_acc=30.91%, avg_loss=4.1191, lr=0.018667]



Epoch 2/15:
Average Loss: 4.1191
Training Accuracy: 33.97%

Validating model...
Validation Loss: 4.0934
Validation Accuracy: 35.13%
Cosine Similarity: 0.5674447655677796
Checkpoint saved: word2vec_checkpoints\word2vec_checkpoint_epoch_2.pt


Epoch 3/15: 100%|██████████| 138/138 [00:10<00:00, 13.40it/s, batch_loss=3.3517, batch_acc=45.45%, avg_loss=3.8211, lr=0.017333]



Epoch 3/15:
Average Loss: 3.8211
Training Accuracy: 36.41%

Validating model...
Validation Loss: 3.9656
Validation Accuracy: 37.23%
Cosine Similarity: 0.5604792901447841


Epoch 4/15: 100%|██████████| 138/138 [00:10<00:00, 12.73it/s, batch_loss=3.6602, batch_acc=32.73%, avg_loss=3.6495, lr=0.016000]



Epoch 4/15:
Average Loss: 3.6495
Training Accuracy: 38.04%

Validating model...
Validation Loss: 3.9351
Validation Accuracy: 37.52%
Cosine Similarity: 0.5456450496401105
Checkpoint saved: word2vec_checkpoints\word2vec_checkpoint_epoch_4.pt


Epoch 5/15: 100%|██████████| 138/138 [00:09<00:00, 14.17it/s, batch_loss=3.1142, batch_acc=49.09%, avg_loss=3.5375, lr=0.014667]



Epoch 5/15:
Average Loss: 3.5375
Training Accuracy: 39.03%

Validating model...
Validation Loss: 3.9364
Validation Accuracy: 37.40%
Cosine Similarity: 0.5323825240135193


Epoch 6/15: 100%|██████████| 138/138 [00:09<00:00, 13.85it/s, batch_loss=2.7780, batch_acc=50.91%, avg_loss=3.4597, lr=0.013333]



Epoch 6/15:
Average Loss: 3.4597
Training Accuracy: 39.60%

Validating model...
Validation Loss: 3.9553
Validation Accuracy: 37.28%
Cosine Similarity: 0.5286958175046104
Checkpoint saved: word2vec_checkpoints\word2vec_checkpoint_epoch_6.pt


Epoch 7/15: 100%|██████████| 138/138 [00:10<00:00, 13.77it/s, batch_loss=3.6717, batch_acc=41.82%, avg_loss=3.4091, lr=0.012000]



Epoch 7/15:
Average Loss: 3.4091
Training Accuracy: 40.03%

Validating model...
Validation Loss: 3.9692
Validation Accuracy: 37.45%
Cosine Similarity: 0.5371948787144252


Epoch 8/15: 100%|██████████| 138/138 [00:09<00:00, 14.12it/s, batch_loss=4.0975, batch_acc=36.36%, avg_loss=3.3693, lr=0.010667]



Epoch 8/15:
Average Loss: 3.3693
Training Accuracy: 40.35%

Validating model...
Validation Loss: 3.9861
Validation Accuracy: 37.40%
Cosine Similarity: 0.5192864239215851
Checkpoint saved: word2vec_checkpoints\word2vec_checkpoint_epoch_8.pt


Epoch 9/15: 100%|██████████| 138/138 [00:09<00:00, 13.92it/s, batch_loss=3.4557, batch_acc=40.00%, avg_loss=3.3304, lr=0.009333]



Epoch 9/15:
Average Loss: 3.3304
Training Accuracy: 40.57%

Validating model...
Validation Loss: 4.0055
Validation Accuracy: 37.53%
Cosine Similarity: 0.521880818264825


Epoch 10/15: 100%|██████████| 138/138 [00:09<00:00, 14.11it/s, batch_loss=3.6378, batch_acc=32.73%, avg_loss=3.3040, lr=0.008000]



Epoch 10/15:
Average Loss: 3.3040
Training Accuracy: 40.80%

Validating model...
Validation Loss: 4.0197
Validation Accuracy: 37.38%
Cosine Similarity: 0.5266713091305324
Checkpoint saved: word2vec_checkpoints\word2vec_checkpoint_epoch_10.pt


Epoch 11/15: 100%|██████████| 138/138 [00:10<00:00, 13.35it/s, batch_loss=2.9832, batch_acc=45.45%, avg_loss=3.2752, lr=0.006667]



Epoch 11/15:
Average Loss: 3.2752
Training Accuracy: 40.99%

Validating model...
Validation Loss: 4.0328
Validation Accuracy: 37.50%
Cosine Similarity: 0.526365966456277


Epoch 12/15: 100%|██████████| 138/138 [00:09<00:00, 14.24it/s, batch_loss=3.3360, batch_acc=38.18%, avg_loss=3.2571, lr=0.005333]



Epoch 12/15:
Average Loss: 3.2571
Training Accuracy: 41.10%

Validating model...
Validation Loss: 4.0440
Validation Accuracy: 37.50%
Cosine Similarity: 0.5253479821341378
Checkpoint saved: word2vec_checkpoints\word2vec_checkpoint_epoch_12.pt


Epoch 13/15: 100%|██████████| 138/138 [00:10<00:00, 13.33it/s, batch_loss=3.3743, batch_acc=41.82%, avg_loss=3.2400, lr=0.004000]



Epoch 13/15:
Average Loss: 3.2400
Training Accuracy: 41.30%

Validating model...
Validation Loss: 4.0508
Validation Accuracy: 37.29%
Cosine Similarity: 0.5185267610209329


Epoch 14/15: 100%|██████████| 138/138 [00:11<00:00, 12.49it/s, batch_loss=2.4642, batch_acc=52.73%, avg_loss=3.2196, lr=0.002667]



Epoch 14/15:
Average Loss: 3.2196
Training Accuracy: 41.37%

Validating model...
Validation Loss: 4.0582
Validation Accuracy: 37.36%
Cosine Similarity: 0.525026034457343
Checkpoint saved: word2vec_checkpoints\word2vec_checkpoint_epoch_14.pt


Epoch 15/15: 100%|██████████| 138/138 [00:10<00:00, 12.88it/s, batch_loss=2.9287, batch_acc=49.09%, avg_loss=3.2100, lr=0.001333]



Epoch 15/15:
Average Loss: 3.2100
Training Accuracy: 41.52%

Validating model...
Validation Loss: 4.0612
Validation Accuracy: 37.36%
Cosine Similarity: 0.5255386199269976
Final model saved: word2vec_checkpoints\word2vec_final_model.pt


In [None]:
# Validate model and get accuracy
print("\nValidating model...")
val_loss, accuracy = validate_model(model, val_loader, device)
print(f"Validation Loss: {val_loss:.4f}")
print(f"Validation Accuracy: {accuracy:.2f}%")
    


In [None]:
# Plot training history
plot_training_history(history)
    


In [None]:
# Print validation pairs and predictions
print("\nValidation Pairs vs Predictions:")
print("-" * 50)

evaluate_model(model, val_loader, device, dataset, BATCH_SIZE)
    


In [None]:
# Save final model and vocabulary
print("\nSaving final model and vocabulary...")

final_model_dir = 'final_model'
os.makedirs(final_model_dir, exist_ok=True)
    
# Save vocabulary
save_vocabulary(final_model_dir, dataset)
    
# Save final model state
save_model(final_model_dir, model, val_loss, accuracy)