# **Tutorial: Creating a CBOW Word2Vec Model with PyTorch**
Word embeddings are a fundamental concept in natural language processing (NLP) and machine learning. They represent words as dense vectors in a continuous space, capturing semantic relationships between words. One popular technique for generating word embeddings is the Word2Vec model. In this tutorial, we will create a Continuous Bag of Words (CBOW) Word2Vec model using PyTorch.

## **Prerequisites**
Before we begin, make sure you have the following prerequisites:

Python and PyTorch installed on your machine.
Basic knowledge of PyTorch and NLP concepts.
## **Understanding CBOW Word2Vec**
CBOW (Continuous Bag of Words) is a type of Word2Vec model that aims to predict a target word based on its surrounding context words. It works by training a neural network to predict a target word from the context words in a sliding window. The word embeddings learned during this process capture semantic meanings and relationships between words.

## **Steps to Create a CBOW Word2Vec Model**
We'll break down the process into the following steps:

**1.Data Preprocessing:** Tokenize and preprocess your text data.

**2.Build Vocabulary:** Create a vocabulary from your text data.

**3.Create CBOW Dataset:** Create a CBOW dataset from the tokenized text.

**4.Define CBOW Model:** Define the CBOW model architecture.

**5.Training:** Train the CBOW model using the dataset.

**6.Save the Model:** Save the trained model for later use.

**7.Word Embeddings:** Access and use the word embeddings learned by the model.



In [27]:
# Import necessary libraries
import torch
import torch.nn as nn
import torch.optim as optim
from collections import Counter
from torch.utils.data import Dataset, DataLoader
from nltk.tokenize import word_tokenize
import nltk
import string
import numpy as np

# Sample text for demonstration
corpus = ["I enjoy deep learning",
          "Deep learning is fascinating",
          "Natural language processing is important",
          "PyTorch is a popular deep learning framework",
          "Word embeddings capture semantic meanings"]

# Tokenization and preprocessing
nltk.download('punkt')
translator = str.maketrans('', '', string.punctuation)
# Tokenize and preprocess the text
tokenized_corpus = [word_tokenize(doc.lower().translate(translator)) for doc in corpus]

# Build vocabulary
# Create a vocabulary and reverse vocabulary to map words to indices and vice versa
words = [word for doc in tokenized_corpus for word in doc]
word_count = Counter(words)
vocab = {word: idx for idx, (word, _) in enumerate(word_count.most_common())}
reverse_vocab = {idx: word for word, idx in vocab.items()}
vocab_size = len(vocab)

# Create CBOW dataset
context_window = 2  # Number of surrounding words to consider
data = []
for doc in tokenized_corpus:
    for i, target_word in enumerate(doc):
        context = [doc[j] for j in range(i - context_window, i + context_window + 1)
                   if 0 <= j < len(doc) and j != i]
        data.append((context, target_word))

# Define CBOW dataset class
class CBOWDataset(Dataset):
    def __init__(self, data, vocab):
        self.data = data
        self.vocab = vocab

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        context, target = self.data[idx]
        context_idx = [self.vocab[word] for word in context]
        target_idx = self.vocab[target]
        return context_idx, target_idx

# Hyperparameters
embedding_dim = 100
num_epochs = 100
learning_rate = 0.01

# Create CBOW dataset and DataLoader
cbow_dataset = CBOWDataset(data, vocab)
dataloader = DataLoader(cbow_dataset, batch_size=1, shuffle=True)

# Define the CBOW model
class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOW, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)  # Correct output size

    def forward(self, x):
        embedded = self.embedding(x)
        embedded_sum = torch.sum(embedded, dim=0)  # Sum along the batch dimension
        out = self.linear(embedded_sum)
        return out

# Initialize model and optimizer
model = CBOW(vocab_size, embedding_dim)
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(num_epochs):
    total_loss = 0
    for context, target in dataloader:
        context = context[0]  # Unpack from list

        optimizer.zero_grad()

        # Convert context and target to PyTorch tensors
        context = torch.LongTensor(context)
        target_word = reverse_vocab[target[0].item()]  # Convert target to word
        target_idx = torch.LongTensor([vocab[target_word]])  # Convert word to index

        output = model(context)

        # Reshape the output and target tensors to match
        output = output.view(1, -1)  # Reshape to (1, vocab_size)
        target_idx = target_idx.view(1)  # Reshape to (1,)

        loss = criterion(output, target_idx)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {total_loss / len(dataloader)}")

# Save the trained model
torch.save(model.state_dict(), 'cbow_word2vec.pth')

# Use the trained model to get word embeddings
word_embeddings = model.embedding.weight.data.numpy()

# Print word embeddings
for word, idx in vocab.items():
    embedding = word_embeddings[idx]
    print(f"Word: {word}, Embedding: {embedding}")



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Epoch 1/100, Loss: 3.314073324203491
Epoch 2/100, Loss: 2.4836450719833376
Epoch 3/100, Loss: 1.9116132354736328
Epoch 4/100, Loss: 1.5604651021957396
Epoch 5/100, Loss: 1.3671385204792024
Epoch 6/100, Loss: 1.255575248003006
Epoch 7/100, Loss: 1.1803887385129928
Epoch 8/100, Loss: 1.137332437634468
Epoch 9/100, Loss: 1.105860486626625
Epoch 10/100, Loss: 1.0843769496679305
Epoch 11/100, Loss: 1.0659975409507751
Epoch 12/100, Loss: 1.0485157623887063
Epoch 13/100, Loss: 1.0426976457238197
Epoch 14/100, Loss: 1.0229039108753204
Epoch 15/100, Loss: 1.0205603456497192
Epoch 16/100, Loss: 0.9991889956593514
Epoch 17/100, Loss: 0.9965507072210312
Epoch 18/100, Loss: 0.9894008563458919
Epoch 19/100, Loss: 0.995814810693264
Epoch 20/100, Loss: 0.9882562506198883
Epoch 21/100, Loss: 0.9765190842747689
Epoch 22/100, Loss: 0.9906574709713459
Epoch 23/100, Loss: 0.9932756780087948
Epoch 24/100, Loss: 0.9699992328882218
Epoch 25/100, Loss: 0.9840547174215317
Epoch 26/100, Loss: 0.9764780841767788


## **Explanation of the Code**
**Data Preprocessing:** We start by tokenizing and preprocessing the input text data using the NLTK library. This ensures that the text is cleaned and split into individual words.

**Build Vocabulary:** We build a vocabulary from the preprocessed text data. The vocabulary consists of unique words, each assigned a unique index.

**Create CBOW Dataset:** We create a CBOW dataset by sliding a context window over the tokenized text and pairing the surrounding words with the target word.

**Define CBOW Model:** We define the CBOW model using PyTorch. The model consists of an embedding layer and a linear layer.

**Training:** We train the CBOW model using the created dataset and standard deep learning training techniques.

**Save the Model:** We save the trained model for later use so that you can access the learned word embeddings.

**Word Embeddings:** We access and print the word embeddings learned by the model. These embeddings represent words as dense vectors in a continuous space.