<a href="https://colab.research.google.com/github/rhiosutoyo/Teaching-Deep-Learning-and-Its-Applications/blob/main/7_2_training_word_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Training Word Embedding

## Definition
Training word embeddings involves learning a vector representation for each word in a vocabulary such that words with similar meanings have similar representations. This process is often done using neural networks and relies on context words to inform the embeddings.

## Code Implementation
The code sets up and trains a word embedding model using the Skip-gram approach. The embeddings capture the semantic meaning of words based on their context in the training corpus. This is achieved through a series of steps involving data preparation, model definition, and iterative training.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from collections import Counter
import numpy as np

# 1. Preprocesses the text data to create a vocabulary and tokenizes the sentences

The code starts with a corpus of sentences and tokenizes them to build a vocabulary. Each word is assigned a unique index.

* **Tokenization**: The tokenize function converts each sentence into a list of lowercase words.
* **Building the Vocabulary**: The build_vocab function constructs a vocabulary from the tokenized sentences. It assigns a unique index to each word based on its frequency in the corpus. This vocabulary is used to convert words into numerical indices.

In [2]:
# Sample text data
corpus = [
    "I love programming",
    "Programming is fun",
    "I love learning new things",
    "Deep learning is a subset of machine learning",
    "PyTorch is a great library for deep learning",
    "Machine learning and deep learning are fascinating",
    "Natural language processing is a part of AI",
    "AI is transforming the world",
    "We can build intelligent systems using AI",
    "Programming requires logical thinking"
]

In [3]:
# Preprocessing
def tokenize(text):
    return text.lower().split()

def build_vocab(corpus):
    tokens = [tokenize(sentence) for sentence in corpus]
    counter = Counter([token for sentence in tokens for token in sentence])
    vocab = {word: idx for idx, (word, _) in enumerate(counter.items())}
    return vocab

vocab = build_vocab(corpus)
vocab_size = len(vocab)

# 2. Defines a TextDataset class to handle the context words for training

The TextDataset class generates pairs of center and context words. This means for each word in a sentence (center word), it considers nearby words (context words) within a defined window size.

* **Initialization**: The TextDataset class takes the corpus and vocabulary as inputs and generates pairs of center and context words. It considers a context window size to determine the context words around each center word.

* **Data Storage**: The dataset stores these pairs as tuples of indices, representing the center and context words.

* **Dataset Methods**: The __len__ method returns the number of pairs, and the __getitem__ method retrieves a specific pair, converting them to tensors.

In [4]:
class TextDataset(Dataset):
    def __init__(self, corpus, vocab, context_size=2):
        self.data = []
        self.vocab = vocab
        for sentence in corpus:
            tokens = tokenize(sentence)
            indices = [vocab[token] for token in tokens]
            for center_pos in range(len(indices)):
                for context_pos in range(-context_size, context_size + 1):
                    if context_pos == 0 or center_pos + context_pos < 0 or center_pos + context_pos >= len(indices):
                        continue
                    context_word = indices[center_pos + context_pos]
                    self.data.append((indices[center_pos], context_word))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        center_word, context_word = self.data[idx]
        return torch.tensor(center_word, dtype=torch.long), torch.tensor(context_word, dtype=torch.long)

dataset = TextDataset(corpus, vocab)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

# 3. Defines a simple word embedding model using PyTorch’s nn.Embedding

The WordEmbeddingModel class uses an embedding layer (nn.Embedding). This layer learns a fixed-size vector (embedding) for each word in the vocabulary.

* **Model Definition**: The WordEmbeddingModel class initializes an embedding layer with the vocabulary size and embedding dimensions.

* **Forward Method**: The forward method returns the embeddings for the given center words, which are looked up in the embedding layer.

In [5]:
# Model
class WordEmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embed_size):
        super(WordEmbeddingModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_size)

    def forward(self, center_words):
        return self.embeddings(center_words)

embed_size = 10
model = WordEmbeddingModel(vocab_size, embed_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 4. Trains the model using a basic context-based approach

During training, the model learns embeddings by predicting context words given a center word. This is done by:
* Forward Pass: Looking up the embeddings for the center words.
* Context Prediction: Using these embeddings to predict the indices of the context words. The torch.matmul(embeddings, model.embeddings.weight.t()) computes scores (similarities) for all words in the vocabulary.
* Loss Calculation: Using CrossEntropyLoss, which compares these scores with the actual context word indices.
* Backward Pass: Computing gradients and updating the model parameters to minimize the loss.

In [6]:
# Training
num_epochs = 100

for epoch in range(num_epochs):
    total_loss = 0
    for center_words, context_words in dataloader:
        optimizer.zero_grad()
        embeddings = model(center_words)
        outputs = torch.matmul(embeddings, model.embeddings.weight.t())
        loss = criterion(outputs, context_words)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {total_loss:.4f}')

Epoch [10/100], Loss: 326.0805
Epoch [20/100], Loss: 267.9783
Epoch [30/100], Loss: 226.5356
Epoch [40/100], Loss: 196.5373
Epoch [50/100], Loss: 174.2668
Epoch [60/100], Loss: 157.4672
Epoch [70/100], Loss: 144.4252
Epoch [80/100], Loss: 134.2227
Epoch [90/100], Loss: 126.3131
Epoch [100/100], Loss: 120.1455


# 5. Tests the model with 10 sample sentences and prints their embeddings

After training, the model can generate embeddings for new sentences. The embeddings for each word can be used in various downstream tasks, such as similarity measurement or as input features for other models.

* **Tokenization and Indexing**: Each test sentence is tokenized, and the words are converted to their corresponding indices using the vocabulary.
* **Embedding Lookup**: The model generates embeddings for the words in each test sentence.
* **Output**: The embeddings for each word in the test sentences are printed, showing the learned representations.

# Warning
This is just example code. For production use, consider improvements and optimizations

Steps to improve accuracy and reduce loss:
1. Increase the size of the corpus: Use a larger and more diverse set of text data to train the model.
2. Use negative sampling: Implement negative sampling to reduce computational complexity and improve training efficiency.
3. Tune hyperparameters: Experiment with different embedding sizes, learning rates, and context window sizes to find the optimal settings for your specific dataset.

In [7]:
# Testing
test_data = [
    "AI is amazing",
    "Learning is a continuous process",
    "Programming opens many doors",
    "Deep learning uses neural networks",
    "Data science is a growing field",
    "PyTorch makes neural network modeling easier",
    "Machine learning is a core component of AI",
    "Understanding AI is important",
    "Technology evolves rapidly",
    "Natural language processing enables communication with machines"
]

# Convert test data to embeddings
for sentence in test_data:
    tokens = tokenize(sentence)
    indices = [vocab[token] for token in tokens if token in vocab]
    embeddings = model(torch.tensor(indices, dtype=torch.long))
    print(f'Sentence: "{sentence}"')
    print('Embeddings:', embeddings.detach().numpy())
    print()

Sentence: "AI is amazing"
Embeddings: [[-0.12648243 -1.4020289  -0.8643976  -0.12619703 -0.04087214  0.73555386
   0.72107124 -0.377158    0.24512364  0.35317817]
 [ 0.70755345  0.4047808   0.16402324  0.0699135   0.16389346  0.5835555
   0.9639186  -0.65452385 -0.07893892  0.89540154]]

Sentence: "Learning is a continuous process"
Embeddings: [[-0.02035596  0.07210164  0.69371    -0.03727553  1.5827276   0.27384332
   0.44894344 -0.04693383  0.10514028 -0.9143499 ]
 [ 0.70755345  0.4047808   0.16402324  0.0699135   0.16389346  0.5835555
   0.9639186  -0.65452385 -0.07893892  0.89540154]
 [ 0.4678213   0.40463117  0.8982228  -0.0554723   0.22818246  0.7663856
   1.0892923  -0.4460114  -0.31314376  0.64136964]]

Sentence: "Programming opens many doors"
Embeddings: [[-0.80301255  0.5380913  -0.44144842  0.5302582  -0.3307413   0.7893749
  -0.02127253 -0.44663918  0.83124155  0.9281888 ]]

Sentence: "Deep learning uses neural networks"
Embeddings: [[ 0.02173646  0.23279826  0.26432922  0.