In [None]:
# Copyright (c) 2023 Sophie Katz
#
# This file is part of Language Model.
#
# Language Model is free software: you can redistribute it and/or modify it under
# the terms of the GNU General Public License as published by the Free Software
# Foundation, either version 3 of the License, or (at your option) any later version.
#
# Language Model is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
# PARTICULAR PURPOSE. See the GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along with Language
# Model. If not, see <https://www.gnu.org/licenses/>.

# Writing word embedding from scratch using Pytorch

This is a minimal implementation of word embedding in PyTorch.

## Resources used

Name | URL
---- | ---
Writing a transformer from scratch | https://www.kaggle.com/code/arunmohan003/transformer-from-scratch-using-pytorch

## Approach

The first step in general for NLP applications is turning our input into a vector. The first step of this is to create word embeddings. There are a number of word embedding models available, but we will be writing our own which will be trained as part of the transformer's training.

## Imports

In [24]:
import torch as T
from torch import nn
import torch.nn.functional as F

_ = T.manual_seed(57)


## Constants

In [30]:
# The embedding vector for a given word has a fixed size that is not necessarily the
# same as the size of the vocabulary. In general, it will usually be much smaller.
WORD_EMBEDDING_SIZE = 512

# The number of different words we expect to find in our input.
VOCABULARY_SIZE = 10000

# The length of our input sentence.
SENTENCE_LENGTH = 13

## Network

We will use Pytorch's built-in embedding layer to create our word embedding module. Our module takes as input a tensor of word indices and returns a tensor of word embeddings.

Our tensor of word indices should look something like:

```python
[1 50 82 ...  4 24 98]
```

It should be of shape `(sentence_length,)`. Likewise, our tensor of word embeddings should be of shape `(sentence_length, embedding_size)`.

In [42]:
class WordEmbedding(nn.Module):
    """
    A simple word embedding model that takes a word as input and returns its embedding.

    This module expects as input a tensor of word indices within the vocabulary of shape
    `(sentence_length,)`. It returns a tensor of word embeddings of shape
    `(sentence_length, embedding_size)`.

    Args:
        vocabulary_size: int
            The number of different words we expect to find in our input.
        embedding_size: int
            The size of the embedding vector for a given word.
    """

    def __init__(self, vocabulary_size: int, embedding_size: int) -> None:
        super().__init__()

        self.vocab_size = vocabulary_size
        self.embedding_size = embedding_size

        # We use Pytorch's built in embedding layer
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_size)

    def forward(self, sentence: T.Tensor) -> T.Tensor:
        # We expect sentence to be of shape (sentence_length,) and to be a tensor of
        # word indices within the vocabulary.

        assert sentence.ndim == 1

        result = self.embedding(sentence)

        # We expect result to be of shape (sentence_length, embedding_size)
        assert result.ndim == 2
        assert result.size(0) == sentence.size(0)
        assert result.size(1) == self.embedding_size

        return result

## Trying it out

Let's generate some random data and just run our embedding module on it. The module isn't trained, so this is basically garbage data, but it illustrates how it would work.

In [43]:
# Create our word embedding module
word_embedding = WordEmbedding(VOCABULARY_SIZE, WORD_EMBEDDING_SIZE)

# Generate a random sequence of word indices
word_indices = T.randint(VOCABULARY_SIZE, (SENTENCE_LENGTH,))

print(f"Word indices: {word_indices}")

# Pass our word indices through the word embedding module to get embedding matrix
word_embeddings = word_embedding(word_indices)

print(f"Word embeddings shape: {word_embeddings.shape}")

assert word_embeddings.shape == (SENTENCE_LENGTH, WORD_EMBEDDING_SIZE)

print()
print("Word embeddings:")
print(word_embeddings)

Word indices: tensor([6021, 2019, 8742, 6067, 1243,  509, 9686, 7760, 8596, 5982, 3962, 9773,
        8539])
Word embeddings shape: torch.Size([13, 512])

Word embeddings:
tensor([[-1.1089, -1.6561, -0.3991,  ...,  1.1571, -0.0644,  1.8553],
        [-2.2252, -0.2858,  0.0437,  ...,  0.8528, -0.7891,  0.1091],
        [ 0.8597,  0.4533, -0.3149,  ...,  1.0133, -0.6705,  0.2638],
        ...,
        [-0.9042, -1.6502, -0.6856,  ...,  2.2120, -1.4461, -0.7706],
        [-1.2849,  0.1946, -1.6062,  ..., -1.4948, -2.0231, -0.5197],
        [-0.3801, -0.3783, -0.6943,  ..., -0.5142,  0.5159,  1.6972]],
       grad_fn=<EmbeddingBackward0>)
