In [None]:
# Copyright (c) 2023 Sophie Katz
#
# This file is part of Language Model.
#
# Language Model is free software: you can redistribute it and/or modify it under
# the terms of the GNU General Public License as published by the Free Software
# Foundation, either version 3 of the License, or (at your option) any later version.
#
# Language Model is distributed in the hope that it will be useful, but WITHOUT
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
# PARTICULAR PURPOSE. See the GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along with Language
# Model. If not, see <https://www.gnu.org/licenses/>.

# Writing a transformer from scratch using Pytorch

In this notebook, I'm writing a minimal implementation of a transformer from scratch in Pytorch. The goal is to understand the transformer architecture. I'm using the transformer architecture described in the paper [Attention is all you need](https://arxiv.org/abs/1706.03762).

## Resources used

Name | URL
---- | ---
A video tutorial going through the paper | https://www.youtube.com/watch?v=U0s0f995w1
Tutorial for simple architecture | https://medium.com/the-dl/transformers-from-scratch-in-pytorch-8777e346ca51

## Architecture

Transformer's architecture as a whole looks rather intimidating at first. Here's a diagram of the architecture taken from the paper:

![1_j9MmpNZzbBqkWes0GN8IBQ.webp](attachment:1_j9MmpNZzbBqkWes0GN8IBQ.webp)

We can break it down into simpler concepts, though. It's an encoder-decoder architecture at its core, but it uses two novel techniques: **attention** and **positional encoding**.
* See [this notebook](attention_from_scratch.ipynb) for an implementation of attention from scratch. We will also use a Pytorch implementation of attention from [this module](../language_model/models/transformer_from_scratch/attention.py).

## Imports

In [1]:
import torch as T
import torch.nn as nn
import torch.nn.functional as F

from language_model.models.transformer_from_scratch.attention import attention

## Positional encoding

In [None]:
def positional_encoding(
        sequence_length: int,
        feature_count: int,
        device: T.device
) -> T.Tensor

## Attention head

Each multi-head attention block is constructed from attention heads. They basically just perform the attention operation that we've already defined in [this module](../language_model/models/transformer_from_scratch/attention.py).

In [3]:
class AttentionHead(nn.Module):
    def __init__(
        self,
        batch_size: int,
        query_sequence_length: int,
        key_sequence_length: int,
        value_sequence_length: int,
    ) -> None:
        super().__init__()

        self.batch_size = batch_size
        self.query_sequence_length = query_sequence_length
        self.key_sequence_length = key_sequence_length
        self.value_sequence_length = value_sequence_length

        self.query = nn.Linear(self.batch_size, self.query_sequence_length)
        self.key = nn.Linear(self.batch_size, self.key_sequence_length)
        self.value = nn.Linear(self.batch_size, self.value_sequence_length)

    def forward(self, query: T.Tensor, key: T.Tensor, value: T.Tensor) -> T.Tensor:
        assert 
        return attention(
            self.query(query),
            self.key(key),
            self.value(value),
        )