Transformer From Scratch 
========================

Reading along Dan Jurafsky and James H. Martin's [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) book, I decided to follow through Chapter 8 of their book to implement a Transformer using Pytorch. It is my goal to have a working transformer which I can use to train on the guitar dataset. I know some linear algebra, the book essentially gives the entire algorithm in terms of linear algebra, and pytorch provides a nice but still very informative abstractions for doing linear algebra. I had no reason not to pursue this project on top of whatever I proposed to do initially. 

Attention Layer
---------------

At the heart of Transformer is the **attention layer**. It is a mechanism that allows words(tokens) to gain contextual meaning from their surrounding words(tokens). It can have multiple **"heads"**, where each "head" can be thought of as a specialist who asks particular set of questions given some data. For instance, one head could focus solely on grammar while another could instead focus on sentiments (even though that might not be exactly what occurs under the hood).

Each head's job, then, is to ask the right kinds of *questions* to choose which of previous words it has seen matters the most to the current word. To do this, each head consists of three main components: **Query**, **Key**, and **Value** weight matrices. 

<!-- 
    essentially, what it is at the end of the day is weighted sum, but it's obviously lot more complicated than that
    don't forget to write out the equations that I have referenced
    maybe throw in some pictures
    say something about how masking and softmax is used to determine what key's to focus on
    also explain how results from different heads are consolidated at the end
--!>

In [8]:
from myTransformer import *
batch_size = 10
N = 10
model_dim = 24
enc_out_dim = 20
num_heads = 4
key_dim = 3

M = 8
X = torch.rand((batch_size, N, model_dim)) # batch_size is 10, 3 words represented as dim (1, 4) tensors
Y = torch.rand((batch_size, M, enc_out_dim)) # 3 words represented as dim (1, 4) tensors
mask = torch.tensor([[0 if i>= j else -torch.inf for j in range(N)] for i in range(N)])

multihead_attention = AttentionLayer(model_dim=model_dim, key_dim=key_dim, num_heads=num_heads)
multihead_attention(X, mask=mask).shape
#multihead_attention.to("cuda")
#multihead_attention(X.to("cuda"), Y.to("cuda"), mask=mask.to("cuda"))

torch.Size([10, 10, 24])

In [11]:
encoder_block = TransformerBlock(model_dim=model_dim, key_dim=key_dim, hidden_dim=8, num_heads=num_heads)
decoder_block = TransformerBlock(model_dim=model_dim, key_dim=key_dim, hidden_dim=8, num_heads=num_heads, enc_out_dim=enc_out_dim)
encoder_block(X,mask=mask)
decoder_block(X, Y)

tensor([[[ 0.0000,  0.8284,  0.0026,  ...,  0.1809,  0.8515, -0.1286],
         [ 0.4656,  0.0000,  0.5362,  ...,  0.9967, -0.3205,  0.5626],
         [ 1.1831,  0.5461, -0.5433,  ...,  0.7300, -0.7640,  0.6041],
         ...,
         [ 0.3099,  0.4969,  0.5268,  ...,  0.8927,  0.1966,  0.5895],
         [-0.3152,  1.2625,  0.3371,  ...,  0.5510,  0.3495,  0.2605],
         [ 0.8440,  1.2372, -0.2170,  ...,  0.1685, -0.0063,  1.4719]],

        [[ 0.1905,  0.5394, -0.1269,  ...,  0.6671,  0.0000,  0.3018],
         [ 0.4371,  1.3839,  0.1648,  ...,  0.5378, -0.4256, -0.0802],
         [ 0.4122,  1.2600,  0.4195,  ...,  1.0659,  1.0272,  0.7933],
         ...,
         [ 0.8240,  1.2874, -0.1663,  ...,  0.6100,  0.6708,  0.0000],
         [ 0.2691,  1.2808,  0.5509,  ...,  0.7116,  0.7199,  0.6429],
         [ 0.5359,  0.6229, -0.0000,  ...,  0.3471, -0.1892,  0.7101]],

        [[ 0.9261,  0.0000,  0.0732,  ...,  0.1877,  0.0228,  0.8999],
         [ 0.1428,  1.2533,  0.6701,  ...,  0

In [3]:
stack = TransformerStack(model_dim=model_dim, key_dim=key_dim, hidden_dim=8, num_heads=num_heads, num_stack=9)
stack.state_dict()
stack.train()
stack(X)

tensor([[[-0.5555, -0.7598, -0.2857,  ..., -0.1691, -2.6221,  2.4038],
         [ 0.3930, -1.4053, -0.9376,  ..., -0.0344, -0.7158,  1.3928],
         [ 3.1673, -2.3931,  1.7603,  ...,  0.1908, -2.5963,  0.7405],
         ...,
         [ 0.0000, -1.3410,  3.1878,  ..., -0.4760, -0.8723,  0.7657],
         [ 2.3476, -3.7490,  3.5743,  ...,  0.0000,  0.0000,  0.4210],
         [ 1.5333, -3.4304,  0.0000,  ..., -0.0000,  0.2820,  0.1675]],

        [[-0.0000, -1.8206,  2.8451,  ..., -0.0254, -2.9636,  0.5313],
         [ 0.1950,  0.3373, -0.2977,  ..., -0.0000, -0.0852,  0.9560],
         [ 0.3769, -0.0563,  0.0000,  ..., -0.3701, -0.2993,  3.9175],
         ...,
         [ 0.0067,  0.0478,  0.8875,  ..., -0.2956, -5.1830,  0.4894],
         [ 0.0000,  0.6846,  1.4667,  ..., -0.0000, -1.9956,  3.9785],
         [ 1.0439,  0.5325, -0.3314,  ..., -0.1789, -1.9533,  0.0884]],

        [[-0.1696,  0.5859, -1.1434,  ...,  0.7306,  0.0734,  0.4135],
         [-1.0895, -1.7640,  2.3305,  ...,  3

In [4]:
# from https://pytorch-tutorials-preview.netlify.app/beginner/transformer_tutorial.html
# i don't completely understand positional encoding yet, but I have built the intuition that 
# it is analogous to how binary numbers encode numbers; smaller bits flips more frequently 
# than larger bits; this is modeled by the sinusodial waves 
# it also takes advantage of linearity of trigonometric addition formulas, which supposedly 
# helps the model to figure out relative positioning...
# https://medium.com/thedeephub/positional-encoding-explained-a-deep-dive-into-transformer-pe-65cfe8cfe10b 
class PositionalEncoding(nn.Module):

    def __init__(self, model_dim: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, model_dim, 2) * (-math.log(10000.0) / model_dim))
        pe = torch.zeros(max_len, 1, model_dim)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, X):
        """
        Arguments:
            x: Tensor, shape ``[seq_len, batch_size, embedding_dim]``
        """
        X = X + self.pe[:X.size(0)]
        return self.dropout(X)

In [5]:
# what I need to do to finish up the encoder_decoder architecture 
# the only difference for the decoder architecture is the cross attention layer `
# which is much like the self-attension layer except that it is using both the final 
# H of the encoder and that of decoder to do query-key matching, thus decoder needs to 
# take in memory from encoder
pos = PositionalEncoding(model_dim=model_dim)


In [6]:
import dac
from ICMTSMTGuitarData import *
from jupyter_audio_utils import *
import torch

model_path = dac.utils.download(model_type="44khz")
model = dac.DAC.load(model_path, weights_only=True).eval()

mono_data2 = ICMTSMTGuitarDataMono(DEFAULT_PEDAL_PROBS)
test_2 = mono_data2[random.randint(0, len(mono_data2)-1)] 

waveform, sr = test_2[0] 
play_audio(*test_2[0])

x = model.preprocess(waveform, sr)

with torch.no_grad():
    z, codes, latents, _, _ = model.encode(x.unsqueeze(dim=0))

audio_tokens = z.transpose(-2,-1) #(batch_size, seq_len, model_dim)
N = audio_tokens.shape[-2] 
model_dim = 32

# further compress information or is this just baseless; the hope is that this will distill features that actually matter
linear = torch.nn.Linear(in_features=audio_tokens.shape[-1], out_features=model_dim)

key_dim = 256
hidden_dim = 512
num_heads=4
print(audio_tokens.shape)

compressed = linear(audio_tokens)
print(compressed.shape)
encoder = TransformerStack(N=N, model_dim=model_dim, key_dim=key_dim, hidden_dim=256, num_heads=num_heads, num_stack=1)
latent = encoder(compressed)

2026-02-08 20:45:59.282020: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2026-02-08 20:45:59.356073: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2026-02-08 20:45:59.371721: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2026-02-08 20:45:59.614629: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: li

torch.Size([1, 173, 1024])
torch.Size([1, 173, 32])


TypeError: __init__() got an unexpected keyword argument 'N'

In [None]:
reverse = torch.nn.Linear(in_features=model_dim, out_features=audio_tokens.shape[-1])

noise = reverse(latent).transpose(-1, -2) 
y = model.decode(noise)


In [None]:
y_detached = y.detach().squeeze().unsqueeze(dim=0)
play_audio(y_detached, 44100)

In [None]:
from myTransformer import *

# blueprint
class EncoderDecoder(nn.Module): 
    def __init__(
        self, 
        N, 
        model_dim,
        key_dim,
        dac, 
        decoder_hidden_dim, 
        decoder_num_stack,
        decoder_num_heads,
        decoder_vocab
    ):
        pass
        self.positional_encoder = PositionalEncoding()
        self.decoder_stack = TransformerStack(
            N=N, 
            model_dim=model_dim, 
            key_dim=key_dim, 
            hidden_dim=decoder_hidden_dim, 
            num_heads=decoder_num_heads, 
            num_stack=decoder_num_stack,
            cross_attention=True
        )
        self.language_head = None
        self.decoder_vocab = decoder_vocab
        self.mask = encoder_mask # do the self register thingy
    def forward(self, X): # (batch_size, N, model_dim)
        # if i hypothesize that the compression of dac is good enough then I just use dac's encoder 
        # lock the dac model, make my custom decoder adapt to the embeddings
        # use the vector codes instead of the z vectors 
        # so that it matchs the model_dim 
        X = dac.encode(X)  
        X = self.positional_encoder(X) 
        H2 = self.decoder_stack(X, H1) 
        Y = self.language_head(H2,...) # not implemented yet

        return Y