#### Chapter 3: Coding Attention Mechanisms
###### Packages that are being used in this notebook:

In [None]:
from importlib.metadata import version

print("torch version:", version("torch"))

###### This chapter covers attention mechanisms, the engine of LLMs:

#### 3.1 The problem with modeling long sequences
###### No code in this section
###### Translating a text word by word isn't feasible due to the differences in grammatical structures between the source and target languages:

###### Prior to the introduction of transformer models, encoder-decoder RNNs were commonly used for machine translation tasks
###### In this setup, the encoder processes a sequence of tokens from the source language, using a hidden state—a kind of intermediate layer within the neural network—to generate a condensed representation of the entire input sequence:

#### 3.2 Capturing data dependencies with attention mechanisms
###### No code in this section
###### Through an attention mechanism, the text-generating decoder segment of the network is capable of selectively accessing all input tokens, implying that certain input tokens hold more significance than others in the generation of a specific output token:
###### Self-attention in transformers is a technique designed to enhance input representations by enabling each position in a sequence to engage with and determine the relevance of every other position within the same sequence

#### 3.3 Attending to different parts of the input with self-attention
##### 3.3.1 A simple self-attention mechanism without trainable weights
###### This section explains a very simplified variant of self-attention, which does not contain any trainable weights
###### This is purely for illustration purposes and NOT the attention mechanism that is used in transformers
###### The next section, section 3.3.2, will extend this simple attention mechanism to implement the real self-attention mechanism
###### Suppose we are given an input sequence  to 
######   -> The input is a text (for example, a sentence like "Your journey starts with one step") that has already been converted into token embeddings as described in chapter 2
######   -> For instance,  is a d-dimensional vector representing the word "Your", and so forth
###### Goal: compute context vectors  for each input sequence element  in  to  (where  and  have the same dimension)
###### A context vector  is a weighted sum over the inputs  to 
######   The context vector is "context"-specific to a certain input
######   -> Instead of  as a placeholder for an arbitrary input token, let's consider the second input, 
######   -> And to continue with a concrete example, instead of the placeholder , we consider the second output context vector, 
######   -> The second context vector, , is a weighted sum over all inputs  to  weighted with respect to the second input element, 
######   -> The attention weights are the weights that determine how much each of the input elements contributes to the weighted sum when computing 
######   ->  In short, think of  as a modified version of  that also incorporates information about all other input elements that are relevant to a given task at hand.

###### (Please note that the numbers in this figure are truncated to one digit after the decimal point to reduce visual clutter; similarly, other figures may also contain truncated values)
###### By convention, the unnormalized attention weights are referred to as "attention scores" whereas the normalized attention scores, which sum to 1, are referred to as "attention weights"
###### The code below walks through the figure above step by step

###### -> Step 1: compute unnormalized attention scores 
###### -> Suppose we use the second input token as the query, that is, , we compute the unnormalized attention scores via dot products:
...
###### -> Above,  is the Greek letter "omega" used to symbolize the unnormalized attention scores
######    => The subscript "21" in  means that input sequence element 2 was used as a query against input sequence element 1
###### -> Suppose we have the following input sentence that is already embedded in 3-dimensional vectors as described in chapter 3 (we use a very small embedding dimension here for illustration purposes, so that it fits onto the page without line breaks):

In [None]:
import torch

inputs = torch.tensor(
   [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

###### (In this book, we follow the common machine learning and deep learning convention where training examples are represented as rows and feature values as columns; in the case of the tensor shown above, each row represents a word, and each column represents an embedding dimension)

###### The primary objective of this section is to demonstrate how the context vector  is calculated using the second input sequence, , as a query

###### The figure depicts the initial step in this process, which involves calculating the attention scores ω between and all other input elements through a dot product operation

###### We use input sequence element 2, , as an example to compute context vector ; later in this section, we will generalize this to compute all context vectors.

###### The first step is to compute the unnormalized attention scores by computing the dot product between the query  and all other input tokens:

In [None]:
query = inputs[1] # 2nd token is the query

attn_scores_2 = torch.empty(inputs.shape[0])
