#### Chapter 3: Coding Attention Mechanisms
###### Packages that are being used in this notebook:

In [None]:
from importlib.metadata import version

print("torch version:", version("torch"))

###### This chapter covers attention mechanisms, the engine of LLMs:

#### 3.1 The problem with modeling long sequences
###### No code in this section
###### Translating a text word by word isn't feasible due to the differences in grammatical structures between the source and target languages:

###### Prior to the introduction of transformer models, encoder-decoder RNNs were commonly used for machine translation tasks
###### In this setup, the encoder processes a sequence of tokens from the source language, using a hidden state—a kind of intermediate layer within the neural network—to generate a condensed representation of the entire input sequence:

#### 3.2 Capturing data dependencies with attention mechanisms
###### No code in this section
###### Through an attention mechanism, the text-generating decoder segment of the network is capable of selectively accessing all input tokens, implying that certain input tokens hold more significance than others in the generation of a specific output token:
###### Self-attention in transformers is a technique designed to enhance input representations by enabling each position in a sequence to engage with and determine the relevance of every other position within the same sequence

#### 3.3 Attending to different parts of the input with self-attention
##### 3.3.1 A simple self-attention mechanism without trainable weights
###### This section explains a very simplified variant of self-attention, which does not contain any trainable weights
###### This is purely for illustration purposes and NOT the attention mechanism that is used in transformers
###### The next section, section 3.3.2, will extend this simple attention mechanism to implement the real self-attention mechanism
###### Suppose we are given an input sequence  to 
######   -> The input is a text (for example, a sentence like "Your journey starts with one step") that has already been converted into token embeddings as described in chapter 2
######   -> For instance,  is a d-dimensional vector representing the word "Your", and so forth
###### Goal: compute context vectors  for each input sequence element  in  to  (where  and  have the same dimension)
###### A context vector  is a weighted sum over the inputs  to 
######   The context vector is "context"-specific to a certain input
######   -> Instead of  as a placeholder for an arbitrary input token, let's consider the second input, 
######   -> And to continue with a concrete example, instead of the placeholder , we consider the second output context vector, 
######   -> The second context vector, , is a weighted sum over all inputs  to  weighted with respect to the second input element, 
######   -> The attention weights are the weights that determine how much each of the input elements contributes to the weighted sum when computing 
######   ->  In short, think of  as a modified version of  that also incorporates information about all other input elements that are relevant to a given task at hand