__Embeddings for an LLM__

This notebook is a simple introduction to the embeddings used in a large language models (LLMs). Large Language models understand text through self attention. Self attention allows an LLM to understand a word based on the surrounding context.

In order to understand context, words in a sentence have to undergo numerical transformations that result in embeddings. There are two key types of embeddings displayed in this notebook. 

__One Hot Encoding__

In One Hot Encoding each word is converted into a unique vector. 

For example, the words cat, dog and mouse would be represented as: 

* cat = [1,0,0]
* dog = [0,1,0]
* mouse = [0,0,1]



__Dense Embeddings__

Dense embeddings quantify the relationships between words. A dense embedding vector contains the following values. 

* Q = Query  --------   Query represents the current word's question. It essentially asks "What I'm I looking for" 
* K = Key   ------- Key represents the label of each word. "This is the information I hold"  
* V = Value  -------   Value represents the actual meaning of the word. 

This structure allows dense embeddings to provide far more information of a word to an LLM than traditional One Hot Encoding.


The following cells demonstrate how dense embeddings are made through a simple example. First, we will import the packages needed. 

In [302]:
# Packages 
import numpy as np

The following example sentence will be analyzed in this notebook. __"Carlos Alcaraz is a very skilled tennis player and philanthropist"__. 

First, the sentence is broken down into tokens which represent the individual words. 

In [303]:
sentence = "Carlos Alcaraz is a very skilled tennis player and philanthropist"
tokens = sentence.split()

print("Tokens: \n",tokens)
print(f"\n Number of tokens: {len(tokens)}")

Tokens: 
 ['Carlos', 'Alcaraz', 'is', 'a', 'very', 'skilled', 'tennis', 'player', 'and', 'philanthropist']

 Number of tokens: 10


These tokens are then converted into vectors. In this example, we will create random vectors to represent each word. In LLM's, these embeddings would come from pre-trained models. The embeddings would have many dimensions. In this example, the embeddings will have only one dimension. 

Example: 
- ChatGPT models are so quick due to these embeddings. The model doesn't compute them in real time as they have already been defined during training. 


The following cell generates random vectors for each word.

In [304]:
d_model = 8
n_tokens = len(tokens)

embeddings = np.random.rand(n_tokens, d_model)
# include the word next to each embedding
for i, token in enumerate(tokens):
    print(f"Embedding for the word '{token}':\n", embeddings[i], "\n")


Embedding for the word 'Carlos':
 [0.34233201 0.70782322 0.58654392 0.45896529 0.19870387 0.2447379
 0.78724402 0.2299572 ] 

Embedding for the word 'Alcaraz':
 [0.80863703 0.35922496 0.75924621 0.53979024 0.55067224 0.28280501
 0.00340947 0.69813302] 

Embedding for the word 'is':
 [0.50789885 0.00332485 0.60721943 0.37638575 0.00801276 0.29489882
 0.9204668  0.53738749] 

Embedding for the word 'a':
 [0.37905786 0.96145295 0.81727328 0.07657175 0.20016525 0.83958062
 0.42651903 0.4139435 ] 

Embedding for the word 'very':
 [0.28268319 0.0743492  0.34933177 0.44806844 0.44347801 0.28984714
 0.4892609  0.97517145] 

Embedding for the word 'skilled':
 [0.11465447 0.69000339 0.2034245  0.23387701 0.43350531 0.52350151
 0.54595507 0.48093439] 

Embedding for the word 'tennis':
 [0.88419703 0.92647658 0.32072877 0.11380039 0.90957333 0.07126183
 0.27856685 0.20602544] 

Embedding for the word 'player':
 [0.96184696 0.5915835  0.36829327 0.22900793 0.35531794 0.79177672
 0.34428773 0.418319

The next step is to create the Query, Key and Value vectors which will allow us to transform the simple embeddings into dense embeddings. At the start of training, these vectors are randomly assigned and gradually adjust to optimal values through training.

In [306]:
# The dimension for our Q, K, and V vectors. It can be different from d_model.
d_k = 6
# These matrices transform the input embeddings into the Q, K, and V spaces.
W_Q = np.random.rand(d_model, d_k)
W_K = np.random.rand(d_model, d_k)
W_V = np.random.rand(d_model, d_k)

In order to derive the Query, Key and Value vectors for each word, the word's embedding will be multiplied by each of the Q, K and V vectors. 


What is this word looking for: 
- Query Vector = word embedding x Query Vector
- Key Vector = word embedding x Key Vector
- Value Vector = word embedding x Value Vector

In [307]:
# Project the embeddings into Q, K, and V spaces
Query = embeddings @ W_Q
Key = embeddings @ W_K
Value = embeddings @ W_V

print("Shape of Query matrix:", Query.shape)
print("Query \n", Query)

Shape of Query matrix: (10, 6)
Query 
 [[1.26535811 1.85957245 1.44507856 1.31112859 1.36775768 1.60020967]
 [1.39511514 2.10762303 2.27040751 1.29268469 1.37557501 1.67435142]
 [1.15327443 1.86268832 1.67460328 0.76079086 1.52265387 1.48942019]
 [1.50726343 1.83887757 1.839101   1.85747175 1.14552448 1.65543575]
 [1.1911837  1.97270956 2.16924636 1.05457209 1.24983044 1.36650191]
 [1.17377374 1.62665185 1.70560996 1.46941051 0.85771377 1.23256189]
 [1.27789966 1.73206948 1.65986651 1.29687169 1.16203942 1.56042428]
 [1.06003181 1.41323641 1.41304266 1.10428475 0.95904985 1.28960101]
 [1.22735991 1.81388128 1.9411265  1.12634582 1.22361666 1.50542652]
 [1.738504   2.41037611 2.35744163 1.24566756 1.73405956 2.1560807 ]]


__Self Attention Scores__ 

In order to figure out how much attention a word should pay to every other word, we take the dot product of its Query Vector with the Key Vectors of all other words in the sentence. A high score relates to higher relevance. 


Query: What is this word looking for ? 

Key: What meaning does this word contain ? 

- Self Attention Scores = Query Vector of Word x Key Vector of all other words
- $Attention Scores = Q * K^{T} /  \sqrt{D_K}$
- Note, this product is divide by the dimensionality or the k vector to prevent the numbers from growing too much.

In [308]:
# Calculate raw scores by multiplying Query vectors with Key vectors
scores = (Query @ Key.T) / np.sqrt(d_k)

print("\nRaw scores for the word 'tennis':\n\n", scores[6])


Raw scores for the word 'tennis':

 [4.86740936 5.77822634 4.51954615 6.09489065 4.96068668 4.80272987
 5.01459764 4.17289351 4.69234047 6.33613172]


The next step is to apply the SoftMax function for the attention weights since the raw scores are difficult to interpret.

The SoftMax function scales numbers into weights that add up to 1. Therefore, a weight of .6 means that a word dedicates 60% of its attention to another word. 


In [309]:
print("Original weights for 'tennis' (row 6):")
print(np.round(scores[6], 2))

def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / np.sum(e_x, axis=-1, keepdims=True)

attention_weights = softmax(scores)

print("\nLet's check the new weights for 'tennis' (row 6):")
print(np.round(attention_weights[6], 2))

# print the new weights for tennis showing the other tokens 
print("\nAttention weights for 'tennis' (row 6) with corresponding tokens:")
for token, weight in zip(tokens, np.round(attention_weights[6], 2)):
    print(f"{token}: {weight}")

Original weights for 'tennis' (row 6):
[4.87 5.78 4.52 6.09 4.96 4.8  5.01 4.17 4.69 6.34]

Let's check the new weights for 'tennis' (row 6):
[0.06 0.15 0.04 0.21 0.07 0.06 0.07 0.03 0.05 0.26]

Attention weights for 'tennis' (row 6) with corresponding tokens:
Carlos: 0.06
Alcaraz: 0.15
is: 0.04
a: 0.21
very: 0.07
skilled: 0.06
tennis: 0.07
player: 0.03
and: 0.05
philanthropist: 0.26


Based on the results above, the word tennis would pay most attention to the words 'Philanthropist' and 'a'. Note, this may change when running notebook as the vectors are randomly assigned each time the notebook is ran.  

Lastly, we take the weighted sum of all Value vectors, using the attention weights. Therefore, words with higher attention weights contribute more to the final value vector for the current word. This final output is a __context aware__ representation of each word. 

- Value vector represents the actual meaning of the word

In [312]:
# The final output is a weighted sum of the Value vectors
output = attention_weights @ Value

print("\nOriginal embedding for 'tennis':\n", np.round(embeddings[6], 2))
print("\nNew context-aware output vector for 'tennis':\n", np.round(output[6], 2))


Original embedding for 'tennis':
 [0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1]

New context-aware output vector for 'tennis':
 [1.46 1.7  1.85 1.69 1.37 1.91]


__Conclusion & Real Life Applications__ 

This simple mechanism of representing words through context aware vectors is the building block for LLM's. Advance LLM's use multi-head attention which doesn't just ask one question (Query), but asks many question such as "What is the grammar related to this word?" and "What is the topic of this word?". This is like having multiple people analyze the same sentence from different perspectives.

LLM's like Chat-GPT or Gemini stack multiple layers of what we just did. Therefore, the context rich output from layer 1 becomes the input for the second layer and so on. By the time, the text has passed through all of the layers, the model has built and incredibly deep understanding of the relationships between every word. This allows an LLM to generate coherent and relevant responses. 