## **What are Embedding Models in LLMs? How do they work?**

The world of generative AI is growing at an exponential pace. Large Language models are the driving engines behind this bandwagon. We are at the tipping point (or maybe beyond) of another IT revolution. 

That being said, it becomes even more crucial to understand the magician's trick that makes these large language models work. 

In this tutorial, we will try to understand the core concepts around "Embeddings" that power the LLMs.
Embeddings, also called "Vector Embedding", allow LLMs to develop a semantic understanding of the textual data that they are trained upon. In simpler terms, these embedding models lay the groundwork for LLMs to perform tasks like sentence completion, similarity search, questions and answers etc.

But before we dive into the concepts of "Vector Embeddings", it is important to establish a baseline understanding of a few terms below:

- What is a Vector? 
- What are Tokens?
- What are Tokenizers?

#### **What is a Vector?**

At the low algorithmic level, machines only understand numeric values. But we all know that models like ChatGPT are able to perform decently well on natural human language (like English). 

How are they able to do so? 

Natural languages (texts) are converted into an array of numeric values before they are fed into the complex algorithms that power LLMs. These arrays of numeric values are called "Vector".

An example of a vector: [2.5, 1.0, 3.3, 7.8]
The above is an example of a vector of size 4. (also called 4-dimensional vector)


In [1]:
import numpy as np

vector = np.array([2.5, 1.7, 3.3, 7.8])
print(f"Vector: {vector}") 

Vector: [2.5 1.7 3.3 7.8]


#### **What are Tokens?**

We stated above that **"texts are converted into numeric array called vectors"**.

But what does that mean? 

Does it mean that whole document is converted into a vector?
Does it mean that each paragraph is converted into a vector?
Does it mean that each sentence is converted into a vector?
Does it mean that each word is converted into a vector?

Answer: **IT DEPENDS!!**.

Tokens are the smallest unit of natural language that are converted into a vector. It could be at the document level, paragraph level, sentence level, word level, sub-word level, or character level etc. Let us look at a few examples below:

Example: Consider the text below.

"Earth is a planet of the solar system. There are 9 planets in the solar system. 
All planets revolve around the sun. Sun is a star."


- Case 1: **Tokenizing the entire paragraph into vector.**  
Tokenization: The whole paragrapg is one single token.   
Vectorization: A single vector.  
Sample Vector Representation: [3.1, 6.8, 5.4, 8.0, 7.1]

- Case 2: **Tokenizing each sentence into vectors.**  
Tokenization: One token for each sentence (total 4 token)  
Vectorization: One vector for each sentence (total 4 vectors).   
Sample Vector Representation: [[1.2, 2.3, 3.8, 7.9, 0.8], [2.5, 3.0, 8.2, 6.6, 4.1], [3.2, 6.5, 8.1, 9.3, 1.4], [1.1, 0.7, 7.2, 3.5, 8.5]]

- Case 3: **Tokenizing each word in the paragraph into a vector. There are 26 words in the paragraph, ignoring punctuations. Each word gets converted into a vector.**  
Tokenization: One token for each word in the paragraph (26 tokens)  
Vectorization: One vector for each token ( total 26 vectors).    
Sample Vector Representation: [[2.1, 3.2, 4.1, 9.8, 7.0], [8.2, 4.2, 7.1, 3.8, 2.0].....total 26 such represenatations]


#### **What are Tokenizers?**

Tokenizers are components that responsible for converting texts into tokens (tokenization). There are different types of pre-trained tokenizers that are available. You can train your own tokenizers. But for the scope of this tutorial, we will use a pre-trained one. 

Generally, each tokenizers follows the following steps:

1. Break down the original text into tokens. These tokens could be any unit (sentence, word, sub-word-character etc) based on tokenizer.
2. Assign a token id to each of the tokens created.

In [2]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Earth is a planet in the solar system.")
print(tokens)

['earth', 'is', 'a', 'planet', 'in', 'the', 'solar', 'system', '.']


In [3]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[3011, 2003, 1037, 4774, 1999, 1996, 5943, 2291, 1012]


Great work so far! So now, we have a basic understanding of vectors and tokens. But I am sure you must be having these questions in mind.

Q. Why are these tokens converted into vectors?  
Q. How are these tokens converted into vectors? Is there a formula for conversion for tokens into vectors?  
Q. What determines the size (dimesions) of the vector created from these tokens?  

To answer these questions - lets jump on to the next topic - **Embedding Models**

#### **What are Embedding Models?**

A language model needs to understand that how tokens are related to each other in the context of human language. To understand this semantic relationship, these tokens are converted into numerical vectors.

Embedding Models are trained upon these token to develop an "embedding space".

- Before the training: the embedding model initializes an N-dimensional 'vector' corresponding to each 'token' with random values. (Value of N depends on the embedding model)
  
- During the embedding model training, the values for these vectors are updated across iterations. In this process, tokens that are similar or related are updated to have similarly valued vectors.
  
- After the training is complete, the collection of all the 'vector' corresponding to all the tokens is called the "embedding space".

- "Embedding Space" is an encoded representation of meanings of tokens and inter-token relationship.


As mentioned above, to create a final vector embedding from a token we need to train a model. But similar to pre-trained tokenizers, there are several pre-trained embedding models as well. For this example, we will use an existing pre-trained model.




In [4]:
import torch
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-uncased")


print("Tokens: ", tokens) # ['earth', 'is', 'a', 'planet', 'in', 'the', 'solar', 'system', '.']
print("Token Ids: ", token_ids) # [3011, 2003, 1037, 4774, 1999, 1996, 5943, 2291, 1012]

# get the embedding vector for the word "earth"
example_vector_embedding = model.embeddings.word_embeddings(torch.tensor([token_ids[0]]))

print(example_vector_embedding.shape)
# print(example_vector_embedding) # example_vector_embedding is 768-Dimensional vector represntation for the word "earth". 

Tokens:  ['earth', 'is', 'a', 'planet', 'in', 'the', 'solar', 'system', '.']
Token Ids:  [3011, 2003, 1037, 4774, 1999, 1996, 5943, 2291, 1012]
torch.Size([1, 768])


In an embedding space, two tokens are similar to each other have similar vectors. To know the similarlity score, one of the techniques used is "cosine similarity".  

To calculate cosine similarly, we compute the the values between two N-dimensional vectors. 
- If the value is closer to 1 that means the tokens are similar.
- If the value is closer to -1 that means the tokens are dissimilar.
- If the value is closer to 0 that means the tokens are unrelated.


Lets calculate the cosine similarity score for a few pair of tokens.

In [5]:
earth_token_id = tokenizer.convert_tokens_to_ids(["earth"])[0]
earth_embedding = model.embeddings.word_embeddings(torch.tensor([earth_token_id]))

planet_token_id = tokenizer.convert_tokens_to_ids(["planet"])[0]
planet_embedding = model.embeddings.word_embeddings(torch.tensor([planet_token_id]))

hotdog_token_id = tokenizer.convert_tokens_to_ids(["hotdog"])[0]
hotdog_embedding = model.embeddings.word_embeddings(torch.tensor([hotdog_token_id]))

life_token_id = tokenizer.convert_tokens_to_ids(["life"])[0]
life_embedding = model.embeddings.word_embeddings(torch.tensor([life_token_id]))



#### **Experiment**: Try different pairs of vector embeddings and report their cosine similarity score. What do you observe?

In [6]:
cos = torch.nn.CosineSimilarity(dim=1)

similarity = cos(life_embedding, earth_embedding)
print(similarity)

tensor([0.3102], grad_fn=<SumBackward1>)


**Congratulations!!!**

With this you have completed the basic understanding of tokens, vectors and embeddings.  

You are all set to level up and learn about "**RAG based LLMs**"