# Embeddings

This notebook demonstrates how to work with text embeddings using various models. 

Embeddings are numerical representations of text that capture semantic meaning, allowing us to perform operations like similarity comparison and clustering.

We'll explore:
- Loading different embedding models
- Generating embeddings for text
- Calculating similarity between embeddings

In [2]:
# Setup environment
from dotenv import find_dotenv, load_dotenv

load_dotenv(verbose=True)
assert find_dotenv(), "no .env file found"

%load_ext autoreload
%autoreload 2

!export PYTHONPATH=$PYTHONPATH:$(pwd)

# Force reload modules (add this cell and run it first)
%load_ext autoreload
%autoreload 2

# Clear any cached imports
import importlib
import sys

if "genai_tk" in sys.modules:
    del sys.modules["genai_tk"]
    del sys.modules["genai_tk.core"]
    del sys.modules["genai_tk.core.embeddings_factory"]

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


KeyError: 'genai_tk.core.embeddings_factory'

In [None]:
SENTENCE_1 = "Tokenization is the process of breaking down a text into individual units."
SENTENCE_2 = "Tokens can be words, phrases, or even individual characters."
SENTENCE_3 = "LangChain Provide a standardized way to load and process various types of documents"

# Available Embedding Models

Our system provides a factory pattern for creating different embedding models. 

The available models are defined in [embeddings.py](../python/ai_core/embeddings.py).

Let's list all available models:

In [None]:
from genai_tk.core.embeddings_factory import EmbeddingsFactory, get_embeddings

# from src.ai_core.embeddings_factory import EmbeddingsFactory, get_embeddings

print(EmbeddingsFactory.known_items())

AssertionError: cannot find config file: '/home/tcl/prj/genai-blueprint/notebooks/config/app_conf.yaml'

Let's create embeddings for our first sentence using different models.

We'll use cosine similarity to compare how similar the embeddings are.


In [None]:
# Generate Embeddings


from langchain_community.utils.math import cosine_similarity

# Try different models by uncommenting one:
MODEL_ID = "ada_002_azure"
MODEL_ID = None  # Default
embedder = get_embeddings(embeddings_id=MODEL_ID)

# or select by tag from a configuration YAML file:
# azure_embedder = get_embeddings(embeddings_tag="azure")

# Generate embedding for first sentence
vector_1 = embedder.embed_documents([SENTENCE_1])
print(f"{vector_1[0][:20]}...")
print(f"length: {len(vector_1[0])}")

Now let's compare how similar our first sentence is to the other sentences:

In [None]:
# Compare Embeddings

other_vectors = embedder.embed_documents([SENTENCE_2, SENTENCE_3])

result = cosine_similarity(vector_1, other_vectors)
print(result)


# The output shows the cosine similarity scores between the first sentence and the other two sentences. Scores closer to 1 indicate higher similarity.

In [None]:
len(vector_1[0])

### Assignment

1. Try different sentences and observe how the similarity scores change
2. Experiment with different embedding models by changing MODEL_ID
3. Explore the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) to compare embedding model performance

Some things to try:
- How do different models handle synonyms?
- What happens with very short vs very long sentences?
- How do the embedding dimensions differ between models?