# Lesson 2: Embedding models

In this exercise, we'll explore embedding models.

Most embedding models handle text input and output floating point numbers. 

The **dimensions** is the number of dimensions in that model. The "embeddings", or "vectors" are the floating point numbers.

Like the chat completions client, the OpenAI SDK has become the standard for how people use embedding models.

For this exercise, I have pre-computed the embeddings for a "database" of 166 items of clothing for a shop (see data/clothes.json). This will save you time. 

For local development, you can either use `nomic-embed-text` (v1) which is English-only, or `granite-embedding:278m` which is multilingual. Nomic has fewer parameters so it requires less memory to calculate embeddings.

```bash
ollama pull nomic-embed-text
ollama pull granite-embedding:278m
```

In [None]:
# Your first embedding with a model
from openai import OpenAI
import utils

# If you change the environment variables, you need to restart the kernel
api_key = utils.get_api_key()

if utils.MODE == "github":
    model = "text-embedding-3-small"  # A fast, small model
    base_url = "https://models.inference.ai.azure.com"
    # default is 1536 for text-embedding-3-small. Is not an arbitrary number, is one of the accepted values (256, 512, 1024, etc.)
    # ada-002 doesn't support variable dimensions
    dimensions = 1024 
elif utils.MODE == "ollama":
    # pick which one works for you
    # model = "nomic-embed-text"  # A comparable open-source model
    model = "granite-embedding:278m"  # a multilingual model
    base_url = utils.get_base_url()
    dimensions = 768  # Both granite and nomic-embed-text have 768 dimensions. If you provide a different number, it will ignore you and still return 768 dimensions.

# OpenAI client is a class. The old API used to use globals. Sometimes you might see code snippets for the old API. 

client = OpenAI(
    base_url=base_url,
    api_key=api_key,
)

def get_embedding(text):
    response = client.embeddings.create(
        input=text,
        model=model,
        dimensions=dimensions,
    )
    return response.data[0].embedding

First lets try getting an embedding for a piece of text.

In [None]:
beans_embedding = get_embedding("delicious beans")
print(len(beans_embedding), beans_embedding[:10])

In [None]:
import pandas as pd
from utils.embeddings import cosine_similarity  # See utils/embeddings.py for the cosine similarity function (its not complicated)

data = pd.read_json("data/clothes.json")
if utils.MODE == "ollama":
    if model == "nomic-embed-text":
        data['embedding'] = data.embedding_nomic
    elif model == "granite-embedding:278m":
        data['embedding'] = data.embedding_granite
    else:
        raise ValueError("I didn't precompute those embeddings. ")


def search_df(df, product_description, n=3):
    embedding = get_embedding(product_description)
    df['similarities'] = df.embedding.apply(lambda x: cosine_similarity(x, embedding))
    res = df.sort_values('similarities', ascending=False).head(n)
    return res

data

In [None]:
res = search_df(data, 'fishing gear', n=3)

res

In [None]:
# This works for many languages, not just English
# Nb: the nomic-embed-text model is barely multilingual. Results will differ greatly with text-embedding-3-small
search_df(data, 'Equipo de pesca', n=3)  # Spanish

In [None]:
search_df(data, '釣りの物' , n=3)  # Japanese

Even though we searched for the same thing in 3 different languages, the similarity score (right column) was quite different.
The embedding models are multilingual but same-language will score higher.

Computing the similarities for every item is highly intensive, so we can use indexes to cluster vectors together to speed up the search.

# Combining Text Completions and Embeddings to make a RAG bot

We can combine the text-completions (LLM) with the embedding search to find relevant products and include them in the chat.

This information could also be something like a knowledgebase, wiki, or an unstructured data store.

The stages are:

1. Get the request from the user
1. Search the embedding index for similar matches
1. Give those matches to the LLM along with the original question or query
1. Ask it to generate a response
1. Give the response back to the user

In [None]:
if utils.MODE == "github":
    chat_model = "gpt-4.1-nano"  # A fast, small model
elif utils.MODE == "ollama":
    chat_model = "llama3.1"  # llama and ollama are not related. It's a coincidence


def rag_chat(query, n=3):
    # Step 1: Get the embedding for the query
    matches = search_df(data, query, n=n)
    
    # Merge this into a prompt
    system_prompt = """
    The user has asked about a product, you are a helpful assistant that can give suggestions about products we have. 

    The matching products are:
    """

    for match in matches.iterrows():
        match = match[1]
        system_prompt += f"""
        Name: {match['name']}
        Description: {match.description}
        URL: https://www.superpythonshop.com/products/{match.id}

        """

    # Step 2: Call the model with the prompt
    response = client.chat.completions.create(
        model=chat_model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query},
        ],
        temperature=0.5,
        n=1,
    )

    # Step 3: Return the response
    return response.choices[0].message.content

from IPython.display import display, Markdown

display(Markdown(rag_chat("I need a warm hat for winter")))


# Task

Your next job is to iterate on this prompt to refine it and improve the suggestions. Try different queries and searches to see what it does.

Instructions:

- Edit the cell above and change the system prompt
- Run the cell again to see the results

Try the following:

- Looking for something silly
- Looking for something that doesn't exist
- Starting an argument with it
- Asking a question with errors
- Asking a question in a different language

