# Embeddings and Dense Vector Search: A Quick Primer

If you come from an NLP background, embeddings are something you might be intimately familiar with - otherwise, you might find the topic a bit...dense. (this attempt at a joke will make more sense later)

In all seriousness, embeddings are a powerful piece of the NLP puzzle, so let's dive in!

> NOTE: While this notebook language/NLP-centric, embeddings have uses beyond just text!

#### Why Do We Even Need Embeddings?

In order to fully understand what Embeddings are, we first need to understand why we have them!

Machine Learning algorithms, ranging from the very big to the very small, all have one thing in common:

They need numeric inputs.

So we need a process by which to translate the domain we live in, dominated by images, audio, language, and more, into the domain of the machine: Numbers.

Another thing we want to be able to do is capture "semantic information" about words/phrases so that we can use algorithmic approaches to determine if words are closely related or not!

So, we need to come up with a process that does these two things well:

- Convert non-numeric data into numeric-data
- Capture potential semantic relationships between individual pieces of data

#### How Do Embeddings Capture Semantic Relationships?

In a simplified sense, embeddings map a word or phrase into n-dimensional space with a dense continuous vector, where each dimension in the vector represents some "latent feature" of the data.

This is best represented in a classic example:

![image](https://i.imgur.com/K5eQtmH.png)

As can be seen in the extremely simplified example: The X_1 axis represents age, and the X_2 axis represents hair.

The relationship of "puppy -> dog" reflects the same relationship as "baby -> adult", but dogs are (typically) hairier than humans. However, adults typically have more hair than babies - so they are shifted slightly closer to dogs on the X_2 axis!

Now, this is a simplified and contrived example - but it is *essentially* the mechanism by which embeddings capture semantic information.

In reality, the dimensions don't sincerely represent hard-concepts like "age" or "hair", but it's useful as a way to think about how the semantic relationships are captured.

Alright, with some history behind us - let's examine how these might help us choose relevant context.

Let's begin with a simple example - simply looking at how close to embedding vectors are for a given phrase.

When we use the term "close" in this notebook - we're referring to a distance measure called "cosine similarity".

We discussed above that if two embeddings are close - they are semantically similar, cosine similarity gives us a quick way to measure how similar two vectors are!

Closeness is measured from 1 to -1, with 1 being extremely close and -1 being extremely close to opposite in meaning.

Let's implement it with Numpy below.

In [64]:
import numpy as np
from numpy.linalg import norm

def cosine_similarity(vec_1, vec_2):
  return np.dot(vec_1, vec_2) / (norm(vec_1) * norm(vec_2))

Now let's use the `text-embedding-3-small` embedding model (more on that in a second) to embed two sentences. In order to use this embedding model endpoint - we'll need to provide our OpenAI API key!

In [65]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

In [66]:
from aimakerspace.openai_utils.embedding import EmbeddingModel

embedding_model = EmbeddingModel()

Let's define our two sentences:

In [67]:
puppy_sentence = "I love puppies!"
dog_sentence = "I love dogs!"

Now we can convert those into embedding vectors using OpenAI!

In [68]:
puppy_vector = embedding_model.get_embedding(puppy_sentence)
dog_vector = embedding_model.get_embedding(dog_sentence)

Now we can determine how closely they are related using our distance measure!

In [69]:
cosine_similarity(puppy_vector, dog_vector)

np.float64(0.8341482011091341)

Remember, with cosine similarity, close to 1. means they're very close!

Let's see what happens if we use a different set of sentences.

In [70]:
puppy_sentence = "I love puppies!"
cat_sentence = "I dislike cats!"

puppy_vector = embedding_model.get_embedding(puppy_sentence)
cat_vector = embedding_model.get_embedding(cat_sentence)

cosine_similarity(puppy_vector, cat_vector)

np.float64(0.3723972998892517)

As you can see - these vectors are further apart - as expected!

### Embedding Vector Calculations

One of the ways that Embedding Vectors can be leveraged, and a fun "proof" that they work the way we expected can be explored via "Vector Calculations"

That is to say: If we take the vector for "King", and subtract the vector for "man", and add the vector for "woman" - we should have a vector that is similar to "Queen".

Let's try this out below!

In [71]:
king_vector = np.array(embedding_model.get_embedding("King"))
man_vector = np.array(embedding_model.get_embedding("man"))
woman_vector = np.array(embedding_model.get_embedding("woman"))

vector_calculation_result = king_vector - man_vector + woman_vector

queen_vector = np.array(embedding_model.get_embedding("Queen"))

cosine_similarity(vector_calculation_result, queen_vector)

np.float64(0.7161951721027584)

[Walid] I am going to try this same exercise and see the cosine similarity between a sentence in french and the same sentence in english.

In [72]:
i_speak_french_eng = np.array(embedding_model.get_embedding("I speak french"))
i_speak_french_fr = np.array(embedding_model.get_embedding("Je parle français"))

cosine_similarity(i_speak_french_eng, i_speak_french_fr)


np.float64(0.8134808594458636)

As you can see - the resulting vector is indeed quite close to the "Queen" vector!

> NOTE: The loss is explained by the vectors not *literally* encoding information along axes as simple as "man" or "woman".

[Walid] I am going to try and generate the embeddings of two images and calculate their cosine similarity.

In [73]:
from PIL import Image
from dotenv import load_dotenv

In [74]:
def get_image_embedding(image_path):
    # Open and prepare the image
    image = Image.open(image_path)
    
    # Create embedding using OpenAI's GPT-4 Vision
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{encode_image(image)}"
                        }
                    }
                ]
            }
        ],
        max_tokens=300
    )
    
    # Get the description and convert it to embedding
    description = response.choices[0].message.content
    # Convert the text description to an embedding using our existing embedding model
    return np.array(embedding_model.get_embedding(description))


In [75]:
def encode_image(image):
    import base64
    import io
    
    # Convert PIL Image to bytes
    buffered = io.BytesIO()
    # Convert image to RGB mode if it's not already
    if image.mode in ('RGBA', 'P'):
        image = image.convert('RGB')
    image.save(buffered, format="JPEG")
    
    # Encode to base64
    return base64.b64encode(buffered.getvalue()).decode('utf-8')

In [76]:
# Compare two images
def compare_images(image_path1, image_path2):
    # Get embeddings for both images
    embedding1 = get_image_embedding(image_path1)
    embedding2 = get_image_embedding(image_path2)
    
    # Calculate similarity
    similarity = cosine_similarity(embedding1, embedding2)
    return similarity

I am calculating the cosine similarity between two poodle images.

In [77]:
image1_path = "images/poodle1.jpg"
image2_path = "images/poodle2.png"

In [78]:
similarity_score = compare_images(image1_path, image2_path)
print(f"Similarity score between images: {similarity_score}")

Similarity score between images: 0.6114003143213


I am calculating the cosine similarity between a poodle and a cat

In [79]:
image1_path = "images/poodle1.jpg"
image2_path = "images/cat.jpg"

In [80]:
similarity_score = compare_images(image1_path, image2_path)
print(f"Similarity score between images: {similarity_score}")

Similarity score between images: 0.28740342245942885


I am calculating the cosine similarity between a poodle and german sheperd

In [81]:
image1_path = "images/poodle1.jpg"
image2_path = "images/germanshepherd.jpg"

In [82]:
similarity_score = compare_images(image1_path, image2_path)
print(f"Similarity score between images: {similarity_score}")

Similarity score between images: 0.36598593994155265


### Conclusion

As you can see - embeddings can help us convert text into a machine understandable format, which we can leverage for a number of purposes.