<a href="https://colab.research.google.com/github/jeffheaton/app_generative_ai/blob/main/t81_559_class_06_3_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-559: Applications of Generative Artificial Intelligence
**Module 6: Retrieval-Augmented Generation (RAG)**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 6 Material

* Part 6.1: Introduction to Retrieval-Augmented Generation (RAG) [[Video]](https://www.youtube.com/watch?v=qA52K0K181Q) [[Notebook]](t81_559_class_06_1_rag.ipydb)
* Part 6.2: Introduction to ChromaDB [[Video]](https://www.youtube.com/watch?v=R53lo4sevLQ) [[Notebook]](t81_559_class_06_2_chromadb.ipynb)
* **Part 6.3: Understanding Embeddings** [[Video]](https://www.youtube.com/watch?v=Tq82Gl2ZZNM) [[Notebook]](t81_559_class_06_3_embeddings.ipynb)
* Part 6.4: Question Answering Over Documents [[Video]](https://www.youtube.com/watch?v=hCwL_lW-gP0) [[Notebook]](t81_559_class_06_4_qa.ipynb)
* Part 6.5: Embedding Databases [[Video]](https://www.youtube.com/watch?v=BG2gT4uYxhM) [[Notebook]](t81_559_class_06_5_embed_db.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running and maps Google Drive if needed.

In [1]:
import os

try:
    from google.colab import drive, userdata
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# OpenAI Secrets
if COLAB:
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Install needed libraries in CoLab
if COLAB:
    !pip install langchain langchain_openai pypdf chromadb

Note: using Google CoLab
Collecting langchain
  Downloading langchain-0.2.5-py3-none-any.whl (974 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.6/974.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_openai
  Downloading langchain_openai-0.1.9-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.3/40.3 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-0.5.3-py3-none-any.whl (559 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m559.5/559.5 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.3.0,>=0.2.7 (from langchain)
  Downloading langchain_core-0.2.9-py3-none-any.whl (321 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

# 6.3: Understanding Embeddings

An [embedding](https://platform.openai.com/docs/guides/embeddings) is a vector (list) of floating point numbers. The [distance](https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use) between two vectors measures their relatedness. Small distances suggest high relatedness, and large distances suggest low relatedness.

The two embedding models that you will choose between in OpenAI are as follows:

* text-embedding-3-small
* text-embedding-3-large

Choosing between the two OpenAI embedding models, "text-embedding-3-small" and "text-embedding-3-large," depends on several factors related to your specific use case, including performance requirements, computational resources, and the nature of the tasks you need the embeddings for. Here are some key considerations:

1. **Performance and Accuracy**:
 * **text-embedding-3-large**: Generally, larger models tend to capture more nuanced and complex relationships within the text, leading to better performance in tasks that require a deep understanding of language, such as semantic similarity, sentiment analysis, and more sophisticated NLP tasks.
 * **text-embedding-3-small**: Smaller models may not be as accurate or detailed as larger ones, but they can still perform well on many tasks, especially those with less complexity or when fine-tuned on specific datasets.
2. **Computational Resources:**
 * **text-embedding-3-large**: Requires more computational power and memory. This requirement is important if you are deploying the model in a resource-constrained environment or need to process a large volume of data in real time.
 * **text-embedding-3-small**: More efficient resource usage, making it a better choice for applications with limited computational power or when operating at a scale where cost and speed are critical factors.
3. **Latency and Throughput:**
 * **text-embedding-3-large**: Typically, larger models have higher latency due to their complexity, which might impact real-time applications.
 * **text-embedding-3-small**: Lower latency and faster inference times benefit applications requiring quick responses.
4. **Cost:**
 * **text-embedding-3-large**: Likely to incur higher operational costs due to greater computational requirements.
 * **text-embedding-3-small**: More cost-effective, particularly for large-scale deployments.
 Use Case Specifics:

The large model might be more appropriate for applications needing high precision and where detailed semantic understanding is crucial, such as nuanced text analysis or advanced AI research.
The small model could be the better choice for applications where speed, cost, and efficiency are more critical, such as real-time systems, chatbots, or applications with more straightforward text processing needs.

In summary,

* Choose text-embedding-3-large if:
 * You need high accuracy and detailed semantic understanding.
 * You have sufficient computational resources and budget.
 * Latency is not a critical concern.
* Choose text-embedding-3-small if:
 * You require efficient resource usage and lower costs.
 * You need faster inference times.

The tasks are less complex, or the environment needs more resource-constrained.
Evaluating your requirements and constraints will help you decide which model to use.

## Instantiating an Embedding Model

For this class, I suggest that you use **text-embedding-3-small**. It has all the capabilities that we need and will stretch your credits further. Let's begin by creating a client that utilizes this model.


In [2]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")

Let's start by understanding the basics of an embedding model and vectors. An embedding model is a tool that can convert any text string, regardless of its length, into a vector. This vector is a unique representation of the text string. If the vectors for two text strings are the same, it means the text strings are identical. If the vectors are different, it suggests that the text strings are distinct. However, this difference is not a simple comparison. Even two very different text strings, with the same meaning, can produce similar vectors.

To begin with, let's look at vectors for individual words.

In [3]:
l1 = embeddings_model.embed_query("dog")
l2 = embeddings_model.embed_query("Something that is a bit longer than a word.")

print(type(l1))

<class 'list'>


As you can see, the output is just a regular Python list. These lists are quite long.

In [4]:
print(len(l1))
print(len(l2))

1536
1536


The length of this string will remain consistent across queries of the same model. The larger version of the model, while it does a better job of creating vectors that differentiate strings, doesn't always require a larger vector length to achieve this increase in quality. It's a nuanced concept worth exploring further.

If we display the actual list itself, we can see that it is just a collection of numbers. Here, we display only the first ten elements.

In [5]:
print(l1[:10])

[0.05113774910569191, -0.01870863139629364, -0.004298428073525429, 0.07271610200405121, -0.007174310740083456, -0.014693480916321278, -0.0059395902790129185, 0.005037412978708744, 0.018954960629343987, -0.01090618409216404]


## Comparing Vectors

To compare these vectors we will use the mathematical capabilities of [Numpy](https://numpy.org/). There are multiple different approaches to compare vectors in mathematics, some of the most common are presented here.

* **Dot Product**: Measures the cosine of the angle between two vectors, indicating their directional similarity. Used in determining orthogonality, projection, and in various applications like computer graphics and machine learning.

* **Cross Product**: Computes a vector perpendicular to two given vectors in three-dimensional space, useful for finding the normal vector to a plane and in physics for torque and angular momentum calculations.

* **Euclidean Distance**: Calculates the straight-line distance between two vectors, widely used in machine learning for clustering and nearest neighbor algorithms.

* **Cosine Similarity**: Evaluates the cosine of the angle between two vectors, emphasizing the orientation rather than the magnitude, often used in text mining and information retrieval to compare document similarity.

* **Manhattan Distance**: Measures the sum of absolute differences between the components of two vectors, useful in grid-based pathfinding algorithms and some machine learning applications.

OpenAI [suggests](https://platform.openai.com/docs/guides/embeddings/frequently-asked-questions) that we use Cosine Similarity to compare their vectors, because it perserves the magnitude. The sign of the individual vector numbers is important, we do not want to discard it.

We will begin by converting the two embeddings from Python lists to Numpy arrays.

In [6]:
import numpy as np

# Convert lists to numpy arrays
vec1 = np.array(l1)
vec2 = np.array(l2)

OpenAI specifies that all their embeddings are normalized to length 1, often called [unit vectors](https://en.wikipedia.org/wiki/Unit_vector). This fact means that:

* Cosine similarity can be computed slightly faster using just a dot product
* Cosine similarity and Euclidean distance will result in the identical rankings

Now, let's put this into practice. To verify that these embeddings are indeed of length 1, we can use a handy function. This function is specifically designed to analyze the length of a vector, making it a useful tool in our exploration of OpenAI embeddings.

In [7]:
def analyze_vector_length(vector):
  # Calculate the length (norm) of the vector
  length = np.linalg.norm(vector)

  # Check if the vector is a unit vector
  is_unit_vector = np.isclose(length, 1.0)

  print(f"Vector: {vector}")
  print(f"Length of the vector: {length}")
  print(f"Is the vector a unit vector? {'Yes' if is_unit_vector else 'No'}")

  # Normalize the vector to make it a unit vector
  unit_vector = vector / length

  # Verify the length of the normalized vector
  normalized_length = np.linalg.norm(unit_vector)

  print(f"Normalized vector (unit vector): {unit_vector}")
  print(f"Length of the normalized vector: {normalized_length}")

In [8]:
analyze_vector_length(vec1)

Vector: [ 0.05113775 -0.01870863 -0.00429843 ...  0.02879578  0.00215999
  0.01790806]
Length of the vector: 0.9999999997385218
Is the vector a unit vector? Yes
Normalized vector (unit vector): [ 0.05113775 -0.01870863 -0.00429843 ...  0.02879578  0.00215999
  0.01790806]
Length of the normalized vector: 1.0000000000000002


Next, let's calculate the actual cosine similarity between our two vectors.

In [9]:
import numpy as np

# Convert lists to numpy arrays
vec1 = np.array(l1)
vec2 = np.array(l2)

# Calculate the dot product
dot_product = np.dot(vec1, vec2)

# Calculate the magnitudes (L2 norms)
magnitude_vec1 = np.linalg.norm(vec1)
magnitude_vec2 = np.linalg.norm(vec2)

# Calculate cosine similarity
cosine_similarity = dot_product / (magnitude_vec1 * magnitude_vec2)

print(f"Cosine similarity: {cosine_similarity}")

Cosine similarity: 0.16737720056743507


The denominator above is 1.0, as shown here. This is due to the unit vector property. We can simplify our unit vector comparison to just the dot product, as indicated by OpenAI.

In [10]:
print(magnitude_vec1 * magnitude_vec2)

1.0000000478179367


The dot product can be calculated as follows.

In [11]:
print(np.dot(vec1, vec2))

0.16737720857106747


## Evaluating Similarities of Strings

We will begin by creating a simple function to compare two strings.

In [12]:
def compare_str(embeddings_model, text1, text2):
    """
    This function returns the dot product of embeddings for two given text strings.

    Parameters:
    embeddings_model: The embeddings model to use for generating embeddings.
    text1 (str): The first text string.
    text2 (str): The second text string.

    Returns:
    float: The dot product of the embeddings for text1 and text2.
    """
    # Get the embeddings for the two text strings
    embedding1 = embeddings_model.embed_query(text1)
    embedding2 = embeddings_model.embed_query(text2)

    # Convert embeddings to numpy arrays for dot product calculation
    embedding1_array = np.array(embedding1)
    embedding2_array = np.array(embedding2)

    # Calculate and return the dot product
    dot_product = np.dot(embedding1_array, embedding2_array)
    return dot_product

Lets try it with two descriptions of a "lawn mower" that do not use many similar words.

In [13]:
compare_str(embeddings_model,
            "A machine that helps people to cut grass.",
            "Device with blades to cut plants under it.")

0.6298291212454291

The value of the dot product (cosine similarity) ranges from -1 to 1. We can interpret it as follows:

* High Similarity: Values close to 1 indicate high similarity.
* Low Similarity: Values close to -1 indicate high dissimilarity.
* Neutral/No Similarity: Values close to 0 indicate no apparent similarity.

So a value of 0.62 means they are reasonably similar. Lets adjust it to compare a lawn mower to an airplane.

In [14]:
compare_str(embeddings_model,
            "A machine that helps people to cut grass.",
            "Vehicle that flys through the air.")

0.2694946993627674

We can see the similarity is lower.