# Embeddings and Vector Databases With ChromaDB

Modern LLMs, while imperfect, can accurately solve a wide range of problems and provide correct answers to many questions. But, due to the limits of their training and the number of text tokens they can process, LLMs aren’t a silver bullet for all tasks.

You wouldn’t expect an LLM to provide relevant responses about topics that don’t appear in their training data. For example, if you asked ChatGPT to summarize information in confidential company documents, then you’d be out of luck. You could show some of these documents to ChatGPT, but there’s a limited number of documents that you can upload before you exceed ChatGPT’s maximum number of tokens. How would you select documents to show ChatGPT?

To address these shortcomings and scale your LLM applications, one great option is to use a vector database like ChromaDB. A vector database allows you to store encoded unstructured objects, like text, as lists of numbers that you can compare to one another. You can, for example, find a collection of documents relevant to a question that you want an LLM to answer.

## Represent Data as Vectors

Before diving into embeddings and vector databases, you should understand what vectors are and what they represent.

You can describe vectors with variable levels of complexity, but one great starting place is to think of a vector as an array of numbers. For example, you could represent vectors using NumPy arrays as follows:

In [None]:
!pip install numpy



In [None]:
import numpy as np

vector1 = np.array([1, 0])
vector2 = np.array([0, 1])

print(vector1)
print(vector2)

[1 0]
[0 1]


In this code block, you import numpy and create two arrays, vector1 and vector2, representing vectors. This is one of the most common and useful ways to work with vectors in Python, and NumPy offers a variety of functionality to manipulate vectors.

You’ve created two NumPy arrays that represent vectors. Now what? It turns out you can do a lot of cool things with vectors, but before continuing on, you’ll need to understand some key definitions and properties:

* **Dimension:** The dimension of a vector is the number of elements that it contains. In the example above, `vector1` and `vector2` are both two-dimensional since they each have two elements. You can only visualize vectors with three dimensions or less, but generally, vectors can have any number of dimensions. In fact, as you’ll see later, vectors that encode words and text tend to have hundreds or thousands of dimensions.

* **Magnitude:** The magnitude of a vector is a non-negative number that represents the vector’s size or length. You can also refer to the magnitude of a vector as the norm, and you can denote it with `||v||` or `|v|`. There are many different definitions of magnitude or norm, but the most common is the *Euclidean norm* or *2-norm*. You’ll learn how to compute this later.

* **Unit vector:** A unit vector is a vector with a magnitude of one. In the example above, `vector1` and `vector2` are unit vectors.

* **Direction:** The direction of a vector specifies the line along which the vector points. You can represent direction using angles, unit vectors, or coordinates in different coordinate systems.

* **Dot product (scalar product):** The dot product of two vectors, u and v, is a number given by `u ⋅ v = ||u|| ||v|| cos(θ)`, where `θ` is the angle between the two vectors. Another way to compute the dot product is to do an *element-wise multiplication of u and v and sum the results*. The dot product is one of the most important and widely used vector operations because it measures the similarity between two vectors. You’ll see more of this later on.

* **Orthogonal vectors:** Vectors are *orthogonal* if their dot product is zero, meaning that they’re at a 90 degree angle to each other. You can think of orthogonal vectors as being completely *unrelated* to each other.

* **Dense vector:** A vector is considered dense if most of its elements are non-zero. Later on, you’ll see that words and text are most usefully represented with dense vectors because each dimension encodes meaningful information.

While there are many more definitions and properties to learn, these six are most important for this tutorial. Note that for the rest of this tutorial, you’ll use `v1`, `v2`, and `v3` to name your vectors.

You first import numpy and create the arrays v1, v2, and v3. Calling v1.shape shows you the dimension of v1.

In [None]:
import numpy as np

v1 = np.array([1, 0])
v2 = np.array([0, 1])
v3 = np.array([np.sqrt(2), np.sqrt(2)])

# Dimension
v1.shape

(2,)

You then see two different ways to compute the magnitude of a NumPy array. The first, `np.sqrt(np.sum(v1**2))`, uses the Euclidean norm that you learned about above. The second computation uses `np.linalg.norm()`, a NumPy function that computes the Euclidean norm of an array by default but can also compute other matrix and vector norms.

In [None]:
# Magnitude
v1_magnitude = np.linalg.norm(v1)
v2_magnitude = np.linalg.norm(v2)
v3_magnitude = np.linalg.norm(v3)

test = np.sqrt(np.sum(v1**v2))
print(f"***{test}")

print(v1_magnitude)
print(v2_magnitude)
print(v3_magnitude)

***1.0
1.0
1.0
2.0


By default, `np.linalg.norm(v1)` calculates the L2 norm (also known as the *Euclidean* norm) of the vector `v1`. This is the most common type of norm and corresponds to the usual notion of distance in Euclidean space.

Here's how the *L2 norm* is calculated:

1. Square each element of the vector v1.
2. Sum all the squared elements.
3. Take the square root of the sum.

4. Example:

If `v1 = np.array([3, 4])`, then:

1. Square each element: [3 ** 2, 4 ** 2] = [9, 16]
2. Sum the squared elements: `9 + 16 = 25`
3. Take the square root: `sqrt(25) = 5`

Therefore, `np.linalg.norm(v1) = 5`

Lastly, below you see two ways to calculate the dot product between two vectors. Using `np.sum(v1 * v2)` first computes the element-wise multiplication between `v1` and `v2` in a vectorized fashion, and you sum the results to produce a single number. A better way to compute the dot product is to use the at-operator `(@)`, as you see with `v1 @ v3`. This is because `@` can perform both vector and matrix multiplications, and the syntax is cleaner.

In [None]:
# Dot product
a1 = np.array([3, 0])
a2 = np.array([0, 4])
b1 = np.array([3, 4])
np.sum(a1*a2)
# print(a1.dot(a2))

np.linalg.norm(b1)

5.0

While all of these vector definitions and properties may seem straightforward to compute, you might still be wondering what they actually mean and why they’re important to understand. One way to better understand vectors is to visualize them in two dimensions. In this context, you can represent vectors as arrows, like in the following plot:


<img src="vectors-graph.avif">
\

The above plot shows the visual representation of the vectors `v1`, `v2`, and `v3` that you worked with in the last example. The tail of each vector arrow always starts at the origin, and the tip is located at the coordinates specified by the vector. As an example, the tip of `v1` lies at (1, 0), and the tip of `v3` lies at roughly (1.414, 1.414). The length of each vector arrow corresponds to the magnitude that you calculated earlier.

From this visual, you can make the following key inferences:

1. `v1` and `v2` are unit vectors because their magnitude, given by the arrow length, is one. `v3` isn’t a unit vector, and its magnitude is two, twice the size of `v1` and `v2`.

2. `v1` and `v2` are orthogonal because their tails meet at a 90 degree angle. You see this visually but can also verify it computationally by computing the dot product between `v1` and `v2`. By using the dot product definition, `v1 ⋅ v2 = ||v1|| ||v2|| cos(θ)`, you can see that when `θ = 90`, `cos(θ) = 0` and `v1 ⋅ v2 = 0`. Intuitively, you can think of `v1` and `v2` as being totally unrelated or having nothing to do with each other. This will become important later.

3. `v3` makes a 45 degree angle with both `v1` and `v2`. This means that `v3` will have a non-zero dot product with `v1` and `v2`. This also means that `v3` is equally related to both `v1` and `v2`. In general, the smaller the angle between two vectors, the more they point toward a common direction.

You’ve now seen how vectors are characterized both computationally and visually. With this understanding, you’re ready to take a slightly deeper dive into the idea of vector similarity. If you only take away one thing from this introduction, it should be what follows.

## Vector Similarity

The ability to measure vector similarity is crucial in machine learning and mathematics more broadly. The foundation for this measurement lies in the *dot product*, which serves as the foundation for many vector similarity metrics.

One issue with the dot product, when used in isolation, is that it can take on any value and is therefore difficult to interpret in absolute terms. For example, if you know only that the dot product between two vectors is -3, then it’s unclear what that means without more context.

To overcome this shortcoming, one common approach is to use **cosine similarity**, a normalized form of the dot product. You compute cosine similarity by taking the cosine of the angle between two vectors. In essence, you rearrange the cosine definition of the dot product from earlier to solve for `cos(θ)`. The equation for cosine similarity looks like this:

<img src="cosine-similarity-graph.avif" width=380>

Cosine similarity disregards the magnitude of both vectors, forcing the calculation to lie between -1 and 1. This is a really nice property because it gives cosine similarity the following interpretations:

* A value of 1 means the angle between the two vectors is 0 degrees. In other words, the two vectors are similar because they point in the exact same direction. Keep in mind this doesn’t mean that the vectors have the same magnitude.

* A value of 0 means the angle between the two vectors is 90 degrees. In this case, the vectors are orthogonal and unrelated to each other.

* A value of -1 means the angle between the two vectors is 180 degrees. This is an interesting case where the vectors are dissimilar because they point in opposite directions.

In short, a cosine similarity of 1 means the vectors are similar, 0 means the vectors are unrelated, and -1 means the vectors are opposite. Any values in between represent varying degrees of similarity or dissimilarity.

We used two-dimensional vectors because they’re straightforward to visualize, but keep in mind that everything covered so far applies to vectors of any dimension. In the next section, you’ll use the same cosine similarity calculation to compare vectors in high-dimensional vector spaces.

You now have a feel for what vectors are and how you can assess their similarity. While there are many more vector concepts to learn about, you know enough to speak the language of embeddings and vector databases. In the next section, you’ll see how to convert words and sentences to vectors, a key prerequisite to text-based vector databases.

## Encode Objects in Embeddings

The next step in your journey to understanding and using vector databases like *ChromaDB* is to get a feel for embeddings. **Embeddings** are a way to represent data such as words, text, images, and audio in a numerical format that computational algorithms can more easily process.

More specifically, embeddings are dense vectors that characterize meaningful information about the objects that they encode. The most common kinds of embeddings are word and text embeddings, and that’s what you’ll focus on in this tutorial.

### Word Embeddings
A word embedding is a vector that captures the semantic meaning of word. Ideally, words that are semantically similar in natural language should have embeddings that are similar to each other in the encoded vector space. Analogously, words that are unrelated or opposite of one another should be further apart in the vector space.

One of the best ways to conceptualize this idea is to plot example word vectors in two dimensions. Take a good look at this scatterplot:

<img src="embedding-graph.avif">

This plot shows hand-crafted word embeddings plotted in two dimensions. Each point indicates where the word embedding’s tail lies. You’ll notice how related words are clustered together, while unrelated words are far from each other.

As an example, the vehicle embeddings are far from the animal embeddings because there’s little semantic similarity between the two. On the other hand, the adjectives with positive connotations are relatively close to the fruits, with the delicious embedding being closest to the fruit embeddings.

Because you’ll usually find the word delicious in contexts relating to food, it makes sense for the delicious embedding to have some similarity with both food embeddings and positive adjective embeddings.

Word embeddings try to capture these semantic relationships for a large vocabulary of words, and as you might imagine, there are a lot of complex relationships to consider. This is why, in practice, word embeddings often require hundreds or thousands of dimensions to account for the complexities of human language.

> Note: If you’re interested in how word embeddings are created, then check out the Word2vec and GloVe algorithms. These algorithms create static word embeddings like the ones that you’ll use later in this section, but there are other ways to create dynamic embeddings. For example, the model underlying most large language models (LLMs), including ChatGPT, creates word embeddings that change based on the context surrounding the word.

You’re now ready to get started using word vectors in Python. For this, you’ll use the popular `spaCy` library, a general-purpose NLP library. To install `spaCy`, create a virtual environment, activate it, and run the following command:

In [None]:
!python --version

Python 3.10.12


In [None]:
!pip install spacy



After you’ve installed `spaCy`, you’ll also need to download a model that provides word embeddings, among other features. For this tutorial, you’ll want to install the medium or large English model:

In [None]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


SpaCy’s `en_core_web_md model` includes 20,000 pre-trained word embeddings. Each of these embeddings is a 300-dimensional vector, capturing semantic information about the corresponding word. This is more than enough for the examples that you’ll see next, but if you have the appetite for more word embeddings, then you can download the `en_core_web_lg model`, which has *514,000* embeddings.

With `spaCy`’s medium or large English model installed, you’re ready to get started using word embeddings. It only takes a few lines of code to look up embeddings:



In [None]:
import spacy




In [None]:
nlp = spacy.load('en_core_web_md')


In [None]:
dog_embedding = nlp.vocab['dog'].vector

In [None]:
dog_embedding.shape

(300,)

You first import spacy and load the medium English model into an object called `nlp`. You then look up the embedding for the word dog with `nlp.vocab["dog"].vector` and store it as `dog_embedding`. Calling `type(dog_embedding)` tells you that the embedding is a NumPy array, and `dog_embedding.shape` indicates that the embedding has 300 dimensions. Lastly, `dog_embedding[0:10]` shows the values of the first 10 dimensions.

This is pretty neat! The `nlp.vocab` object allows you to find the word embedding for any word in the model’s vocabulary.

You can now assess the similarity between word embeddings using metrics like cosine similarity. To do this, create a new function called `compute_cosine_similarity`:

In [None]:
import numpy as np

def compute_cosine_similarity(u: np.ndarray, v: np.ndarray) -> float:
    """Compute the cosine similarity between two vectors"""
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

This function computes the cosine similarity between two NumPy arrays, `u` and `v`, using the definition discussed previously. You can pass word embeddings directly from `spaCy` into `compute_cosine_similarity()` to see how related they are:



In [None]:
dog_embedding = nlp.vocab["dog"].vector
cat_embedding = nlp.vocab["cat"].vector
apple_embedding = nlp.vocab["apple"].vector
tasty_embedding = nlp.vocab["tasty"].vector
delicious_embedding = nlp.vocab["delicious"].vector
truck_embedding = nlp.vocab["truck"].vector

In [None]:
compute_cosine_similarity(dog_embedding, cat_embedding)

0.8220817

In [None]:
compute_cosine_similarity(apple_embedding, tasty_embedding)

0.47355863

In [None]:
compute_cosine_similarity(tasty_embedding, delicious_embedding)

0.84820914

In [None]:
compute_cosine_similarity(delicious_embedding, truck_embedding)

0.0897876

In [None]:
compute_cosine_similarity(dog_embedding, truck_embedding)

0.25462714

In this block, you import `spacy` and `compute_cosine_similarity()`, and you instantiate an `nlp` object using the medium-size English model. Next, you look up and store embeddings for six common words from the model’s vocabulary. By computing the cosine similarity between these embeddings, you get a sense for how the model views their semantic relationship. Here are some important observations about the similarity scores:

* The cat and dog embeddings have a relatively high cosine similarity. This is likely because cats and dogs are common house pets, and you can find the word dog close to the word cat in English texts.

* The delicious and tasty embeddings also have a high cosine similarity because they have almost the same meaning. However, unlike the dog and cat embeddings, delicious and tasty have similar word embeddings because you can use them interchangeably.

* The delicious and apple embeddings have a moderate cosine similarity near 0.53. This is because delicious is a commonly used adjective to describe an apple. The reason that the cosine similarity isn’t higher in this case may be because apple and delicious aren’t always used in the same context. The word delicious can describe any food, not just apples.

* The truck and delicious embeddings have a cosine similarity close to 0. As you might expect, truck and delicious aren’t words that commonly appear in the same context.

Word embeddings are great for capturing the semantic relationships between words, but what if you wanted to take things to the next level and analyze the similarity between sentences or documents? It turns out you accomplish this with text embeddings, and these are the kinds of embeddings that you’ll most often store in vector databases. More on that in the next section.

### Text Embeddings
Text embeddings encode information about sentences and documents, not just individual words, into vectors. This allows you to compare larger bodies of text to each other just like you did with word vectors. Because they encode more information than a single word embedding, text embeddings are a more powerful representation of information.

Text embeddings are typically the fundamental objects stored in vector databases like `ChromaDB`, and in this section, you’ll learn how to create and compare them.

> Note: The best text embedding models are built using transformers, which leverage a mechanism known as attention. To oversimplify things, the attention mechanism helps create context-specific word embeddings that fuse into text embeddings.

The most efficient way to generate text embeddings is to use pretrained models. These models vary in size, but they’re all typically trained on a large corpus of text, enabling them to pick up on complex semantic relationships. The `SentenceTransformers` library in Python is one of the best tools for this. You can install `sentence-transformers` with the following command:

In [None]:
!pip install sentence-transformers



Generating text embeddings with `SentenceTransformers` is just as straightforward as using word vectors in spaCy. Here’s an example to get you started:





In [None]:
# Install the sentence-transformers package

from sentence_transformers import SentenceTransformer


model = SentenceTransformer('all-MiniLM-L6-v2')

texts = [
    "The canine barked loudly.",
    "The dog made a noisy bark.",
    "He ate a lot of pizza.",
    "He devoured a large quantity of pizza pie.",
]



In [None]:
text_embeddings = model.encode(texts)

You first import the `SentenceTransformer` class and load the `"all-MiniLM-L6-v2"` model into an object called `model`. This is one of the smallest pretrained models available, but it’s a great one to start with.

> Note: The first time you use a model in SentenceTransformers, you’ll automatically download and save it in your environment. The initial download will take a few seconds depending on how large the model is, but after that, the model should load quickly.

Next, you define a list of sentences and call `model.encode(texts)` to create the corresponding text embeddings. Notice that `text_embeddings` is a NumPy array with the shape `(4, 384)`, which means that it has 4 rows and 384 columns. This is because you encoded 4 texts, and `all-MiniLM-L6-v2` generates 384-dimensional embeddings.

Here's a breakdown of what happens behind the scenes:

1. Sentence Transformer Initialization:

    * When you create a SentenceTransformer with `'all-MiniLM-L6-v2'`, you're loading a pre-trained model. This specific model is a smaller version of the "all-mpnet-base-v2" model, fine-tuned for sentence similarity tasks.
    * This pre-trained model already has knowledge of language and relationships between words, acquired through training on a massive dataset.

2. Text Processing:

    * Your input texts are tokenized. This means they are split into individual words or sub-word units.
    * These tokens are then converted into numerical representations that the model can understand.

3. Transformer Encoding:

    * The tokenized input is fed into the transformer model (MiniLM in this case).
    * The transformer processes the input sequence through multiple layers of self-attention and feed-forward networks. This allows it to capture complex relationships between words and understand the overall meaning of the sentence.
    * The output of the transformer is a sequence of contextualized word embeddings. Each word's embedding now represents its meaning within the context of the entire sentence.

4. Pooling:

    * To obtain a single fixed-length sentence embedding, a pooling operation is applied to the sequence of word embeddings. Common pooling methods include:
    * Mean Pooling: Averaging all the word embeddings.
    * Max Pooling: Taking the maximum value for each dimension across all word embeddings.

5. Output:

    * The `model.encode(texts)` function returns a NumPy array (`text_embeddings`) where each row represents a sentence, and each column represents a dimension in the embedding space. In this case, you'll have a 4x384 array, as you encoded 4 sentences, and `'all-MiniLM-L6-v2'` produces 384-dimensional embeddings.

In essence, the SentenceTransformer model takes your text, breaks it down, understands its meaning using the transformer, and then condenses this understanding into a dense vector representation. This vector captures the semantic essence of the input text.


While all the texts in this example are single sentences, you can encode longer texts up to a specified word length. For example, `all-MiniLM-L6-v2` encodes texts up to 256 words. It’ll truncate any text longer than this.

You now have a text embedding for all four texts, and just like with word embeddings, you can compare them using cosine similarity:

In [None]:
text_embeddings_dict = dict(zip(texts, list(text_embeddings)))
text_embeddings_dict

{'The canine barked loudly.': array([ 3.49791571e-02, -6.75668102e-03,  4.14040610e-02,  1.02993436e-01,
         1.22451980e-03, -5.86038232e-02,  6.07008673e-03, -5.44335432e-02,
        -5.31257363e-03, -2.54312195e-02,  3.07495072e-02, -2.83202007e-02,
         3.18731405e-02,  4.32385169e-02,  2.42533647e-02,  2.52667833e-02,
         4.16827984e-02,  1.80074181e-02,  4.40936871e-02, -1.06606603e-01,
         4.04250255e-04,  7.75528774e-02,  4.00948524e-02, -1.38763273e-02,
        -4.20556292e-02, -7.21510686e-03,  2.03238986e-02, -8.29127803e-02,
         2.65511367e-02, -1.30723817e-02,  2.53968723e-02, -9.90436748e-02,
         1.84984896e-02,  1.03864567e-02, -4.68488457e-03, -2.48143300e-02,
         6.21577986e-02,  4.55181152e-02,  9.23830941e-02,  2.91003305e-02,
         3.50559987e-02,  4.20786291e-02, -1.08965877e-02, -8.03956538e-02,
        -9.96238440e-02, -1.85392443e-02, -4.03281748e-02, -5.50757460e-02,
         4.68578301e-02, -7.46965855e-02, -3.26624326e-02, 

In [None]:
dog_text_1 = "The canine barked loudly."
dog_text_2 = "The dog made a noisy bark."
compute_cosine_similarity(text_embeddings_dict[dog_text_1],
                          text_embeddings_dict[dog_text_2])

0.77686167

In [None]:
pizza_text_1 = "He ate a lot of pizza."
pizza_test_2 = "He devoured a large quantity of pizza pie."
compute_cosine_similarity(text_embeddings_dict[pizza_text_1],
                          text_embeddings_dict[pizza_test_2])

0.78713405

In [None]:
compute_cosine_similarity(text_embeddings_dict[dog_text_1],
                          text_embeddings_dict[pizza_text_1])

0.09128271

In the above code, you use `dict()` and `zip()` together to create a dictionary where the keys are the four sentences and the values are their embeddings. This allows you to directly look up the embeddings for each text. You then compute the cosine similarity between a few pairs of texts. Here are some important conclusions:

* The cosine similarity between `"The canine barked loudly"` and `"The dog made a noisy bark"` is relatively high even though the two sentences use different words. The same is true for the similarity between `"`He ate a lot of pizza"` and `"He devoured a large quantity of pizza pie"`. Because the text embeddings encode semantic meaning, any pair of related texts should have a high cosine similarity.

* As you might expect, the cosine similarity between `"The canine barked loudly"` and `"He ate a lot of pizza"` is low because the sentences are unrelated to each other.

This example, while straightforward, illustrates a powerful idea that underpins vector databases. That is, you can take a collection of unstructured objects, compute and store their embeddings, and then compare these embeddings to one another or to new embeddings. In this case, the unstructured objects are text, but keep in mind that the same idea can work for other data like images and audio.

Now that you’re up to speed on vectors and embeddings, you’re ready to get started with ChromaDB! In the next section, you’ll learn about vector databases and get a hands-on overview of ChromaDB.

## Get Started With ChromaDB, an Open-Source Vector Database

Now that you understand the mechanisms behind ChromaDB, you’re ready to tackle a real-world scenario. Say you have a library of thousands of documents, and you need a way to search through them.

In particular, you want to be able to make queries that point you to relevant documents. For example, if your query is `find me documents containing financial information`, then you want whatever system you use to point you to a financial document in your library.

How would you design this system? With your knowledge of vectors and embeddings, your first inclination might be to run all of the documents through an embedding algorithm and store the documents and embeddings together. You’d then convert a new query to an embedding and use cosine similarity to find the documents that are most relevant to the query.

While you’re perfectly capable of writing the code for this, you’re sure there has to be something out there to do this for you. Enter vector databases!

### What is a Vector Database?

A vector database is a database that allows you to efficiently store and query embedding data. Vector databases extend the capabilities of traditional relational databases to embeddings. However, the key distinguishing feature of a vector database is that query results aren’t an exact match to the query. Instead, using a specified similarity metric, the vector database returns embeddings that are similar to a query.

As an example use case, suppose you’ve stored company documents in a vector database. This means each document has been embedded and can be compared to other embeddings through a similarity metric like cosine similarity.

The vector database will accept a query like "how much revenue did the company make in Q2 2023" and embed the query. It’ll then compare the embedded query to other embeddings in the vector database and return the documents that have embeddings that are most similar to the query embedding.

In this example, perhaps the most similar document says something like "Company XYZ reported $15 million in revenue for Q2 2023". The vector database identified the document that had an embedding most similar to how much revenue did the company make in Q2 2023, which likely had a high similarity score based on the document’s semantics.

To make this possible, vector databases are equipped with features that balance the speed and accuracy of query results. Here are the core components of a vector database that you should know about:

* **Embedding function**: When using a vector database, oftentimes you’ll store and query data in its raw form, rather than uploading embeddings themselves. Internally, the vector database needs to know how to convert your data to embeddings, and you have to specify an embedding function for this. For text, you can use the embedding functions available in the `SentenceTransformers` library or any other function that maps raw text to vectors.

* **Similarity metric**: To assess embedding similarity, you need a similarity metric like *cosine similarity*, *the dot product*, or *Euclidean distance*. As you learned previously, cosine similarity is a popular choice, but choosing the right similarity metric depends on your application.

* **Indexing**: When you’re dealing with a large number of embeddings, comparing a query embedding to every embedding stored in the database is often too slow. To overcome this, vector databases employ indexing algorithms that group similar embeddings together.

    At query time, the query embedding is compared to a smaller subset of embeddings based on the index. Because the embeddings recommended by the index aren’t guaranteed to have the highest similarity to the query, this is called approximate nearest neighbor search.

* **Metadata**: You can store metadata with each embedding to help give context and make query results more precise. You can filter your embedding searches on metadata much like you would in a relational database. For example, you could store the year that a document was published as metadata and only look for similar documents that were published in a given year.

* **Storage location**: With any kind of database, you need a place to store the data. Vector databases can store embeddings and metadata both in memory and on disk. Keeping data in memory allows for faster reads and writes, while writing to disk is important for persistent storage.

* **CRUD operations**: Most vector databases support create, read, update, and delete (CRUD) operations. This means you can maintain and interact with data like you would in a relational database.

There’s a whole lot more detail and complexity that you could explore with vector databases, but these core concepts should be enough to get you going. Next up, you’ll get your hands dirty with ChromaDB, one of the most popular and user-friendly vector databases around.

### Meet ChromaDB for LLM Applications

ChromaDB is an open-source vector database designed specifically for LLM applications. ChromaDB offers you both a user-friendly API and impressive performance, making it a great choice for many embedding applications.

To get started, activate your virtual environment and run the following command:

In [None]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-0.5.20-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.5-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.32.1-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.7.2-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.28.2-py3

Because you have a grasp on vectors and embeddings, and you understand the motivation behind vector databases, the best way to get started is with an example. For this example, you’ll store ten documents to search over. To illustrate the power of embeddings and semantic search, each document covers a different topic, and you’ll see how well ChromaDB associates your queries with similar documents.

You’ll start by importing dependencies, defining configuration variables, and creating a ChromaDB client:

In [None]:
import chromadb

CHROMA_DATA_PATH = "chroma_data/"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "demo_docs"

client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)



You first import `chromadb` and then import the `embedding_functions` module, which you’ll use to specify the embedding function. Next, you specify the location where ChromaDB will store the embeddings on your machine in `CHROMA_DATA_PATH`, the name of the embedding model that you’ll use in `EMBED_MODEL`, and the name of your first collection in `COLLECTION_NAME`.

You then instantiate a `PersistentClient` object that writes your embedding data to `CHROMA_DB_PATH.` By doing this, you ensure that data will be stored at `CHROMA_DB_PATH` and persist to new clients. Alternatively, you can use `chromadb.Client()` to instantiate a ChromaDB instance that only writes to memory and doesn’t persist on disk.

Next, you instantiate your embedding function and the ChromaDB collection to store your documents in:

In [None]:
from chromadb.utils import embedding_functions

embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=EMBED_MODEL)


In [None]:
collection = client.create_collection(
    name=COLLECTION_NAME,
    embedding_function=embedding_func,
    metadata={"hnsw:space": "cosine"}
)

UniqueConstraintError: Collection demo_docs already exists

You specify an embedding function from the SentenceTransformers library. ChromaDB will use this to embed all your documents and queries. In this example, you’ll continue using the `"all-MiniLM-L6-v2"` model. You then create your first collection.

A collection is the object that stores your embedded documents along with any associated metadata. If you’re familiar with relational databases, then you can think of a collection as a table. In this example, your collection is named `demo_docs`, it uses the `"all-MiniLM-L6-v2"` embedding function that you instantiated, and it uses the cosine similarity distance function as specified by `metadata={"hnsw:space": "cosine"}`.

`metadata={"hnsw:space": 'cosine'}`  configures ChromaDB to use the `HNSW` algorithm with cosine distance for efficient and relevant similarity search within your collection. HNSW stands for **Hierarchical Navigable Small World**. It's a powerful algorithm for approximate nearest neighbor search (ANN). Essentially, it creates an efficient structure to find vectors similar to your query vector without exhaustively comparing it to every vector in the collection. HNSW is a powerful **indexing** technique that makes searching for similar vectors in high-dimensional spaces much faster and more efficient. HNSW significantly speeds up similarity search, especially for large datasets. It's crucial for making ChromaDB performant.

`hnsw:space:` is a specific metadata key within ChromaDB that tells the HNSW index which distance metric to use when comparing vectors.

The last step in setting up your collection is to add documents and metadata:

In [None]:
documents = [
    "The latest iPhone model comes with impressive features and a powerful camera.",
    "Exploring the beautiful beaches and vibrant culture of Bali is a dream for many travelers.",
    "Einstein's theory of relativity revolutionized our understanding of space and time.",
    "Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.",
    "The American Revolution had a profound impact on the birth of the United States as a nation.",
    "Regular exercise and a balanced diet are essential for maintaining good physical health.",
    "Leonardo da Vinci's Mona Lisa is considered one of the most iconic paintings in art history.",
    "Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
    "Startup companies often face challenges in securing funding and scaling their operations.",
    "Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'",
]

genres = [
    "technology",
    "travel",
    "science",
    "food",
    "history",
    "fitness",
    "art",
    "climate change",
    "business",
    "music",
]

collection.add(
    documents = documents,
    ids=[f"id{i}" for i in range(len(documents))],
    metadatas=[{"genre": genre} for genre in genres],

)




In this block, you define a list of ten documents in `documents` and specify the genre of each document in `genres`. You then add the documents and genres using `collection.add()`. Each document in the documents argument is embedded and stored in the collection. You also have to define the `ids` argument to uniquely identify each document and embedding in the collection. You accomplish this with a list comprehension that creates a list of ID strings.

The `metadatas` argument is optional, but most of the time, it’s useful to store metadata with your embeddings. In this case, you define a single metadata field, *"genre"*, that records the genre of each document. When you query a document, metadata provides you with additional information that can be helpful to better understand the document’s contents. You can also filter on metadata fields, just like you would in a relational database query.

With documents embedded and stored in a collection, you’re ready to run some semantic queries:

In [None]:
query_results = collection.query(
    query_texts=["Find me some delicious food!"],
    n_results=1,
)

In [None]:
query_results.keys()

dict_keys(['ids', 'embeddings', 'documents', 'uris', 'data', 'metadatas', 'distances', 'included'])

In [None]:
query_results["documents"]

[['Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.']]

In [None]:
query_results["ids"]

[['id3']]

In [None]:
query_results["distances"]

[[0.7638264485407227]]

In [None]:
query_results["metadatas"]

[[{'genre': 'food'}]]

In this example, you query the `demo_docs` collection for documents that are most similar to the sentence *Find me some delicious food!*. You accomplish this using `collection.query()`, where you pass your queries in `query_texts` and specify the number of similar documents to find with `n_results`. In this case, you only asked for the single document that’s most similar to your query.

The results returned by `collection.query()` are stored in a dictionary with the keys *ids*, *distances*, *metadatas*, *embeddings*, and *documents*. This is the same information that you added to your collection at the beginning, but it’s filtered down to match your query. In other words, `collection.query()` returns all of the stored information about documents that are most similar to your query.

As you can see, the embedding for *Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens* was most similar to the query *Find me some delicious food*. You probably agree that this document is the closest match. You can also see the ID, metadata, and distance associated with the matching document embedding. Here, you’re using cosine distance, which is one minus the cosine similarity between two embeddings.

With `collection.query()`, you’re not limited to single queries or single results:

In [None]:
query_results = collection.query(
    query_texts=["Teach me about history"],
)

In [None]:
query_results["documents"][0]

["Einstein's theory of relativity revolutionized our understanding of space and time.",
 'The American Revolution had a profound impact on the birth of the United States as a nation.',
 "Leonardo da Vinci's Mona Lisa is considered one of the most iconic paintings in art history.",
 'Exploring the beautiful beaches and vibrant culture of Bali is a dream for many travelers.',
 "Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
 'The latest iPhone model comes with impressive features and a powerful camera.',
 "Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'",
 'Regular exercise and a balanced diet are essential for maintaining good physical health.',
 'Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.',
 'Startup companies often face challenges in securing funding and scaling their operations.']

In [None]:
query_results["distances"][0]

[0.6265882786137,
 0.6904192995177738,
 0.8771599648570388,
 0.9187455171695804,
 0.9201708074723751,
 0.9903300023447454,
 1.0029226760096814,
 1.0276568123963041,
 1.0680427566360333,
 1.0846893520811818]

 Here, you pass two queries into `collection.query()`, *Teach me about history* and *What’s going on in the world*. You also request the two most similar documents for each query by specifying n_results=2. Lastly, by passing `include=["documents", "distances"]`, you ensure that the dictionary only contains the documents and their embedding distances.

Calling `query_results["documents"][0]` shows you the two most similar documents to the first query in query_texts, and `query_results["distances"][0]` contains the corresponding embedding distances. As an example, the cosine distance between *Teach me about history* and *Einstein’s theory of relativity revolutionized our understanding of space and time* is about **0.627**.

Similarly, `query_results["documents"][1]` shows you the two most similar documents to the second query in `query_texts`, and `query_results["distances"][1]` contains the corresponding embedding distances. For this query, the two most similar documents weren’t as strong of a match as in the first query. Recall that cosine distance is one minus cosine similarity, so a cosine distance of **0.80** corresponds to a cosine similarity of **0.20**.

Cosine Similarity vs. Cosine Distance

Cosine Similarity: Measures the similarity between two vectors. It ranges from -1 (completely opposite) to 1 (identical).
Cosine Distance: Measures the distance or dissimilarity between two vectors. It's inversely related to cosine similarity.

ChromaDB, by default, returns cosine distances.  Therefore:

* Lower values mean higher similarity. A distance of 0 means the vectors are identical.
* Higher values mean lower similarity. A distance of 1 means the vectors are orthogonal (no similarity).

To get the cosine similarity, you would subtract these values from 1.


> Note: Keep in mind that so-called similar documents returned from a semantic search over embeddings may not actually be relevant to the task that you’re trying to solve. The success of a semantic search is somewhat subjective, and you or your stakeholders might not agree on the quality of the results.
> If there are no relevant documents in your collection for a given query, or your embedding algorithm wasn’t trained on the right or enough data, then your results might be poor. It’s up to you to understand your application, your stakeholders’ expectations, and the limitations of your embedding algorithm and document collection.

Another awesome feature of ChromaDB is the ability to filter queries on metadata. To motivate this, suppose you want to find the single document that’s most related to music history. You might run this query:

Your query is *Teach me about history*, and the most similar document is *Einstein's theory of relativity revolutionized our understanding of space and time.*. Since our interest is in music history, this isn’t quite the result that you’re looking for. Because you’re particularly interested in music history, you can filter on the `"genre"` metadata field to search over more relevant documents:

In [None]:
collection.query(
    query_texts=["Teach me about history"],
    where={},
    n_results=1,
)

In this query, you specify in the where argument that you’re only looking for documents with the "music" genre. To apply filters, ChromaDB expects a dictionary where the keys are metadata names and the values are dictionaries specifying how to filter. In plain English, you can interpret `{"genre": {"$eq": "music"}}` as *filter the collection where the "genre" metadata field equals "music"*.

As you can see, the document about Beethoven’s Symphony No. 9 is the most similar document. Of course, for this example, there’s only one document with the music genre.

To make it slightly more difficult, you could filter on both history and music:

In [None]:
query_results = collection.query(
    query_texts=["Teach me about history"],
    where={"genre": {"$in": ["music", "history"]}},
    n_results=2,
)

query_results["documents"]

[['The American Revolution had a profound impact on the birth of the United States as a nation.',
  "Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'"]]

In [None]:
query_results["distances"]

[[0.6904192995177738, 1.0029226760096814]]

This query filters the collection of documents that have either a music or history genre, as specified by `where={"genre": {"$in": ["music", "history"]}}`. As you can see, the Beethoven document is still the most similar, while the American Revolution document is a close second. These were straightforward filtering examples on a single metadata field, but ChromaDB also supports other filtering operations that you might need.

If you want to update existing documents, embeddings, or metadata, then you can use collection`.update()`. This requires you to know the IDs of the data that you want to update. In this example, you’ll update both the documents and metadata for "id1" and "id2":

In [None]:
collection.update(
    ids=["id1", "id2"],
    documents=[
        "The latest iPhone model comes with impressive features and a powerful camera.",
        "Bali has beautiful beaches."
    ],
    metadatas=[{"genre": "tech"}, {"genre": "beaches"}]
)


Here, you rename the documents for "id1" and "id2", and you also modify their metadata. To confirm that your update worked, you call `collection.get(ids=["id1", "id2"])` and can see that you’ve successfully updated both documents and their metadata.

If you’re not sure whether a document exists for an ID, you can use `collection.upsert()`. This works the same way as `collection.update()`, except it’ll insert new documents for IDs that don’t exist.

Lastly, if you want to delete any items in the collection, then you can use `collection.delete()`:

In [None]:
collection.delete(ids=["id1", "id2"])