<details><summary style="display:list-item; font-size:16px; color:blue;">Jupyter Help</summary>
    
Having trouble testing your work? Double-check that you have followed the steps below to write, run, save, and test your code!
    
[Click here for a walkthrough GIF of the steps below](https://static-assets.codecademy.com/Courses/ds-python/jupyter-help.gif)

Run all initial cells to import libraries and datasets. Then follow these steps for each question:
    
1. Add your solution to the cell with `## YOUR SOLUTION HERE ## `.
2. Run the cell by selecting the `Run` button or the `Shift`+`Enter` keys.
3. Save your work by selecting the `Save` button, the `command`+`s` keys (Mac), or `control`+`s` keys (Windows).
4. Select the `Test Work` button at the bottom left to test your work.

![Screenshot of the buttons at the top of a Jupyter Notebook. The Run and Save buttons are highlighted](https://static-assets.codecademy.com/Paths/ds-python/jupyter-buttons.png)

**Setup**

Run the setup cell to import our vector database and instantiate our embedding model.

In [1]:
import chromadb
embedding_function = chromadb.utils.embedding_functions.DefaultEmbeddingFunction()

**Embedding Strings**

Now that we've received an introduction to the relevant concepts, let's practice performing some similarity search of our own.

First, we instantiate the `all-MiniLM-L6-v2` embedding model from Sentence Transformers, which is included in the Chroma package by default. This pre-trained language model has 384 dimensions for each embedding, which means each piece of text will be represented by a 384-dimensional vector.

You can read more about this model here: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

Let's start by embedding a simple string of text using this model. Run the code cell below to see how to get the embedding for a string.

In [2]:
embedding = embedding_function(["Welcome to this RAG course!"])
print(embedding[0][:5]) # we only print the first five values

[0.011329571716487408, 0.052544090896844864, 0.07574588060379028, 0.0004700662975665182, -0.0421183742582798]


As we can see, `embedding_function` takes a list of strings and returns a list of embeddings. In the example cell, we only print the first five values of the embedding of "Welcome to this RAG course!" However, the full embedding is vector with 384 dimensions.

#### Checkpoint 1/3

Create an embedding for a custom text of your choice using the `embedding_function`. Replace the value of `text_to_embed` with your own text. Remember that the `embedding_function` accepts a single list as an argument.

After generating the embedding, you'll print its length to confirm that it has the expected 384 dimensions. The `embedding_function` will return a list. The embedding will be the first item in that list.

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.

In [3]:
# Embed a custom text
text_to_embed = "Replace this with your own text"

## YOUR SOLUTION HERE ##
# Generate embedding of your text
my_embedding = embedding_function([text_to_embed])

# Print the length of the embedding to confirm it has 384 dimensions
## YOUR SOLUTION HERE ##
print(len(my_embedding[0]))

384


**Creating and Populating a Chroma Collection**

OK, we now know how to generate embeddings, which we can use to conduct similarity search.

Now, let's explore storing and searching embeddings with Chroma. We'll start by initializing a Chroma client, creating a collection, and adding sample data.

The Chroma client is the primary interface for interacting with the Chroma vector database. We'll use the client to create a *collection*, where we'll store our vectors. 

In the code cell below, we initialize a Chroma client, create a collection, and add two embeddings with their corresponding documents and ids.

Finally, we look at the first few rows of the new collection with the `.peek()` method.

In [4]:
# Initialize the Chroma client
chroma_client = chromadb.Client()

# Create a new collection
collection = chroma_client.create_collection(name="my_test_collection")

my_docs = ["This is a document", "This is another document"]
embeddings = embedding_function(my_docs)

# Add some sample data to the collection
collection.add(
    embeddings=embeddings,
    documents=my_docs,
    ids=["id1", "id2"]
)

# Peek at the first few rows of the collection
collection.peek()

{'ids': ['id1', 'id2'],
 'embeddings': [[-0.06602787971496582,
   0.17150495946407318,
   -0.0005892717163078487,
   0.018142186105251312,
   0.0023983698338270187,
   0.0110479686409235,
   -0.026964809745550156,
   0.02810460887849331,
   0.03565191105008125,
   0.05120042711496353,
   0.0007242555730044842,
   0.074327252805233,
   -0.01497076265513897,
   -0.02220062166452408,
   -0.11353682726621628,
   0.017623422667384148,
   -0.05495397001504898,
   -0.0016642519040033221,
   0.00702937226742506,
   0.0625418946146965,
   0.031576432287693024,
   0.13108083605766296,
   0.007027113810181618,
   -0.013396484777331352,
   0.022983085364103317,
   0.07595545053482056,
   -0.0890459194779396,
   0.018912803381681442,
   0.06100377440452576,
   -0.026860538870096207,
   0.019091319292783737,
   0.05663471668958664,
   0.09586423635482788,
   0.02442280389368534,
   0.02423647791147232,
   0.027704017236828804,
   0.06225651130080223,
   -0.005705961957573891,
   0.042252108454704285

#### Checkpoint 2/3

Create a new collection using `chroma_client.create_collection` and give it a name of your choice. Save this in the given variable `my_collection`.

Next, generate embeddings for the three documents in the `docs` list using the `embedding_function`. Remember: the embeddings function accepts a list of strings and returns a list of embeddings.

Finally, pass the embeddings and docs to their corresponding parameters in `my_collection.add()` and use the `peek()` method to display the first few rows of your new collection.

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.

In [5]:
## YOUR SOLUTION HERE ##
# Create a new collection
my_collection = chroma_client.get_or_create_collection(name="my_collection")

# Generate embeddings for the three documents
## YOUR SOLUTION HERE ##
docs = ["document one", "document two", "document three"]
embeddings = embedding_function(docs)

# Add the embeddings and documents to the collection
my_collection.add(
    ## YOUR SOLUTION HERE ##
    embeddings=embeddings,
    documents=docs,
    ids=["id1", "id2", "id3"]
)

# Peek at the first few rows of the new collection
## YOUR SOLUTION HERE ##
my_collection.peek()

{'ids': ['id1', 'id2', 'id3'],
 'embeddings': [[-0.08633355051279068,
   0.08902689814567566,
   0.026804370805621147,
   0.011506963521242142,
   -0.010133574716746807,
   -0.005830519832670689,
   -0.00198702747002244,
   0.05372660979628563,
   0.02667728066444397,
   0.009986907243728638,
   -0.017697615548968315,
   0.057498954236507416,
   0.03698812425136566,
   -0.01948704943060875,
   -0.0031605828553438187,
   -0.00836500059813261,
   -0.0828881487250328,
   -0.029623204842209816,
   0.019875135272741318,
   0.07764454931020737,
   -0.029613057151436806,
   0.043174054473638535,
   0.011861334554851055,
   -0.002964151557534933,
   -0.007586163002997637,
   0.000893469201400876,
   -0.11177945882081985,
   -0.044148076325654984,
   -0.025330925360322,
   -0.06839500367641449,
   0.07826012372970581,
   0.04236732795834541,
   0.020977383479475975,
   0.04261021316051483,
   0.13014808297157288,
   0.016672436147928238,
   0.09599524736404419,
   0.027027077972888947,
   0.013

Note that there's an empty list for `metadatas` revealed when we `peek` at the collection. We'll introduce metadata and its filterin in a forthcoming exercise.

For now, let's find out how to search our collection for similar embeddings.

 **Semantic Search with Chroma**

Chroma supports several distance metrics for performing semantic search, including cosine similarity, L2 norm, and inner product. Let's create a collection that uses cosine similarity.

In [6]:
# Create a new Chroma collection that uses cosine similarity
cosine_collection = chroma_client.get_or_create_collection(
        name="cosine_collection",
        metadata={"hnsw:space": "cosine"}
    )

In this code block, we create a new Chroma collection called `cosine_collection` using the `get_or_create_collection()` method of the Chroma client, passing a second named argument for metadata.

The `metadata` parameter accepts a dictionary that allows us to configure the behavior of the collection. In this case, we set the "hnsw:space" key to "cosine", which tells Chroma to use cosine similarity as the distance metric for the HNSW (Hierarchical Navigable Small World) index that powers the semantic search.

Now, let's add some sample documents to this collection. Notice that we don't have to generate the embeddings ourselves - Chroma will do that for us using the embedding model we set up earlier.

To show how similarity search works, we'll add to our new collection the following two statements:

- In this document, we'll talk about big cats: tigers, mountain lions, panthers, and other ferocious felines.
- In this document we'll discuss the solar system: moons, planets, and asteroids. We'll also talk about the sun and the stars.

In more naive forms of search, to retrieve either of these documents we'd likely need to include in our query words that are already in the document. Part of the magic of embedding-based semantic search is that we don't need to share words with the document in our query to retrieve it during a search.

In [7]:
# Add two documents to the cosine_collection
cosine_collection.add(
    documents=["In this document, we'll talk all about big cats. Tigers, mountain lions, panthers, and other ferocious felines.",
               "In this document we'll discuss the solar system: moons, planets, and asteroids. We'll also talk about the sun and the stars."],
    ids=["id1", "id2"]
)

# Peek at the first few rows of the cosine_collection
cosine_collection.peek()

{'ids': ['id1', 'id2'],
 'embeddings': [[0.01748039945960045,
   0.04280036687850952,
   0.0006748950108885765,
   0.0666121244430542,
   -0.045822445303201675,
   -0.004645896144211292,
   -0.04351050406694412,
   -0.07402431219816208,
   -0.05519437417387962,
   0.04498982056975365,
   -0.030888546258211136,
   -0.06045880913734436,
   -0.059546299278736115,
   0.024731310084462166,
   -0.03388407453894615,
   0.02296214923262596,
   -0.029364561662077904,
   -0.05966084823012352,
   -0.012157085351645947,
   0.05620809271931648,
   0.02865065075457096,
   0.06231977045536041,
   0.006445934996008873,
   0.06030651554465294,
   -0.10825696587562561,
   -0.02892524190247059,
   -0.14275632798671722,
   0.05002284795045853,
   0.007584958802908659,
   -0.04945380985736847,
   -0.10318565368652344,
   -0.010468323715031147,
   0.06755015254020691,
   0.1000571921467781,
   -0.020035549998283386,
   -0.04020274057984352,
   0.07068999111652374,
   -0.0014667754294350743,
   0.06982392072

Now, let's perform a semantic search.

Queries are performed with the `.query()` method. We can pass multiple queries to the `query_texts` argument at once, though in this case we only pass one. The other argument, `n_results`, specifies how many documents are returned.

Notice how the query returns the relevant document, even though the words in the query don't appear in the document string.

In [8]:
# Perform a semantic search on the cosine_collection
cosine_collection.query(query_texts=["I'm in the mood to read about wildlife, animals and nature."], n_results=1)

{'ids': [['id1']],
 'distances': [[0.6165463924407959]],
 'metadatas': [[None]],
 'embeddings': None,
 'documents': [["In this document, we'll talk all about big cats. Tigers, mountain lions, panthers, and other ferocious felines."]],
 'uris': None,
 'data': None}

#### Checkpoint 3/3

Using the `.add()` method, add the following document strings to the `cosine_collection`.

- The internal combustion engine was a groundbreaking invention that paved the way for the modern automobile.
- The North Pole is among the coldest places on the planet, home to polar bears, seals, and penguins.

Give these documents the ids `"id3"` and `"id4"`, respectively.

Then, perform a query that returns the document with `"id3"` (the one about the internal combustion engine) without using any of the words in the document's string. Specify you want only one result returned from the query.

Don't forget to run the cell and save the notebook before selecting `Test Work`! Open the `Jupyter Help` toggle at the top of the notebook for more details.

In [9]:
# Add the new documents and IDs to the cosine_collection
cosine_collection.add(
  ## YOUR SOLUTION HERE ##
    documents=["The internal combustion engine was a groundbreaking invention that paved the way for the modern automobile.",
               "The North Pole is among the coldest places on the planet, home to polar bears, seals, and penguins."],
    ids=["id3", "id4"]
)

# Perform a semantic search to find a document about cars, without using any of the words in the original document
## YOUR SOLUTION HERE ##
cosine_collection.query(query_texts=["I want to learn about vehicles."], n_results=1)

{'ids': [['id3']],
 'distances': [[0.7543686628341675]],
 'metadatas': [[None]],
 'embeddings': None,
 'documents': [['The internal combustion engine was a groundbreaking invention that paved the way for the modern automobile.']],
 'uris': None,
 'data': None}