# Basic embedding retrieval with Chroma

This notebook demonstrates the most basic use of Chroma to store and retrieve information using embeddings. This core building block is at the heart of many powerful AI applications.

## What are embeddings?

Embeddings are the A.I-native way to represent any kind of data, making them the perfect fit for working with all kinds of A.I-powered tools and algorithms. They can represent text, images, and soon audio and video.

To create an embedding, data is fed into an embedding model, which outputs vectors of numbers. The model is trained in such a way that 'similar' data, e.g. text with similar meanings, or images with similar content, will produce vectors which are nearer to one another, than those which are dissimilar.

## Embeddings and retrieval

We can use the similarity property of embeddings to search for and retrieve information. For example, we can find documents relevant to a particular topic, or images similar to a given image. Rather than searching for keywords or tags, we can search by finding data with similar semantic meaning.


In [None]:
# ! pip install -Uq chromadb numpy datasets

In [None]:
! pip install chromadb datasets

Collecting chromadb
  Downloading chromadb-0.4.22-py3-none-any.whl (509 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/509.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m501.8/509.0 kB[0m [31m16.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/536.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [3

## Example Dataset

As a demonstration we use the [SciQ dataset](https://arxiv.org/abs/1707.06209), available from [HuggingFace](https://huggingface.co/datasets/sciq).

Dataset description, from HuggingFace:

> The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

In this notebook, we will demonstrate how to retrieve supporting evidence for a given question.


In [None]:
# Get the SciQ dataset from HuggingFace
from datasets import load_dataset
import pandas as pd

In [None]:
%%time

dataset = load_dataset("sciq", split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.99M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/339k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/343k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11679 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

CPU times: user 916 ms, sys: 770 ms, total: 1.69 s
Wall time: 4.6 s


In [None]:
%%time

# Filter the dataset to only include questions with a support
dataset = dataset.filter(lambda x: x["support"] != "")

print("Number of questions with support: ", len(dataset))

Filter:   0%|          | 0/11679 [00:00<?, ? examples/s]

Number of questions with support:  10481
CPU times: user 110 ms, sys: 5.87 ms, total: 115 ms
Wall time: 114 ms


In [None]:
dataset

Dataset({
    features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
    num_rows: 10481
})

## Loading the data into Chroma

Chroma comes with a built-in embedding model, which makes it simple to load text.
We can load the SciQ dataset into Chroma with just a few lines of code.


In [None]:
# Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
import chromadb

client = chromadb.Client()

In [None]:
# Create a new Chroma collection to store the supporting evidence. We don't need to specify an embedding fuction, and the default will be used.
collection = client.create_collection("vectordb5")

### Investigate the Data Type

In [None]:
# this is the input arg for documents
type(dataset["support"][:100]), type(dataset["support"][0]), dataset["support"][0]

(list,
 str,
 'Mesophiles grow best in moderate temperature, typically between 25°C and 40°C (77°F and 104°F). Mesophiles are often found living in or on the bodies of humans or other animals. The optimal growth temperature of many pathogenic mesophiles is 37°C (98°F), the normal human body temperature. Mesophilic organisms have important uses in food preparation, including cheese, yogurt, beer and wine.')

In [None]:
# this is the input arg for metadatas
[{"type": "support"} for _ in range(0, 2)]

[{'type': 'support'}, {'type': 'support'}]

In [None]:
i = 0
[dataset["question"][i], dataset["support"][i] for i in range(100)]

('What type of organism is commonly used in preparation of foods such as cheese and yogurt?',
 'Mesophiles grow best in moderate temperature, typically between 25°C and 40°C (77°F and 104°F). Mesophiles are often found living in or on the bodies of humans or other animals. The optimal growth temperature of many pathogenic mesophiles is 37°C (98°F), the normal human body temperature. Mesophilic organisms have important uses in food preparation, including cheese, yogurt, beer and wine.')

In [None]:
%%time

# Embed and store the first 100 supports for this demo
collection.add(
    ids=[str(i) for i in range(0, 100)],  # IDs are just strings
    documents=dataset["question"][:100],
    metadatas=[{"type": "support"} for _ in range(0, 100)],
)

CPU times: user 18.7 s, sys: 44.8 ms, total: 18.7 s
Wall time: 5.28 s


In [None]:
collection.count()

100

## Querying the data

Once the data is loaded, we can use Chroma to find supporting evidence for the questions in the dataset.
In this example, we retrieve the most relevant result according to the embedding similarity score.

Chroma handles computing similarity and finding the most relevant results for you, so you can focus on building your application.


In [None]:
user_query = dataset["question"][0]
user_query

'What type of organism is commonly used in preparation of foods such as cheese and yogurt?'

In [None]:
results = collection.query(
    query_texts=user_query,
    n_results=2)

In [None]:
results

{'ids': [['0', '36']],
 'distances': [[0.0, 0.8735413551330566]],
 'metadatas': [[{'type': 'support'}, {'type': 'support'}]],
 'embeddings': None,
 'documents': [['What type of organism is commonly used in preparation of foods such as cheese and yogurt?',
   'Fungus-like protist saprobes play what role in a food chain and are specialized to absorb nutrients from nonliving organic matter, such as dead organisms or their wastes?']],
 'uris': None,
 'data': None}

In [None]:
idx = results["ids"][0]
idx = [int(i) for i in idx]
idx

[0, 36]

In [None]:
results["distances"][0]

[0.0, 0.8735413551330566]

In [None]:
ref = [[dataset["question"][i], dataset["support"][i]] for i in idx]
ref = pd.DataFrame(ref, columns=["question", "support"])
ref["distances"] = results["distances"][0]
ref

Unnamed: 0,question,support,distances
0,What type of organism is commonly used in prep...,"Mesophiles grow best in moderate temperature, ...",0.0
1,Fungus-like protist saprobes play what role in...,Agents of Decomposition The fungus-like protis...,0.873541


we display the query questions along with their retrieved supports

In [None]:
%%time

# Print the question and the corresponding support
for i, q in enumerate(dataset['question'][:3]):
    print(f"Question: {q}")
    print(f"Retrieved support: {results['documents'][i][0]}")
    print()

Question: What type of organism is commonly used in preparation of foods such as cheese and yogurt?
Retrieved support: What type of organism is commonly used in preparation of foods such as cheese and yogurt?

Question: What phenomenon makes global winds blow northeast to southwest or the reverse in the northern hemisphere and northwest to southeast or the reverse in the southern hemisphere?
Retrieved support: What phenomenon makes global winds blow northeast to southwest or the reverse in the northern hemisphere and northwest to southeast or the reverse in the southern hemisphere?

Question: Changes from a less-ordered state to a more-ordered state (such as a liquid to a solid) are always what?
Retrieved support: Changes from a less-ordered state to a more-ordered state (such as a liquid to a solid) are always what?

CPU times: user 55.2 ms, sys: 269 µs, total: 55.5 ms
Wall time: 36.2 ms


## What's next?

Check out the Chroma documentation to [get started](https://docs.trychroma.com/getting-started) with building your own applications.

The core embeddings based retrieval functionality demonstrated here is at the heart of many powerful AI applications, like using large language models with Chroma to [chat with your documents](https://github.com/chroma-core/chroma/tree/main/examples/chat_with_your_documents), as well as memory for agents like [BabyAgi](https://github.com/yoheinakajima/babyagi) and [Voyager](https://github.com/MineDojo/Voyager).

Chroma is already integrated with many popular AI applications frameworks, including [LangChain](https://python.langchain.com/docs/integrations/vectorstores/chroma) and [LlamaIndex](https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/ChromaIndexDemo.html).

Join our community to learn more and get help with your projects: [Discord](https://discord.gg/MMeYNTmh3x) | [Twitter](https://twitter.com/trychroma)

We are [hiring](https://trychroma.notion.site/careers-chroma-9d017c3007c7478ebd85bad854101497?pvs=4)!