# Basic embedding retrieval with Chroma

This notebook demonstrates the most basic use of Chroma to store and retrieve information using embeddings. This core building block is at the heart of many powerful AI applications.

## What are embeddings?

Embeddings are the A.I-native way to represent any kind of data, making them the perfect fit for working with all kinds of A.I-powered tools and algorithms. They can represent text, images, and soon audio and video.

To create an embedding, data is fed into an embedding model, which outputs vectors of numbers. The model is trained in such a way that 'similar' data, e.g. text with similar meanings, or images with similar content, will produce vectors which are nearer to one another, than those which are dissimilar.

## Embeddings and retrieval

We can use the similarity property of embeddings to search for and retrieve information. For example, we can find documents relevant to a particular topic, or images similar to a given image. Rather than searching for keywords or tags, we can search by finding data with similar semantic meaning.


In [None]:
! pip install chromadb datasets

Collecting chromadb
  Downloading chromadb-0.4.22-py3-none-any.whl (509 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.109.2-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.27.1-py3-none-any.

## Example Dataset

As a demonstration we use the [SciQ dataset](https://arxiv.org/abs/1707.06209), available from [HuggingFace](https://huggingface.co/datasets/sciq).

Dataset description, from HuggingFace:

> The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

In this notebook, we will demonstrate how to retrieve supporting evidence for a given question.


In [None]:
# Get the SciQ dataset from HuggingFace
from datasets import load_dataset
import pandas as pd

In [None]:
%%time

dataset = load_dataset("eagle0504/youthless-homeless-shelter-web-scrape-dataset-qa-formatted")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/308 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/131 [00:00<?, ? examples/s]

CPU times: user 913 ms, sys: 1.2 s, total: 2.12 s
Wall time: 4.52 s


In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['questions', 'answers'],
        num_rows: 131
    })
})

## Loading the data into Chroma

Chroma comes with a built-in embedding model, which makes it simple to load text.
We can load the SciQ dataset into Chroma with just a few lines of code.


In [None]:
# Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
import chromadb

client = chromadb.Client()

In [None]:
# Create a new Chroma collection to store the supporting evidence. We don't need to specify an embedding fuction, and the default will be used.
collection = client.create_collection("vectordb1")

### Investigate the Data Type

In [None]:
len(dataset["train"])

131

THE HEAVY LIFTING IS HERE! PLEASE CURATE YOUR QUESTIONS TO MAKE SURE YOU CAN OVERLAP WITH USER QUESTIONS AS MUCH AS YOU CAN.

In [None]:
dataset["train"]['questions']

['What is the main focus of the Youth Spirit Artworks program in Berkeley, California?',
 'What challenges do older homeless and low-income youth from BIPOC and LGBTQIA+ communities face, and how does YSA address these challenges?',
 'What is the main mission of the Telegraph Avenue Homeless Youth Drop-In Center?',
 'What is the main goal of providing young people with skills, experience, and confidence?',
 'What were the main goals of Young Aspirations, Young Artists (YaYa) organization in New Orleans?',
 'What role did Sally play in the co-founding of Street Spirit?',
 'What is the name of the group that started it?',
 'Who were some of the important leaders in the Board?',
 'Who played a big role in getting funding for YSA and supporting its vision?',
 "How has YSA's use of art as a tool contributed to meeting life outcomes for young people who have experienced trauma and face financial challenges?",
 'What opportunities does this provide for individuals?',
 'Who is the secretary of

In [None]:
%%time

# Embed and store the first 100 supports for this demo
L = len(dataset["train"]['questions'])
collection.add(
    ids=[str(i) for i in range(0, L)],  # IDs are just strings
    documents=dataset["train"]['questions'], # Enter questions here
    metadatas=[{"type": "support"} for _ in range(0, L)],
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:03<00:00, 22.0MiB/s]


CPU times: user 16.2 s, sys: 684 ms, total: 16.9 s
Wall time: 7.95 s


In [None]:
collection.count()

131

## Querying the data

Once the data is loaded, we can use Chroma to find supporting evidence for the questions in the dataset.
In this example, we retrieve the most relevant result according to the embedding similarity score.

Chroma handles computing similarity and finding the most relevant results for you, so you can focus on building your application.


In [None]:
dataset["train"]['questions'][0]

'What is the main focus of the Youth Spirit Artworks program in Berkeley, California?'

In [None]:
user_query = "What is the main focus of the Youth Spirit Artworks program in Berkeley, California?"
user_query

'What is the main focus of the Youth Spirit Artworks program in Berkeley, California?'

In [None]:
results = collection.query(
    query_texts=user_query,
    n_results=2)

In [None]:
results

{'ids': [['0', '63']],
 'distances': [[0.0, 0.42253151535987854]],
 'metadatas': [[{'type': 'support'}, {'type': 'support'}]],
 'embeddings': None,
 'documents': [['What is the main focus of the Youth Spirit Artworks program in Berkeley, California?',
   "What is the goal of Youth Spirit Artworks' community organizing campaign in the East Bay Area?"]],
 'uris': None,
 'data': None}

In [None]:
idx = results["ids"][0]
idx = [int(i) for i in idx]
idx

[0, 63]

In [None]:
results["distances"][0]

[0.0, 0.42253151535987854]

In [None]:
[dataset["train"]['questions'][i] for i in idx]

['What is the main focus of the Youth Spirit Artworks program in Berkeley, California?',
 "What is the goal of Youth Spirit Artworks' community organizing campaign in the East Bay Area?"]

In [None]:
[dataset["train"]['answers'][i] for i in idx]

['About YSAYouth Spirit Artworks (YSA) is a program in Berkeley, California, that helps homeless andlow-income young people in the San Francisco Bay Area',
 'In response to the dire need for youth housing, Youth Spirit Artworks is engaged in aten-year community organizing campaign to create “100 Homes for 100 Homeless Youth” in theEast Bay Area']

In [None]:
ref = pd.DataFrame(
    {
        "idx": idx,
        "question": [dataset["train"]['questions'][i] for i in idx],
        "answers": [dataset["train"]['answers'][i] for i in idx],
        "distances": results["distances"][0]
    }
)
ref

Unnamed: 0,idx,question,answers,distances
0,0,What is the main focus of the Youth Spirit Art...,About YSAYouth Spirit Artworks (YSA) is a prog...,0.0
1,63,What is the goal of Youth Spirit Artworks' com...,In response to the dire need for youth housing...,0.422532


In [None]:
ref = pd.DataFrame(
    {
        "idx": idx,
        "question": [dataset["train"]['questions'][i] for i in idx],
        "answers": [dataset["train"]['answers'][i] for i in idx],
        "distances": results["distances"][0]
    }
)
ref

Unnamed: 0,idx,question,answers,distances
0,0,What is the main focus of the Youth Spirit Art...,About YSAYouth Spirit Artworks (YSA) is a prog...,0.082199
1,63,What is the goal of Youth Spirit Artworks' com...,In response to the dire need for youth housing...,0.441973


In [None]:
special_threshold = -0.3
filtered_ref = ref[ref["distances"] < special_threshold]
if filtered_ref.shape[0] > 0:
    ref_from_db_search = filtered_ref["answers"]
else:
    ref_from_db_search = ref["answers"]

ref_from_db_search

0    About YSAYouth Spirit Artworks (YSA) is a prog...
1    In response to the dire need for youth housing...
Name: answers, dtype: object

we display the query questions along with their retrieved supports

## What's next?

Check out the Chroma documentation to [get started](https://docs.trychroma.com/getting-started) with building your own applications.

The core embeddings based retrieval functionality demonstrated here is at the heart of many powerful AI applications, like using large language models with Chroma to [chat with your documents](https://github.com/chroma-core/chroma/tree/main/examples/chat_with_your_documents), as well as memory for agents like [BabyAgi](https://github.com/yoheinakajima/babyagi) and [Voyager](https://github.com/MineDojo/Voyager).

Chroma is already integrated with many popular AI applications frameworks, including [LangChain](https://python.langchain.com/docs/integrations/vectorstores/chroma) and [LlamaIndex](https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/ChromaIndexDemo.html).

Join our community to learn more and get help with your projects: [Discord](https://discord.gg/MMeYNTmh3x) | [Twitter](https://twitter.com/trychroma)

We are [hiring](https://trychroma.notion.site/careers-chroma-9d017c3007c7478ebd85bad854101497?pvs=4)!