# Chroma: Open-Source AI Application Vector Database

**Embeddings, vector search, document storage, full-text search, metadata filtering, and multi-modal. All in one place. Retrieval that just works. As it should be.**

Chroma gives you the tools to:

* store embeddings and their metadata
* embed documents and queries
* search embeddings

## Table of Contents

1. **Create Client**
2. **Create Collection**
3. **Add Documents to Collection**
4. **Query Collection | Vector Store**
	* 4.1 Search Relevant Documents
	* 4.2 Filter Documents using **"where"** & **"where_documents"** Clause
		* 4.2.1 **"where"** Clause
		* 4.2.2 **"where_documents"** Clause
	* 4.3 Change Distance Function to Find Relevant Docs
5. **Add More Documents to Collection**
6. **Update Records**
7. **Delete Records**
8. **How to Create Persistent Vector Store?**
9. **Work with Collections**
	* 9.1 Retrieve Existing Collection
	* 9.2 List Collections
	* 9.3 Delete Collection
10. **Custom Embedding Function**
11. **Chroma DB in Client Server Mode**

### Installation

* **pip install chromadb**


In [1]:
import chromadb

print(chromadb.__version__)

0.5.4


## 1. Create Client

In [2]:
chroma_client = chromadb.Client()

chroma_client

<chromadb.api.client.Client at 0x7fcd28d0c9a0>

## 2. Create Collection

* Collections are where you'll store your embeddings, documents, and any additional metadata



In [3]:
collection = chroma_client.create_collection(name="first_collection", 
                                             metadata={"title": "random documents", 
                                                       "description": "This store contains embeddings of random strings."},
                                             )

collection

Collection(id=d1ca2552-9d5e-4871-b47c-8efeeee67ae2, name=first_collection)



* **get_or_create**: If True, return the existing collection if it exists.
* **embedding_function** - You can provide your own embedding model through this parameter of **create_collection()** method.

In [4]:
collection.metadata

{'title': 'random documents',
 'description': 'This store contains embeddings of random strings.'}

In [5]:
collection.name

'first_collection'

In [6]:
collection.modify(name="in-memory-store", metadata={'description': 'This store contains random strings.',                                                    
                                                    'title': 'random documents'})

In [7]:
collection.name, collection.metadata

('in-memory-store',
 {'description': 'This store contains random strings.',
  'title': 'random documents'})

## 3. Add Documents to Collection

* By default, it uses **all-MiniLM-L6-v2** (https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model to create embeddings of input documents. It'll be extracted in directory **'~/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz'**.



In [8]:
docs = [
    "Claude 3.5 is latest Conversational AI Model from Anthropic.",
    "Gemini is latest Conversational AI Model from Google.",
    "Llama-3 is latest Conversational AI Model from Meta.",
    "Mixtral 8x7B is latest Conversational AI Model from Mistral AI.",
    "GPT-4o is latest Conversational AI Model from OpenAI."
    ]

collection.add(
    documents=docs,
    ids=["anthropic", "google", "meta", "mistral", "openai"],
    metadatas=[{"version": 3.5}, {"version": 1.5}, {"version": 3}, {"version": 1.0}, {"version": 4},]
)

* **embeddings**: The embeddings to add. If None, embeddings will be computed based on the documents or images using the **embedding_function** set for the Collection. Optional.
* **metadatas**: The metadata to associate with the embeddings. When querying, you can filter on this metadata. Optional.
* **documents**: The documents to associate with the embeddings. Optional.
* **images**: The images to associate with the embeddings. Optional.
* **uris**: The uris of the images to associate with the embeddings. Optional.

In [9]:
collection.count()

5

In [10]:
collection.get(ids=['anthropic', "openai"])

{'ids': ['anthropic', 'openai'],
 'embeddings': None,
 'metadatas': [{'version': 3.5}, {'version': 4}],
 'documents': ['Claude 3.5 is latest Conversational AI Model from Anthropic.',
  'GPT-4o is latest Conversational AI Model from OpenAI.'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

* **include** - What fields to include in result. Default is ["documents", "metadatas"]

In [11]:
result = collection.get(ids=['anthropic', "openai"], include=["embeddings", "documents"])

result

{'ids': ['anthropic', 'openai'],
 'embeddings': [[-0.025551294907927513,
   -0.048511527478694916,
   -0.02144000306725502,
   0.004685196094214916,
   0.04506189003586769,
   -0.013431379571557045,
   0.04406303912401199,
   0.004853194113820791,
   0.021838100627064705,
   0.02490963228046894,
   -0.020790286362171173,
   -0.023848697543144226,
   -0.09575176984071732,
   0.0055699702352285385,
   0.051716193556785583,
   0.018941761925816536,
   0.007113051600754261,
   -0.024127962067723274,
   -0.06495177745819092,
   -0.03754911199212074,
   0.013603613711893559,
   0.031409505754709244,
   0.06299193948507309,
   0.014542656019330025,
   -0.02865356020629406,
   -0.03570043295621872,
   -0.004308402072638273,
   -0.06764957308769226,
   0.038303352892398834,
   -0.0026582202408462763,
   0.022728418931365013,
   0.07201245427131653,
   0.06977482885122299,
   -0.014278991147875786,
   -0.07476340979337692,
   0.021168643608689308,
   -0.03930078446865082,
   -0.0277602169662714,

In [12]:
len(result["embeddings"][0])

384

In [13]:
result = collection.peek(limit=3) ## It's like pandas dataframe's head() method.

result

{'ids': ['anthropic', 'google', 'meta'],
 'embeddings': [[-0.025551294907927513,
   -0.048511527478694916,
   -0.02144000306725502,
   0.004685196094214916,
   0.04506189003586769,
   -0.013431379571557045,
   0.04406303912401199,
   0.004853194113820791,
   0.021838100627064705,
   0.02490963228046894,
   -0.020790286362171173,
   -0.023848697543144226,
   -0.09575176984071732,
   0.0055699702352285385,
   0.051716193556785583,
   0.018941761925816536,
   0.007113051600754261,
   -0.024127962067723274,
   -0.06495177745819092,
   -0.03754911199212074,
   0.013603613711893559,
   0.031409505754709244,
   0.06299193948507309,
   0.014542656019330025,
   -0.02865356020629406,
   -0.03570043295621872,
   -0.004308402072638273,
   -0.06764957308769226,
   0.038303352892398834,
   -0.0026582202408462763,
   0.022728418931365013,
   0.07201245427131653,
   0.06977482885122299,
   -0.014278991147875786,
   -0.07476340979337692,
   0.021168643608689308,
   -0.03930078446865082,
   -0.027760216

## 4. Query Vector Store | Search Relevant Documents


### 4.1 Search Relevant Documents

* **query_texts** - Query based on text.
* **query_embeddings** - Query based on embeddings value.

Default distance metric is **l2** or **euclidean** which means that low score is near to query.

In [14]:
from pprint import pprint

results = collection.query(
    query_texts=["Which is latest AI model from Anthropic?",], # Chroma will embed this for you using embedding_function.
    n_results=3
)

pprint(results)

{'data': None,
 'distances': [[0.5246765613555908, 0.922234833240509, 0.9563497304916382]],
 'documents': [['Claude 3.5 is latest Conversational AI Model from Anthropic.',
                'Mixtral 8x7B is latest Conversational AI Model from Mistral '
                'AI.',
                'Llama-3 is latest Conversational AI Model from Meta.']],
 'embeddings': None,
 'ids': [['anthropic', 'mistral', 'meta']],
 'included': ['metadatas', 'documents', 'distances'],
 'metadatas': [[{'version': 3.5}, {'version': 1.0}, {'version': 3}]],
 'uris': None}


In [15]:
results["documents"]

[['Claude 3.5 is latest Conversational AI Model from Anthropic.',
  'Mixtral 8x7B is latest Conversational AI Model from Mistral AI.',
  'Llama-3 is latest Conversational AI Model from Meta.']]

* **include** - What fields to include in result. Default is ["metadatas", "documents", "distances"]

In [16]:
from pprint import pprint

results = collection.query(
    query_texts=["Which is latest AI model from Anthropic?"], # Chroma will embed this for you using embedding_function.
    n_results=3,
    include=["embeddings", "documents", "distances"]
)

pprint(results)

{'data': None,
 'distances': [[0.5246765613555908, 0.922234833240509, 0.9563497304916382]],
 'documents': [['Claude 3.5 is latest Conversational AI Model from Anthropic.',
                'Mixtral 8x7B is latest Conversational AI Model from Mistral '
                'AI.',
                'Llama-3 is latest Conversational AI Model from Meta.']],
 'embeddings': [[[-0.025551294907927513,
                  -0.048511527478694916,
                  -0.02144000306725502,
                  0.004685196094214916,
                  0.04506189003586769,
                  -0.013431379571557045,
                  0.04406303912401199,
                  0.004853194113820791,
                  0.021838100627064705,
                  0.02490963228046894,
                  -0.020790286362171173,
                  -0.023848697543144226,
                  -0.09575176984071732,
                  0.0055699702352285385,
                  0.051716193556785583,
                  0.018941761925816536,
         

### 4.2 Filter Documents using "where" & "where_documents" Clause

Both clauses are supported in **query()**, **get()** and **delete()** methods of collection.

##### 4.2.1 "where" Clause

Filtering metadata supports the following operators:

* **\$eq** - equal to (string, int, float)
* **\$ne** - not equal to (string, int, float)
* **\$gt** - greater than (int, float)
* **\$gte** - greater than or equal to (int, float)
* **\$lt** - less than (int, float)
* **\$lte** - less than or equal to (int, float)

In [17]:
from pprint import pprint

results = collection.query(
    query_texts=["Which is latest AI model from Anthropic?"], # Chroma will embed this for you using embedding_function.
    n_results=3,
    where={"version": {"$eq": 3.5}}
)

pprint(results)

{'data': None,
 'distances': [[0.5246765613555908]],
 'documents': [['Claude 3.5 is latest Conversational AI Model from '
                'Anthropic.']],
 'embeddings': None,
 'ids': [['anthropic']],
 'included': ['metadatas', 'documents', 'distances'],
 'metadatas': [[{'version': 3.5}]],
 'uris': None}


In [18]:
from pprint import pprint

results = collection.query(
    query_texts=["Which is latest AI model from Anthropic?"], # Chroma will embed this for you using embedding_function.
    n_results=3,
    where={"version": {"$gt": 3}}
)

pprint(results)

{'data': None,
 'distances': [[0.5246765613555908, 1.0035216808319092]],
 'documents': [['Claude 3.5 is latest Conversational AI Model from Anthropic.',
                'GPT-4o is latest Conversational AI Model from OpenAI.']],
 'embeddings': None,
 'ids': [['anthropic', 'openai']],
 'included': ['metadatas', 'documents', 'distances'],
 'metadatas': [[{'version': 3.5}, {'version': 4}]],
 'uris': None}


The following inclusion operators are supported:

* **\$in** - a value is in predefined list (string, int, float, bool)
* **\$nin** - a value is not in predefined list (string, int, float, bool)

In [19]:
from pprint import pprint

results = collection.query(
    query_texts=["Which is latest AI model from Anthropic?"], # Chroma will embed this for you using embedding_function.
    n_results=3,
    where={"version": {"$in": [3, 4]}}
)

pprint(results)

{'data': None,
 'distances': [[0.9563497304916382, 1.0035216808319092]],
 'documents': [['Llama-3 is latest Conversational AI Model from Meta.',
                'GPT-4o is latest Conversational AI Model from OpenAI.']],
 'embeddings': None,
 'ids': [['meta', 'openai']],
 'included': ['metadatas', 'documents', 'distances'],
 'metadatas': [[{'version': 3}, {'version': 4}]],
 'uris': None}


In [20]:
from pprint import pprint

results = collection.query(
    query_texts=["Which is latest AI model from Anthropic?"], # Chroma will embed this for you using embedding_function.
    n_results=3,
    where={"version": {"$nin": [4, ]}} ## It did not included Claude-3.5 because it is float.
)

pprint(results)

{'data': None,
 'distances': [[0.9563497304916382]],
 'documents': [['Llama-3 is latest Conversational AI Model from Meta.']],
 'embeddings': None,
 'ids': [['meta']],
 'included': ['metadatas', 'documents', 'distances'],
 'metadatas': [[{'version': 3}]],
 'uris': None}


#### 4.2.2 "where_documents" Clause

* **\$contains** - Contains (string)
* **\$not_contains** - Does not contain (string)

In [21]:
from pprint import pprint

results = collection.query(
    query_texts=["Which is latest AI model from Anthropic?"], # Chroma will embed this for you using embedding_function.
    n_results=3,
    where_document={"$contains": "AI"} ## Case Sensitive
)

pprint(results)

{'data': None,
 'distances': [[0.5246765613555908, 0.922234833240509, 0.9563497304916382]],
 'documents': [['Claude 3.5 is latest Conversational AI Model from Anthropic.',
                'Mixtral 8x7B is latest Conversational AI Model from Mistral '
                'AI.',
                'Llama-3 is latest Conversational AI Model from Meta.']],
 'embeddings': None,
 'ids': [['anthropic', 'mistral', 'meta']],
 'included': ['metadatas', 'documents', 'distances'],
 'metadatas': [[{'version': 3.5}, {'version': 1.0}, {'version': 3}]],
 'uris': None}


In [22]:
from pprint import pprint

results = collection.query(
    query_texts=["Which is latest AI model from Anthropic?"], # Chroma will embed this for you using embedding_function.
    n_results=3,
    where_document={"$not_contains": "Llama"} ## Case Sensitive
)

pprint(results)

{'data': None,
 'distances': [[0.5246765613555908, 0.922234833240509, 0.9956796169281006]],
 'documents': [['Claude 3.5 is latest Conversational AI Model from Anthropic.',
                'Mixtral 8x7B is latest Conversational AI Model from Mistral '
                'AI.',
                'Gemini is latest Conversational AI Model from Google.']],
 'embeddings': None,
 'ids': [['anthropic', 'mistral', 'google']],
 'included': ['metadatas', 'documents', 'distances'],
 'metadatas': [[{'version': 3.5}, {'version': 1.0}, {'version': 1.5}]],
 'uris': None}


**NOTE:** There is one major difference between **"where"** and **"where_documents"** clauses. The "where" clause is applied to documents after they are retrieved based on query embeddings hence returned count might be less than what we asked for. The "where_documents" clause is applied before matching query embeddings. First docs are filtered and then on them clause is applied.

### 4.3 Change Distance Function to Find Relevant Docs

* The default is distance function is **"l2"** which is the **squared L2 norm**.

<img src="distance_function.jpg"/>

```Python
collection = chroma_client.create_collection(
    name="first_collection", 
    metadata={
        "title": "random documents",
        "description": "This store contains embeddings of random strings.",
        "hnsw:space": "cosine", ## For Distance function.
        },
)
```

## 5. Add More Documents to Collection

In [23]:
collection.add(
    documents=["Command-R-Plus is latest Conversational AI model from Cohere Inc."],
    ids=["cohere"],
    metadatas=[{"version": 1.5}]
)

In [24]:
collection.count()

6

In [25]:
collection.get(ids=["cohere"])

{'ids': ['cohere'],
 'embeddings': None,
 'metadatas': [{'version': 1.5}],
 'documents': ['Command-R-Plus is latest Conversational AI model from Cohere Inc.'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

## 6. Update Records

* Update the embeddings, metadatas or documents for provided ids.

In [26]:
collection.update(
    ids=["anthropic"],
    documents=["Claude 3.5-Sonnet is latest Conversational AI Model from Anthropic Inc."]
)

In [27]:
collection.get(ids=["anthropic"])

{'ids': ['anthropic'],
 'embeddings': None,
 'metadatas': [{'version': 3.5}],
 'documents': ['Claude 3.5-Sonnet is latest Conversational AI Model from Anthropic Inc.'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [28]:
collection.update(
    ids=["microsoft"],
    documents=["Phi-3 is latest Open-Source Conversational AI Model from Microsoft."]
)

Update of nonexisting embedding ID: microsoft
Update of nonexisting embedding ID: microsoft


In [29]:
collection.get(ids=["microsoft"])

{'ids': [],
 'embeddings': None,
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

* **upsert()** - This method updates existing record if present else adds new record.

In [30]:
collection.upsert(
    ids=["microsoft"],
    documents=["Phi-3 is latest Open-Source Conversational AI Model from Microsoft."],
    metadatas=[{"version": 3}]
)

In [31]:
collection.count()

7

In [32]:
collection.get(ids=["microsoft"])

{'ids': ['microsoft'],
 'embeddings': None,
 'metadatas': [{'version': 3}],
 'documents': ['Phi-3 is latest Open-Source Conversational AI Model from Microsoft.'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

## 7. Delete Records

In [33]:
collection.delete(ids=["microsoft"])

In [34]:
collection.count()

6

In [35]:
collection.get(ids=["microsoft"])

{'ids': [],
 'embeddings': None,
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}


## 8. How to Create Persistent Vector Store?

###  8.1 Persistent Collection using "PersistentClient" Object

In [36]:
chroma_client = chromadb.PersistentClient(path="./conversational_ai")

chroma_client

<chromadb.api.client.Client at 0x7fcd289112a0>

In [37]:
!ls -l | grep conversational

drwxrwxr-x 2 sunny sunny   4096 Sep 13 07:49 conversational_ai


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [38]:
persistent_collection = chroma_client.create_collection(
                                             name="persistent_collection", 
                                             metadata={"title": "random documents", 
                                                       "description": "This store contains embeddings of random strings."},
                                             get_or_create=True)

persistent_collection.add(
    documents=docs,
    ids=["anthropic", "google", "meta", "mistral", "openai"],
    metadatas=[{"version": 3.5}, {"version": 1.5}, {"version": 3}, {"version": 1.0}, {"version": 4},]
)

persistent_collection.count()

5

In [39]:
collection1 = chroma_client.create_collection(name="persistent_collection", 
                                              metadata={"title": "random documents", 
                                                       "description": "This store contains embeddings of random strings."},
                                              get_or_create=True)

collection1.count()

5

### 8.2 Persistent Collection using "Client" Object

In [40]:
client_settings =  chromadb.Settings(persist_directory="./conversational_ai", is_persistent=True)

chroma_client = chromadb.Client(settings=client_settings)

collection2 = chroma_client.create_collection(name="persistent_collection", 
                                              metadata={"title": "random documents", 
                                                       "description": "This store contains embeddings of random strings."},
                                              get_or_create=True)


collection2.count()

5

## 9. Work with  Collection

### 9.1 Retrieve Existing Collection

In [41]:
collection = chroma_client.get_collection("persistent_collection") ## You'll need to provide embedding_function which you used initially.

collection.count()

5

In [42]:
collection = chroma_client.get_or_create_collection("persistent_collection")

collection.count()

5

In [43]:
collection = chroma_client.get_or_create_collection("persistent_collection2")

collection.count()

0

### 9.2 List Collections

In [44]:
chroma_client.list_collections()

[Collection(id=8dffe559-f9f6-4b99-9059-534148aa3bb6, name=persistent_collection),
 Collection(id=c64c16f0-01d7-4269-b833-dee09a2b861e, name=persistent_collection2)]

In [45]:
chroma_client.count_collections()

2

### 9.3 Delete Collection

In [46]:
chroma_client.delete_collection(name="persistent_collection2")

In [47]:
chroma_client.list_collections()

[Collection(id=8dffe559-f9f6-4b99-9059-534148aa3bb6, name=persistent_collection)]

In [48]:
chroma_client.count_collections()

1

## 10. Custom Embedding Function

In [49]:
chromadb.utils.embedding_functions.DefaultEmbeddingFunction()

<chromadb.utils.embedding_functions.onnx_mini_lm_l6_v2.ONNXMiniLM_L6_V2 at 0x7fcd2896b7c0>

* all-MiniLM-L6-v2 - https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 
    * Context Window Tokens - 256, Embeddings Length - 386
    
All models are available at this link - [Sentence Transformers Embeddings Models](https://sbert.net/docs/sentence_transformer/pretrained_models.html#original-models)

In [52]:
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction(model_name="all-distilroberta-v1")

embedding_function

<chromadb.utils.embedding_functions.sentence_transformer_embedding_function.SentenceTransformerEmbeddingFunction at 0x7fcc395e4f10>

In [53]:
embeddings = embedding_function.embed_with_retries(docs)

len(embeddings), len(embeddings[0])

(5, 768)

In [54]:
chroma_client = chromadb.PersistentClient(path="./conversational_ai")

collection = chroma_client.create_collection(name="persistent_collection_distilroberta", 
                                             metadata={"title": "random documents", 
                                                       "description": "This store contains embeddings of random strings."},
                                             embedding_function=embedding_function,                                             
                                             get_or_create=True)

collection.add(
    documents=docs,
    ids=["anthropic", "google", "meta", "mistral", "openai"],
    metadatas=[{"version": 3.5}, {"version": 1.5}, {"version": 3}, {"version": 1.0}, {"version": 4},]
)

collection.count()

5

In [55]:
chroma_client.list_collections()

[Collection(id=70d5ecb4-103a-412c-b4c1-e56c569c2036, name=persistent_collection_distilroberta),
 Collection(id=8dffe559-f9f6-4b99-9059-534148aa3bb6, name=persistent_collection)]

In [56]:
result = collection.get(ids=['anthropic', "openai"], include=["embeddings", "documents"])

len(result["embeddings"][0])

768

**chromadb.utils.embedding_functions**

* AmazonBedrockEmbeddingFunction
* CohereEmbeddingFunction
* GoogleGenerativeAiEmbeddingFunction
* GooglePalmEmbeddingFunction
* GoogleVertexEmbeddingFunction
* HuggingFaceEmbeddingFunction
* JinaEmbeddingFunction
* InstructorEmbeddingFunction
* OllamaEmbeddingFunction
* ONNXMiniLM_L6_V2
* OpenAIEmbeddingFunction
* OpenCLIPEmbeddingFunction
* RoboflowEmbeddingFunction
* SentenceTransformerEmbeddingFunction
* Text2VecEmbeddingFunction
* create_langchain_embedding

## 11. Chroma DB in Client Server Mode

Chroma can also be configured to run in client/server mode. In this mode, the Chroma client connects to a Chroma server running in a separate process.

To start the Chroma server, run the following command:

* **chroma run --path /db_path** 

In [61]:
#chroma run --path ./conversational_ai

In [69]:
chroma_client = chromadb.HttpClient(host='localhost', port=8000, ssl=False, headers=None)

In [70]:
collection = chroma_client.get_collection("persistent_collection")

collection.count()

5

In [71]:
result = collection.get(ids=['anthropic', "openai"], include=["embeddings", "documents"])

len(result["embeddings"][0])

384

In [72]:
collection = chroma_client.get_collection("persistent_collection_distilroberta")

collection.count()

5

In [73]:
result = collection.get(ids=['anthropic', "openai"], include=["embeddings", "documents"])

len(result["embeddings"][0])

768

## Summary

In this notebook, I tried to cover majority of functionalities available through **chromadb** vector store. Please feel free to let me know your views in the comments section. If you have any doubt, then feel free to comment.