# Chroma Vector Store Quick Reference (LangChain)

## Introduction

The Chroma Vector Store API is a powerful tool for managing and querying vectorized data, enabling seamless integration with machine learning models and natural language processing tasks. By leveraging Chroma, developers can efficiently store, retrieve, and manipulate high-dimensional embeddings, making it an essential component for building intelligent applications. This guide provides a detailed walkthrough of Chroma's core functionalities, including database persistence, document operations, search capabilities, and utility functions. Whether you're working with text, images, or other data types, Chroma offers a robust and scalable solution for vector storage and retrieval.



In [1]:
!pip install -qU langchain-openai
!pip install -qU langchain_community
!pip install -qU langchain_experimental
!pip install -qU langchain-chroma>=0.1.2
!pip install -qU chromadb

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.3/54.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m412.2/412.2 kB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m455.5/455.5 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m67.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.2/209.2 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-api-core 1.34.1 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.

---

## **1. Database Persistence and Loading**

### **1.1 Saving the Chroma Database**
Save the Chroma database to disk using the `persist_directory` parameter.

In [3]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
from kaggle_secrets import UserSecretsClient

# Fetch API key securely
user_secrets = UserSecretsClient()
my_api_key = user_secrets.get_secret("api-key-openai")

# Initialize OpenAI embeddings
embed = OpenAIEmbeddings(model="text-embedding-3-small", api_key=my_api_key)

# Initialize Chroma with persist_directory
vector_store = Chroma(
    collection_name="my_collection",
    embedding_function=embed,
    persist_directory="./chroma_db"  # Data will be saved here
)

# Create documents
documents = [
    Document(page_content="The quick brown fox jumps over the lazy dog.", metadata={"source": "fable"}),
    Document(page_content="Artificial intelligence is transforming the world.", metadata={"source": "tech"}),
]

# Add documents to the vector store
vector_store.add_documents(documents=documents, ids=["doc1", "doc2"])

BackendError: Unexpected response from the service. Response: {'errors': ['No user secrets exist for kernel id 75082080 and label api-key-openai.'], 'error': {'code': 5, 'details': []}, 'wasSuccessful': False}.

### **1.2 Loading the Chroma Database**
Load a previously saved Chroma database from disk.

In [None]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from kaggle_secrets import UserSecretsClient

# Fetch API key securely
user_secrets = UserSecretsClient()
my_api_key = user_secrets.get_secret("api-key-openai")

# Initialize OpenAI embeddings
embed = OpenAIEmbeddings(model="text-embedding-3-small", api_key=my_api_key)

# Load the Chroma database from the persist_directory
vector_store = Chroma(
    collection_name="my_collection",
    embedding_function=embed,
    persist_directory="./chroma_db"  # Same directory used to save the database
)

# Perform a similarity search to verify loading
results = vector_store.similarity_search(query="AI", k=2)
for doc in results:
    print(f"Loaded Document: {doc.page_content}")

### **1.3 Checking if a Collection Exists**
Check if a collection exists before loading it.

In [None]:
# Access the internal Chroma client used by langchain_chroma
chroma_client = vector_store._client

# Check if the collection exists
try:
    collection = chroma_client.get_collection(name="my_collection")
    print("Collection exists and is loaded.")
except Exception as e:
    print("Collection does not exist.")

### **1.4 Deleting a Persisted Collection**
Delete a persisted collection by removing its directory.

In [None]:
import shutil

# Delete the persisted collection directory
shutil.rmtree("./chroma_db")
print("Persisted collection directory deleted.")

---

## **2. Document Operations**

### **2.1 Adding Documents with `add_documents()`**
Add documents to the Chroma vector store.

In [None]:
import os
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

# Use /kaggle/working/ for persist_directory
persist_directory = "/kaggle/working/chroma_db"

# Ensure the persist_directory exists
if not os.path.exists(persist_directory):
    os.makedirs(persist_directory)

# Initialize OpenAI embeddings
embed = OpenAIEmbeddings(model="text-embedding-3-small", api_key=my_api_key)

# Initialize Chroma with persist_directory
vector_store = Chroma(
    collection_name="my_collection",
    embedding_function=embed,
    persist_directory=persist_directory
)

# Create documents
documents = [
    Document(page_content="The quick brown fox jumps over the lazy dog.", metadata={"source": "fable"}),
    Document(page_content="Artificial intelligence is transforming the world.", metadata={"source": "tech"}),
]

# Add documents to the vector store
try:
    ids = ["doc1", "doc2"]
    added_ids = vector_store.add_documents(documents=documents, ids=ids)
    if added_ids == ids:
        print("Documents added successfully.")
    else:
        print("Failed to add documents. Returned IDs do not match.")
except Exception as e:
    print(f"Error adding documents: {e}")

### **2.2 Adding Texts with `add_texts()`**
This example demonstrates how to add textual data to the Chroma vector store using the `add_texts` method. The provided texts are embedded and stored along with optional metadata and IDs.

In [None]:
# Example of using add_texts
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is transforming the world."
]

metadatas = [
    {"source": "fable"},
    {"source": "tech"}
]

ids = ["text1", "text2"]

try:
    added_ids = vector_store.add_texts(texts=texts, metadatas=metadatas, ids=ids)
    if added_ids == ids:
        print("Texts added successfully.")
    else:
        print("Failed to add texts. Returned IDs do not match.")
except Exception as e:
    print(f"Error adding texts: {e}")

### **2.3 Retrieving Documents Using `get` and `get_by_ids`**
This example demonstrates how to retrieve documents from a Chroma vector store using the `get` and `get_by_ids` functions. The `get` function allows filtering by metadata, limiting results, and pagination, while `get_by_ids` retrieves specific documents by their IDs.

In [None]:
# Using `get` function
print("\nUsing `get` function:")
results = vector_store.get(
    ids=["doc1", "text1"],      # Retrieve specific documents by their IDs
    where={"source": "fable"},  # Filter by metadata
    limit=5,                    # Limit the number of results
    offset=0                    # Skip the first N results
)

# Print the results
print("Retrieved Documents:")
for doc_id, document in zip(results["ids"], results["documents"]):
    print(f"ID: {doc_id}, Content: {document}")

In [None]:
# Using `get_by_ids` function
print("\nUsing `get_by_ids` function:")
document_ids = ["doc2", "text2"]
results = vector_store.get_by_ids(document_ids)

# Print the results
print("Retrieved Documents by IDs:")
for doc_id, document in zip(document_ids, results):
    print(f"ID: {doc_id}, Content: {document}")

### **2.4 Updating Documents**
Update an existing document in the vector store.

In [None]:
# Update a document
updated_document = Document(
    page_content="AI is revolutionizing industries.",
    metadata={"source": "tech"}
)

try:
    vector_store.update_documents(ids=["doc2"], documents=[updated_document])
    updated_doc = vector_store.get(ids=["doc2"])["documents"][0]
    if updated_doc == updated_document.page_content:
        print("Document updated successfully.")
    else:
        print("Failed to update document. Content does not match.")
except Exception as e:
    print(f"Error updating document: {e}")

### **2.5 Deleting Documents**
Delete documents by their IDs.

In [None]:
# Delete a document
try:
    vector_store.delete(ids=["doc1"])
    deleted_doc = vector_store.get(ids=["doc1"])
    if not deleted_doc["documents"]:
        print("Document deleted successfully.")
    else:
        print("Failed to delete document. Document still exists.")
except Exception as e:
    print(f"Error deleting document: {e}")

---

## **3. Search Operations**

1. **Similarity Search**:
   - Ideal for retrieving the most relevant documents based on semantic similarity.
   - Use this when you want straightforward, top-k results.

2. **Similarity Search with Scores**:
   - Provides additional insight into how closely each document matches the query.
   - Useful for ranking or filtering results based on similarity thresholds.

3. **Maximal Marginal Relevance (MMR)**:
   - Balances relevance and diversity in search results.
   - Use this when you want to avoid redundant or overly similar documents.

### **3.1 Similarity Search**
Search for documents similar to a query.

The `similarity_search` method retrieves documents from the vector store that are most similar to the given query. This is useful for finding relevant information based on semantic similarity.

#### **Parameters**:
- `query` (str): The input query to search for.
- `k` (int): The number of documents to return. Defaults to 4.

In [None]:
# Perform a similarity search
query = "What is AI?"
results = vector_store.similarity_search(query, k=2)

# Print results
for doc in results:
    print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")

### **3.2 Similarity Search with Scores**
Search for documents and retrieve similarity scores.

The `similarity_search_with_score` method returns documents along with their similarity scores. The scores indicate how closely each document matches the query, with lower scores representing higher similarity.

#### **Parameters**:
- `query` (str): The input query to search for.
- `k` (int): The number of documents to return. Defaults to 4.

In [None]:
# Perform a similarity search with scores
results = vector_store.similarity_search_with_score(query="AI", k=2)

for doc, score in results:
    print(f"Score: {score}, Content: {doc.page_content}")

### **3.3 Maximal Marginal Relevance (MMR) Search**
Use MMR to balance similarity and diversity in search results.

The `max_marginal_relevance_search` method optimizes for both similarity to the query and diversity among the selected documents. This is useful when you want to avoid redundant results and ensure a variety of relevant documents.

#### **Parameters**:
- `query` (str): The input query to search for.
- `k` (int): The number of documents to return. Defaults to 4.
- `fetch_k` (int): The number of documents to fetch before applying MMR. Defaults to 20.
- `lambda_mult` (float): A value between 0 and 1 that determines the trade-off between similarity and diversity. Higher values favor similarity, while lower values favor diversity. Defaults to 0.5.

In [None]:
# Perform MMR search
results = vector_store.max_marginal_relevance_search(
    query="AI",
    k=3,
    fetch_k=10,
    lambda_mult=0.5  # Higher values favor similarity, lower values favor diversity
)

for doc in results:
    print(f"MMR Result: {doc.page_content}")

---

## **4. Store Retriever**

### **4.1 Using Chroma as a Retriever**
Convert the vector store into a retriever for use in LangChain pipelines.

In [None]:
# Create a retriever
retriever = vector_store.as_retriever()

# Use the retriever
query = "What is AI?"
docs = retriever.invoke(query)
for doc in docs:
    print(f"Retrieved Document: {doc.page_content}")

### **4.2 Retrieve More Documents with Higher Diversity (MMR)**
Use the Maximal Marginal Relevance (MMR) algorithm to retrieve documents with a balance of relevance and diversity.

In [None]:
# Create a retriever with MMR
retriever = vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 3, "lambda_mult": 0.5}
)

# Use the retriever
query = "What is AI?"
docs = retriever.invoke(query)
for doc in docs:
    print(f"Retrieved Document: {doc.page_content}")

### **4.3 Fetch More Documents for MMR but Return Only Top 5**
Fetch a larger pool of documents for MMR to consider but return only the top 5 most relevant and diverse documents.

In [None]:
# Create a retriever with MMR and a larger fetch pool
retriever = vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 50}
)

# Use the retriever
query = "What is AI?"
docs = retriever.invoke(query)
for doc in docs:
    print(f"Retrieved Document: {doc.page_content}")

### **4.4 Retrieve Documents with a Relevance Score Threshold**
Retrieve only documents that have a similarity score above a specified threshold.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Normalize scores to [0, 1]
def normalize_scores(docs_with_scores):
    scores = [score for _, score in docs_with_scores]
    scaler = MinMaxScaler(feature_range=(0, 1))
    normalized_scores = scaler.fit_transform([[score] for score in scores]).flatten()
    return [(doc, score) for (doc, _), score in zip(docs_with_scores, normalized_scores)]

# Fetch documents with relevance scores
query = "What is the color of the sky?"
docs_with_scores = vector_store.similarity_search_with_relevance_scores(query)

# Normalize the scores
normalized_docs_with_scores = normalize_scores(docs_with_scores)

# Filter documents based on the normalized score threshold
score_threshold = 0.8
filtered_docs = [doc for doc, score in normalized_docs_with_scores if score >= score_threshold]

# Print the filtered documents
for doc in filtered_docs:
    print(f"Retrieved Document: {doc.page_content}")

### **4.5 Retrieve Only the Single Most Similar Document**
Retrieve only the single most relevant document to the query.

In [None]:
# Create a retriever to fetch only the top document
retriever = vector_store.as_retriever(search_kwargs={"k": 1})

# Use the retriever
query = "What is AI?"
docs = retriever.invoke(query)
for doc in docs:
    print(f"Retrieved Document: {doc.page_content}")

### **4.6 Filter Documents by Metadata**
Retrieve documents that match specific metadata filters, such as a paper title or publication year.

In [None]:
# Create a retriever with a metadata filter
retriever = vector_store.as_retriever(
    search_kwargs={"filter": {"paper_title": "GPT-4 Technical Report"}}
)

# Use the retriever
query = "What is AI?"
docs = retriever.invoke(query)
for doc in docs:
    print(f"Retrieved Document: {doc.page_content}")

---

## **5. Class Methods**

### **5.1 Creating a Vector Store from Documents**
Create a Chroma vector store directly from a list of documents.

In [None]:
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

# Initialize OpenAI embeddings
embed = OpenAIEmbeddings(model="text-embedding-3-small", api_key=my_api_key)

# Create documents
documents = [
    Document(page_content="The sun rises in the east.", metadata={"source": "science"}),
    Document(page_content="The moon orbits the Earth.", metadata={"source": "science"}),
]

# Create a Chroma vector store from documents
try:
    vector_store = Chroma.from_documents(
        documents=documents,
        embedding=embed,
        collection_name="science_collection",
        persist_directory="./chroma_db_science"
    )
    
    # Verify success by checking if the collection exists
    if vector_store._collection:
        print("Vector store created successfully.")
    else:
        print("Failed to create vector store. Collection is empty.")
except Exception as e:
    print(f"Error creating vector store: {e}")

### **5.2 Creating a Vector Store from Texts**
Create a Chroma vector store directly from raw texts.

In [None]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

# Initialize OpenAI embeddings
embed = OpenAIEmbeddings(model="text-embedding-3-small", api_key=my_api_key)

# Create texts and metadata
texts = ["The sky is blue.", "The grass is green."]
metadatas = [{"source": "nature"}, {"source": "nature"}]

# Create a Chroma vector store from texts
try:
    vector_store = Chroma.from_texts(
        texts=texts,
        embedding=embed,
        metadatas=metadatas,
        collection_name="nature_collection",
        persist_directory="./chroma_db_nature"
    )
    
    # Verify success by checking if the collection exists
    if vector_store._collection:
        print("Vector store created successfully.")
    else:
        print("Failed to create vector store. Collection is empty.")
except Exception as e:
    print(f"Error creating vector store: {e}")

## Conclusion

In this guide, we explored the versatility of the Chroma Vector Store API through practical examples, from saving and loading databases to performing advanced search operations. By following these examples, you can effectively manage vectorized data, integrate Chroma into your workflows, and build intelligent systems that leverage the power of embeddings. Whether you're a beginner or an experienced developer, Chroma's intuitive API and powerful features make it an invaluable tool for modern AI applications. Start experimenting with Chroma today and unlock the full potential of vector-based data management.