# Chroma

- Author: [Gwangwon Jung](https://github.com/pupba)
- Design: []()
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/02-Chroma.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/02-Chroma.ipynb)

## Overview

This tutorial covers how to use **Chroma Vector Store** with **LangChain** .

`Chroma` is an **open-source AI application database** .

In this tutorial, after learning how to use `langchain-chroma` , we will implement examples of a simple **Text Search** engine using `Chroma` .

![search-example](./assets/02-chroma-with-langchain-flow-search-example.png)

### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [What is Chroma?](#what-is-chroma?)
- [LangChain Chroma Basic](#langchain-chroma-basic)
- [Manage Store](#manage-store)
- [Query Vector Store](#query-vector-store)


### References

- [Chroma Docs](https://docs.trychroma.com/docs/overview/introduction)
- [Langchain-Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma/)
- [List of VectorStore supported by Langchain](https://python.langchain.com/docs/integrations/vectorstores/)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain-core",
        "langchain-chroma",
        "chromadb",
        "langchain-text-splitters",
        "langchain-huggingface",
        "python-dotenv",
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Chroma",
        "HUGGINGFACEHUB_API_TOKEN": "",
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [4]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

## What is Chroma?

![logo](./assets/02-chroma-with-langchain-chroma-logo.png)

`Chroma` is the **open-source vector database** designed for AI application. 

It specializes in storing high-dimensional vectors and performing fast similariy search, making it ideal for tasks like **semantic search** , **recommendation systems** and **multimodal search** .

With its **developer-friendly APIs** and seamless integration with frameworks like **LangChain** , `Chroma` is powerful tool for building scalable, AI-driven solutions.

The biggest feature of `Chroma` is that it internally **Indexing ([HNSW](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world))** and **Embedding ([all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2))** are used when storing data.

## LangChain Chroma Basic

### Select Embedding Model

We load the **Embedding Model** with `langchain_huggingface` .

If you want to use a different model, use a different model.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "Alibaba-NLP/gte-base-en-v1.5"

embeddings = HuggingFaceEmbeddings(
    model_name=model_name, model_kwargs={"trust_remote_code": True}
)

### Create VectorDB

The **library** supported by **LangChain** has no `upsert` function and lacks interface uniformity with other **Vector DBs**, so we have implemented a new **Python** class.

First, Load a **Python** class from **utils/chroma/basic.py** .

In [6]:
from utils.chroma.basic import ChromaDB

vector_store = ChromaDB(embeddings=embeddings)

Create `ChromaDB` object.

- **Mode** : `persistent`

- **Persistent Path** : `data/chroma.sqlite` (Used `SQLite` DB)

- **collection** : `test`

- **hnsw:space** : `cosine`

In [7]:
configs = {
    "mode": "persistent",
    "persistent_path": "data/chroma_text",
    "collection": "test",
    "hnsw:space": "cosine",
}

vector_store.connect(**configs)

### Load Text Documents Data

In this tutorial, we will use the **A Little Prince** fairy tale document.

To put this data in **Chroma** ,we will do data preprocessing first.

First of all, we will load the `data/the_little_prince.txt` file that extracted only the text of the fairy tale document.


In [8]:
# If your "OS" is "Windows", add 'encoding=utf-8' to the open function
with open("./data/the_little_prince.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

Second, chunking the text imported into the `RecursiveCharacterTextSplitter` .

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

split_docs = text_splitter.create_documents([raw_text])

for docs in split_docs[:2]:
    print(f"Content: {docs.page_content}\nMetadata: {docs.metadata}", end="\n\n")

Content: The Little Prince
Written By Antoine de Saiot-Exupery (1900〜1944)
Metadata: {}

Content: [ Antoine de Saiot-Exupery ]
Metadata: {}



Preprocessing document for **Chroma** .

In [10]:
pre_dosc = vector_store.preprocess_documents(
    documents=split_docs,
    source="The Little Prince",
    author="Antoine de Saint-Exupéry",
    chapter=True,
)

In [11]:
pre_dosc[:2]

[Document(metadata={'source': 'The Little Prince', 'author': 'Antoine de Saint-Exupéry', 'chapter': 1, 'id': '57623d83-6989-4288-aef9-4a73a9dd4e15'}, page_content='- we are introduced to the narrator, a pilot, and his ideas about grown-ups'),
 Document(metadata={'source': 'The Little Prince', 'author': 'Antoine de Saint-Exupéry', 'chapter': 1, 'id': 'f2828b20-b00f-4771-80dd-13d3b7e76a87'}, page_content='Once when I was six years old I saw a magnificent picture in a book, called True Stories from')]

## Manage Store

This section introduces four basic functions.

- `add`

- `upsert(parallel)`

- `query`

- `delete`

### Add

Add the new **Documents** .

An error occurs if you have the same **ID** .

In [12]:
vector_store.add(pre_documents=pre_dosc[:2])

In [13]:
uids = list(vector_store.unique_ids)
uids

['57623d83-6989-4288-aef9-4a73a9dd4e15',
 'f2828b20-b00f-4771-80dd-13d3b7e76a87']

In [14]:
vector_store.chroma.get(ids=uids[0])

{'ids': ['57623d83-6989-4288-aef9-4a73a9dd4e15'],
 'embeddings': None,
 'documents': ['- we are introduced to the narrator, a pilot, and his ideas about grown-ups'],
 'uris': None,
 'data': None,
 'metadatas': [{'author': 'Antoine de Saint-Exupéry',
   'chapter': 1,
   'source': 'The Little Prince'}],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

Error occurs when trying to `add` duplicate `ids` .

In [15]:
vector_store.add(pre_documents=pre_dosc[:2])

Add of existing embedding ID: 57623d83-6989-4288-aef9-4a73a9dd4e15
Add of existing embedding ID: f2828b20-b00f-4771-80dd-13d3b7e76a87
Insert of existing embedding ID: 57623d83-6989-4288-aef9-4a73a9dd4e15
Insert of existing embedding ID: f2828b20-b00f-4771-80dd-13d3b7e76a87


### Upsert(parallel)

`Upsert` will `Update` a document or `Add` a new document if the same `ID` exists.

In [16]:
tmp_ids = [docs.metadata["id"] for docs in pre_dosc[:2]]
vector_store.chroma.get(ids=tmp_ids)

{'ids': ['57623d83-6989-4288-aef9-4a73a9dd4e15',
  'f2828b20-b00f-4771-80dd-13d3b7e76a87'],
 'embeddings': None,
 'documents': ['- we are introduced to the narrator, a pilot, and his ideas about grown-ups',
  'Once when I was six years old I saw a magnificent picture in a book, called True Stories from'],
 'uris': None,
 'data': None,
 'metadatas': [{'author': 'Antoine de Saint-Exupéry',
   'chapter': 1,
   'source': 'The Little Prince'},
  {'author': 'Antoine de Saint-Exupéry',
   'chapter': 1,
   'source': 'The Little Prince'}],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [17]:
pre_dosc[0].page_content = "Changed Content"
pre_dosc[0]

Document(metadata={'source': 'The Little Prince', 'author': 'Antoine de Saint-Exupéry', 'chapter': 1, 'id': '57623d83-6989-4288-aef9-4a73a9dd4e15'}, page_content='Changed Content')

In [18]:
vector_store.upsert_documents(
    documents=pre_dosc[:2],
)
tmp_ids = [docs.metadata["id"] for docs in pre_dosc[:2]]
vector_store.chroma.get(ids=tmp_ids)

{'ids': ['57623d83-6989-4288-aef9-4a73a9dd4e15',
  'f2828b20-b00f-4771-80dd-13d3b7e76a87'],
 'embeddings': None,
 'documents': ['Changed Content',
  'Once when I was six years old I saw a magnificent picture in a book, called True Stories from'],
 'uris': None,
 'data': None,
 'metadatas': [{'author': 'Antoine de Saint-Exupéry',
   'chapter': 1,
   'id': '57623d83-6989-4288-aef9-4a73a9dd4e15',
   'source': 'The Little Prince'},
  {'author': 'Antoine de Saint-Exupéry',
   'chapter': 1,
   'id': 'f2828b20-b00f-4771-80dd-13d3b7e76a87',
   'source': 'The Little Prince'}],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [19]:
# parallel upsert
vector_store.upsert_documents_parallel(
    documents=pre_dosc,
    batch_size=32,
    max_workers=10,
)

## Query Vector Store

There are two ways to **Query** the **LangChain Chroma Vector Store** .

- **Directly** : Query the vector store directly using methods like `similarity_search` or `similarity_search_with_score` .

- **Turning into retriever** : Convert the vector store into a **retriever** object, which can be used in **LangChain** pipelines or chains.

### Query

This method is created by wrapping the methods of the `langchain-chroma` .

**Parameters**

- `query:str` - Query text to search for.

- `k:int = DEFAULT_K` - Number of results to return. Defaults to 4.

- `filter: Dict[str, str] | None = None` - Filter by metadata. Defaults to None.

- `where_document: Dict[str, str] | None = None` - dict used to filter by the documents. E.g. {$contains: {"text": "hello"}}.

- `**kwargs:Any` : Additional keyword arguments to pass to Chroma collection query.


**Returns**
- `List[Document]` - List of documents most similar to the query text and distance in float for each. Lower score represents more similarity.

**Simple Search**

In [20]:
docs = vector_store.query(query="Prince", top_k=2)

for doc in docs:
    print("ID:", doc.metadata["id"])
    print("Chapter:", doc.metadata["chapter"])
    print("Page Content:", doc.page_content)
    print()

ID: 62dbbc31-f4be-4d11-bbbc-282e8c1e522b
Chapter: 7
Page Content: prince disturbed my thoughts.

ID: 09f7204d-ccc7-4f2a-b767-440fe3c68ee5
Chapter: 6
Page Content: Oh, little prince! Bit by bit I came to understand the secrets of your sad little life... For a



**Filtering Search**

In [21]:
docs = vector_store.query(query="Prince", top_k=2, filters={"chapter": 20})

for doc in docs:
    print("ID:", doc.metadata["id"])
    print("Chapter:", doc.metadata["chapter"])
    print("Page Content:", doc.page_content)
    print()

ID: b38c0471-f74b-4fa6-a9c9-872b01fd87bb
Chapter: 20
Page Content: snow, the little prince at last came upon a road. And all roads lead to the abodes of men.

ID: 02092b04-eeb3-496e-885c-174e0f864a80
Chapter: 20
Page Content: extinct forever... that doesn‘t make me a very great prince..."



**Cosine Similarity Search**

In [22]:
# Cosine Similarity
results = vector_store.query(query="Prince", top_k=2, cs=True, filters={"chapter": 20})

for doc, score in results:
    print("ID:", doc.metadata["id"])
    print("Chapter:", doc.metadata["chapter"])
    print("Page Content:", doc.page_content)
    print(f"Similarity Score: {round(score,2)*100:.1f}%")
    print()

ID: b38c0471-f74b-4fa6-a9c9-872b01fd87bb
Chapter: 20
Page Content: snow, the little prince at last came upon a road. And all roads lead to the abodes of men.
Similarity Score: 60.0%

ID: 02092b04-eeb3-496e-885c-174e0f864a80
Chapter: 20
Page Content: extinct forever... that doesn‘t make me a very great prince..."
Similarity Score: 54.0%



### as_retriever()

The `as_retriever()` method converts a `VectorStore` object into a `Retriever` object.

A `Retriever` is an interface used in `LangChain` to query a vector store and retrieve relevant documents.

**Parameters**

- `search_type:Optional[str]` - Defines the type of search that the Retriever should perform. Can be `similarity` (default), `mmr` , or `similarity_score_threshold`

- `search_kwargs:Optional[Dict]` - Keyword arguments to pass to the search function. 

    Can include things like:

    `k` : Amount of documents to return (Default: 4)

    `score_threshold` : Minimum relevance threshold for similarity_score_threshold

    `fetch_k` : Amount of documents to pass to `MMR` algorithm(Default: 20)
        
    `lambda_mult` : Diversity of results returned by MMR; 1 for minimum diversity and 0 for maximum. (Default: 0.5)

    `filter` : Filter by document metadata


**Returns**

- `VectorStoreRetriever` - Retriever class for VectorStore.


### invoke()

Invoke the retriever to get relevant documents.

Main entry point for synchronous retriever invocations.

**Parameters**

- `input:str` - The query string.
- `config:RunnableConfig | None = None` - Configuration for the retriever. Defaults to None.
- `**kwargs:Any` - Additional arguments to pass to the retriever.


**Returns**

- `List[Document]` : List of relevant documents.

In [23]:
from langchain_chroma import Chroma

client = Chroma(
    collection_name="test",
    persist_directory="data/chroma_text",
    collection_metadata={"hnsw:space": "cosine"},
    embedding_function=embeddings,
)

In [24]:
retriever = client.as_retriever(search_type="similarity", search_kwargs={"k": 2})
docs = retriever.invoke("Prince", filter={"chapter": 5})

for doc in docs:
    print("ID:", doc.id)
    print("Chapter:", doc.metadata["chapter"])
    print("Page Content:", doc.page_content)
    print()

ID: 63c0f702-49d4-411e-beab-e18dbf2543ff
Chapter: 5
Page Content: Indeed, as I learned, there were on the planet where the little prince lived-- as on all planets--

ID: 2d4abb40-f615-4f22-8183-066e8a43f317
Chapter: 5
Page Content: Now there were some terrible seeds on the planet that was the home of the little prince; and these



### Delete

`Delete` the Documents.

You can use with `filter` .

In [25]:
len(vector_store.unique_ids)

1317

In [26]:
len([docs for docs in pre_dosc if docs.metadata["chapter"] == 1])

43

In [27]:
vector_store.delete_by_filter(
    unique_ids=list(vector_store.unique_ids), filters={"chapter": 1}
)

Success Delete 43 Documents


In [28]:
len(vector_store.unique_ids)

1274

In [29]:
vector_store.delete_by_filter(unique_ids=list(vector_store.unique_ids))

Success Delete 1274 Documents


In [25]:
len(vector_store.unique_ids)

0

Remove a **Huggingface Cache** , `vector_store` , `embeddings` and `client` .

If you created a **vectordb** directory, please **remove** it at the end of this tutorial.

In [30]:
from huggingface_hub import scan_cache_dir

del embeddings
del vector_store
del client
scan = scan_cache_dir()
scan.delete_revisions()

DeleteCacheStrategy(expected_freed_size=0, blobs=frozenset(), refs=frozenset(), repos=frozenset(), snapshots=frozenset())