# ChromaDB

The following is an example of using the `ChromaDocumentIndex` which is a wrapper around `chromadb` https://docs.trychroma.com/getting-started

> By default, Chroma uses the Sentence Transformers all-MiniLM-L6-v2 model to create embeddings.

- https://docs.trychroma.com/embeddings

In [1]:
from llm_workflow.base import Document
from llm_workflow.indexes import ChromaDocumentIndex

doc_index = ChromaDocumentIndex()

docs = [
    Document(content="This is a document about basketball.", metadata={'id': 0}),
    Document(content="This is a document about baseball.", metadata={'id': 1}),
    Document(content="This is a document about football.", metadata={'id': 2}),
]
doc_index.add(docs=docs)

In [2]:
results = doc_index.search(value="Give a document about baseball", n_results=1)
print(results)
print(results[0].content)

[Document(content='This is a document about baseball.', metadata={'id': 1, 'distance': 0.14589954912662506})]
This is a document about baseball.


---

# Using an Embeddings Model

We can also supply our own embeddings model. In this example, let's use `OpenAIEmbedding`.

In [3]:
#The `load_dotenv` function below loads all the variables found in the `.env` file as environment variables. You must have a `.env` file located in the project directory containing your OpenAI API key, in the following format.
# OPENAI_API_KEY=sk-...
from dotenv import load_dotenv
load_dotenv()

True

In [4]:
from llm_workflow.base import Document
from llm_workflow.indexes import ChromaDocumentIndex
from llm_workflow.models import OpenAIEmbedding

embeddings_model = OpenAIEmbedding(model_name='text-embedding-ada-002')
doc_index = ChromaDocumentIndex(embeddings_model=embeddings_model)

docs = [
    Document(content="This is a document about basketball.", metadata={'id': 0}),
    Document(content="This is a document about baseball.", metadata={'id': 1}),
    Document(content="This is a document about football.", metadata={'id': 2}),
]
doc_index.add(docs=docs)

A `DocumentIndex` object should reveal any history/usage by the underlying embeddings model (in this case, the history/usage from `OpenAIEmbedding`).

The cost/tokens/history below are associated with the the embeddings that were created from the documents that were added to the document index.

In [5]:
print(f"Cost:   ${doc_index.cost:.6f}")
print(f"Tokens: {doc_index.total_tokens:,}")

Cost:   $0.000002
Tokens: 21


In [6]:
doc_index.history

[EmbeddingRecord(uuid='3e595c4a-6e49-4dbf-ae2d-7bbc2b6949dc', timestamp='2023-07-04 22:56:05.735', metadata={'model_name': 'text-embedding-ada-002'}, total_tokens=21, cost=2.1000000000000002e-06)]

In [7]:
results = doc_index.search(value="Give a document about baseball", n_results=1)
print(results)
print(results[0].content)

[Document(content='This is a document about baseball.', metadata={'id': 1, 'distance': 0.08675364404916763})]
This is a document about baseball.


The cost/tokens/history below are updated based on the embeddings that were created from the search value above. You can check that the cost and tokens listed in the `history` property sum to the cost and total-tokens shown below.

In [8]:
print(f"Cost:   ${doc_index.cost:.6f}")
print(f"Tokens: {doc_index.total_tokens:,}")

Cost:   $0.000003
Tokens: 26


In [9]:
doc_index.history

[EmbeddingRecord(uuid='3e595c4a-6e49-4dbf-ae2d-7bbc2b6949dc', timestamp='2023-07-04 22:56:05.735', metadata={'model_name': 'text-embedding-ada-002'}, total_tokens=21, cost=2.1000000000000002e-06),
 EmbeddingRecord(uuid='5fdc6a50-15dd-4bbf-9da8-707ed12200c8', timestamp='2023-07-04 22:56:05.943', metadata={'model_name': 'text-embedding-ada-002'}, total_tokens=5, cost=5.000000000000001e-07)]

---