## ChromaDB Tutorial

- [Try Chroma](https://docs.trychroma.com/guides/multimodal)
      - [Chroma Guide](https://docs.trychroma.com/guides)

- [Embeddings and Vector Databases With ChromaDB](https://realpython.com/chromadb-vector-database/)
    - intro on vector Embeddings
    - chromadb query
    - LLM 

- [Chroma DB Tutorial: A Step-By-Step Guide](https://www.datacamp.com/tutorial/chromadb-tutorial-step-by-step-guide)  

- [Chroma Vector Database Tutorial](https://anderfernandez.com/en/blog/chroma-vector-database-tutorial/)
    - deploy to docker
    - deploy to clickhouse
    - connect in python
    - used SP500 company info
    - filter by metadata

- [Getting Started with Chroma DB: A Beginner’s Tutorial](https://medium.com/@pierrelouislet/getting-started-with-chroma-db-a-beginners-tutorial-6efa32300902)
    - brief 
    - use docker

see `readme.md`

In [1]:
!pip show chromadb

Name: chromadb
Version: 0.5.0
Summary: Chroma.
Home-page: 
Author: 
Author-email: Jeff Huber <jeff@trychroma.com>, Anton Troynikov <anton@trychroma.com>
License: 
Location: /home/papagame/anaconda3/envs/vanna/lib/python3.11/site-packages
Requires: bcrypt, build, chroma-hnswlib, fastapi, grpcio, importlib-resources, kubernetes, mmh3, numpy, onnxruntime, opentelemetry-api, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-instrumentation-fastapi, opentelemetry-sdk, orjson, overrides, posthog, pydantic, pypika, PyYAML, requests, tenacity, tokenizers, tqdm, typer, typing-extensions, uvicorn
Required-by: 


In [3]:
!pwd

/home/papagame/projects/wgong/py4kids


In [2]:
import chromadb
from chromadb.utils import embedding_functions

CHROMA_DATA_PATH = "./db"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "demo_docs"

client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)

### Create collection

In [18]:
# embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
#     model_name=EMBED_MODEL
# )

collection = client.get_or_create_collection(
    name=COLLECTION_NAME,
    # embedding_function=embedding_func,
    metadata={"hnsw:space": "cosine"},
    
)

### List collections

In [38]:
for collection in client.list_collections():
    print(f"Collection name: {collection.name}")

Collection name: demo_docs


### Add Docs

In [19]:
# sample docs
documents = [
    "The latest iPhone model comes with impressive features and a powerful camera.",
    "Exploring the beautiful beaches and vibrant culture of Bali is a dream for many travelers.",
    "Einstein's theory of relativity revolutionized our understanding of space and time.",
    "Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.",
    "The American Revolution had a profound impact on the birth of the United States as a nation.",
    "Regular exercise and a balanced diet are essential for maintaining good physical health.",
    "Leonardo da Vinci's Mona Lisa is considered one of the most iconic paintings in art history.",
    "Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
    "Startup companies often face challenges in securing funding and scaling their operations.",
    "Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'",
]

genres = [
    "technology",
    "travel",
    "science",
    "food",
    "history",
    "fitness",
    "art",
    "climate change",
    "business",
    "music",
]

collection.add(
    documents=documents,
    ids=[f"id{i}" for i in range(len(documents))],
    metadatas=[{"genre": g} for g in genres]
)

Add of existing embedding ID: id0
Add of existing embedding ID: id1
Add of existing embedding ID: id2
Add of existing embedding ID: id3
Add of existing embedding ID: id4
Add of existing embedding ID: id5
Add of existing embedding ID: id6
Add of existing embedding ID: id7
Add of existing embedding ID: id8
Add of existing embedding ID: id9
Insert of existing embedding ID: id0
Insert of existing embedding ID: id1
Insert of existing embedding ID: id2
Insert of existing embedding ID: id3
Insert of existing embedding ID: id4
Insert of existing embedding ID: id5
Insert of existing embedding ID: id6
Insert of existing embedding ID: id7
Insert of existing embedding ID: id8
Insert of existing embedding ID: id9


### Query

In [20]:
query_results = collection.query(
    query_texts=["Find me some delicious food!"],
    n_results=2,
)

In [21]:
query_results

{'ids': [['id3', 'id1']],
 'distances': [[1.5276524111861616, 1.6584325895032108]],
 'metadatas': [[{'genre': 'food'}, {'genre': 'travel'}]],
 'embeddings': None,
 'documents': [['Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.',
   'Exploring the beautiful beaches and vibrant culture of Bali is a dream for many travelers.']],
 'uris': None,
 'data': None}

In [22]:
query_results = collection.query(
    query_texts=["Teach me about history",
                 "What's going on in the world?"],
    include=["documents", "distances", 'metadatas'],
    n_results=2
)

In [23]:
query_results

{'ids': [['id2', 'id4'], ['id7', 'id2']],
 'distances': [[1.2531765850067067, 1.3808384485195626],
  [1.6005885986562254, 1.776421301382694]],
 'metadatas': [[{'genre': 'science'}, {'genre': 'history'}],
  [{'genre': 'climate change'}, {'genre': 'science'}]],
 'embeddings': None,
 'documents': [["Einstein's theory of relativity revolutionized our understanding of space and time.",
   'The American Revolution had a profound impact on the birth of the United States as a nation.'],
  ["Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
   "Einstein's theory of relativity revolutionized our understanding of space and time."]],
 'uris': None,
 'data': None}

In [24]:
query_results = collection.query(
    query_texts=["Teach me about music history"],
    n_results=1
)
query_results

{'ids': [['id2']],
 'distances': [[1.5251642004851387]],
 'metadatas': [[{'genre': 'science'}]],
 'embeddings': None,
 'documents': [["Einstein's theory of relativity revolutionized our understanding of space and time."]],
 'uris': None,
 'data': None}

In [25]:
query_results = collection.query(
    query_texts=["Teach me about music history"],
    where={"genre": {"$eq": "music"}},
    n_results=1,
)
query_results

{'ids': [['id9']],
 'distances': [[1.6372656175451257]],
 'metadatas': [[{'genre': 'music'}]],
 'embeddings': None,
 'documents': [["Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'"]],
 'uris': None,
 'data': None}

In [26]:
query_results = collection.query(
    query_texts=["Teach me about music history"],
    where={"genre": {"$in": ["music", "history"]}},
    n_results=2,
)

query_results

{'ids': [['id9', 'id4']],
 'distances': [[1.6372656175451257, 1.6400827280864978]],
 'metadatas': [[{'genre': 'music'}, {'genre': 'history'}]],
 'embeddings': None,
 'documents': [["Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'",
   'The American Revolution had a profound impact on the birth of the United States as a nation.']],
 'uris': None,
 'data': None}

### Update

In [29]:
collection.update(
    ids=["id1", "id2"],
    documents=["The new iPhone is awesome!",
               "Bali has beautiful beaches"],
    metadatas=[{"genre": "tech"}, {"genre": "beaches"}]
)

query_results = collection.get(ids=["id1", "id2"])
query_results

{'ids': ['id1', 'id2'],
 'embeddings': None,
 'metadatas': [{'genre': 'tech'}, {'genre': 'beaches'}],
 'documents': ['The new iPhone is awesome!', 'Bali has beautiful beaches'],
 'uris': None,
 'data': None}

In [34]:
query_results = collection.query(
    query_texts=["Which phone is great", "where to find beautiful beach"],
    n_results=1,
    include=['metadatas', 'documents'],  # ids is always included
)
query_results

{'ids': [['id1'], ['id2']],
 'distances': None,
 'metadatas': [[{'genre': 'tech'}], [{'genre': 'beaches'}]],
 'embeddings': None,
 'documents': [['The new iPhone is awesome!'], ['Bali has beautiful beaches']],
 'uris': None,
 'data': None}

In [37]:
print(collection.count())

collection.delete(ids=["id2"])

print(collection.count())

collection.get(["id1", "id2", "id3"])

9
8


{'ids': ['id3'],
 'embeddings': None,
 'metadatas': [{'genre': 'food'}],
 'documents': ['Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.'],
 'uris': None,
 'data': None}

### Remove all collections

In [39]:
for collection in client.list_collections():
    print(f"Removing collection: {collection.name}")
    client.delete_collection(name=collection.name)

Removing collection: demo_docs


In [41]:
ids = """
83ca2e16-13e9-4cb7-831d-0f47f406a9f0
2fba5c16-790f-4852-a580-63749fde839b
897c7090-6cbf-4e2d-8ffa-2217cdc127c0
3b4962a5-3791-4ff9-b98f-82b164b28cc2
38f8c0b8-a754-4baf-8521-ff44dc3b59f0
67c9a261-0f99-4c9b-b44d-de93cbb8ed5d
e6dd68d3-4bdc-4a76-8315-bb6fee65465e
"""
print(",".join([f"'{i.strip()}'" for i in ids.split() if i.strip()]))

'83ca2e16-13e9-4cb7-831d-0f47f406a9f0','2fba5c16-790f-4852-a580-63749fde839b','897c7090-6cbf-4e2d-8ffa-2217cdc127c0','3b4962a5-3791-4ff9-b98f-82b164b28cc2','38f8c0b8-a754-4baf-8521-ff44dc3b59f0','67c9a261-0f99-4c9b-b44d-de93cbb8ed5d','e6dd68d3-4bdc-4a76-8315-bb6fee65465e'
