##<b>ChromaDB</b>
- https://www.trychroma.com
- We will be creating a data storage for movie "The Matrix". We will use the characters from the movie like neo, mr_anderson, trinity to store their relevant information.
- Will guide you through creating, inspecting, and deleting collections, as well as changing the distance function in ChromaDB

In [1]:
!pip install chromadb openai -q
!pip install sentence-transformers -q

In [2]:
# setup a client

import chromadb
client = chromadb.Client()

In [3]:
neo_collection = client.create_collection(name="neo")

In [4]:
# Inspecting a collection
print(neo_collection)

name='neo' id=UUID('94ebf10c-23e8-4fef-99fd-bee3565e7d91') metadata=None tenant='default_tenant' database='default_database'


In [5]:
# Rename the collection name and inspecting it again
neo_collection.modify(name="mr_anderson")
print(neo_collection)

name='mr_anderson' id=UUID('94ebf10c-23e8-4fef-99fd-bee3565e7d91') metadata=None tenant='default_tenant' database='default_database'


In [6]:
# Counting items
item_count = neo_collection.count()
print(f"# of items in collection: {item_count}")

# of items in collection: 0


In ChromaDB, the distance function determines how the "distance" or "difference" between two items in the collection is calculated. This is crucial when performing operations like querying for similar items. The default distance function in ChromaDB is "l2", which stand for Euclidean distance. It's a common measure of distance in a plane.

In [7]:
# Get or Create a new collection, and change the distance function
trinity_collection = client.get_or_create_collection(
    name="trinity", metadata={"hnsw:space": "cosine"}
)
print(trinity_collection)

name='trinity' id=UUID('0f824422-cb0d-489e-b013-b89bad1a6237') metadata={'hnsw:space': 'cosine'} tenant='default_tenant' database='default_database'


We set the distance function to "cosine". The Cosune distance is a measure of similarity between two vectors by taking the cosing of the angle betweem them. This can be useful in many domains including text analysis where high dimentionality and sparcity are common.

In [8]:
# Deleting an collection
try:
  client.delete_collection(name="mr_anderson")
  print("Mr. Anderson collection deleted.")
except ValueError as e:
  print(f"Error: {e}")

Mr. Anderson collection deleted.


In [9]:
neo_collection = client.create_collection(name="neo")

In [11]:
# Adding data
# Adding raw documents
neo_collection.add(
    documents = [
        "There is no spoon.",
        "I know kung fu."
    ],
    ids = ["quote1","quote2"]
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:13<00:00, 6.01MiB/s]


In [12]:
item_count = neo_collection.count()
print(f"Count of items in collection: {item_count}")

Count of items in collection: 2


In [13]:
neo_collection.get()

{'ids': ['quote1', 'quote2'],
 'embeddings': None,
 'metadatas': [None, None],
 'documents': ['There is no spoon.', 'I know kung fu.'],
 'uris': None,
 'data': None}

In [14]:
# Take a peek
neo_collection.peek(limit=5)

{'ids': ['quote1', 'quote2'],
 'embeddings': [[0.004506412893533707,
   -0.07763516902923584,
   -0.038877587765455246,
   -0.01235272828489542,
   -0.08395075052976608,
   0.04847825691103935,
   0.027022896334528923,
   -0.08260542154312134,
   0.07585711777210236,
   0.016495604068040848,
   0.034576863050460815,
   -0.0697631761431694,
   -0.012515238486230373,
   -0.05832795426249504,
   -0.0736757442355156,
   -0.12272055447101593,
   -0.0331801138818264,
   -0.10826481133699417,
   -0.010775878094136715,
   0.0024138721637427807,
   0.03132103383541107,
   0.000363168801413849,
   0.057470910251140594,
   -0.01934056729078293,
   0.06213092431426048,
   0.05513307452201843,
   0.019474970176815987,
   -0.06181753799319267,
   -0.025465352460741997,
   0.06344398111104965,
   -0.019318535923957825,
   -0.005409401375800371,
   -0.08224662393331528,
   -0.04527893662452698,
   0.037652596831321716,
   0.007059104740619659,
   0.050451889634132385,
   0.040108174085617065,
   0.050

By default, this will return a dictionary with the ids, metadatas (if provided) and documents of the items in the collection. The main difference in peek and get methods is that the get method allows for more arguments, whereas the peek method only takes limit, which is simply the number of results to return.

## Adding document-associated embeddings

In [15]:
morphheus_collection = client.create_collection(name="morpheus")

In [17]:
# Adding document-associated embeddings
morphheus_collection.add(
    documents = [
        "Welcome to the real world.",
        "What if I told you everything you knew was a lie."
    ],
    embeddings=[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]],
    ids = ["quote1", "quote2"],
)

In [19]:
morphheus_collection.count()

2

In [20]:
morphheus_collection.get()

{'ids': ['quote1', 'quote2'],
 'embeddings': None,
 'metadatas': [None, None],
 'documents': ['Welcome to the real world.',
  'What if I told you everything you knew was a lie.'],
 'uris': None,
 'data': None}

In [21]:
# Adding embeddings and metadata

In [22]:
# Create the collection
locations_collection = client.create_collection(name="locations")

In [23]:
# Adding embeddings and metadata
locations_collection.add(
    embeddings = [[0.1, 0.2, 0.3],[0.4, 0.5, 0.6]],
    metadatas=[
        {"location": "Machine City", "description": "City inhabitated by machines"},
        {"location": "Zion", "description": "Last human city"},
    ],
    ids=["location1","location2"],
)

In [24]:
locations_collection.count()

2

In [26]:
locations_collection.get()

{'ids': ['location1', 'location2'],
 'embeddings': None,
 'metadatas': [{'description': 'City inhabitated by machines',
   'location': 'Machine City'},
  {'description': 'Last human city', 'location': 'Zion'}],
 'documents': [None, None],
 'uris': None,
 'data': None}

## Query the collection

In [27]:
# Query texts

In [38]:
try:
  client.delete_collection(name="morpheus")
  print("Collection deleted.")
except ValueError as e:
  print(f"Error: {e}")

Collection deleted.


In [39]:
morpheus_collection = client.get_or_create_collection(
    name="morpheus", metadata={"hnsw:space": "cosine"}
)

In [40]:
morpheus_collection.add(
    documents = [
        "This is your last chance, After this, there is no turning back.",
        "You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe.",
        "You take the red pill, you stay in the Wonderland, and I show you how deep the rabbit hole goes.",
    ],
    ids = ["quote1", "quote2", "quote3"],
)

In [41]:
morpheus_collection.get()

{'ids': ['quote1', 'quote2', 'quote3'],
 'embeddings': None,
 'metadatas': [None, None, None],
 'documents': ['This is your last chance, After this, there is no turning back.',
  'You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe.',
  'You take the red pill, you stay in the Wonderland, and I show you how deep the rabbit hole goes.'],
 'uris': None,
 'data': None}

In [42]:
# Querying by a set of query_texts
results = morpheus_collection.query(
    query_texts = ["Take the red pill"],
    n_results=2,
)

print(results)

{'ids': [['quote3', 'quote2']], 'distances': [[0.46854692697525024, 0.523399829864502]], 'metadatas': [[None, None]], 'embeddings': None, 'documents': [['You take the red pill, you stay in the Wonderland, and I show you how deep the rabbit hole goes.', 'You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe.']], 'uris': None, 'data': None}


In [43]:
# Query by ID

In [44]:
# Add the raw documents
trinity_collection.add(
    documents=[
        "Dodge this.",
        "I think they're trying to tell us something.",
        "Neo, no one has ever done this before."
    ],
    ids=["quote1", "quote2", "quote3"],
)

In [45]:
items = trinity_collection.get(ids=["quote2", "quote3"])
print(items)

{'ids': ['quote2', 'quote3'], 'embeddings': None, 'metadatas': [None, None], 'documents': ["I think they're trying to tell us something.", 'Neo, no one has ever done this before.'], 'uris': None, 'data': None}


In [46]:
# Choosing which data is returned from a collection

In [48]:
# Query the collection by text and choose which data is returned
results = morpheus_collection.query(
    query_texts=["take the red pill"],
    n_results=1,
    include=["embeddings", "distances"]
)

print(results)

{'ids': [['quote3']], 'distances': [[0.46854692697525024]], 'metadatas': None, 'embeddings': [[[0.06073591113090515, -0.029532479122281075, -0.01913934014737606, 0.05918867513537407, -0.007075101602822542, 0.0476175956428051, 0.1270502209663391, 0.0202015433460474, 0.06602806597948074, -0.05049310252070427, -0.09023090451955795, 0.031919196248054504, -0.03407925367355347, 0.026690136641263962, 0.012208563275635242, -0.01583772525191307, 0.04909858852624893, -0.013204006478190422, -0.06546106934547424, 0.011001543141901493, 0.014509349130094051, 0.03402267396450043, 0.049754977226257324, 0.04764392226934433, -0.1075998991727829, -0.02754928544163704, -0.0021915032994002104, -0.02136864699423313, -0.15663385391235352, -0.014851683750748634, 0.06334412842988968, 0.01717643067240715, 0.025652745738625526, -0.028411855921149254, 0.01741626113653183, 0.06714940816164017, 0.00031362680601887405, -0.020626556128263474, 0.0715298131108284, 0.04597096890211105, -0.020056113600730896, -0.03658187

In [49]:
# Using where filter

In [65]:
# Create the collection
matrix_collection = client.create_collection(name="matrix")

In [66]:
# Add the raw documents
matrix_collection.add(
    documents=[
        "The Matrix is everywhere, it is all around us.",
        "Unfortunately, no one can be told what the Matrix is",
        "You can see it when you look out your window or when you turn on your television.",
        "You are a plague, Mr. Anderson. You and your kind are a cancer of this planet.",
        "You hear that Mr. Anderson?... That is the sound of inevitability...",
    ],
    metadatas=[
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Agent Smith"},
        {"category": "quote", "speaker": "Agent Smith"},
    ],
    ids=["quote1","quote2","quote3","quote4","quote5"],
)

In [67]:
# Querying with where filters
results = matrix_collection.query(
    query_texts=["What is the Matrix?"],
    where={"speaker": "Morpheus"},
    n_results=2,
)

print(results)

{'ids': [['quote2', 'quote1']], 'distances': [[0.4784316420555115, 0.8001493811607361]], 'metadatas': [[{'category': 'quote', 'speaker': 'Morpheus'}, {'category': 'quote', 'speaker': 'Morpheus'}]], 'embeddings': None, 'documents': [['Unfortunately, no one can be told what the Matrix is', 'The Matrix is everywhere, it is all around us.']], 'uris': None, 'data': None}


## Updating Data

In [68]:
# Update items in the collection
matrix_collection.update(
    ids=["quote2"],
    metadatas=[{"category": "quote", "speaker": "Morpheus"}],
    documents=["The Matrix is a system, Neo. That system is our enemy."],
)

In [69]:
items = matrix_collection.get(ids=["quote2"])
print(items)

{'ids': ['quote2'], 'embeddings': None, 'metadatas': [{'category': 'quote', 'speaker': 'Morpheus'}], 'documents': ['The Matrix is a system, Neo. That system is our enemy.'], 'uris': None, 'data': None}


In [70]:
# Upsert Operations

In [71]:
matrix_collection.get()

{'ids': ['quote1', 'quote2', 'quote3', 'quote4', 'quote5'],
 'embeddings': None,
 'metadatas': [{'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Agent Smith'},
  {'category': 'quote', 'speaker': 'Agent Smith'}],
 'documents': ['The Matrix is everywhere, it is all around us.',
  'The Matrix is a system, Neo. That system is our enemy.',
  'You can see it when you look out your window or when you turn on your television.',
  'You are a plague, Mr. Anderson. You and your kind are a cancer of this planet.',
  'You hear that Mr. Anderson?... That is the sound of inevitability...'],
 'uris': None,
 'data': None}

In [72]:
# Upsert operation
matrix_collection.upsert(
    ids=["quote2", "quote4"],
    metadatas=[
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Agent Smith"},
    ],
    documents=[
        "You take the blue pill, the story ends, you wake yp in your bed and believe whatever you want to believe.",
        "I'm going to enjoy watching you die, Mr. Anderson."
    ],
)

In [73]:
matrix_collection.get()

{'ids': ['quote1', 'quote2', 'quote3', 'quote4', 'quote5'],
 'embeddings': None,
 'metadatas': [{'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Agent Smith'},
  {'category': 'quote', 'speaker': 'Agent Smith'}],
 'documents': ['The Matrix is everywhere, it is all around us.',
  'You take the blue pill, the story ends, you wake yp in your bed and believe whatever you want to believe.',
  'You can see it when you look out your window or when you turn on your television.',
  "I'm going to enjoy watching you die, Mr. Anderson.",
  'You hear that Mr. Anderson?... That is the sound of inevitability...'],
 'uris': None,
 'data': None}

In [74]:
# Upsert operation
matrix_collection.upsert(
    ids=["quote10"],
    metadatas=[
        {"category": "quote", "speaker": "Morpheus"},
    ],
    documents=[
        "Everything is a matrix"
    ],
)

In [75]:
matrix_collection.get()

{'ids': ['quote1', 'quote10', 'quote2', 'quote3', 'quote4', 'quote5'],
 'embeddings': None,
 'metadatas': [{'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Agent Smith'},
  {'category': 'quote', 'speaker': 'Agent Smith'}],
 'documents': ['The Matrix is everywhere, it is all around us.',
  'Everything is a matrix',
  'You take the blue pill, the story ends, you wake yp in your bed and believe whatever you want to believe.',
  'You can see it when you look out your window or when you turn on your television.',
  "I'm going to enjoy watching you die, Mr. Anderson.",
  'You hear that Mr. Anderson?... That is the sound of inevitability...'],
 'uris': None,
 'data': None}

In [61]:
# Delete by ID

In [76]:
trinity_collection.get()

{'ids': ['quote1', 'quote2', 'quote3'],
 'embeddings': None,
 'metadatas': [None, None, None],
 'documents': ['Dodge this.',
  "I think they're trying to tell us something.",
  'Neo, no one has ever done this before.'],
 'uris': None,
 'data': None}

In [77]:
trinity_collection.delete(ids=["quote3"])

In [78]:
trinity_collection.get()

{'ids': ['quote1', 'quote2'],
 'embeddings': None,
 'metadatas': [None, None],
 'documents': ['Dodge this.', "I think they're trying to tell us something."],
 'uris': None,
 'data': None}

In [79]:
# Delete with 'where' filter

In [82]:
# Add the raw documents
matrix_collection.add(
    documents=[
        "The Matrix is everywhere, it is all around us.",
        "You can see it when you look out your wiindow or when you turn on the television.",
        "You can feel it when you go to work, when you go to church, when you pay your taxes.",
        "It seems that you've been living two lives.",
        "I believe that, as a species, human beings define their reality through misery and suffering",
        "Homan beings are a disease, a cancer of this planet.",
    ],
    metadatas=[
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Agent Smith"},
        {"category": "quote", "speaker": "Agent Smith"},
        {"category": "quote", "speaker": "Agent Smith"},
    ],
    ids=["quote1","quote2","quote3","quote4","quote5","quote6"],
)



In [83]:
matrix_collection.get()

{'ids': ['quote1',
  'quote10',
  'quote2',
  'quote3',
  'quote4',
  'quote5',
  'quote6'],
 'embeddings': None,
 'metadatas': [{'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Agent Smith'},
  {'category': 'quote', 'speaker': 'Agent Smith'},
  {'category': 'quote', 'speaker': 'Agent Smith'}],
 'documents': ['The Matrix is everywhere, it is all around us.',
  'Everything is a matrix',
  'You take the blue pill, the story ends, you wake yp in your bed and believe whatever you want to believe.',
  'You can see it when you look out your window or when you turn on your television.',
  "I'm going to enjoy watching you die, Mr. Anderson.",
  'You hear that Mr. Anderson?... That is the sound of inevitability...',
  'Homan beings are a disease, a cancer of this planet.'],
 'uris': None,
 'data': None}

In [84]:
# Deleting items that match the where filter
matrix_collection.delete(where={"speaker":"Agent Smith"})

In [86]:
item_count = matrix_collection.count()
print(f"Count of items in collection: {item_count}")

Count of items in collection: 4


In [87]:
matrix_collection.get()

{'ids': ['quote1', 'quote10', 'quote2', 'quote3'],
 'embeddings': None,
 'metadatas': [{'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'}],
 'documents': ['The Matrix is everywhere, it is all around us.',
  'Everything is a matrix',
  'You take the blue pill, the story ends, you wake yp in your bed and believe whatever you want to believe.',
  'You can see it when you look out your window or when you turn on your television.'],
 'uris': None,
 'data': None}