### ChromaDB
- https://www.trychroma.com/

*   We will be creating a data storage for movie "The Matrix". We will use the characters from the movie like neo, mr_anderson, trinity to store their relvent information

*   Will guide you through creating, inspecting, and deleting collections, as well as changing the distance function in ChromaDB



In [None]:
!pip install chromadb openai -q

In [None]:
# need this to work with embedding
!pip install sentence-transformers -q

In [None]:
# setup a client

import chromadb
client = chromadb.Client()

In [None]:
neo_collection = client.create_collection(name="neo")

In [None]:
# inspecting a collection
print(neo_collection)

In [None]:
# Rename the collection name and inspecting it again
neo_collection.modify(name="mr_anderson")
print(neo_collection)

In [None]:
# Counting items
item_count = neo_collection.count()
print(f"# of items in collection: {item_count}")

In [None]:
# Distance

In ChromaDB, the distance function determines how the "distance" or "difference" between two items in the collection is calculated. This is crucial when performing operations like querying for similar items.
The default distance function in ChromaDB is "l2", which stands for Euclidean distance. It's a common measure of distance in a plane.

In [None]:
# Get or Create a new collection, and change the distance function
trinity_collection = client.get_or_create_collection(
    name="trinity",
    metadata={"hnsw:space": "cosine"}
)
print(trinity_collection)

We set the distance function to "cosine". The Cosine distance is a measure of similarity between two vectors by taking the cosine of the angle between them. This can be useful in many domains including text analysis where high dimensionality and sparsity are common.

In [None]:
# Deleting a collection
try:
    client.delete_collection(name="mr_anderson")
    print("Mr. Anderson collection deleted.")
except ValueError as e:
    print(f"Error: {e}")

In [None]:
neo_collection = client.create_collection(name="neo")

In [None]:
# Adding data
# Adding raw documents
neo_collection.add(
    documents=[
        "There is no spoon.",
        "I know kung fu."
    ],
    ids=["quote1", "quote2"]
)

In [None]:
item_count = neo_collection.count()
print(f"Count of items in collection: {item_count}")

In [None]:
neo_collection.get()

In [None]:
# Take a peek
neo_collection.peek(limit=5)

By default, this will return a dictionary with the ids, metadatas (if provided) and documents of the items in the collection. The main difference in peek and get methods is that the get method allows for more arguments, whereas the peek method only takes limit, which is simply the the number of results to return.

### Adding document-associated embeddings

In [None]:
morpheus_collection = client.create_collection(name="morpheus")

In [None]:
# Adding document-associated embeddings
morpheus_collection.add(
    documents=[
        "Welcome to the real world.",
        "What if I told you everything you knew was a lie."
    ],
    embeddings=[
        [0.1, 0.2, 0.3],
        [0.4, 0.5, 0.6]
    ],
    ids=["quote1", "quote2"],
)

In [None]:
morpheus_collection.count()

In [None]:
morpheus_collection.get()

In [None]:
# adding embeddings and metadata

In [None]:
# Create the collection
locations_collection = client.create_collection(name="locations")

In [None]:
# Adding embeddings and metadata
locations_collection.add(
    embeddings=[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]],
    metadatas=[
        {"location": "Machine City", "description": "City inhabited by machines"},
        {"location": "Zion", "description": "Last human city"},
    ],
    ids=["location1", "location2"],
)

In [None]:
locations_collection.count()

In [None]:
locations_collection.get()

### Query the collection

In [None]:
# Query texts

In [None]:
try:
    client.delete_collection(name="morpheus")
    print("Collection deleted.")
except ValueError as e:
    print(f"Error: {e}")

In [None]:
morpheus_collection = client.create_collection(
     name="morpheus", metadata={"hnsw:space": "cosine"}
)

In [None]:
morpheus_collection.add(
    documents=[
        "This is your last chance. After this, there is no turning back.",
        "You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe.",
        "You take the red pill, you stay in Wonderland, and I show you how deep the rabbit hole goes.",
    ],
    ids=["quote1", "quote2", "quote3"],
)

In [None]:
morpheus_collection.get()

In [None]:
# Querying by a set of query_texts
results = morpheus_collection.query(
    query_texts=["Take the red pill"],
    n_results=2,
)

print(results)

In [None]:
# Query by ID

In [None]:
# Add the raw documents
trinity_collection.add(
    documents=[
        "Dodge this.",
        "I think they're trying to tell us something.",
        "Neo, no one has ever done this before.",
    ],
    ids=["quote1", "quote2", "quote3"],
)

In [None]:
items = trinity_collection.get(ids=["quote2", "quote3"])

print(items)

In [None]:
# Choosing which data is returned from a collection

In [None]:
# Query the collection by text and choose which data is returned
results = morpheus_collection.query(
    query_texts=["take the red pill"],
    n_results=1,
    include=["embeddings", "distances"]
)

print(results)

In [None]:
# Using where filter

In [None]:
# Create the collection
matrix_collection = client.create_collection(name="matrix")

In [None]:
# Add the raw documents
matrix_collection.add(
    documents=[
        "The Matrix is everywhere, it is all around us.",
        "Unfortunately, no one can be told what the Matrix is",
        "You can see it when you look out your window or when you turn on your television.",
        "You are a plague, Mr. Anderson. You and your kind are a cancer of this planet.",
        "You hear that Mr. Anderson?... That is the sound of inevitability...",
    ],
    metadatas=[
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Agent Smith"},
        {"category": "quote", "speaker": "Agent Smith"},
    ],
    ids=["quote1", "quote2", "quote3", "quote4", "quote5"],
)

In [None]:
# Querying with where filters
results = matrix_collection.query(
    query_texts=["What is the Matrix?"],
    where={"speaker": "Morpheus"},
    n_results=2,
)

print(results)

### Updating Data

In [None]:
# Update items in the collection
matrix_collection.update(
    ids=["quote2"],
    metadatas=[{"category": "quote", "speaker": "Morpheus"}],
    documents=["The Matrix is a system, Neo. That system is our enemy."],
)

In [None]:
items = matrix_collection.get(ids=["quote2"])

print(items)

In [None]:
# Upsert Operations

In [None]:
matrix_collection.get()

In [None]:
# Upsert operation
matrix_collection.upsert(
    ids=["quote2", "quote4"],
    metadatas=[
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Agent Smith"},
    ],
    documents=[
        "You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe.",
        "I'm going to enjoy watching you die, Mr. Anderson.",
    ],
)

In [None]:
matrix_collection.get()

In [None]:
# Upsert operation
matrix_collection.upsert(
    ids=["quote10"],
    metadatas=[
        {"category": "quote", "speaker": "Morpheus"},
    ],
    documents=[
        "Everything is a matrix",
    ],
)

In [None]:
# Delete by ID

In [None]:
trinity_collection.get()

In [None]:
trinity_collection.delete(ids=["quote3"])


In [None]:
trinity_collection.get()

In [None]:
# Delete with 'where' filter

In [None]:
# Add the raw documents
matrix_collection.add(
    documents=[
        "The Matrix is everywhere, it is all around us.",
        "You can see it when you look out your window or when you turn on your television.",
        "You can feel it when you go to work, when you go to church, when you pay your taxes.",
        "It seems that you've been living two lives.",
        "I believe that, as a species, human beings define their reality through misery and suffering",
        "Human beings are a disease, a cancer of this planet.",
    ],
    metadatas=[
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Agent Smith"},
        {"category": "quote", "speaker": "Agent Smith"},
        {"category": "quote", "speaker": "Agent Smith"},
    ],
    ids=["quote1", "quote2", "quote3", "quote4", "quote5", "quote6"],
)

In [None]:
matrix_collection.get()

In [None]:
# Deleting items that match the where filter
matrix_collection.delete(where={"speaker": "Agent Smith"})

In [None]:
item_count = matrix_collection.count()
print(f"Count of items in collection: {item_count}")

In [None]:
matrix_collection.get()

### Using Embedding Functions

In [None]:
from chromadb.utils import embedding_functions

In [None]:
# Initialize OpenAI embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="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",
    model_name="text-embedding-ada-002",
)

In [None]:
# Create the collection with the OpenAI embedding function
matrix_collection1 = client.create_collection(
    name="matrix1",
    embedding_function=openai_ef,
)

In [None]:
# Add the raw documents
matrix_collection1.add(
    documents=[
        "The Matrix is all around us.",
        "What you know you can't explain, but you feel it",
        "There is a difference between knowing the path and walking the path",
    ],
    ids=["quote1", "quote2", "quote3"],
)

In [None]:
print(matrix_collection1)

In [None]:
matrix_collection1.get()

In [None]:
# Querying by a set of query_texts
results = matrix_collection1.query(query_texts=["What is the Matrix?"], n_results=2)

print(results)