### ChromaDB
- https://www.trychroma.com/

*   Loome andmehoidla filmi "Matrix" jaoks. Kasutame filmi tegelasi, nagu Neo, Mr. Anderson ja Trinity, nende asjakohase teabe salvestamiseks.

*   kollektsioonide loomise, uurimise ja kustutamise kaudu, samuti kaugusfunktsiooni muutmise kaudu ChromaDB-s.



In [1]:
!pip install chromadb openai -q

In [2]:
# need this to work with embedding
!pip install sentence-transformers -q

In [3]:
# setup a client

import chromadb
client = chromadb.Client()

In [4]:
neo_collection = client.create_collection(name="neo")

In [5]:
# inspecting a collection
print(neo_collection)

Collection(id=178da1ac-94df-4ba0-b9cc-3c6a39476de7, name=neo)


In [6]:
# Rename the collection name and inspecting it again
neo_collection.modify(name="mr_anderson")
print(neo_collection)

Collection(id=178da1ac-94df-4ba0-b9cc-3c6a39476de7, name=mr_anderson)


In [7]:
# Counting items
item_count = neo_collection.count()
print(f"# of items in collection: {item_count}")

# of items in collection: 0


# Distance

ChromaDB-s määrab kaugusfunktsioon, kuidas arvutatakse kahe objekti vaheline "kaugus" või "erinevus" kollektsioonis. See on oluline selliste toimingute tegemisel nagu sarnaste objektide otsimine. ChromaDB vaike-kaugusfunktsioon on "l2", mis tähistab Eukleidilist kaugust. See on tavaline kauguse mõõtmine tasapinnal.

In [8]:
# Get or Create a new collection, and change the distance function
trinity_collection = client.get_or_create_collection(
    name="trinity",
    metadata={"hnsw:space": "cosine"}
)
print(trinity_collection)

Collection(id=17fea18b-ef5e-4e84-b1fe-a81966dda705, name=trinity)


Seadsime kaugusfunktsiooni "cosine". Koosinus kaugus on mõõdik, mis määrab kahe vektori sarnasuse, arvutades nendevahelise nurga koosinuse. See võib olla kasulik paljudes valdkondades, sealhulgas tekstianalüüsis, kus suured mõõtmed ja hõredus on tavalised.

In [9]:
# Deleting a collection
try:
    client.delete_collection(name="mr_anderson")
    print("Mr. Anderson collection deleted.")
except ValueError as e:
    print(f"Error: {e}")

Mr. Anderson collection deleted.


In [10]:
neo_collection = client.create_collection(name="neo")

In [11]:
# Adding data
# Adding raw documents
neo_collection.add(
    documents=[
        "There is no spoon.",
        "I know kung fu."
    ],
    ids=["quote1", "quote2"]
)

In [12]:
item_count = neo_collection.count()
print(f"Count of items in collection: {item_count}")

Count of items in collection: 2


In [13]:
neo_collection.get()

{'ids': ['quote1', 'quote2'],
 'embeddings': None,
 'metadatas': [None, None],
 'documents': ['There is no spoon.', 'I know kung fu.'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [14]:
# Take a peek
neo_collection.peek(limit=5)

{'ids': ['quote1', 'quote2'],
 'embeddings': [[0.004506412893533707,
   -0.07763516902923584,
   -0.038877587765455246,
   -0.01235272828489542,
   -0.08395075052976608,
   0.04847825691103935,
   0.027022896334528923,
   -0.08260542154312134,
   0.07585711777210236,
   0.016495604068040848,
   0.034576863050460815,
   -0.0697631761431694,
   -0.012515238486230373,
   -0.05832795426249504,
   -0.0736757442355156,
   -0.12272055447101593,
   -0.0331801138818264,
   -0.10826481133699417,
   -0.010775878094136715,
   0.0024138721637427807,
   0.03132103383541107,
   0.000363168801413849,
   0.057470910251140594,
   -0.01934056729078293,
   0.06213092431426048,
   0.05513307452201843,
   0.019474970176815987,
   -0.06181753799319267,
   -0.025465352460741997,
   0.06344398111104965,
   -0.019318535923957825,
   -0.005409401375800371,
   -0.08224662393331528,
   -0.04527893662452698,
   0.037652596831321716,
   0.007059104740619659,
   0.050451889634132385,
   0.040108174085617065,
   0.050

Vaikimisi tagastab see sõnastiku, mis sisaldab kollektsiooni objektide ID-sid, metaandmeid (kui need on esitatud) ja dokumente. Peamine erinevus peek ja get meetodite vahel on see, et get meetod võimaldab rohkem argumente, samas kui peek meetod võtab ainult limit argumendi, mis määrab lihtsalt tagastatavate tulemuste arvu.

### Dokumendiga seotud embeddingute lisamine

In [15]:
morpheus_collection = client.create_collection(name="morpheus")

In [16]:
# Adding document-associated embeddings
morpheus_collection.add(
    documents=[
        "Welcome to the real world.",
        "What if I told you everything you knew was a lie."
    ],
    embeddings=[
        [0.1, 0.2, 0.3],
        [0.4, 0.5, 0.6]
    ],
    ids=["quote1", "quote2"],
)

In [17]:
morpheus_collection.count()

2

In [18]:
morpheus_collection.get()

{'ids': ['quote1', 'quote2'],
 'embeddings': None,
 'metadatas': [None, None],
 'documents': ['Welcome to the real world.',
  'What if I told you everything you knew was a lie.'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [19]:
# adding embeddings and metadata

In [20]:
# Create the collection
locations_collection = client.create_collection(name="locations")

In [21]:
# Adding embeddings and metadata
locations_collection.add(
    embeddings=[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]],
    metadatas=[
        {"location": "Machine City", "description": "City inhabited by machines"},
        {"location": "Zion", "description": "Last human city"},
    ],
    ids=["location1", "location2"],
)

In [22]:
locations_collection.count()

2

In [23]:
locations_collection.get()

{'ids': ['location1', 'location2'],
 'embeddings': None,
 'metadatas': [{'description': 'City inhabited by machines',
   'location': 'Machine City'},
  {'description': 'Last human city', 'location': 'Zion'}],
 'documents': [None, None],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

### Query the collection

# Query texts

In [24]:
try:
    client.delete_collection(name="morpheus")
    print("Collection deleted.")
except ValueError as e:
    print(f"Error: {e}")

Collection deleted.


In [25]:
morpheus_collection = client.create_collection(
     name="morpheus", metadata={"hnsw:space": "cosine"}
)

In [26]:
morpheus_collection.add(
    documents=[
        "This is your last chance. After this, there is no turning back.",
        "You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe.",
        "You take the red pill, you stay in Wonderland, and I show you how deep the rabbit hole goes.",
    ],
    ids=["quote1", "quote2", "quote3"],
)

In [27]:
morpheus_collection.get()

{'ids': ['quote1', 'quote2', 'quote3'],
 'embeddings': None,
 'metadatas': [None, None, None],
 'documents': ['This is your last chance. After this, there is no turning back.',
  'You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe.',
  'You take the red pill, you stay in Wonderland, and I show you how deep the rabbit hole goes.'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [28]:
# Querying by a set of query_texts
results = morpheus_collection.query(
    query_texts=["Take the red pill"],
    n_results=2,
)

print(results)

{'ids': [['quote3', 'quote2']], 'distances': [[0.4833802580833435, 0.523399829864502]], 'metadatas': [[None, None]], 'embeddings': None, 'documents': [['You take the red pill, you stay in Wonderland, and I show you how deep the rabbit hole goes.', 'You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe.']], 'uris': None, 'data': None, 'included': ['metadatas', 'documents', 'distances']}


# Query by ID

In [29]:
# Add the raw documents
trinity_collection.add(
    documents=[
        "Dodge this.",
        "I think they're trying to tell us something.",
        "Neo, no one has ever done this before.",
    ],
    ids=["quote1", "quote2", "quote3"],
)

In [30]:
items = trinity_collection.get(ids=["quote2", "quote3"])

print(items)

{'ids': ['quote2', 'quote3'], 'embeddings': None, 'metadatas': [None, None], 'documents': ["I think they're trying to tell us something.", 'Neo, no one has ever done this before.'], 'uris': None, 'data': None, 'included': ['metadatas', 'documents']}


# Choosing which data is returned from a collection

In [31]:
# Query the collection by text and choose which data is returned
results = morpheus_collection.query(
    query_texts=["take the red pill"],
    n_results=1,
    include=["embeddings", "distances"]
)

print(results)

{'ids': [['quote3']], 'distances': [[0.4833802580833435]], 'metadatas': None, 'embeddings': [[[0.0612788051366806, -0.02984556369483471, -0.019049806520342827, 0.055405668914318085, -0.007395263761281967, 0.043489884585142136, 0.12292641401290894, 0.019732583314180374, 0.05923592671751976, -0.05111175775527954, -0.09325962513685226, 0.037388864904642105, -0.03443676233291626, 0.02969019114971161, 0.008961701765656471, -0.008797801099717617, 0.04690297320485115, -0.012945846654474735, -0.06328832358121872, 0.007151228375732899, 0.004883265122771263, 0.038365889340639114, 0.055117808282375336, 0.05231316387653351, -0.09759976714849472, -0.02976042963564396, -0.0029865610413253307, -0.017095468938350677, -0.1617484837770462, -0.01317501813173294, 0.0669235959649086, 0.022074854001402855, 0.02148345671594143, -0.02672904171049595, 0.02127642184495926, 0.06961461901664734, 0.01167245302349329, -0.01919923909008503, 0.06758654862642288, 0.04263416305184364, -0.025372076779603958, -0.04455849

# Using where filter

In [32]:
# Create the collection
matrix_collection = client.create_collection(name="matrix")

In [33]:
# Add the raw documents
matrix_collection.add(
    documents=[
        "The Matrix is everywhere, it is all around us.",
        "Unfortunately, no one can be told what the Matrix is",
        "You can see it when you look out your window or when you turn on your television.",
        "You are a plague, Mr. Anderson. You and your kind are a cancer of this planet.",
        "You hear that Mr. Anderson?... That is the sound of inevitability...",
    ],
    metadatas=[
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Agent Smith"},
        {"category": "quote", "speaker": "Agent Smith"},
    ],
    ids=["quote1", "quote2", "quote3", "quote4", "quote5"],
)

In [34]:
# Querying with where filters
results = matrix_collection.query(
    query_texts=["What is the Matrix?"],
    where={"speaker": "Morpheus"},
    n_results=2,
)

print(results)

{'ids': [['quote2', 'quote1']], 'distances': [[0.4784316420555115, 0.8001493811607361]], 'metadatas': [[{'category': 'quote', 'speaker': 'Morpheus'}, {'category': 'quote', 'speaker': 'Morpheus'}]], 'embeddings': None, 'documents': [['Unfortunately, no one can be told what the Matrix is', 'The Matrix is everywhere, it is all around us.']], 'uris': None, 'data': None, 'included': ['metadatas', 'documents', 'distances']}


### Updating Data

In [35]:
# Update items in the collection
matrix_collection.update(
    ids=["quote2"],
    metadatas=[{"category": "quote", "speaker": "Morpheus"}],
    documents=["The Matrix is a system, Neo. That system is our enemy."],
)

In [36]:
items = matrix_collection.get(ids=["quote2"])

print(items)

{'ids': ['quote2'], 'embeddings': None, 'metadatas': [{'category': 'quote', 'speaker': 'Morpheus'}], 'documents': ['The Matrix is a system, Neo. That system is our enemy.'], 'uris': None, 'data': None, 'included': ['metadatas', 'documents']}


# Upsert Operations

In [37]:
matrix_collection.get()

{'ids': ['quote1', 'quote2', 'quote3', 'quote4', 'quote5'],
 'embeddings': None,
 'metadatas': [{'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Agent Smith'},
  {'category': 'quote', 'speaker': 'Agent Smith'}],
 'documents': ['The Matrix is everywhere, it is all around us.',
  'The Matrix is a system, Neo. That system is our enemy.',
  'You can see it when you look out your window or when you turn on your television.',
  'You are a plague, Mr. Anderson. You and your kind are a cancer of this planet.',
  'You hear that Mr. Anderson?... That is the sound of inevitability...'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [38]:
# Upsert operation
matrix_collection.upsert(
    ids=["quote2", "quote4"],
    metadatas=[
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Agent Smith"},
    ],
    documents=[
        "You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe.",
        "I'm going to enjoy watching you die, Mr. Anderson.",
    ],
)

In [39]:
matrix_collection.get()

{'ids': ['quote1', 'quote2', 'quote3', 'quote4', 'quote5'],
 'embeddings': None,
 'metadatas': [{'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Agent Smith'},
  {'category': 'quote', 'speaker': 'Agent Smith'}],
 'documents': ['The Matrix is everywhere, it is all around us.',
  'You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe.',
  'You can see it when you look out your window or when you turn on your television.',
  "I'm going to enjoy watching you die, Mr. Anderson.",
  'You hear that Mr. Anderson?... That is the sound of inevitability...'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [40]:
# Upsert operation
matrix_collection.upsert(
    ids=["quote10"],
    metadatas=[
        {"category": "quote", "speaker": "Morpheus"},
    ],
    documents=[
        "Everything is a matrix",
    ],
)

# Delete by ID

In [41]:
trinity_collection.get()

{'ids': ['quote1', 'quote2', 'quote3'],
 'embeddings': None,
 'metadatas': [None, None, None],
 'documents': ['Dodge this.',
  "I think they're trying to tell us something.",
  'Neo, no one has ever done this before.'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [42]:
trinity_collection.delete(ids=["quote3"])


In [43]:
trinity_collection.get()

{'ids': ['quote1', 'quote2'],
 'embeddings': None,
 'metadatas': [None, None],
 'documents': ['Dodge this.', "I think they're trying to tell us something."],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

# Delete with 'where' filter

In [44]:
# Add the raw documents
matrix_collection.add(
    documents=[
        "The Matrix is everywhere, it is all around us.",
        "You can see it when you look out your window or when you turn on your television.",
        "You can feel it when you go to work, when you go to church, when you pay your taxes.",
        "It seems that you've been living two lives.",
        "I believe that, as a species, human beings define their reality through misery and suffering",
        "Human beings are a disease, a cancer of this planet.",
    ],
    metadatas=[
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Morpheus"},
        {"category": "quote", "speaker": "Agent Smith"},
        {"category": "quote", "speaker": "Agent Smith"},
        {"category": "quote", "speaker": "Agent Smith"},
    ],
    ids=["quote1", "quote2", "quote3", "quote4", "quote5", "quote6"],
)



In [45]:
matrix_collection.get()

{'ids': ['quote1',
  'quote10',
  'quote2',
  'quote3',
  'quote4',
  'quote5',
  'quote6'],
 'embeddings': None,
 'metadatas': [{'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Agent Smith'},
  {'category': 'quote', 'speaker': 'Agent Smith'},
  {'category': 'quote', 'speaker': 'Agent Smith'}],
 'documents': ['The Matrix is everywhere, it is all around us.',
  'Everything is a matrix',
  'You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe.',
  'You can see it when you look out your window or when you turn on your television.',
  "I'm going to enjoy watching you die, Mr. Anderson.",
  'You hear that Mr. Anderson?... That is the sound of inevitability...',
  'Human beings are a disease, a cancer of this planet.'],
 'uris': None,
 'data': None,
 'included': ['me

In [46]:
# Deleting items that match the where filter
matrix_collection.delete(where={"speaker": "Agent Smith"})

In [47]:
item_count = matrix_collection.count()
print(f"Count of items in collection: {item_count}")

Count of items in collection: 4


In [48]:
matrix_collection.get()

{'ids': ['quote1', 'quote10', 'quote2', 'quote3'],
 'embeddings': None,
 'metadatas': [{'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'},
  {'category': 'quote', 'speaker': 'Morpheus'}],
 'documents': ['The Matrix is everywhere, it is all around us.',
  'Everything is a matrix',
  'You take the blue pill, the story ends, you wake up in your bed and believe whatever you want to believe.',
  'You can see it when you look out your window or when you turn on your television.'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

### Using Embedding Functions

In [49]:
from chromadb.utils import embedding_functions

In [59]:
'''# Initialize OpenAI embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="sk-mF14---------------------hOus",
    model_name="text-embedding-ada-002",
)'''

'# Initialize OpenAI embedding function\nopenai_ef = embedding_functions.OpenAIEmbeddingFunction(\n    api_key="sk-mF14---------------------hOus",\n    model_name="text-embedding-ada-002",\n)'

In [51]:
'''# Create the collection with the OpenAI embedding function
matrix_collection1 = client.create_collection(
    name="matrix1",
    embedding_function=openai_ef,
)'''

'# Create the collection with the OpenAI embedding function\nmatrix_collection1 = client.create_collection(\n    name="matrix1",\n    embedding_function=openai_ef,\n)'

In [52]:
#!pip install openai==0.28


In [53]:
# Create the collection
matrix_collection2 = client.create_collection(name="matrix2")

In [54]:
# Add the raw documents
matrix_collection2.add(
    documents=[
        "The Matrix is all around us.",
        "What you know you can't explain, but you feel it",
        "There is a difference between knowing the path and walking the path",
    ],
    ids=["quote1", "quote2", "quote3"],
)

In [55]:
print(matrix_collection2)

Collection(id=add34754-f6f6-432f-a89f-14aa1ef5eb65, name=matrix2)


In [56]:
matrix_collection2.get()

{'ids': ['quote1', 'quote2', 'quote3'],
 'embeddings': None,
 'metadatas': [None, None, None],
 'documents': ['The Matrix is all around us.',
  "What you know you can't explain, but you feel it",
  'There is a difference between knowing the path and walking the path'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [57]:
# Querying by a set of query_texts
results = matrix_collection.query(query_texts=["What is the Matrix?"], n_results=2)

print(results)

{'ids': [['quote10', 'quote1']], 'distances': [[0.5123476982116699, 0.8001493811607361]], 'metadatas': [[{'category': 'quote', 'speaker': 'Morpheus'}, {'category': 'quote', 'speaker': 'Morpheus'}]], 'embeddings': None, 'documents': [['Everything is a matrix', 'The Matrix is everywhere, it is all around us.']], 'uris': None, 'data': None, 'included': ['metadatas', 'documents', 'distances']}


In [58]:
# Querying by a set of query_texts
results = matrix_collection2.query(query_texts=["What is the Matrix?"], n_results=2)

print(results)

{'ids': [['quote1', 'quote2']], 'distances': [[0.660364031791687, 1.7509040832519531]], 'metadatas': [[None, None]], 'embeddings': None, 'documents': [['The Matrix is all around us.', "What you know you can't explain, but you feel it"]], 'uris': None, 'data': None, 'included': ['metadatas', 'documents', 'distances']}
