source: https://www.datacamp.com/tutorial/chromadb-tutorial-step-by-step-guide </br>
author: Abid Ali Awan

In [1]:
import chromadb

client = chromadb.PersistentClient(path="db/")


In [2]:
collection = client.create_collection(name="Students")

In [3]:
student_info = """
Alexandra Thompson, a 19-year-old computer science sophomore with a 3.7 GPA,
is a member of the programming and chess clubs who enjoys pizza, swimming, and hiking
in her free time in hopes of working at a tech company after graduating from the University of Washington.
"""

club_info = """
The university chess club provides an outlet for students to come together and enjoy playing
the classic strategy game of chess. Members of all skill levels are welcome, from beginners learning
the rules to experienced tournament players. The club typically meets a few times per week to play casual games,
participate in tournaments, analyze famous chess matches, and improve members' skills.
"""

university_info = """
The University of Washington, founded in 1861 in Seattle, is a public research university
with over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.
As the flagship institution of the six public universities in Washington state,
UW encompasses over 500 buildings and 20 million square feet of space,
including one of the largest library systems in the world."""

In [4]:
collection.add(
    
    documents = [student_info, club_info, university_info],
    metadatas = [{"source":"student_info"}, {"source":"club_info"}, {"source":"university_info"}],
    ids = ["id1", "id2", "id3"]
)

In [5]:
results = collection.query(query_texts=["What is the student name?"],
                           n_results=2)

In [6]:
results

{'ids': [['id1', 'id2']],
 'distances': [[1.2946666564424738, 1.3954030668049473]],
 'metadatas': [[{'source': 'student_info'}, {'source': 'club_info'}]],
 'embeddings': None,
 'documents': [['\nAlexandra Thompson, a 19-year-old computer science sophomore with a 3.7 GPA,\nis a member of the programming and chess clubs who enjoys pizza, swimming, and hiking\nin her free time in hopes of working at a tech company after graduating from the University of Washington.\n',
   "\nThe university chess club provides an outlet for students to come together and enjoy playing\nthe classic strategy game of chess. Members of all skill levels are welcome, from beginners learning\nthe rules to experienced tournament players. The club typically meets a few times per week to play casual games,\nparticipate in tournaments, analyze famous chess matches, and improve members' skills.\n"]],
 'uris': None,
 'data': None}

### Embeddings

In [8]:
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                model_name="text-embedding-ada-002", api_key='',
            )

In [9]:
students_embeddings = openai_ef([student_info, club_info, university_info])


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [10]:
print(students_embeddings)

[[-0.012540126219391823, 0.00813133642077446, 0.012488334439694881, -0.04329806938767433, -0.0012462438317015767, 0.024277476593852043, -0.030220603570342064, 0.0015335272764787078, -0.009542667306959629, -0.020600248128175735, 0.028692740947008133, -0.015110301785171032, -0.003832604270428419, -0.016159089282155037, 0.01931839808821678, -0.022814353927969933, 0.01632741279900074, -0.019201865419745445, 0.01464417390525341, -0.02423863299190998, 0.00673295371234417, -0.008772261440753937, 0.007458040956407785, -0.013504751026630402, -0.015317469835281372, -0.006338039878755808, 0.01851562224328518, -0.018049495294690132, 0.01807539165019989, -0.004871680401265621, 0.039310090243816376, -0.015550533309578896, -0.016094349324703217, -0.015395157039165497, -0.02258129045367241, -0.011109373532235622, -0.0025669385213404894, 0.004246939904987812, -0.0011256657307967544, 0.007218502927571535, 0.013634230941534042, 0.0006320236716419458, -0.006516074761748314, 0.008888793177902699, -0.004327

Instead of using the default embedding model, we will load already created embedding directly to the collections.

In [11]:
collection2 = client.create_collection(name="Students2")

In [12]:
collection2.add(
    embeddings= students_embeddings,
    documents = [student_info, club_info, university_info],
    metadatas = [{"source":"student_info"}, {"source":"club_info"}, {"source":"university_info"}],
    ids = ["id1", "id2", "id3"]
)

There is another, more straightforward method, too. You can add an OpenAI embedding function while creating or accessing the collection. Apart from OpenAI, you can use Cohere, Google PaLM, HuggingFace, and Instructor models.

In our case, adding new text documents will run an OpenAI embedding function instead of the default model to convert text into embeddings.

In [13]:
collection2 = client.get_or_create_collection(name="Students2",embedding_function=openai_ef)

In [14]:
collection2.add(
    documents = [student_info, club_info, university_info],
    metadatas = [{"source": "student info"},{"source": "club info"},{'source':'university info'}],
    ids = ["id1", "id2", "id3"]
)

Add of existing embedding ID: id1
Add of existing embedding ID: id2
Add of existing embedding ID: id3
Insert of existing embedding ID: id1
Insert of existing embedding ID: id2
Insert of existing embedding ID: id3


Let’s see the difference by running a similar query on the new collection.

In [15]:
results = collection2.query(
    query_texts=["What is the student name?"],
    n_results=2
)

results

{'ids': [['id1', 'id3']],
 'distances': [[0.4379204209803942, 0.5050991263340903]],
 'metadatas': [[{'source': 'student_info'}, {'source': 'university_info'}]],
 'embeddings': None,
 'documents': [['\nAlexandra Thompson, a 19-year-old computer science sophomore with a 3.7 GPA,\nis a member of the programming and chess clubs who enjoys pizza, swimming, and hiking\nin her free time in hopes of working at a tech company after graduating from the University of Washington.\n',
   '\nThe University of Washington, founded in 1861 in Seattle, is a public research university\nwith over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.\nAs the flagship institution of the six public universities in Washington state,\nUW encompasses over 500 buildings and 20 million square feet of space,\nincluding one of the largest library systems in the world.']],
 'uris': None,
 'data': None}

Updating and Removing Data

In [16]:
collection2.update(
    ids=["id1"],
    documents=["Kristiane Carina, a 19-year-old computer science sophomore with a 3.7 GPA"],
    metadatas=[{"source": "student info"}],
)

In [17]:
results = collection2.query(
    query_texts=["What is the student name?"],
    n_results=2
)

results

{'ids': [['id1', 'id3']],
 'distances': [[0.3739192407986011, 0.5050991263340903]],
 'metadatas': [[{'source': 'student info'}, {'source': 'university_info'}]],
 'embeddings': None,
 'documents': [['Kristiane Carina, a 19-year-old computer science sophomore with a 3.7 GPA',
   '\nThe University of Washington, founded in 1861 in Seattle, is a public research university\nwith over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.\nAs the flagship institution of the six public universities in Washington state,\nUW encompasses over 500 buildings and 20 million square feet of space,\nincluding one of the largest library systems in the world.']],
 'uris': None,
 'data': None}

In [18]:
collection2.delete(ids = ['id1'])


results = collection2.query(
    query_texts=["What is the student name?"],
    n_results=2
)

results

{'ids': [['id3', 'id2']],
 'distances': [[0.5053133789363521, 0.5303026621257574]],
 'metadatas': [[{'source': 'university_info'}, {'source': 'club_info'}]],
 'embeddings': None,
 'documents': [['\nThe University of Washington, founded in 1861 in Seattle, is a public research university\nwith over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.\nAs the flagship institution of the six public universities in Washington state,\nUW encompasses over 500 buildings and 20 million square feet of space,\nincluding one of the largest library systems in the world.',
   "\nThe university chess club provides an outlet for students to come together and enjoy playing\nthe classic strategy game of chess. Members of all skill levels are welcome, from beginners learning\nthe rules to experienced tournament players. The club typically meets a few times per week to play casual games,\nparticipate in tournaments, analyze famous chess matches, and improve members' skills.\n"]],
 'uris

Collection Management


In [19]:
vector_collections = client.create_collection("vectordb")


vector_collections.add(
    documents=["This is Chroma DB CheatSheet",
               "This is Chroma DB Documentation",
               "This document Chroma JS API Docs"],
    metadatas=[{"source": "Chroma Cheatsheet"},
    {"source": "Chroma Doc"},
    {'source':'JS API Doc'}],
    ids=["id1", "id2", "id3"]
)

In [20]:
vector_collections.count()

3

In [21]:
vector_collections.get()

{'ids': ['id1', 'id2', 'id3'],
 'embeddings': None,
 'metadatas': [{'source': 'Chroma Cheatsheet'},
  {'source': 'Chroma Doc'},
  {'source': 'JS API Doc'}],
 'documents': ['This is Chroma DB CheatSheet',
  'This is Chroma DB Documentation',
  'This document Chroma JS API Docs'],
 'uris': None,
 'data': None}

In [22]:
vector_collections.modify(name="chroma_info")

# list all collections
client.list_collections()

[Collection(name=Students),
 Collection(name=chroma_info),
 Collection(name=Students2)]

In [23]:
vector_collections_new = client.get_collection(name="chroma_info")

In [25]:
client.delete_collection(name="chroma_info")
client.list_collections()

[Collection(name=Students), Collection(name=Students2)]

In [None]:
client.reset()
client.list_collections()