# Overview of an Open Source Vector Database
Chroma is the open-source embedding database. In the current digital era, handling data intelligently and effectively is essential. Here, we examine what ChromaDB, an open-source vector embedding database that enables semantic search, has to offer. Documents can be stored in ChromaDB as dense vector embeddings, which are produced using transformer-based language models and enable sophisticated semantic retrieval of documents. This blog post will show you how to build and save embeddings in ChromaDB and how to utilize user queries to obtain documents that match semantically.

## 1. Installation

In [1]:
!pip install chromadb -q
!pip install sentence-transformers -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.1/41.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.9/57.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

Next, we need to connect to ChromaDB and create a collection. By default, ChromaDB uses the Sentence Transformers `all-MiniLM-L6-v2` model to create embeddings.

In [2]:
import chromadb

client = chromadb.Client()
collection = client.create_collection("test_collection")

## 2. Adding Documents
We can add some documents to our collection, along with corresponding metadata and unique IDs.

In [3]:
collection.add(
    documents=["Cat is a domestic animal", "Lamborghini is my favorite car."],
    metadatas=[{"category": "animal"}, {"category": "vehicle"}],
    ids=["id1", "id2"]
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:01<00:00, 68.1MiB/s]


## 3. Querying
Now, we can query our collection. Let's search for the term "favorite". The returned result should be the document about the car.

In [4]:
results = collection.query(
    query_texts=["favorite"],
    n_results=1
)
print(results)

{'ids': [['id2']], 'distances': [[1.3154857158660889]], 'metadatas': [[{'category': 'vehicle'}]], 'embeddings': None, 'documents': [['Lamborghini is my favorite car.']], 'uris': None, 'data': None}


## 3. Reading files from a folder
Now, let's add our pet documents to the collection. We start by reading all the text files from the "pets" folder and storing the data in a list.

In [5]:
import os

def read_files_from_folder(folder_path):
    file_data = []

    for file_name in os.listdir(folder_path):
        if file_name.endswith(".txt"):
            with open(os.path.join(folder_path, file_name), 'r') as file:
                content = file.read()
                file_data.append({"file_name": file_name, "content": content})

    return file_data

folder_path = "pets"
file_data = read_files_from_folder(folder_path)
print(file_data)

[{'file_name': 'The Emotional Bond Between Humans and Pets.txt', 'content': 'Pets offer more than just companionship; they provide emotional support, reduce stress, and can even help their owners lead healthier lives. The bond between pets and their owners is strong, and many people consider their pets as part of the family. This bond can be especially important in times of personal or societal stress, providing comfort and consistency.'}, {'file_name': 'Nutrition Needs of Pet Animals.txt', 'content': 'Proper nutrition is vital for the health and wellbeing of pets. Dogs and cats require a balanced diet that includes proteins, carbohydrates, and fats. Some may even have specific dietary needs based on their breed or age. Birds typically thrive on a diet of seeds, fruits, and vegetables, while reptiles have diverse diets ranging from live insects to fresh produce. Fish diets depend greatly on the species, with some needing live food and others subsisting on flakes or pellets.'}, {'file_n

## 4. Adding File Contents to ChromaDB
Let's create separate lists for documents, metadata, and ids, which we add to our collection.

In [6]:
documents = []
metadatas = []
ids = []

for index, data in enumerate(file_data):
    documents.append(data['content'])
    metadatas.append({'source': data['file_name']})
    ids.append(str(index + 1))

pet_collection = client.create_collection("pet_collection")

pet_collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)


## 5. Performing Semantic Searches
Let's now query the collection for the different kinds of pets people commonly own.

In [7]:
results = pet_collection.query(
    query_texts=["What are the different kinds of pets people commonly own?"],
    n_results=1
)
print(results)

{'ids': [['5']], 'distances': [[0.7325009107589722]], 'metadatas': [[{'source': 'Different Types of Pet Animals.txt'}]], 'embeddings': None, 'documents': [['Pet animals come in all shapes and sizes, each suited to different lifestyles and home environments. Dogs and cats are the most common, known for their companionship and unique personalities. Small mammals like hamsters, guinea pigs, and rabbits are often chosen for their low maintenance needs. Birds offer beauty and song, and reptiles like turtles and lizards can make intriguing pets. Even fish, with their calming presence, can be wonderful pets.']], 'uris': None, 'data': None}


Our query successfully retrieves the most relevant document, which talks about different types of pet animals.

In [8]:
#lets try another
results = pet_collection.query(
    query_texts=["Are pets really have a need of diet?"],
    n_results=1
)
print(results)

{'ids': [['2']], 'distances': [[0.5089439153671265]], 'metadatas': [[{'source': 'Nutrition Needs of Pet Animals.txt'}]], 'embeddings': None, 'documents': [['Proper nutrition is vital for the health and wellbeing of pets. Dogs and cats require a balanced diet that includes proteins, carbohydrates, and fats. Some may even have specific dietary needs based on their breed or age. Birds typically thrive on a diet of seeds, fruits, and vegetables, while reptiles have diverse diets ranging from live insects to fresh produce. Fish diets depend greatly on the species, with some needing live food and others subsisting on flakes or pellets.']], 'uris': None, 'data': None}


## 6. Filtering Results
If you want to refine your search further, you can use the where_document parameter to specify a condition that must be met in the document text.

For example, if you want to find documents about the emotional benefits of owning a pet that mention reptiles, you could use the following query

In [9]:
results = pet_collection.query(
    query_texts=["What are the emotional benefits of owning a pet?"],
    n_results=1,
    where_document={"$contains":"reptiles"}
)
print(results)


{'ids': [['5']], 'distances': [[0.837824821472168]], 'metadatas': [[{'source': 'Different Types of Pet Animals.txt'}]], 'embeddings': None, 'documents': [['Pet animals come in all shapes and sizes, each suited to different lifestyles and home environments. Dogs and cats are the most common, known for their companionship and unique personalities. Small mammals like hamsters, guinea pigs, and rabbits are often chosen for their low maintenance needs. Birds offer beauty and song, and reptiles like turtles and lizards can make intriguing pets. Even fish, with their calming presence, can be wonderful pets.']], 'uris': None, 'data': None}


Similarly, you may use the where option to narrow your search results based on metadata. Assume for the moment that you wish to locate information regarding the emotional advantages of pet ownership, but you want to retrieve this information specifically from the document about pet training and behaviour. You could do so with the following query:

In [10]:
results = pet_collection.query(
    query_texts=["What are the emotional benefits of owning a pet?"],
    n_results=1,
    where={"source": "Training and Behaviour of Pets.txt"}
)
print(results)


{'ids': [['3']], 'distances': [[0.8881877660751343]], 'metadatas': [[{'source': 'Training and Behaviour of Pets.txt'}]], 'embeddings': None, 'documents': [['Training is essential for a harmonious life with pets, particularly for dogs. It helps pets understand their boundaries and makes cohabitation easier for both pets and owners. Training should be based on positive reinforcement. Understanding pet behavior is also important, as changes in behavior can often be a sign of underlying health issues.']], 'uris': None, 'data': None}


## 7. Using a different model for embedding
Although ChromaDB creates embeddings using the Sentence Transformers all-MiniLM-L6-v2 model by default, you are free to use any alternative model. Sentence Transformers' 'paraphrase-MiniLM-L3-v2' model is employed in this instance.

First, we load the model and create embeddings for our documents.

In [11]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-MiniLM-L3-v2')

documents = []
embeddings = []
metadatas = []
ids = []

for index, data in enumerate(file_data):
    documents.append(data['content'])
    embedding = model.encode(data['content']).tolist()
    embeddings.append(embedding)
    metadatas.append({'source': data['file_name']})
    ids.append(str(index + 1))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.04k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/69.6M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [12]:
print(embeddings)

[[-0.19393713772296906, 0.06576558202505112, 0.183914452791214, -0.3133375942707062, -0.04582231864333153, 0.22872565686702728, 0.04865900054574013, -0.17936348915100098, 0.18261446058750153, -0.02524932101368904, -0.03102186694741249, -0.17568105459213257, -0.20327067375183105, -0.22137999534606934, 0.24683110415935516, -0.016986481845378876, 0.07477787882089615, 0.05372237786650658, -0.25089704990386963, 0.3237697184085846, -0.21091125905513763, 0.0678769126534462, -0.13660962879657745, -0.06466995179653168, -0.4310069978237152, -0.148874431848526, -0.1737636774778366, 0.04178356006741524, 0.2163541167974472, -0.08694472908973694, -0.05120129510760307, 0.09658914059400558, 0.0003728560113813728, -0.20701158046722412, -0.0936785563826561, 0.12885144352912903, 0.1944105476140976, -0.10627590864896774, 0.0955352783203125, -0.07952672988176346, -0.192728191614151, 0.17446070909500122, -0.06462385505437851, 0.15284043550491333, -0.452633261680603, 0.26032060384750366, 0.36341404914855957,

Then, we create a new collection and add the documents, embeddings, metadata, and ids to it.

In [13]:
pet_collection_emb = client.create_collection("new_pet_collection_emb")

pet_collection_emb.add(
    documents=documents,
    embeddings=embeddings,
    metadatas=metadatas,
    ids=ids
)

Now, when we perform a query, we need to provide the embedding of the query text instead of the text itself. Let's search again for the different kinds of pets people commonly own.

In [14]:
query = "What are the different kinds of pets people commonly own?"
input_em = model.encode(query).tolist()

results = pet_collection_emb.query(
    query_embeddings=[input_em],
    n_results=1
)
print(results)


{'ids': [['5']], 'distances': [[12.040445327758789]], 'metadatas': [[{'source': 'Different Types of Pet Animals.txt'}]], 'embeddings': None, 'documents': [['Pet animals come in all shapes and sizes, each suited to different lifestyles and home environments. Dogs and cats are the most common, known for their companionship and unique personalities. Small mammals like hamsters, guinea pigs, and rabbits are often chosen for their low maintenance needs. Birds offer beauty and song, and reptiles like turtles and lizards can make intriguing pets. Even fish, with their calming presence, can be wonderful pets.']], 'uris': None, 'data': None}


The results are similar to our previous query, with the same document about different types of pet animals being returned.

Finally, let's make a more specific query about what foods are recommended for dogs.

In [15]:
query = "foods that are recommended for dogs?"
input_em = model.encode(query).tolist()

results = pet_collection_emb.query(
    query_embeddings=[input_em],
    n_results=1
)
print(results)

{'ids': [['2']], 'distances': [[17.143932342529297]], 'metadatas': [[{'source': 'Nutrition Needs of Pet Animals.txt'}]], 'embeddings': None, 'documents': [['Proper nutrition is vital for the health and wellbeing of pets. Dogs and cats require a balanced diet that includes proteins, carbohydrates, and fats. Some may even have specific dietary needs based on their breed or age. Birds typically thrive on a diet of seeds, fruits, and vegetables, while reptiles have diverse diets ranging from live insects to fresh produce. Fish diets depend greatly on the species, with some needing live food and others subsisting on flakes or pellets.']], 'uris': None, 'data': None}
