# CRUD Operations on Vector Databases 

## Lab Description:

In this lab, participants will learn how to perform CRUD (Create, Read, Update, Delete) operations on a vector database using ChromaDB. We begin by introducing word embeddings and how they transform text into numerical representations. Next, we create an **in-memory Chroma vector store** and add text data to it. We then perform a similarity search to retrieve relevant information based on embeddings. After that, we explore how to remove, update, and delete elements from the vector store.

## Lab Objectives:

After completing this lab, participants will be able to:

- Understand and apply the concept of word embeddings.
- Create and manage a Chroma in-memory vector store.
- Modify vector data by updating and deleting stored elements.
- Perform CRUD operations (Create, Read, Update, Delete) on a vector database..


## VECTOR EMBEDDINGS:

Vector embeddings are basically numerical representations of objects like text, audio or images. This enables us to capture their semantic meanings or their relationships with one another. Making it easier to perform similarity searches and classifications on our dataset. 

So, how exactly are sentences converted into numbers ? 

Well, first, a sentence is broken down into smaller chunks called "tokens". Each token is then mapped to a numerical representation using various methods. Some of which are:

Word Embeddings: Certain techniques like "Word2Vec" converts each token into vectors based on its context, which provides their semantic relationships in a given context.

Transformers: Models like BERT uses deep learning to generate embeddings. Each tokens vector would then reflect its meaning in a context. 

So let's see how vector embeddings are implemented. 

HuggingFace provides a lot of sentence transformers like "all-MiniLM-L6-v2" and "all-mpnet-base-v2". We can leverage these transformers to generate embeddings for our sentences. 

In [1]:
from langchain_huggingface.embeddings import HuggingFaceEmbeddings #importing the HuggingFaceEmbeddings class from LangChain

In [2]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") #loads the model "all-mpnet-base-v2" from HuggingFace
text = "This is a test document."                                                        #A sample sentence
query_result = embeddings.embed_query(text)  
#prinlen #embed_query() method takes in a string as the input and computes query embeddings using a HuggingFace transformer model.
print(query_result[:3])                                                                  #prints the first 3 elements of the vector. For most models like BERT each token is converted into a vector with 768 values.

  from tqdm.autonotebook import tqdm, trange


[-0.04895174130797386, -0.03986189886927605, -0.021562783047556877]


In [3]:
len(query_result)

768

These are the first 3 elements of the word embedding that we generated.

## VECTOR DATABASES

We would need a system to manage and query high-dimensional vector embeddings like the one which we just generated. This is where a vector database comes into play. A vector database lets us store large volumes of vector embeddings. This enables efficient storage, indexing and retrieval. There are many popular vector databases like ChromaDB, FAISS and Weaviate.

## CREATING A VECTOR DATABASE WITH CHROMADB:

Lets see how we can create a vector database using Chroma DB. Note how we use the method discussed above to first convert our text into its vector embedding before adding to the vector DB. We use `sentence-transformers/all-mpnet-base-v2` from huggingface as our Embedding Model. This Model is responsible for converting our text collection into vector representations.

<div style="text-align: center;">
    <img src="flow.png" alt="flow" width="800" height="660">
</div>

In [1]:
from langchain_chroma import Chroma  #langchain_chroma is the official langchain framework for chromadb

In [4]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") #loading the embedding model
text_collection = [                                                                      #Sample text sets to be added to the vector db
    "I love cats", 
    "I love programming",
    "I love computers", 
    "I like to eat burgers"
]

vector_store = Chroma.from_texts(texts=text_collection, embedding=embeddings)            #creating a vector database from texts, the parameters are the list of text and the embedding model



Now that the vector store is created, we can perform operations like similarity search on the vector database. 

In [5]:
query = "I love machine learning"                                    #The query to perform similarity search on the vector db

result = vector_store.similarity_search_with_score(query=query, k=2) #similarity_search_with_score method returns the list of documents most similar to the given query. The parameter k=2 implies it return the 2 most similar docs, it also returns a score, lower score indicating more similarity
result

[(Document(metadata={}, page_content='I love computers'), 0.8973314762115479),
 (Document(metadata={}, page_content='I love programming'),
  0.9493520259857178)]

This returns the two most similar sentenses to the query, the score is cosine distance in float for each. Lower score represents more similarity. 


Cosine distance measures the similarity between two vectors by calculating the cosine of the angle between them in a multi-dimensional space. In NLP, it is commonly used to determine how similar two text embeddings are, with a smaller cosine distance indicating greater similarity.

We can also store a piece of text and associated metadata as a document and then add it to the vector db using document class from the langchain_core library.

In Python, a UUID (Universally Unique Identifier) is a special 128-bit number that helps identify things uniquely. The uuid module allows you to create these identifiers easily, ensuring they won't be duplicated. This helps us to generate unique ids for each document so that querying them later would be a simple task. 

In [4]:
from langchain_core.documents import Document
from uuid import uuid4

In [5]:
document1 = Document(
    page_content="I love cats",     #stores the text along with an associated metadata
    metadata={"source":"tweet"},
    id=1                            #you can also add an optional id to distinguish each document especially in a really large dataset
)

document2 = Document(
    page_content="I love programming",     
    metadata={"source":"tweet"},
    id=2
)

document3 = Document(
    page_content="Vector Databases Enhance Search and Retrieval for AI Applications.",     
    metadata={"source":"news"},
    id=3
)

document4 = Document(
    page_content="Machine Learning Drives Innovation and Efficiency Across Sectors.",     
    metadata={"source":"news"},
    id=4
)

document = [                        #all documents are now a single list 
    document1,
    document2,
    document3,
    document4
]

uuids = [str(uuid4()) for _ in range(len(document))]  #Generates a random uuid for each document in our list.
# print(uuids)
# document

we can now create a vector database to store these documents.

In [6]:
print(uuids)

['6b96740a-74d1-48e0-9dbe-75484495caca', '6e7863e1-47c7-4c6c-95c4-bf2e4eb920a8', 'f399256d-7fe5-49ac-bb6f-53fdd16a468b', 'cdd6ddd4-920c-4a6d-9490-01e9523cb9c1']


In [9]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")  #import the embedding model

vector_store = Chroma(
    collection_name="docuemnt_collection",    #the name of your collection 
    embedding_function=embeddings             #specify the embedding function 
)

To add our documents into the vector store, we can use the add_document() method. 

In [10]:
vector_store.add_documents(documents=document, ids=uuids)  #this will add documents into the vector db with the given embedding.

['6b96740a-74d1-48e0-9dbe-75484495caca',
 '6e7863e1-47c7-4c6c-95c4-bf2e4eb920a8',
 'f399256d-7fe5-49ac-bb6f-53fdd16a468b',
 'cdd6ddd4-920c-4a6d-9490-01e9523cb9c1']

## RETRIEVAL USING SIMILARITY SEARCH

Now that we have the vector database, we can perform similarity searches on them.

In [13]:
query = "Deep learning powers AI by helping machines learn from data."

result = vector_store.similarity_search(
    query=query,              #the query to be searched
    k=2,                      #returns the 2 most similar documents
    filter={"source":"news"}  #filters documents based on the metadata. [optional]
   )

result

[Document(metadata={'source': 'news'}, page_content='Machine Learning Drives Innovation and Efficiency Across Sectors.'),
 Document(metadata={'source': 'news'}, page_content='Vector Databases Enhance Search and Retrieval for AI Applications.')]

This returns the two most similar documents, filtered by the metadata we provided. 

## UPDATE VECTOR DB

We can update the vector database with the update_document() method. It requires the id of the document to be updated and the new document. 

In [14]:
updated_document_1 = Document(            #The new document
    page_content="I really love dogs",
    metadata={"source": "tweet"},
    id=1,
)

vector_store.update_document(document_id=uuids[0], document=updated_document_1)   #The update_document() method updates the vector db. Note how we access the element with uuid[0], which is the id we generated for the first element.

In [12]:
vector_store.get(ids=uuids[0]) #the .get() method fetches the vector db elements by their ids

{'ids': ['95f10471-c84b-4eb6-beae-3f61d13a0df8'],
 'embeddings': None,
 'metadatas': [{'source': 'tweet'}],
 'documents': ['I really love dogs'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

We can now do a similarity search on the updated vector database. 

In [15]:
query = "I love pets." #The query for our new similarity search

result = vector_store.similarity_search(query=query, k=2, filter={"source":"tweet"})  #performs a similarity search with the given query and fetches the 2 most similar documents, with a filter. 
result

[Document(metadata={'source': 'tweet'}, page_content='I really love dogs'),
 Document(metadata={'source': 'tweet'}, page_content='I love programming')]

We can see that the similarity search pulled the updated document. 

## DELETING FROM THE VECTOR DB

The delete() method deletes vector db elements. 

In [16]:
vector_store.delete(ids=uuids[-1])  #this will delete the last element of the vector db

In [17]:
uuids.pop() #removing the id of the deleted element

'e47753de-cbaa-4d1d-aa9a-30fb9fffe984'

In [18]:
vector_store.get(ids=uuids[-1]) #fetches the last element in the vector db

{'ids': ['2bc909ee-83e6-4f89-89ea-5d63c60676d3'],
 'embeddings': None,
 'metadatas': [{'source': 'news'}],
 'documents': ['Vector Databases Enhance Search and Retrieval for AI Applications.'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

The last element is deleted. 

<div style="text-align: left;">
    <img src="logo.png" alt="flow" width="150" height="100">
</div>