## Install Chroma Vector DB and LangChain wrapper

In [1]:
!pip install onnxruntime




[notice] A new release of pip is available: 23.1.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip





In [2]:
from dotenv import load_dotenv
#load the .env : 
load_dotenv(r'C:\Users\Rahul\Documents\FREELANCING\1.AppliedSkil\Gen_AI_learning\.env')

True

## Setup Environment Variables

In [3]:
documents = ['Quantum mechanics describes the behavior of very small particles.',
 'Photosynthesis is the process by which green plants make food using sunlight.',
 "Shakespeare's plays are a testament to English literature.",
 'Artificial Intelligence aims to create machines that can think and learn.',
 'The pyramids of Egypt are historical monuments that have stood for thousands of years.']

### Open AI Embedding Models

LangChain enables us to access Open AI embedding models which include the newest models: a smaller and highly efficient `text-embedding-3-small` model, and a larger and more powerful `text-embedding-3-large` model.

In [4]:
from langchain_openai import OpenAIEmbeddings

# details here: https://openai.com/blog/new-embedding-models-and-api-updates
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

## Vector Databases

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector database takes care of storing embedded data and performing vector search for you.

### Chroma Vector DB

[Chroma](https://docs.trychroma.com/getting-started) is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0.

In [5]:
# delete vector db if exists
!rm -rf ./chroma_db

'rm' is not recognized as an internal or external command,
operable program or batch file.


### Create a Vector DB and persist on disk

Here we initialize a connection to a Chroma vector DB client, and also we want to save to disk, so we simply initialize the Chroma client and pass the directory where we want the data to be saved to.

In [6]:
from langchain_chroma import Chroma

# create empty vector DB
chroma_db = Chroma(collection_name='search_docs',
                   embedding_function=openai_embed_model,
                   persist_directory="./chroma_db")

ValueError: The onnxruntime python package is not installed. Please install it with `pip install onnxruntime`

We take some sample documents

In [None]:
documents

We create document IDs to uniquely identify each document

In [None]:
ids = ['doc_'+str(i) for i in range(len(documents))]
ids

Checking the Vector DB to see if its empty

In [None]:
chroma_db.get()

### Adding documents to Vector DB

Here we take our texts, pass them through the Open AI embedder to get embeddings and add it to the Chroma Vector DB.

If you have documents in the LangChain `Document` format then you can use `add_documents` instead

In [None]:
chroma_db.add_texts(texts=documents, ids=ids)

We check out Vector DB now to see these documents have been indexed successfully

In [None]:
chroma_db.get()

Run some search queries in our Vector DB

In [None]:
query = 'Tell me about AI'
docs = chroma_db.similarity_search_with_score(query=query, k=1)
docs

In [None]:
query = 'Do you know about the pyramids?'
docs = chroma_db.similarity_search_with_score(query=query, k=1)
docs

In [None]:
query = 'What is Biology?'
docs = chroma_db.similarity_search_with_score(query=query, k=1)
docs

### Adding more documents to our Vector DB

You can add new documents anytime to the vector DB as shown below

In [20]:
new_documents = [ 'Biology is the study of living organisms and their interactions with the environment.',
 'Music therapy can aid in the mental well-being of individuals.',
 'The Milky Way is just one of billions of galaxies in the universe.',
 'Economic theories help understand the distribution of resources in society.',
 'Yoga is an ancient practice that involves physical postures and meditation.']

In [None]:
new_ids = ['doc_'+str(i+len(ids)) for i in range(len(new_documents))]
new_ids

In [None]:
chroma_db.add_texts(texts=new_documents, ids=new_ids)

In [None]:
chroma_db.get()

In [None]:
query = 'What is Biology?'
docs = chroma_db.similarity_search_with_score(query=query, k=1)
docs

### Updating documents in the Vector DB

While building toward a real application, you want to go beyond adding data, and also update and delete data.

Chroma has users provide ids to simplify the bookkeeping here and update documents as shown below using the `update_documents`function

In [None]:
chroma_db.get(['doc_3'])

In [None]:
from langchain_core.documents import Document

ids = ['doc_3']
texts = ['AI is known as Artificial Intelligence. Artificial Intelligence aims to create machines that can think and learn.']
documents = [Document(page_content=text, metadata={'doc': id})
                for id, text in zip(ids,texts)]
documents

In [27]:
chroma_db.update_documents(ids=ids,documents=documents)

In [None]:
chroma_db.get(['doc_3'])

In [None]:
query = 'What is AI?'
docs = chroma_db.similarity_search_with_score(query=query, k=1)
docs

### Deleting documents in the Vector DB

Chroma has users provide ids to simplify the bookkeeping here and delete documents as shown below using the `delete`function

In [30]:
chroma_db.delete(['doc_9'])

In [None]:
chroma_db.get()

### Load Vector DB from disk

Once you have saved your DB to disk, you can load it up anytime and connect to it and run queries as shown below

In [None]:
# load from disk
db = Chroma(persist_directory="./chroma_db",
            embedding_function=openai_embed_model,
            collection_name='search_docs')

query = 'What is AI?'
docs = db.similarity_search_with_score(query=query, k=1)
docs