# Exploring Vector Database Operations in LangChain

## Install OpenAI, and LangChain dependencies

In [None]:
!pip install langchain #==0.3.11
!pip install langchain-openai #==0.2.12
!pip install langchain-community #==0.3.11

Collecting langchain-openai
  Downloading langchain_openai-0.3.32-py3-none-any.whl.metadata (2.4 kB)
Downloading langchain_openai-0.3.32-py3-none-any.whl (74 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.5/74.5 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-openai
Successfully installed langchain-openai-0.3.32
Collecting langchain-community
  Downloading langchain_community-0.3.29-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-core<2.0.0,>=0.3.75 (from langchain-community)
  Downloading langchain_core-0.3.75-py3-none-any.whl.metadata (5.7 kB)
Collecting requests<3,>=2.32.5 (from langchain-community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7,>=0.6.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.6.7->langchain-community)
  Downloading mar

## Install Chroma Vector DB and LangChain wrapper

In [None]:
!pip install langchain-chroma

Collecting langchain-chroma
  Downloading langchain_chroma-0.2.5-py3-none-any.whl.metadata (1.1 kB)
Collecting chromadb>=1.0.9 (from langchain-chroma)
  Downloading chromadb-1.0.20-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Collecting pybase64>=1.4.1 (from chromadb>=1.0.9->langchain-chroma)
  Downloading pybase64-1.4.2-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb>=1.0.9->langchain-chroma)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb>=1.0.9->langchain-chroma)
  Downloading onnxruntime-1.22.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.9 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb>=1.0.9->langchain-chroma)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.36.0-py3-none-any.whl.metadata (2.4 kB)
Collecting pypika>

## Enter Open AI API Key

In [None]:
import os
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

In [None]:
documents = ['Quantum mechanics describes the behavior of very small particles.',
 'Photosynthesis is the process by which green plants make food using sunlight.',
 "Shakespeare's plays are a testament to English literature.",
 'Artificial Intelligence aims to create machines that can think and learn.',
 'The pyramids of Egypt are historical monuments that have stood for thousands of years.']

### Open AI Embedding Models

LangChain enables us to access Open AI embedding models which include the newest models: a smaller and highly efficient `text-embedding-3-small` model, and a larger and more powerful `text-embedding-3-large` model.

In [None]:
from langchain_openai import OpenAIEmbeddings

# details here: https://openai.com/blog/new-embedding-models-and-api-updates
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

## Vector Databases

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector database takes care of storing embedded data and performing vector search for you.

### Chroma Vector DB

[Chroma](https://docs.trychroma.com/getting-started) is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0.

In [None]:
# delete vector db if exists
!rm -rf ./chroma_db

### Create a Vector DB and persist on disk

Here we initialize a connection to a Chroma vector DB client, and also we want to save to disk, so we simply initialize the Chroma client and pass the directory where we want the data to be saved to.

In [None]:
from langchain_chroma import Chroma

# create empty vector DB
chroma_db = Chroma(collection_name='search_docs',
                   embedding_function=openai_embed_model,
                   persist_directory="./chroma_db")

We take some sample documents

In [None]:
documents

['Quantum mechanics describes the behavior of very small particles.',
 'Photosynthesis is the process by which green plants make food using sunlight.',
 "Shakespeare's plays are a testament to English literature.",
 'Artificial Intelligence aims to create machines that can think and learn.',
 'The pyramids of Egypt are historical monuments that have stood for thousands of years.']

We create document IDs to uniquely identify each document

In [None]:
ids = ['doc_'+str(i) for i in range(len(documents))]
ids

['doc_0', 'doc_1', 'doc_2', 'doc_3', 'doc_4']

Checking the Vector DB to see if its empty

In [None]:
chroma_db.get()

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': []}

### Adding documents to Vector DB

Here we take our texts, pass them through the Open AI embedder to get embeddings and add it to the Chroma Vector DB.

If you have documents in the LangChain `Document` format then you can use `add_documents` instead

In [None]:
chroma_db.add_texts(texts=documents, ids=ids)

['doc_0', 'doc_1', 'doc_2', 'doc_3', 'doc_4']

We check out Vector DB now to see these documents have been indexed successfully

In [None]:
chroma_db.get()

{'ids': ['doc_0', 'doc_1', 'doc_2', 'doc_3', 'doc_4'],
 'embeddings': None,
 'documents': ['Quantum mechanics describes the behavior of very small particles.',
  'Photosynthesis is the process by which green plants make food using sunlight.',
  "Shakespeare's plays are a testament to English literature.",
  'Artificial Intelligence aims to create machines that can think and learn.',
  'The pyramids of Egypt are historical monuments that have stood for thousands of years.'],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': [None, None, None, None, None]}

Run some search queries in our Vector DB

In [None]:
query = 'Tell me about AI'
docs = chroma_db.similarity_search_with_score(query=query, k=1)
docs

[(Document(id='doc_3', metadata={}, page_content='Artificial Intelligence aims to create machines that can think and learn.'),
  0.8804225921630859)]

In [None]:
query = 'Do you know about the pyramids?'
docs = chroma_db.similarity_search_with_score(query=query, k=1)
docs

[(Document(id='doc_4', metadata={}, page_content='The pyramids of Egypt are historical monuments that have stood for thousands of years.'),
  0.8855744004249573)]

In [None]:
query = 'What is Biology?'
docs = chroma_db.similarity_search_with_score(query=query, k=1)
docs

[(Document(id='doc_1', metadata={}, page_content='Photosynthesis is the process by which green plants make food using sunlight.'),
  1.5097941160202026)]

### Adding more documents to our Vector DB

You can add new documents anytime to the vector DB as shown below

In [None]:
new_documents = [ 'Biology is the study of living organisms and their interactions with the environment.',
 'Music therapy can aid in the mental well-being of individuals.',
 'The Milky Way is just one of billions of galaxies in the universe.',
 'Economic theories help understand the distribution of resources in society.',
 'Yoga is an ancient practice that involves physical postures and meditation.']

In [None]:
new_ids = ['doc_'+str(i+len(ids)) for i in range(len(new_documents))]
new_ids

['doc_5', 'doc_6', 'doc_7', 'doc_8', 'doc_9']

In [None]:
chroma_db.add_texts(texts=new_documents, ids=new_ids)

['doc_5', 'doc_6', 'doc_7', 'doc_8', 'doc_9']

In [None]:
chroma_db.get()

{'ids': ['doc_0',
  'doc_1',
  'doc_2',
  'doc_3',
  'doc_4',
  'doc_5',
  'doc_6',
  'doc_7',
  'doc_8',
  'doc_9'],
 'embeddings': None,
 'documents': ['Quantum mechanics describes the behavior of very small particles.',
  'Photosynthesis is the process by which green plants make food using sunlight.',
  "Shakespeare's plays are a testament to English literature.",
  'Artificial Intelligence aims to create machines that can think and learn.',
  'The pyramids of Egypt are historical monuments that have stood for thousands of years.',
  'Biology is the study of living organisms and their interactions with the environment.',
  'Music therapy can aid in the mental well-being of individuals.',
  'The Milky Way is just one of billions of galaxies in the universe.',
  'Economic theories help understand the distribution of resources in society.',
  'Yoga is an ancient practice that involves physical postures and meditation.'],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': 

In [None]:
query = 'What is Biology?'
docs = chroma_db.similarity_search_with_score(query=query, k=1)
docs

[(Document(id='doc_5', metadata={}, page_content='Biology is the study of living organisms and their interactions with the environment.'),
  0.6786499619483948)]

### Updating documents in the Vector DB

While building toward a real application, you want to go beyond adding data, and also update and delete data.

Chroma has users provide ids to simplify the bookkeeping here and update documents as shown below using the `update_documents`function

In [None]:
chroma_db.get(['doc_3'])

{'ids': ['doc_3'],
 'embeddings': None,
 'documents': ['Artificial Intelligence aims to create machines that can think and learn.'],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': [None]}

In [None]:
from langchain_core.documents import Document

ids = ['doc_3']
texts = ['AI is known as Artificial Intelligence. Artificial Intelligence aims to create machines that can think and learn.']
documents = [Document(page_content=text, metadata={'doc': id})
                for id, text in zip(ids,texts)]
documents

[Document(metadata={'doc': 'doc_3'}, page_content='AI is known as Artificial Intelligence. Artificial Intelligence aims to create machines that can think and learn.')]

In [None]:
chroma_db.update_documents(ids=ids,documents=documents)

In [None]:
chroma_db.get(['doc_3'])

{'ids': ['doc_3'],
 'embeddings': None,
 'documents': ['AI is known as Artificial Intelligence. Artificial Intelligence aims to create machines that can think and learn.'],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': [{'doc': 'doc_3'}]}

In [None]:
query = 'What is AI?'
docs = chroma_db.similarity_search_with_score(query=query, k=1)
docs

[(Document(id='doc_3', metadata={'doc': 'doc_3'}, page_content='AI is known as Artificial Intelligence. Artificial Intelligence aims to create machines that can think and learn.'),
  0.563372015953064)]

### Deleting documents in the Vector DB

Chroma has users provide ids to simplify the bookkeeping here and delete documents as shown below using the `delete`function

In [None]:
chroma_db.delete(['doc_9'])

In [None]:
chroma_db.get()

{'ids': ['doc_0',
  'doc_1',
  'doc_2',
  'doc_3',
  'doc_4',
  'doc_5',
  'doc_6',
  'doc_7',
  'doc_8'],
 'embeddings': None,
 'documents': ['Quantum mechanics describes the behavior of very small particles.',
  'Photosynthesis is the process by which green plants make food using sunlight.',
  "Shakespeare's plays are a testament to English literature.",
  'AI is known as Artificial Intelligence. Artificial Intelligence aims to create machines that can think and learn.',
  'The pyramids of Egypt are historical monuments that have stood for thousands of years.',
  'Biology is the study of living organisms and their interactions with the environment.',
  'Music therapy can aid in the mental well-being of individuals.',
  'The Milky Way is just one of billions of galaxies in the universe.',
  'Economic theories help understand the distribution of resources in society.'],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': [None,
  None,
  None,
  {'doc':

### Load Vector DB from disk

Once you have saved your DB to disk, you can load it up anytime and connect to it and run queries as shown below

In [None]:
# load from disk
db = Chroma(persist_directory="./chroma_db",
            embedding_function=openai_embed_model,
            collection_name='search_docs')

query = 'What is AI?'
docs = db.similarity_search_with_score(query=query, k=1)
docs

[(Document(id='doc_3', metadata={'doc': 'doc_3'}, page_content='AI is known as Artificial Intelligence. Artificial Intelligence aims to create machines that can think and learn.'),
  0.5634348392486572)]