# Vector stores and retrievers
These abstractions are designed to support retrieval of data (from vector databases/other sources).
Improtant: when application fetch data to be responsed to a part of model. as in the case of RAG
Concepts
- Documents
- Vector Stores
- Retrivers

In [1]:
import getpass
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

## Documents
To represent a unit of text and associated metadata have a 2 attributes:
- `page_content`: a string representing the content.
- `metadata`: a dict containing arbitrary metadata.

In [2]:
from langchain_core.documents import Document

documents = [
  Document(
    page_content="Dogs are great companions, known for their loyalty and friendliness.",
    metadata={"source": "mammal-pets-doc"},
  ),
  Document(
    page_content="Cats are independent pets that often enjoy their own space.",
    metadata={"source": "mammal-pets-doc"},
  ),
  Document(
    page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
    metadata={"source": "fish-pets-doc"},
  ),
  Document(
    page_content="Parrots are intelligent birds capable of mimicking human speech.",
    metadata={"source": "bird-pets-doc"},
  ),
  Document(
    page_content="Rabbits are social animals that need plenty of space to hop around.",
    metadata={"source": "mammal-pets-doc"},
  ),
]

## Vector Stores
Vector search is a common way to store and search unstructure data. (number vector)
In LangChain, VectorStores objects contains methods of adding text and `Document` objects to store.
Initialized with `embedding` models

In [7]:
# Here we will demonstrate usage of LangChain VectorStores using Chroma, which includes an in-memory implementation.
from langchain_chroma import Chroma
from langchain_community.embeddings import OllamaEmbeddings

# Create a VectorStore from the documents using OllamaEmbeddings
vectorstore = Chroma.from_documents(
    documents,
    embedding=OllamaEmbeddings(model="llama3"),
)

can query vectorstore:
- Synchronously and asynchronously;
- By string query and by vector;
- With and without returning similarity scores;
- By similarity and maximum marginal relevance (to balance similarity with query to diversity in retrieved results).

In [8]:
# e.g. similarity to a string query:
vectorstore.similarity_search("cat")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]

In [9]:
# async query
await vectorstore.asimilarity_search("cat")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]

In [10]:
# return with scores
vectorstore.similarity_search_with_score("cat")

[(Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
  44496.98828125),
 (Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
  45515.8515625),
 (Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
  47526.62890625),
 (Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'}),
  47820.3671875)]

In [11]:
# Return documents based on similarity to a embedded query:
embedding = OllamaEmbeddings(model="llama3").embed_query("cat")

vectorstore.similarity_search_by_vector(embedding)

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]

## Retrievers
LangChain VectorStore object do not subclass Runnable it so cannot integrate with LCEL chains.
but LangChain Retrievers (can) are Runnables

In [12]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda

retriever = RunnableLambda(vectorstore.similarity_search).bind(k=1)  # select top result

retriever.batch(["cat", "shark"])

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],
 [Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})]]

In [13]:
# or
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(["cat", "shark"])

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],
 [Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})]]

In [14]:
# example with simple RAG
from langchain_community.llms import Ollama
model = Ollama(model="llama3")

In [15]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([("human", message)])

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | model

In [17]:
response = rag_chain.invoke("tell me about cats")

print(response)

According to the provided document, cats are independent pets that often enjoy their own space.
