#Vector Stores and retrievers.

It will familiarize you with LangChain's vector store and retriever abstractions. These abstractions are designed to support reterival of data-- from(vector) databases and other sources-- for integration with LLM workflows. They are important for applications that fetch data to be reasoned over as part of model inference, as in the case of retrieval-augumentes generation.

We will cover:

* Documents
* Vector Stores
* Retrievers


In [None]:
!pip install langchain
!pip install langchain-chroma
!pip install langchain_groq

# Documents

langchain implements a Document abstarction, which intended to represent a unit of text and associated metadata. It has 2 attributes:

* Page_content: a string representing the content;
* metadata: a dict containing arbitrary metadata attribute can capture information about the source of the document, its relationship to other documents, and the other information. Note that an individual Document object represents a chunk of a larger document.

Let's generate some sample documents


In [3]:
from langchain_core.documents import Document
documents=[
    Document(
        page_content="Dogs are great companions, known from their loyalty and friendliness.",
        metadata={"source":"mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source":"mammal-pets-doc"},
    ),
    Document(
        page_content="Goldfish are popular pets for begineers,requiring relatively simple care.",
        metadata={"source":"fish-pets-doc"},
    ),
    Document(
        page_content="Parrots are intelligent birds capable of mimicking human speech",
        metadata={"source":"bird-pets-doc"},
    ),
    Document(
        page_content="Rabbits are social  animals that need plenty of space to hop around.",
        metadata={"source":"mammal-pets-doc"},
    ),
]

In [4]:
documents

[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known from their loyalty and friendliness.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.'),
 Document(metadata={'source': 'fish-pets-doc'}, page_content='Goldfish are popular pets for begineers,requiring relatively simple care.'),
 Document(metadata={'source': 'bird-pets-doc'}, page_content='Parrots are intelligent birds capable of mimicking human speech'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Rabbits are social  animals that need plenty of space to hop around.')]

In [10]:
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_groq import ChatGroq

groq_api_key=os.getenv("GROQ_API_KEY")

os.environ["HF_TOKEN"]=os.getenv("HF_TOKEN")

llm=ChatGroq(groq_api_key=groq_api_key,model="Llama3-8b-8192")
llm

ChatGroq(client=<groq.resources.chat.completions.Completions object at 0x0000017F1CC9FA00>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x0000017F1E90C9A0>, model_name='Llama3-8b-8192', groq_api_key=SecretStr('**********'))

In [12]:
from langchain_huggingface import HuggingFaceEmbeddings


In [None]:
embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

In [14]:
## vector store- convert these sentences/text into words vector and that will be store in the vector db
from langchain_chroma import Chroma
vectorstore=Chroma.from_documents(documents,embedding=embedding)

In [15]:
vectorstore

<langchain_chroma.vectorstores.Chroma at 0x17f3deaada0>

In [18]:
vectorstore.similarity_search("cats")

[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Rabbits are social  animals that need plenty of space to hop around.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known from their loyalty and friendliness.'),
 Document(metadata={'source': 'bird-pets-doc'}, page_content='Parrots are intelligent birds capable of mimicking human speech')]

In [19]:
## Async Query
await vectorstore.asimilarity_search("cats")
#It will not wait till the response is probabily coming back.. It is just Async method.

[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Rabbits are social  animals that need plenty of space to hop around.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known from their loyalty and friendliness.'),
 Document(metadata={'source': 'bird-pets-doc'}, page_content='Parrots are intelligent birds capable of mimicking human speech')]

In [20]:
vectorstore.similarity_search_with_score("cats") # which ever is less that will be the nearest document

[(Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.'),
  0.8780325651168823),
 (Document(metadata={'source': 'mammal-pets-doc'}, page_content='Rabbits are social  animals that need plenty of space to hop around.'),
  1.4694515466690063),
 (Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known from their loyalty and friendliness.'),
  1.5009397268295288),
 (Document(metadata={'source': 'bird-pets-doc'}, page_content='Parrots are intelligent birds capable of mimicking human speech'),
  1.5552324056625366)]

# Retrievers

Langchain vector store objects  do not sub class Runnable , and so cannot immediately be integrated into Langchain Expression Language chains.

Langchain Retrievers are Runnables, so they implement a standard set of methods (e.g: synchronous and asynchronous invoke and batch operations) and are designed to be incorporated in LCEL chains.

We can create a simple version of this ourselves, without subclassing Retriever. If we choose what method we wish to use to retrieve documents, we can create a runnable easily. Below we will build one around the similarity_search method.


In [22]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda

retriever=RunnableLambda(vectorstore.similarity_search).bind(k=1) # which is top nearest result.

In [23]:
retriever.batch(["cat","dog"])

[[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.')],
 [Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known from their loyalty and friendliness.')]]

Vector stores implement an as_retriever method that will generate a Retriever, specifically a Vector Store Retriever. these retrievers include specific search_type and search_kwargs attributes that identify what methods of the underlying vector store to call, and how to parameterize them. For instance, we can replicate the above with the following:

In [24]:
vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k":1}
)
retriever.batch(["cat","dogs"])

[[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.')],
 [Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known from their loyalty and friendliness.')]]

In [25]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

message="""
Answer this question using the provide contect only.

{question}

contect:
{context}
"""
prompt=ChatPromptTemplate.from_messages([("human",message)])

rag_chain={"context":retriever,"question":RunnablePassthrough()}|prompt|llm
response=rag_chain.invoke("tell me about dogs")
print(response.content)

According to the provided content, dogs are great companions, known for their loyalty and friendliness.
