# Vector store as Retriever

Vector stores are a powerful and efficient way to index and retrieve unstructured data. A vectorstore can be used as a retriever by calling the `as_retriever()` method.


In [39]:
#loading sample data
import pandas as pd
data = pd.read_csv("sample_text_data.csv")

In [40]:
from langchain.vectorstores import FAISS
from langchain.docstore.in_memory import InMemoryDocstore

In [41]:
 # embedding model
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

In [42]:
embed_model = HuggingFaceEmbeddings(model_name = "thenlper/gte-base")

In [43]:
# creating langchain-faiss vector store
import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore

In [44]:
EMBED_DIMS = 768

In [45]:
langchain_index = faiss.IndexFlatL2(EMBED_DIMS)

In [46]:
vector_store = FAISS(embedding_function = embed_model,
                    index = langchain_index,
                    docstore = InMemoryDocstore(),
                    index_to_docstore_id = {})

In [47]:
from langchain_core.documents import Document

all_documents = [ Document(
    page_content = sentence,
    metadata = {"source": "raw_data"})
                 for sentence in data['0'] ]

In [20]:
vector_store.add_documents(documents = all_documents,
                          ids = list(range(len(all_documents))))

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,


In [21]:
vector_retriever = vector_store.as_retriever()

In [24]:
vector_retriever

VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x144af4c10>, search_kwargs={})

In [25]:
vector_retriever.invoke("who plays football")

[Document(id='28', metadata={'source': 'raw_data'}, page_content='An American footballer is wearing a red and white strip.A man wearing a red football uniform and gray gloves looks to the left.'),
 Document(id='1320', metadata={'source': 'raw_data'}, page_content='A football player kicks the ball.An Oklahoma football player attempts to kick the ball.'),
 Document(id='143', metadata={'source': 'raw_data'}, page_content='A group of football players running down the field.Football players are on the field.'),
 Document(id='1013', metadata={'source': 'raw_data'}, page_content='Kids playing tag football.Young boys play touch football.')]

# BM25 Retrivers

BM25 is a traditional keywords based retriever. BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. It also used IDF into consideration

Langchain integration of BM25 uses `rank_bm25` package

In [33]:
from langchain.retrievers import BM25Retriever

In [29]:
bm25_retriever = BM25Retriever.from_documents(all_documents)

In [30]:
bm25_retriever.invoke("who plays football")

[Document(metadata={'source': 'raw_data'}, page_content='A shirtless man playing football on a lawn.A shirtless man catches a football'),
 Document(metadata={'source': 'raw_data'}, page_content='A football player kicks the ball.An Oklahoma football player attempts to kick the ball.'),
 Document(metadata={'source': 'raw_data'}, page_content='Two boys in a field kicking a soccer ball.Two football players bang into each other during a football game.'),
 Document(metadata={'source': 'raw_data'}, page_content='A dog plays in the water.a dog plays in the snow.')]

# Ensemble Retrievers 

It is possible to combine multiple retrievers using ensembling. This is particularly useful when you have multiple retrievers that are good at finding different types of relevant documents.

In [32]:
from langchain.retrievers import EnsembleRetriever

In [36]:
ensemble_retriever = EnsembleRetriever(
    retrievers = [bm25_retriever, vector_retriever], weights = [0.3, 0.7])

In [37]:
ensemble_retriever.invoke("who plays football")

[Document(metadata={'source': 'raw_data'}, page_content='A football player kicks the ball.An Oklahoma football player attempts to kick the ball.'),
 Document(id='28', metadata={'source': 'raw_data'}, page_content='An American footballer is wearing a red and white strip.A man wearing a red football uniform and gray gloves looks to the left.'),
 Document(id='143', metadata={'source': 'raw_data'}, page_content='A group of football players running down the field.Football players are on the field.'),
 Document(id='1013', metadata={'source': 'raw_data'}, page_content='Kids playing tag football.Young boys play touch football.'),
 Document(metadata={'source': 'raw_data'}, page_content='A shirtless man playing football on a lawn.A shirtless man catches a football'),
 Document(metadata={'source': 'raw_data'}, page_content='Two boys in a field kicking a soccer ball.Two football players bang into each other during a football game.'),
 Document(metadata={'source': 'raw_data'}, page_content='A dog p

# Source document retention

Retrievers utilize some kind of index to make documents easily searchable. The process of indexing can include a transformation step (e.g., vectorstores often use document splitting). Whatever transformation is used, can be very useful to retain a link between the transformed document and the original, giving the retriever the ability to return the original document.

LangChain has two different retrievers that can be used to address this challenge. The Multi-Vector retriever allows the user to use any document transformation (e.g., use an LLM to write a summary of the document) for indexing while retaining linkage to the source document. The ParentDocument retriever links document chunks from a text-splitter transformation for indexing while retaining linkage to the source document.

The `ParentDocumentRetriever` strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents.

In [50]:
# for this we will use a pdf so we can observe the differnece
from langchain_community.document_loaders import PyPDFLoader

pdf_file = "~/Desktop/langchain_concepts/hdfc_policy_doc.pdf"

In [52]:
pdf_loader = PyPDFLoader(pdf_file)
# will load each page as a document, with lazy loading
document_pages = []
async for page in pdf_loader.alazy_load():
    document_pages.append(page)

In [53]:
len(document_pages)

48

now we can directly give ParentDocumentRetriever a text splitter which
will break documents into overlapping chunks itself, 
a vector store which will vectorize these chunks for retrievel
Parent doc retriever will keep ids of each split text corresponding to its original doc

In [64]:
from langchain.retrievers import ParentDocumentRetriever

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain.storage import InMemoryStore

from langchain_community.vectorstores import FAISS

In [81]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 200,
                                               chunk_overlap = 20)

In [82]:
EMBED_DIMS = 768
embed_model = HuggingFaceEmbeddings(model_name = "thenlper/gte-base")

In [83]:
faiss_index = faiss.IndexFlatL2(EMBED_DIMS)
vector_store = FAISS(embedding_function = embed_model,
                    index = faiss_index,
                    docstore = InMemoryDocstore(),
                    index_to_docstore_id = {})

In [84]:
retriever = ParentDocumentRetriever(
    vectorstore = vector_store,
    docstore = InMemoryStore(),
    child_splitter = text_splitter
)

In [85]:
retriever.add_documents(document_pages, ids = None)

 Now we can compare that if we directly search vector store, it will just return us the most similar chunk
While if we use Parent Retriever, it will give back the entire document to which the most similar chunk belong

In [86]:
vector_store.similarity_search("hdfc insurance", k = 1)

[Document(id='5d1fcdcf-18bb-43f6-9e3c-b2665ad3e69e', metadata={'producer': 'Microsoft® Word LTSC', 'creator': 'Microsoft® Word LTSC', 'creationdate': '2025-03-10T17:06:03+05:30', 'author': 'harinis', 'moddate': '2025-03-10T17:06:03+05:30', 'source': '/Users/sarmehta/Desktop/langchain_concepts/hdfc_policy_doc.pdf', 'total_pages': 48, 'page': 0, 'page_label': '1', 'doc_id': 'c03615e4-daa8-462b-8fbf-14b05a66007d'}, page_content='IST) or email us at service@hdfclife.com. Please quote your Policy number in all correspondence. Our postal')]

In [88]:
retriever.invoke("hdfc insurance")[0]

Document(metadata={'producer': 'Microsoft® Word LTSC', 'creator': 'Microsoft® Word LTSC', 'creationdate': '2025-03-10T17:06:03+05:30', 'author': 'harinis', 'moddate': '2025-03-10T17:06:03+05:30', 'source': '/Users/sarmehta/Desktop/langchain_concepts/hdfc_policy_doc.pdf', 'total_pages': 48, 'page': 0, 'page_label': '1'}, page_content='Page 1 of 48 \n \n \nPart A \n(Welcome Letter) \n<<Date>> \n<<Policyholder’s Name>>  \n<<Policyholder’s Address>> \n<<Policyholder’s Contact Number>> \n \nSub: Your Policy no. <<>>-HDFC Life Sampoorn Nivesh Plus \n \nDear <<Policyholder’s Name>>,  \n \nWe are glad to inform you that your proposal has been accepted and the HDFC Life Sampoorn  Nivesh Plus \nPolicy (“Policy”) being this Policy, has been issued. We have made every effort to design your Policy in a \nsimple format. We have highlighted items of importance so that you may recognize them easily. \n \nPolicy document: \nAs evidence of the insurance contract between HDFC Life Insurance Company Limit