# Build a semantic search engine

### Concepts

This guide focuses on retrieval of text data. We will cover the following concepts:

- Documents and document loaders;
- Text splitters;
- Embeddings;
- Vector stores and retrievers.


In [20]:
%%capture --no-stderr
%pip install -qU langchain-community pypdf langchain-ollama langchain_unstructured

### Documents and Document Loaders

LangChain implements a Document abstraction, which is intended to represent a unit of text and associated metadata. It has three attributes:

- page_content: a string representing the content;
- metadata: a dict containing arbitrary metadata;
- id: (optional) a string identifier for the document.

The `metadata` attribute can capture information about the source of the document, its relationship to other documents, and other information.

Note: that an individual `Document` object often represents a chunk of a larger document.


In [None]:
from langchain_core.documents import Document

# generate sample docs
docs = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source":"mammal-pets-doc"}
    )
]
# docs

In [21]:
# loading docs
# from langchain_community.document_loaders import PyPDFLoader
file_path='./data/zomato-q1fy25.pdf'
# loader=PyPDFLoader(file_path=file_path) # PyPDFLoader loads one Document object per PDF page.
# docs=loader.load()
# len(docs) # 581

# f"{docs[0].page_content[:200]}\n" # page_content
# f"{docs[0].metadata}" # metadata
import getpass
import os

if "UNSTRUCTURED_API_KEY" not in os.environ:
    os.environ["UNSTRUCTURED_API_KEY"] = getpass.getpass("Unstructured API Key:")

from langchain_unstructured import UnstructuredLoader

loader = UnstructuredLoader(
    file_path=file_path,
    strategy="hi_res",
    partition_via_api=True,
    coordinates=True,
)
docs = []
for doc in loader.lazy_load():
    docs.append(doc)

INFO: Preparing to split document for partition.
INFO: Starting page number set to 1
INFO: Allow failed set to 0
INFO: Concurrency level set to 5
INFO: Splitting pages 1 to 28 (28 total)
INFO: Determined optimal split size of 6 pages.
INFO: Partitioning 4 files with 6 page(s) each.
INFO: Partitioning 1 file with 4 page(s).
INFO: Partitioning set #1 (pages 1-6).
INFO: Partitioning set #2 (pages 7-12).
INFO: Partitioning set #3 (pages 13-18).
INFO: Partitioning set #4 (pages 19-24).
INFO: Partitioning set #5 (pages 25-28).
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Su

In [22]:
# splitting docs
"""
For both information retrieval and downstream question-answering 
purposes, a page may be too coarse a representation. Our goal 
in the end will be to retrieve Document objects that answer an 
input query, and further splitting our PDF will help ensure that 
the meanings of relevant portions of the document are not "washed out" 
by surrounding text.

We can use text splitters for this purpose.
"""

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, add_start_index=True)

all_splits=text_splitter.split_documents(docs)

len(all_splits)

335

In [23]:
# embeddings
"""
Vector search is a common way to store and search 
over unstructured data (such as unstructured text). 
The idea is to store numeric vectors that are associated 
with the text. Given a query, we can embed it as a vector 
of the same dimension and use vector similarity metrics 
(such as cosine similarity) to identify related text.
"""

from langchain_ollama import OllamaEmbeddings
embeddings=OllamaEmbeddings(model='llama3.2:latest')

vec_1=embeddings.embed_query(all_splits[0].page_content)
vec_2=embeddings.embed_query(all_splits[1].page_content)

assert len(vec_1) == len(vec_2)
f"Generated vecs of length {len(vec_1)}\n"
# vec_1[:10]

INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"


'Generated vecs of length 3072\n'

In [24]:
# vectorstore
# Armed with a model for generating text embeddings, 
# we can next store them in a special data structure 
# that supports efficient similarity search.
from langchain_core.vectorstores import InMemoryVectorStore

vec_store=InMemoryVectorStore(embedding=embeddings) # init vectorstore w/ embedding model
ids = vec_store.add_documents(documents=all_splits)
len(ids)

INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"


335

In [None]:
# querying
"""
Once we've instantiated a VectorStore that contains documents, we can query it. VectorStore includes methods for querying:

Synchronously and asynchronously;
By string query and by vector;
With and without returning similarity scores;
By similarity and maximum marginal relevance (to balance similarity with query to diversity in retrieved results).
"""

results = vec_store.similarity_search(
    "25,000 unique SKUs"
)
results[0]

INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"


"Doc:Deepinder: Today, Zomato and Blinkit are our two large consumer businesses and both of them serve customers' needs at home. However, we also have one of India’s largest ‘going-out’ businesses. Our dining-out business which helps our customers discover restaurants when they want to go out and dine at restaurants. This dining-out business is now operating at a run-rate of $500m+ annualised GOV and is already proﬁtable."

In [37]:
# different vectorsearch queries
results = vec_store.similarity_search_with_score("Blinkit?")
doc, score = results[0]
f"Score: {score}"
f"Doc:{doc.page_content}"

embedding = embeddings.embed_query("How is zomato performing against competitors?")
results = vec_store.similarity_search_by_vector(embedding)
results[0]

INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"


Document(id='f299860a-30e2-4cf6-8111-7002a485947e', metadata={'source': './data/zomato-q1fy25.pdf', 'coordinates': {'points': [[130.0000083333333, 1069.017333984375], [130.0000083333333, 1263.828125], [1512.6430320583331, 1263.828125], [1512.6430320583331, 1069.017333984375]], 'system': 'PixelSpace', 'layout_width': 1656, 'layout_height': 2339}, 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 6, 'parent_id': '9e23952d215bb433a3f9c13b20b971fa', 'filename': 'zomato-q1fy25.pdf', 'category': 'NarrativeText', 'element_id': '7fe99e400d7cff82bd3be308b7ccbde6', 'start_index': 0}, page_content="Deepinder: Today, Zomato and Blinkit are our two large consumer businesses and both of them serve customers' needs at home. However, we also have one of India’s largest ‘going-out’ businesses. Our dining-out business which helps our customers discover restaurants when they want to go out and dine at restaurants. This dining-out business is now operating at a run-rate of $500m+ annuali

In [40]:
# retrievers
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain

@chain
def retriever(query: str)->List[Document]:
    return vec_store.similarity_search(query)

retriever.batch([
    "Can you share some data points that can help us appreciate assortment category expansion on the Blinkit platform over the past couple of years?",
    "Any ESG updates?"
])

# vectorestore implements as_retriever
retriever = vec_store.as_retriever(
    searchType='similarity',
    search_kwargs={"k":1}
)
retriever.batch([
    "Can you share some data points that can help us appreciate assortment category expansion on the Blinkit platform over the past couple of years?",
    "Any ESG updates?"
])

INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO: HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"


[[Document(id='8489ad70-5e75-4b61-b291-2d893de36852', metadata={'source': './data/zomato-q1fy25.pdf', 'coordinates': {'points': [[122.37303161621094, 1104.950927734375], [122.37303161621094, 1204.6910400390625], [1449.06201171875, 1204.6910400390625], [1449.06201171875, 1104.950927734375]], 'system': 'PixelSpace', 'layout_width': 1656, 'layout_height': 2339}, 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 5, 'filename': 'zomato-q1fy25.pdf', 'category': 'Title', 'element_id': 'a3960af8dcbd496b2b99a7d0cec717c4', 'start_index': 0}, page_content='Q7. Can you share some data points that can help us appreciate assortment/ category expansion on the Blinkit platform over the past couple of years?')],
 [Document(id='f299860a-30e2-4cf6-8111-7002a485947e', metadata={'source': './data/zomato-q1fy25.pdf', 'coordinates': {'points': [[130.0000083333333, 1069.017333984375], [130.0000083333333, 1263.828125], [1512.6430320583331, 1263.828125], [1512.6430320583331, 1069.017333984375]