## Semantic Search (RAG)

This guide focuses on retrieval of text data. We will cover the following concepts:

- Documents and document loaders
- Text splitters
- Embeddings
- Vector stores and retrievers.


Document

In [1]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

PDF Loader

In [2]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "C:/Users/asus/OneDrive/Desktop/pia/langchain_tutorial/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

len(docs)

107

In [3]:
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
F

{'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': 'C:/Users/asus/OneDrive/Desktop/pia/langchain_tutorial/nke-10k-2023.pdf', 'total_pages': 107, 'page': 0, 'page_label': '1'}


### Splitting

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

516

### Embedding Models

In [5]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

In [6]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 768

[0.047472208738327026, 0.02167578600347042, -0.009018083103001118, 0.00535670667886734, 0.02555767446756363, -0.010230272077023983, -0.00841403380036354, 0.039303917437791824, 0.021570531651377678, -0.024095429107546806]


## Vector Stores

For storing document objects and querying them. Many implementations with FAISS, MongoDB, Qdrant

In [7]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

In [8]:
ids = vector_store.add_documents(documents=all_splits)

In [9]:
results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)


Score: 0.8137387523451609

page_content='Table of Contents
YEAR ENDED MAY 31,
(Dollars in millions) 2023 2022 2021
REVENUES
North America $ 21,608 $ 18,353 $ 17,179 
Europe, Middle East & Africa 13,418 12,479 11,456 
Greater China 7,248 7,547 8,290 
Asia Pacific & Latin America 6,431 5,955 5,343 
Global Brand Divisions 58 102 25 
Total NIKE Brand 48,763 44,436 42,293 
Converse 2,427 2,346 2,205 
Corporate 27 (72) 40 
TOTAL NIKE, INC. REVENUES $ 51,217 $ 46,710 $ 44,538 
EARNINGS BEFORE INTEREST AND TAXES
North America $ 5,454 $ 5,114 $ 5,089 
Europe, Middle East & Africa 3,531 3,293 2,435 
Greater China 2,283 2,365 3,243 
Asia Pacific & Latin America 1,932 1,896 1,530 
Global Brand Divisions (4,841) (4,262) (3,656)
Converse 676 669 543 
Corporate (2,840) (2,219) (2,261)
Interest expense (income), net (6) 205 262 
TOTAL NIKE, INC. INCOME BEFORE INCOME TAXES $ 6,201 $ 6,651 $ 6,661 
ADDITIONS TO PROPERTY, PLANT AND EQUIPMENT
North America $ 283 $ 146 $ 98 
Europe, Middle East & Africa 21

### Retrievers

Langchain runnables and LangChain Expression Language (LCEL)
- Abstracton, encapsulates any callable entity—such as a language model, prompt template, retriever, or custom function
- Implement a standard set of methods (e.g., synchronous and asynchronous invoke and batch operations).
- Chainable 

In [10]:
from langchain_core.runnables import RunnableLambda

def add_one(x):
    return x + 1

def multiply_by_two(x):
    return x * 2

runnable = RunnableLambda(add_one) | RunnableLambda(multiply_by_two)
result = runnable.invoke(3)  # Outputs 8

In [11]:
import os 

from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash-lite",
    google_api_key= os.getenv("GOOGLE_API_KEY")
)


from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template("tell me a joke about {topic}")

chain = prompt | llm | StrOutputParser()

chain.invoke({"topic": "bears"})

"Why don't bears like fast food?\n\nBecause they can't bear to wait!"

In [12]:
analysis_prompt = ChatPromptTemplate.from_template("is this a funny joke? {joke}")

composed_chain = {"joke": chain} | analysis_prompt | llm | StrOutputParser()

composed_chain.invoke({"topic": "bears"})

'Yes, that\'s a classic and generally considered a funny joke! It\'s a pun, which is a type of joke that plays on the different meanings of a word or the fact that there are words that sound alike but have different meanings. In this case, "bear feet" sounds like "bare feet" (feet without shoes).\n\nHere\'s why it works:\n\n*   **Unexpected twist:** The setup leads you to expect a logical reason, but the punchline provides a silly, pun-based answer.\n*   **Wordplay:** The humor comes from the clever use of language.\n*   **Simplicity:** It\'s short, easy to understand, and doesn\'t require any complex knowledge.\n*   **Relatability:** It\'s a common joke that many people have heard, and the familiarity makes it fun.\n\nSo, yes, it\'s a funny joke!'

In [13]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

[[Document(id='3b021906-1339-4f19-98ba-6c6380ea75ae', metadata={'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creator': 'EDGAR Filing HTML Converter', 'creationdate': '2023-07-20T16:22:00-04:00', 'title': '0000320187-23-000039', 'author': 'EDGAR Online, a division of Donnelley Financial Solutions', 'subject': 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31', 'keywords': '0000320187-23-000039; ; 10-K', 'moddate': '2023-07-20T16:22:08-04:00', 'source': 'C:/Users/asus/OneDrive/Desktop/pia/langchain_tutorial/nke-10k-2023.pdf', 'total_pages': 107, 'page': 26, 'page_label': '27', 'start_index': 804}, page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our\nwholesale, NIKE Direct and merchandising strategies in the region, among other functions.\nIn the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis