# semantic search

An important step in RAG's

Q: What does RAG do?

A:

- A RAG program combines a given question with retrieved context into a prompt for a LLM

- RAG is a way to improve the prompt before talking to the LLM.

In [1]:
import dotenv
import os
os.getcwd()

'/home/ruchirich/Documents/repositories/hello-langchain/langchain'

In [2]:
dotenv.load_dotenv()

True

# embedding models

create vectors

In [3]:
from langchain_ollama import OllamaEmbeddings

In [4]:
embeddings_model = OllamaEmbeddings(model="llama3.2")

In [5]:
embeddings = embeddings_model.embed_documents(
    texts=[
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)

In [6]:
len(embeddings)

5

In [7]:
len(embeddings[0])

3072

In [8]:
query_embedding = embeddings_model.embed_query(text="What is the meaning of life?")

In [9]:
len(query_embedding)

3072

# math basics

cosine similarity refresher

In [10]:
import numpy as np

In [11]:
def dot_product(v1, v2):
    return np.dot(v1, v2)

In [12]:
def cosine_similarity(v1, v2):
    norm_v1 = np.linalg.norm(v1)
    norm_v2 = np.linalg.norm(v2)

    v1_dot_v2 = dot_product(v1, v2)

    cosine = v1_dot_v2 / (norm_v1 * norm_v2)

    return cosine

In [13]:
v1 = np.array(embeddings)

In [14]:
v2 = np.array(query_embedding)

In [15]:
dot_product(v1, v2)

array([0.40480284, 0.53689173, 0.64684796, 0.25815021, 0.54585133])

In [16]:
cosine_similarity(v1, v2)

array([0.18103333, 0.24010528, 0.2892792 , 0.11544828, 0.24411213])

# vector store

store vectors

In [17]:
from langchain_core.vectorstores import InMemoryVectorStore

In [18]:
vector_store = InMemoryVectorStore(embedding=embeddings_model)

# Document

A document is source of truth for RAG.

`Document` class provides functionalities

In [19]:
from langchain_core.documents import Document

In [20]:
document_1 = Document(
    page_content="I had chocalate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

documents = [document_1, document_2]

In [21]:
vector_store.add_documents(documents=documents, ids=['id1', 'id2'])

['id1', 'id2']

In [26]:
query = "I need a recipe for meals?"
docs = vector_store.similarity_search(query, k=1)

In [27]:
docs

[Document(id='id1', metadata={'source': 'tweet'}, page_content='I had chocalate chip pancakes and scrambled eggs for breakfast this morning.')]

# Practice

## Document loader

In [28]:
from langchain_community.document_loaders import PyPDFLoader

In [29]:
file_path = "10-k_AAPL.pdf"

loader = PyPDFLoader(file_path)
docs = loader.load()

In [30]:
len(docs) # 121 pages converted to 121 documents

121

In [31]:
print(f"{docs[0].page_content[:1000]}\n")
print(docs[0].metadata)

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☒  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended September 28, 2024
or
☐  TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from              to             .
Commission File Number: 001-36743
Apple Inc.
(Exact name of Registrant as specified in its charter)
California 94-2404110
(State or other jurisdictionof incorporation or organization) (I.R.S. Employer Identification No.)
One Apple Park Way
Cupertino, California 95014
(Address of principal executive offices) (Zip Code)
(408) 996-1010
(Registrant’s telephone number, including area code)
Securities registered pursuant to Section 12(b) of the Act:
Title of each class Trading symbol(s) Name of each exchange on which registered
Common Stock, $0.00001 par value per shareAAPL The Nasdaq Stock Market LLC
0.000% Notes du

## Splitting

Page is too coarse and wide for semantic search.

We want to retrieve relevant and specific documents to answer an input query

In [32]:
from langchain_text_splitters import RecursiveCharacterTextSplitter # recommended text splitter for generic text use cases

In [33]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True # 
)

In [34]:
all_splits = text_splitter.split_documents(documents=docs)
len(all_splits)

549

wip

https://python.langchain.com/docs/tutorials/retrievers/#embeddings

## Embedding model

In [40]:
vector_1 = embeddings_model.embed_query(all_splits[0].page_content)
vector_2 = embeddings_model.embed_query(all_splits[1].page_content)

In [42]:
print(len(vector_1))
print(vector_1[:10])

3072
[0.024086894, -0.0043432284, -0.00025116676, -0.026937498, 0.011013382, -0.029059304, 0.014889716, -0.014015334, 0.011201528, -0.0095463265]


## vector store

In [43]:
vector_store = InMemoryVectorStore(
    embedding=embeddings_model
)

In [44]:
# index the documents
ids = vector_store.add_documents(
    documents=all_splits
)

In [48]:
results = vector_store.similarity_search(
    "When was Apple incorporated"
)

print(results[0])

page_content='Events of Default
Each of the following events are defined in the Indentures as an “event of default” (whatever the reason for such event of default and whether or not itwill be voluntary or involuntary or be effected by operation of law or pursuant to any judgment, decree or order of any court or any order, rule or regulation of any
administrative or governmental body) with respect to the debt securities of any series:
(1)    default in the payment of any installment of interest on any debt securities of such series for 30 days after becoming due;
(2)    default in the payment of principal of or premium, if any, on any debt securities of such series when it becomes due and payable at its stated
maturity, upon optional redemption, upon declaration or otherwise;
(3)    default in the performance, or breach, of any covenant or agreement of ours in the applicable Indenture with respect to the debt securities of such' metadata={'source': '10-k_AAPL.pdf', 'page': 68, 'page_lab

In [52]:
results = vector_store.similarity_search_with_score(
    query="How does Apple comply with data privacy regulations?"
)

doc, score = results[0]
print(score)

print(doc)

0.5288635709496877
page_content='The Company’s global operations are subject to complex and changing laws and regulations on subjects, including antitrust; privacy, data security and data
localization; consumer protection; advertising, sales, billing and e-commerce; financial services and technology; product liability; intellectual property ownershipand infringement; digital platforms; machine learning and artificial intelligence; internet, telecommunications and mobile communications; media, television, filmand digital content; availability of third-party software applications and services; labor and employment; anticorruption; import, export and trade; foreign
exchange controls and cash repatriation restrictions; anti–money laundering; foreign ownership and investment; tax; and environmental, health and safety,including electronic waste, recycling, product design and climate change.' metadata={'source': '10-k_AAPL.pdf', 'page': 15, 'page_label': '16', 'start_index': 3546}


In [54]:
# Q: What difference does it make to use an embedded query vs the above method

embedded_query = embeddings_model.embed_query("What are some environmental initiatives at Apple?")

results = vector_store.similarity_search_by_vector(
    embedding=embedded_query
)

print(results[0])

page_content='The Company’s ability to compete successfully depends heavily on ensuring the continuing and timely introduction of innovative new products, services and
technologies to the marketplace. The Company designs and develops nearly the entire solution for its products, including the hardware, operating system,numerous software applications and related services. Principal competitive factors important to the Company include price, product and service features
(including security features), relative price and performance, product and service quality and reliability, design innovation, a strong third-party software andaccessories ecosystem, marketing and distribution capability, service and support, and corporate reputation.
The Company is focused on expanding its market opportunities related to smartphones, personal computers, tablets, wearables and accessories, and services.' metadata={'source': '10-k_AAPL.pdf', 'page': 5, 'page_label': '6', 'start_index': 0}


In [55]:
# vector store is not a subclass of Runnable, so create a runnable instance as follows:

retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k":1}
)

retriever.batch([
    "How many stores does Apple have?",
    "What is the revenue projection for the next year?"
])

[[Document(id='b5032bc9-3932-435c-87d1-237a4e48c6b3', metadata={'source': '10-k_AAPL.pdf', 'page': 4, 'page_label': '5', 'start_index': 0}, page_content='Services\nAdvertising\nThe Company’s advertising services include third-party licensing arrangements and the Company’s own advertising platforms.\nAppleCare\nThe Company offers a portfolio of fee-based service and support products under the AppleCare brand. The offerings provide priority access to Apple technical\nsupport, access to the global Apple authorized service network for repair and replacement services, and in many cases additional coverage for instances ofaccidental damage or theft and loss, depending on the country and type of product.\nCloud Services\nThe Company’s cloud services store and keep customers’ content up-to-date and available across multiple Apple devices and Windows personal computers.\nDigital Content\nThe Company operates various platforms, including the App Store, that allow customers to discover and downlo