<a href="https://colab.research.google.com/github/sudhirshahu51/RAG/blob/main/pdf_query_with_chromadb_llamaindex_with__cross_reranker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [29]:
!pip install llama-index langchain chromadb pypdf llama-index-llms-groq --quiet

# 1. **Text Extraction from PDF file**

In [2]:
from pypdf import PdfReader

reader = PdfReader('/content/drive/MyDrive/Colab Notebooks/Data/microsoft_annual_report_2022.pdf')
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]
len(pdf_texts)

90

In [3]:
#for text in pdf_texts[:5]:
print(pdf_texts[0])

1 
Dear shareholders, colleagues, customers, and partners:  
We are living through a period of historic economic, societal, and geopolitical change. The world in 2022 looks nothing like 
the world in 2019. As I write this, inflation is at a 40 -year high, supply chains are stretched, and the war in Ukraine is 
ongoing. At the same time, we are entering a technological era with the potential to power awesome advancements 
across every sector of our economy and society. As the world’s largest software company, this places us at a historic 
intersection of opportunity and responsibility to the world around us.  
Our mission to empower every person and every organization on the planet to achieve more has never been more 
urgent or more necessary. For all the uncertainty in the world, one thing is clear: People and organizations in every 
industry are increasingly looking to digital technology to overcome today’s challenges and emerge stronger. And no 
company is better positioned to help t

# 2. **Splitting Chunks**

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter #understand sentence and token based splitting can also be done iusing llamaindex or wrapper of langchain in llamaindex.

In [5]:
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

#print(word_wrap(character_split_texts[10]))
print(f"\nTotal chunks: {len(character_split_texts)}")


Total chunks: 344


In [6]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

#print(word_wrap(token_split_texts[10]))
print(f"\nTotal chunks: {len(token_split_texts)}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


Total chunks: 349


# **3. Create embeddings**

In [7]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction


In [8]:
embedding_function = SentenceTransformerEmbeddingFunction()
print(embedding_function([token_split_texts[10]]))

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[array([ 4.25627008e-02,  3.32118124e-02,  3.03400941e-02, -3.48665677e-02,
        6.84165210e-02, -8.09090883e-02, -1.54743837e-02, -1.45092339e-03,
       -1.67444516e-02,  6.77077174e-02, -5.05413823e-02, -4.91953641e-02,
        5.13999313e-02,  9.19273049e-02, -7.17783868e-02,  3.95196825e-02,
       -1.28335450e-02, -2.49474961e-02, -4.62286212e-02, -2.43575163e-02,
        3.39496508e-02,  2.55024210e-02,  2.73171309e-02, -4.12621116e-03,
       -3.63383405e-02,  3.69089260e-03, -2.74304301e-02,  4.79671173e-03,
       -2.88962368e-02, -1.88706759e-02,  3.66662778e-02,  2.56958883e-02,
        3.13128494e-02, -6.39343634e-02,  5.39440550e-02,  8.22534934e-02,
       -4.17568274e-02, -6.99579250e-03, -2.34860424e-02, -3.07479482e-02,
       -2.97919312e-03, -7.79093876e-02,  9.35312267e-03,  3.16284667e-03,
       -2.22570375e-02, -1.82946529e-02, -9.61245876e-03, -3.15068923e-02,
       -5.51966717e-03, -3.27030383e-02,  1.68029770e-01, -4.74596955e-02,
       -5.00168316e-02, 

In [10]:
from tqdm.auto import tqdm, trange
from tqdm import tqdm

In [11]:
#progress_bar = tqdm(total=len(token_split_texts)) may be used

chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection("microsoft_annual_report_20221", embedding_function=embedding_function)

ids = [str(i) for i in tqdm(range(len(token_split_texts)))]

chroma_collection.add(ids=ids, documents=token_split_texts)
chroma_collection.count()

100%|██████████| 349/349 [00:00<00:00, 837900.46it/s]


349

In [12]:
query = "What was the total revenue?"

results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results['documents'][0]

In [13]:
import textwrap

In [14]:
for document in retrieved_documents:
    print(textwrap.fill(document))
    print('\n')

74 note 13 — unearned revenue unearned revenue by segment was as
follows : ( in millions ) june 30, 2022 2021 productivity and business
processes $ 24, 558 $ 22, 120 intelligent cloud 19, 371 17, 710 more
personal computing 4, 479 4, 311 total $ 48, 408 $ 44, 141 changes in
unearned revenue were as follows : ( in millions ) year ended june 30,
2022 balance, beginning of period $ 44, 141 deferral of revenue 110,
455 recognition of unearned revenue ( 106, 188 ) balance, end of
period $ 48, 408 revenue allocated to remaining performance
obligations, which includes unearned revenue and amounts that will be
invoiced and recognized as revenue in future periods, was $ 193
billion as of june 30, 2022, of which $ 189 billion is related to the
commercial portion of revenue. we expect to recognize approximately 45
% of this revenue over the next 12 months and the remainder
thereafter. note 14 — leases


that are not sold separately. • we tested the mathematical accuracy of
management ’ s calculat

# 4. Calling LLM using GROQ API

In [15]:
from google.colab import userdata
from llama_index.llms.groq import Groq
from llama_index.core.llms import ChatMessage
groq_key = userdata.get('GROQ_API')
llm = Groq(model="llama3-70b-8192", api_key=groq_key) #https://docs.llamaindex.ai/en/stable/examples/llm/groq/#streaming

In [16]:
response = llm.complete("Explain the importance of low latency LLMs")

In [17]:
print(response)

Low-latency Large Language Models (LLMs) are crucial in various applications where real-time or near-real-time processing is essential. Here are some reasons why low-latency LLMs are important:

1. **Real-time Conversational AI**: In conversational AI, such as chatbots, voice assistants, and customer service platforms, low-latency LLMs enable rapid response times, making interactions feel more natural and human-like. This is particularly important in applications where users expect immediate responses, such as in customer support or virtual assistants.
2. **Live Streaming and Broadcasting**: Low-latency LLMs are essential for live streaming and broadcasting applications, such as real-time transcription, subtitles, and closed captions. This ensures that viewers receive accurate and timely information, enhancing their overall experience.
3. **Gaming and Interactive Systems**: In gaming and interactive systems, low-latency LLMs enable faster processing of player inputs, allowing for more 

In [18]:
response = llm.stream_complete("Explain the importance of low latency LLMs")

In [19]:
for r in response:
    print(r.delta, end="")

Low-latency Large Language Models (LLMs) are crucial in various applications where real-time or near-real-time processing is essential. Here are some reasons why low-latency LLMs are important:

1. **Real-time Conversational AI**: In conversational AI, such as chatbots, voice assistants, and customer service platforms, low-latency LLMs enable rapid response times, making interactions feel more natural and human-like. This is particularly important in applications where users expect immediate responses, such as in customer support or virtual assistants.
2. **Live Streaming and Broadcasting**: Low-latency LLMs are essential for live streaming and broadcasting applications, such as real-time transcription, live subtitles, and automated commentary. This enables viewers to receive accurate and timely information, enhancing their overall experience.
3. **Gaming and Interactive Systems**: In gaming and interactive systems, low-latency LLMs can improve the responsiveness of NPCs (non-player ch

In [20]:
def rag(query, retrieved_documents, llm):
    information = "\n\n".join(retrieved_documents)

    messages = [
    ChatMessage(
        role="system", content="You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report. You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
    ),
    ChatMessage(role="user", content=f"Question: {query}. \n Information: {information}"),
    ]

    resp = llm.stream_chat(messages)
    return resp


In [21]:
output = rag(query=query, retrieved_documents=retrieved_documents, llm = llm)

for r in output:
    print(r.delta, end="")

The total revenue was $198,270 million.

# 5. Query Expansion

In [22]:
!pip install umap-learn --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/88.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/56.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.9/56.9 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [23]:
#!pip uninstall umap


In [24]:
#import umap
import umap.umap_ as umap
import numpy as np
from tqdm import tqdm

In [25]:
embeddings = chroma_collection.get(include=['embeddings'])['embeddings']
umap_tranform = umap.UMAP(random_state=0, transform_seed=0).fit(embeddings)

  warn(




*   Expansion with generated answers



In [26]:
def augment_query_generated(query, llm):
    messages = [
    ChatMessage(
        role="system", content="You are a helpful expert financial research assistant. Provide an example answer to the given question, that might be found in a document like an annual report. "
    ),
    ChatMessage(role="user", content=f"Question: {query}"),
    ]

    resp = llm.chat(messages)
    return resp

In [27]:
original_query = "Was there significant turnover in the executive team?"
hypothetical_answer = augment_query_generated(original_query, llm)

joint_query = f"{original_query} {hypothetical_answer}"
print(textwrap.fill(joint_query))

Was there significant turnover in the executive team? assistant:
Here's an example answer that might be found in an annual report:
**Executive Team Changes**  During the fiscal year, our company
experienced some changes in its executive leadership team. In January,
our Chief Operating Officer (COO), John Smith, announced his
retirement after 10 years of service to the company. John played a
crucial role in shaping our operational strategy and will be deeply
missed. We are grateful for his contributions and wish him all the
best in his future endeavors.  In February, we appointed Jane Doe as
our new COO. Jane brings over 15 years of experience in the industry
and has a proven track record of driving operational excellence and
growth. We are excited to have her on board and are confident that she
will continue to build on the strong foundation established by her
predecessor.  Additionally, in June, our Chief Financial Officer
(CFO), Michael Johnson, left the company to pursue another
opp

In [28]:
results = chroma_collection.query(query_texts=joint_query, n_results=5, include=['documents', 'embeddings'])
retrieved_documents = results['documents'][0]

for doc in retrieved_documents:
    print(textwrap.fill(doc))
    print('')

89 directors and executive officers of microsoft corporation directors
satya nadella chairman and chief executive officer, microsoft
corporation sandra e. peterson 2, 3 operating partner, clayton,
dubilier & rice, llc john w. stanton 1, 4 founder and chairman,
trilogy partnerships reid g. hoffman 4 general partner, greylock
partners penny s. pritzker 4 founder and chairman, psp partners, llc
john w. thompson 3, 4 lead independent director, microsoft corporation
hugh f. johnston 1 vice chairman and executive vice president and
chief financial officer, pepsico, inc. carlos a. rodriguez 1 chief
executive officer, adp, inc. emma n. walmsley 2, 4 chief executive
officer, gsk, plc teri l. list 1, 3 former executive vice president
and chief financial officer, gap, inc. charles w. scharf 2, 3 chief
executive officer and president, wells fargo & company padmasree
warrior 2 founder, president and chief executive

officer, fable group inc. board committees 1. audit committee 2.
compensation commi


*   Expansion with multiple queries




In [30]:
def augment_multiple_query(query, llm):

    messages = [
    ChatMessage(
        role="system", content="You are a helpful expert financial research assistant. Your users are asking questions about an annual report. "
            "Suggest up to five additional related questions to help them find the information they need, for the provided question. "
            "Suggest only short questions without compound sentences. Suggest a variety of questions that cover different aspects of the topic."
            "Make sure they are complete questions, and that they are related to the original question."
            "Output one question per line. Do not number the questions."
    ),
    ChatMessage(role="user", content=query ),
    ]
    resp = llm.chat(messages)
    return resp

In [31]:
original_query = "What were the most important factors that contributed to increases in revenue?"
aug_queries = augment_multiple_query(original_query, llm)
print(aug_queries)



assistant: What was the total revenue for the year?
Were there any new product or service launches?
Did the company experience any changes in pricing?
Were there any significant acquisitions or divestitures?
Did the company expand into new markets or geographies?


In [32]:
augmented_queries = aug_queries.message.blocks[0].text.split("\n")
queries = [original_query] + augmented_queries
results = chroma_collection.query(query_texts=queries, n_results=5, include=['documents', 'embeddings'])

retrieved_documents = results['documents']

# Deduplicate the retrieved documents
unique_documents = set()
for documents in retrieved_documents:
    for document in documents:
        unique_documents.add(document)

for i, documents in enumerate(retrieved_documents):
    print(f"Query: {queries[i]}")
    print('')
    print("Results:")
    for doc in documents:
        print(textwrap.fill(doc))
        print('')
    print('-'*100)

Query: What were the most important factors that contributed to increases in revenue?

Results:
engineering, gaming, and linkedin. • sales and marketing expenses
increased $ 1. 7 billion or 8 % driven by investments in commercial
sales and linkedin. sales and marketing included a favorable foreign
currency impact of 2 %. • general and administrative expenses
increased $ 793 million or 16 % driven by investments in corporate
functions. operating income increased $ 13. 5 billion or 19 % driven
by growth across each of our segments. current year net income and
diluted eps were positively impacted by the net tax benefit related to
the transfer of intangible properties, which resulted in an increase
to net income and diluted eps of $ 3. 3 billion and $ 0. 44,
respectively. prior year net income and diluted eps were positively
impacted by the net tax benefit related to the india supreme court
decision on withholding taxes, which resulted in an increase to net
income and diluted eps of $ 620 

In [33]:
original_query_embedding = embedding_function([original_query])
augmented_query_embeddings = embedding_function(augmented_queries)

# 6. Re-ranking the long tail

In [34]:
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [35]:
results = chroma_collection.query(query_texts=queries, n_results=10, include=['documents', 'embeddings'])
retrieved_documents = results['documents']

In [36]:
# Deduplicate the retrieved documents
%%timeit -n100 -r3
unique_documents = set()
for documents in retrieved_documents:
    for document in documents:
        unique_documents.add(document)

unique_documents = list(unique_documents)

The slowest run took 4.56 times longer than the fastest. This could mean that an intermediate result is being cached.
24 µs ± 11.8 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)


In [37]:
#%%timeit -n100 -r3
#faster
flatList = sum(retrieved_documents, [])
unique_documents = set(flatList)
#flatList = list(np.concatenate(lst)) #Slow

In [38]:
pairs = []
for doc in unique_documents:
    pairs.append([original_query, doc])

In [39]:
scores = cross_encoder.predict(pairs) #learn how cross encoders works

In [40]:
print("Scores:")
for score in scores: print(score)

Scores:
-7.73989
-11.079268
-11.040518
-11.116722
-7.445792
-10.886526
-10.743448
-10.859211
-11.121365
-11.007851
-10.985082
-11.233842
-11.022106
-11.142147
-11.1664505
-7.503585
-11.05914
-4.6518903
-10.017991
-6.902089
-9.316702
-7.917177
-10.822462
-10.973504
-9.4284725
-11.11678
-4.3417664
-10.947717
-10.088552
-10.311901
-11.072828
-11.024439
-11.248009
-1.1369962
-11.141821
-9.768026
-11.04459
-10.187727
-4.8184843
-10.814183
-11.074697
-11.137989
-11.103973
-11.157962
-11.078432
-10.882423
-8.505108
-3.768155
-5.2747493
-10.119463


In [41]:
print("New Ordering:")
for o in np.argsort(scores)[::-1]:print(o)

New Ordering:
33
47
26
17
38
48
19
4
15
0
21
46
20
24
35
18
28
49
37
29
6
39
22
7
45
5
27
23
10
9
12
31
2
36
16
30
40
44
1
42
3
25
8
41
34
13
43
14
11
32


# 7. Apply Embedding Adaptors to improve the results more.
from 5th tutorial.

# 8. Applying Hybrid Search

[link](https://superlinked.com/vectorhub/articles/optimizing-rag-with-hybrid-search-reranking)

In [42]:
!pip install llama-index install llama-index-retrievers-bm25 --quiet

[31mERROR: Could not find a version that satisfies the requirement install (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for install[0m[31m
[0m