# Pipeline
The working of the pipepline implemented in the streamlit app had been developed and tested in this jupyter notebook.
The pipeline consists of the following steps:
1. PDF Processing:
   - Load PDF using PyMuPDF
   - Extract text from each page
   - Clean text by removing unnecessary newlines
   - Split text into paragraphs

2. Text Embedding:
   - Use SentenceTransformer model to generate embeddings for each paragraph(Since Sentence-Transformers was used there was no need for text pre-processing, although thinking to make the results better I triead a little preprocessing but that was actually lowering the quality of result)
   - Normalize embeddings for better similarity comparison

3. Semantic Search:
   - Encode user query using same SentenceTransformer model
   - Calculate similarity between query and paragraph embeddings
   - Get top 5 most similar paragraphs

4. BM25 Re-ranking:
   - Apply BM25 ranking algorithm on top 7 paragraphs
   - Re-rank based on term frequency and document length
   - Return top 3 most relevant paragraphs as final results

Since I have no measurable quantity for measuring the accuracy and quality of the results what all I can share is my observation. And according to what i have observed of this pipeline, everytime out of the 3 presented results 1-2 are definitely highly accurate and high quality result. They may not be at rank 1 but they are present for all queries.

In [1]:
import pymupdf

## PDF Processing

In [2]:
#Working on sample pdf file which I took as Brief Answers to the Big Questions. This was also tested on a research paper of RAG
path="Samples/BOOK1.pdf"

pdf = pymupdf.open(path)

In [3]:
# Sample text from the pdf file.
pdf[5].get_text()

'Foreword\nEddie Redmayne\nThe first time I met Stephen Hawking, I was struck by his extraordinary\npower and his vulnerability. The determined look in his eyes coupled with\nthe immobile body was familiar to me from my research—I had recently\nbeen engaged to play the role of Stephen in The Theory of Everything\nand had spent several months studying his work and the nature of his\ndisability, attempting to understand how to use my body to express the\npassage of motor neurone disease over time.\nAnd yet when I finally met Stephen, the icon, this scientist of\nphenomenal talent, whose main communication was through a\ncomputerised voice along with a pair of exceptionally expressive\neyebrows, I was floored. I tend to get nervous in silences and talk too\nmuch whereas Stephen absolutely understood the power of silence, the\npower of feeling like you are being scrutinised. Flustered, I chose to talk\nto him about how our birthdays were only days apart, putting us in the\nsame zodiacal si

In [4]:
import re

In [5]:
# Cleaning the text by removing the unnecessary newlines unless preceded by a full stop.
def clean_text(text):
    text = re.sub(r'(?<!\.)\n', ' ', text)  # Remove \n unless preceded by a full stop
    return text

# Getting the total text from the pdf file.
total_text=""
for page in pdf:
    text=page.get_text()
    
    total_text+=" "+clean_text(text)

In [6]:
print(total_text)

   Copyright © 2018 by Spacetime Publications Limited Foreword copyright © 2018 by Eddie Redmayne Introduction copyright © 2018 by Kip Thorne Afterword copyright © 2018 by Lucy Hawking All rights reserved.
Published in the United States by Bantam Books, an imprint of Random House, a division of Penguin Random House LLC, New York.
BANTAM BOOKS and the HOUSE colophon are registered trademarks of Penguin Random House LLC.
Published in the United Kingdom by John Murray (Publishers), a Hachette UK Company.
Photograph of the adult Stephen Hawking © Andre Pattenden Hardback ISBN 9781984819192 Ebook ISBN 9781984819208 randomhousebooks.com Text design by Craig Burgess, adapted for ebook Cover design: Dan Rembert Cover image: © Shutterstock v5.3.2 ep  Contents Cover Title Page Copyright A Note from the Publisher Foreword: Eddie Redmayne An Introduction: Kip Thorne Why We Must Ask the Big Questions Chapter 1: Is There a God? Chapter 2: How Did It All Begin? Chapter 3: Is There Other Intelligent L

In [7]:
# Splitting the text into paragraphs.
paragraphs=total_text.split('\n')

In [8]:
import textwrap

# Checking the paragraphs
for i, p in enumerate(paragraphs):
    wrapped_text=textwrap.fill(p,width=100)

    print("------------------------------------------------------------------------------------------------------------------------------------------------------")
    print(wrapped_text)
    print("------------------------------------------------------------------------------------------------------------------------------------------------------")

------------------------------------------------------------------------------------------------------------------------------------------------------
   Copyright © 2018 by Spacetime Publications Limited Foreword copyright © 2018 by Eddie Redmayne
Introduction copyright © 2018 by Kip Thorne Afterword copyright © 2018 by Lucy Hawking All rights
reserved.
------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Published in the United States by Bantam Books, an imprint of Random House, a division of Penguin
Random House LLC, New York.
------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------

## Text Embedding

In [9]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load BERT-based embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for each chunk
chunk_embeddings = embedding_model.encode(paragraphs, convert_to_numpy=True,normalize_embeddings=True,device='cuda')

print(f"Generated {len(chunk_embeddings)} embeddings with shape {chunk_embeddings.shape}")

  from .autonotebook import tqdm as notebook_tqdm


Generated 539 embeddings with shape (539, 384)


In [10]:
chunk_embeddings[0]

array([ 4.86012641e-03, -1.47548234e-02,  4.21559215e-02,  2.54215803e-02,
       -7.34864874e-03, -1.26005793e-02, -5.17367870e-02,  8.59142654e-03,
       -2.42775176e-02, -4.34214324e-02,  3.29339877e-03, -2.79288702e-02,
       -5.97392321e-02, -1.01349363e-02, -8.01654458e-02, -1.59024708e-02,
        3.07004713e-03, -2.44494043e-02, -3.22255343e-02,  2.52496656e-02,
        9.49325338e-02,  3.21525373e-02,  3.67122330e-02,  9.92837027e-02,
        3.62107903e-02,  1.30142095e-02, -2.59261075e-02, -1.09681301e-02,
       -4.09123637e-02, -1.90260902e-03,  3.55388112e-02,  3.52474004e-02,
       -3.06959096e-02, -4.33857441e-02, -1.30853439e-02,  5.85099570e-02,
        1.68124065e-02, -5.21628447e-02, -1.47797866e-02, -3.42935510e-02,
        1.38933826e-02, -3.16030271e-02,  2.37071365e-02,  5.42017035e-02,
       -6.97503015e-02,  4.13712393e-03,  7.31109157e-02, -6.68011233e-02,
       -3.94878611e-02,  8.79251361e-02,  5.45731571e-04, -5.37547376e-03,
       -1.98059697e-02, -

## Semantic Search

In [11]:
# Sample query from my favorite character of the book
query="Who created the universe?"

query_embedded= embedding_model.encode(query, convert_to_numpy=True,normalize_embeddings=True,device='cuda')

In [12]:
query_embedded.shape

(384,)

In [13]:
# Calculating the similarity between the query and the chunk embeddings.
simmilarities=np.dot(chunk_embeddings,query_embedded)

In [14]:
simmilarities.shape

(539,)

In [15]:
simmilarities

array([ 0.29099476,  0.1535826 ,  0.12490446,  0.09485378,  0.3456967 ,
        0.26615998,  0.16044319,  0.19552839,  0.22492844,  0.13995838,
        0.144224  ,  0.23749012,  0.12966302,  0.46957904,  0.12631112,
        0.40001404,  0.2620515 ,  0.2108422 ,  0.29225713,  0.16360223,
        0.27764612,  0.32818604,  0.2578235 ,  0.3256113 ,  0.24044564,
        0.08010814,  0.32374406,  0.50771654,  0.30315098,  0.2931148 ,
        0.24177194,  0.1947309 ,  0.32078367,  0.32410628,  0.46876892,
        0.31457177,  0.2941826 ,  0.22646265,  0.31390405,  0.23118597,
        0.28295404,  0.41737056,  0.18488567,  0.23196565,  0.1892666 ,
        0.36661443,  0.34084255,  0.53740287,  0.43632144,  0.3650121 ,
        0.27395874,  0.4507491 ,  0.06766124,  0.04208345,  0.16855718,
        0.38742873,  0.12461039,  0.13460855,  0.5692712 ,  0.31339437,
        0.0058302 , -0.02264979,  0.03683634,  0.09808662,  0.06519544,
        0.342071  ,  0.41778758, -0.04079486, -0.00218364, -0.00

In [16]:
# Getting the top 5 most similar chunks.
top_7_idx=np.argsort(simmilarities,axis=0)[-7:][::-1].tolist()
top_7_idx

[140, 134, 77, 113, 149, 122, 135]

In [17]:
# Filtering the paragraphs based on the top 10 most similar chunks.
filtered_paragraphs=[]
for i in top_7_idx:
    filtered_paragraphs.append(paragraphs[i])
filtered_paragraphs

['In the last hundred years, we have made spectacular advances in our understanding of the universe. We now know the laws that govern what happens in all but the most extreme conditions, like the origin of the universe, or black holes. The role played by time at the beginning of the universe is, I believe, the final key to removing the need for a grand designer and revealing how the universe created itself.',
 'Since we know the universe itself was once very small—perhaps smaller than a proton—this means something quite remarkable. It means the universe itself, in all its mind-boggling vastness and complexity, could simply have popped into existence without violating the known laws of nature. From that moment on, vast amounts of energy were released as space itself expanded—a place to store all the negative energy needed to balance the books. But of course the critical question is raised again: did God create the quantum laws that allowed the Big Bang to occur? In a nutshell, do we nee

## BM25 Re-ranking

In [18]:
import numpy as np
from rank_bm25 import BM25Okapi

# Reciprocal Rank Fusion Function
def reciprocal_rank_fusion(rankings, k=60):
    fused_scores = {}
    
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            if doc_id not in fused_scores:
                fused_scores[doc_id] = 0
            fused_scores[doc_id] += 1 / (k + rank + 1)  # RRF formula

    # Sort by fused scores (higher is better)
    sorted_docs = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    return [doc_id for doc_id, _ in sorted_docs]


# Creating BM25 object from the filtered paragraphs
tokenized_paragraphs = [paragraph.split() for paragraph in filtered_paragraphs]
bm25 = BM25Okapi(tokenized_paragraphs)

# Getting BM25 scores for the query
tokenized_query = query.split()
bm25_scores = bm25.get_scores(tokenized_query)

# Ranking documents based on BM25 (higher score = better rank)
bm25_ranking = np.argsort(bm25_scores)[::-1]  # Sortting in descending order

# Getting semantic search ranking
semantic_ranking = list(range(5))

# Apply RRF on BM25 + Semantic Search rankings
fused_ranking = reciprocal_rank_fusion([semantic_ranking, bm25_ranking])

# Getting top 3 refined results
top_3_results = [filtered_paragraphs[i] for i in fused_ranking[:3]]

top_3_results


['In the last hundred years, we have made spectacular advances in our understanding of the universe. We now know the laws that govern what happens in all but the most extreme conditions, like the origin of the universe, or black holes. The role played by time at the beginning of the universe is, I believe, the final key to removing the need for a grand designer and revealing how the universe created itself.',
 'I think the universe was spontaneously created out of nothing, according to the laws of science. The basic assumption of science is scientific determinism. The laws of science determine the evolution of the universe, given its state at one time. These laws may, or may not, have been decreed by God, but he cannot intervene to break the laws, or they would not be laws. That leaves God with the freedom to choose the initial state of the universe, but even here it seems there may be laws. So God would have no freedom at all.',
 'Since we know the universe itself was once very small—