This code creates RAG system to answer user question based on the pdf provided

Dependencies:
- farm-haystack: The Haystack framework for document retrieval and question-answering.
- sentence-transformers: A library for generating sentence embeddings using transformer models.
- faiss: A library for similarity search and clustering of dense vectors.

The code consists of the following components:

1. **Document Indexing:**
   - Utilizes the `FAISSDocumentStore` to store and index PDF documents.
   - Text extraction is performed using a `TextConverter`, followed by preprocessing steps such as cleaning whitespace, headers, and footers.
   - The processed documents are then indexed using a specified FAISS index.

2. **Document Retrieval:**
   - Implements an `EmbeddingRetriever` that utilizes sentence embeddings generated by the "multi-qa-mpnet-base-dot-v1" model from Hugging Face.
   - The retriever is used to find the most relevant documents for a given query.

3. **Question-Answering:**
   - Utilizes a pre-trained FARM (Fast and Robust Model) model for reading comprehension (e.g., "deepset/roberta-base-squad2").
   - The retriever and reader components are combined in a pipeline to generate answers to user queries.

4. **Querying and Answer Retrieval:**
   - User queries are provided in the variable `query`.
   - The query is preprocessed, and the retrieval and reading pipelines are executed to obtain relevant answers.
   - The final answer to the query is extracted and printed.

Note: Ensure the necessary dependencies are installed before running the code.


Install the required dependencies using the specified commands

In [1]:
# %%bash

# pip install --upgrade pip
# pip install farm-haystack[colab,inference]
# !pip install farm-haystack[faiss]

# !pip install pdfminer.six
# pip install pymupdf pypdf2
# import nltk
# nltk.download('punkt')

Lets Import packages required for this experiment

In [2]:
import os
from haystack.nodes import EmbeddingRetriever
from haystack.nodes import FARMReader
from haystack import Pipeline
from haystack.document_stores import FAISSDocumentStore
from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor
from haystack.pipelines import ExtractiveQAPipeline
from haystack.schema import Document
from haystack.utils import print_answers
import requests

import fitz  # PyMuPDF
import PyPDF2

Lets download the required pdf file

In [3]:
doc_dir = "data"
pdf_path = doc_dir + "/pdf_file/downloaded_file.pdf"
url = "https://assets.openstax.org/oscms-prodcms/media/documents/ConceptsofBiology-WEB.pdf?_gl=1*m935pz*_ga*NjEzODkzMDg5LjE3MDg5ODI2MzI.*_ga_T746F8B0QC*MTcwOTI0NDQ3My4yLjEuMTcwOTI0NDQ4Ni40Ny4wLjA."

In [4]:
response = requests.get(url)
with open(pdf_path, "wb") as pdf_file:
    pdf_file.write(response.content)
print("Downloaded the PDF file successfully.")

Downloaded the PDF file successfully.


Extracting content from the pdf for the indexing  for now we will use chapter 9 and 10

In [5]:

# def extract_structured_text(pdf_path, start_page, end_page):
#     try:
#         structured_text = []

#         # Using PyMuPDF (MuPDF) for layout analysis
#         with fitz.open(pdf_path) as pdf_document:
#             for page_number in range(start_page - 1, end_page):  # Adjusting for 0-based indexing
#                 page = pdf_document[page_number]
#                 blocks = page.get_text("blocks")
#                 paragraphs = [" ".join([block[4] for block in blocks])]
#                 structured_text.extend(paragraphs)

#         return structured_text
#     except Exception as e:
#         print(f"Error extracting structured text from PDF: {e}")
#         return None

# start_page = 210
# end_page = 258

# structured_text_chapter1_2 = extract_structured_text(pdf_path, start_page, end_page)

# if structured_text_chapter1_2:
#     print("Structured text extracted from pdf successfully:")
#     for paragraph in structured_text_chapter1_2[:5]:  # Displaying the first 5 paragraphs as a sample
#         print(paragraph)
# else:
#     print("Extraction of structured text from pdf failed.")


In [6]:
import fitz

def extract_text_page_wise(pdf_path,start_page,end_page):
    doc = fitz.open(pdf_path)
    extracted_text = []

    for page_number in range(start_page - 1, end_page):
        page = doc[page_number]
        text = page.get_text("text")
        extracted_text.append(text)

    doc.close()

    return extracted_text

start_page = 210
end_page = 258
text_per_page = extract_text_page_wise(pdf_path,start_page,end_page)

# Print text from each page
for page_number, text in enumerate(text_per_page):
    print(f"Page {page_number + 1}:\n{text}\n")
    break

Page 1:
196
8 • Critical Thinking Questions
Access for free at openstax.org




Saving these extracted into files which will be used for indexing

In [7]:
raw_files_dir = "/content/data/raw_files"
for i, text in enumerate(text_per_page):
    filename = f"{raw_files_dir}/document_{i+1}.txt"
    with open(filename, "w") as f:
      f.write(text)

Create a FAISS document store with a SQLite backend and a Flat index factory.

In [8]:
document_store = FAISSDocumentStore(sql_url="sqlite:///", faiss_index_factory_str="Flat")

In [9]:
# from haystack.utils import clean_wiki_text, convert_files_to_docs

# # Convert files to dicts
# docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

# # Now, let's write the dicts containing documents to our DB.
# document_store.write_documents(docs)


Build an indexing pipeline with text conversion and pre-processing components to prepare documents for indexing.

In [10]:

indexing_pipeline = Pipeline()
text_converter = TextConverter()
preprocessor = PreProcessor(
    clean_whitespace=True,
    clean_header_footer=True,
    clean_empty_lines=True,
    split_by="word",
    split_length=200,
    split_overlap=20,
    split_respect_sentence_boundary=True,
)

"""Index a collection of PDF files located in a specified directory."""

indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])

files_to_index = [raw_files_dir + "/" + f for f in os.listdir(raw_files_dir)]
indexing_pipeline.run_batch(file_paths=files_to_index)



Converting files: 100%|██████████| 49/49 [00:00<00:00, 810.28it/s]
Preprocessing: 100%|██████████| 49/49 [00:00<00:00, 524.65docs/s]
Writing Documents: 10000it [00:00, 31676.40it/s]


{'documents': [<Document: {'content': 'transcription; mRNA is used to synthesize proteins by\nthe process of translation. The genetic code is the\ncorrespondence between the three-nucleotide mRNA\ncodon and an amino acid. The genetic code is\n“translated” by the tRNA molecules, which associate a\nspecific codon with a specific amino acid. The genetic\ncode is degenerate because 64 triplet codons in mRNA\nspecify only 20 amino acids and three stop codons.\nThis means that more than one codon corresponds to\nan amino acid. Almost every species on the planet uses\nthe same genetic code.\nThe players in translation include the mRNA template,\nribosomes, tRNAs, and various enzymatic factors. The\nsmall ribosomal subunit binds to the mRNA template.\nTranslation begins at the initiating AUG on the mRNA.\nThe formation of bonds occurs between sequential\namino acids specified by the mRNA template according\nto the genetic code. The ribosome accepts charged\ntRNAs, and as it steps along the mRN

In [11]:
"""Instantiate an embedding retriever using a pre-trained model from Hugging Face (e.g., sentence-transformers/multi-qa-mpnet-base-dot-v1)"""
retriever = EmbeddingRetriever(
    document_store=document_store, embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1"
)
"""Update document embeddings in the FAISS document store using the retriever."""
document_store.update_embeddings(retriever)

"""Create a FARM-based reader using a pre-trained model from Hugging Face (e.g., deepset/roberta-base-squad2)."""
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

"""Build a querying pipeline with retriever and reader components."""
querying_pipeline = Pipeline()
querying_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
querying_pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  return self.fget.__get__(instance, owner)()
Updating Embedding:   0%|          | 0/126 [00:00<?, ? docs/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Documents Processed: 10000 docs [03:16, 50.87 docs/s]


In [12]:
# pipe = ExtractiveQAPipeline(reader, retriever)

In [13]:
"""Define a sample query and preprocess it for the querying pipeline."""
query="DNA replicates by which of the methods?"
query_dict = {"text": query}
my_doc = Document(content=query_dict["text"])
query_preprocessed = PreProcessor().process(my_doc)[0].to_dict()["content"]
query_preprocessed

  query_preprocessed = PreProcessor().process(my_doc)[0].to_dict()["content"]


'DNA replicates by which of the methods?'

In [14]:
query_preprocessed

'DNA replicates by which of the methods?'

In [15]:
"""Execute the querying pipeline with specified parameters, retrieving relevant answers."""

prediction = querying_pipeline.run(
    query_preprocessed, params={"Retriever": {"top_k": 5}, "Reader": {"top_k": 2}}
)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:08<00:00,  8.54s/ Batches]


In [16]:
"""Print the top answer from the prediction results."""

print_answers(prediction, details="minimum")

'Query: DNA replicates by which of the methods?'
'Answers:'
[   {   'answer': 'semi-conservative',
        'context': '9.2 DNA Replication\n'
                   'DNA replicates by a semi-conservative method in\n'
                   'which each of the two parental DNA strands act as a\n'
                   'template for new DNA to be syn'},
    {   'answer': 'initiation, elongation, and termination',
        'context': 'mes and other proteins. It occurs in three main stages: '
                   'initiation, elongation, and termination.\n'
                   'Recall that eukaryotic DNA is bound to proteins known'}]


In [17]:
prediction["answers"][0].to_dict()["answer"]

'semi-conservative'

Creating AnswerPredictor function for the prediction

In [18]:
class AnswerPredictor:
    def __init__(self, qry_pipeline):
        self.qry_pipeline = qry_pipeline

    def preprocess_query(self, query):
        # Preprocess the query using the provided code
        my_doc = Document(content=query)
        query_preprocessed = PreProcessor().process(my_doc)[0].to_dict()["content"]
        return query_preprocessed

    def predict_answer(self, query_preprocessed, retriever_params={"top_k": 5}, reader_params={"top_k": 2}):
        # Run the querying pipeline
        prediction = self.qry_pipeline.run(
            query_preprocessed, params={"Retriever": retriever_params, "Reader": reader_params}
        )
        return prediction["answers"][0].to_dict()["answer"]


if __name__ == "__main__":

    existing_pipeline = Pipeline()
    existing_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
    existing_pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])

    predictor = AnswerPredictor(qry_pipeline=querying_pipeline)

    # Get user input for the query
    user_query = input("Enter your question: ")

    # Preprocess the query
    preprocessed_query = predictor.preprocess_query(user_query)

    # Predict the answer
    predicted_answer = predictor.predict_answer(preprocessed_query)

    # Display the result
    print("\n\nPredicted Answer:", predicted_answer)

Enter your question: DNA replicates by which of the methods?


  query_preprocessed = PreProcessor().process(my_doc)[0].to_dict()["content"]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:08<00:00,  8.88s/ Batches]



Predicted Answer: semi-conservative





Other experimemnt which are not included in this notebooks include:

1. InMemoryDocumentStore
2. BM25Retriever