# BioRAG-A-Retrieval-Augmented-Agent-for-Scientific-Literature-Analysis

## Extraction and Initial Document Creation - Indexing

In [None]:
## Loading a document
from langchain_community.document_loaders import UnstructuredPDFLoader

file_name = "cells-11-02650-v2.pdf"
loader = UnstructuredPDFLoader(file_name,
                               ocr_languages = 'eng',
                               mode = "elements",
                               strategy = "hi_res",
                               infer_table_structure= True)
docs = loader.load()

  from .autonotebook import tqdm as notebook_tqdm
The ocr_languages kwarg will be deprecated in a future version of unstructured. Please use languages instead.
Only one of languages and ocr_languages should be specified. languages is preferred. ocr_languages is marked for deprecation.
The `max_size` parameter is deprecated and will be removed in v4.26. Please specify in `size['longest_edge'] instead`.


In [None]:
# Observe the document objbects

for d in docs:
    print(d.page_content)
    print("\n"+"---"*20+"\n" + str(d.metadata)+"\n" +"---"*20 +"\n")

f cells

------------------------------------------------------------
{'source': 'cells-11-02650-v2.pdf', 'detection_class_prob': 0.5524067878723145, 'coordinates': {'points': ((np.float64(98.8648681640625), np.float64(136.2649383544922)), (np.float64(98.8648681640625), np.float64(230.7023162841797)), (np.float64(331.4766540527344), np.float64(230.7023162841797)), (np.float64(331.4766540527344), np.float64(136.2649383544922))), 'system': 'PixelSpace', 'layout_width': 1654, 'layout_height': 2339}, 'last_modified': '2025-12-13T08:33:42', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'filename': 'cells-11-02650-v2.pdf', 'category': 'Image', 'element_id': 'b8486a48b8e0c0ee8de406208835b402'}
------------------------------------------------------------



------------------------------------------------------------
{'source': 'cells-11-02650-v2.pdf', 'coordinates': {'points': ((np.float64(1435.8290244444443), np.float64(155.81900222222203)), (np.float64(1435.82902444

In [None]:
cats = set()
for do in docs:
    cats.add(do.metadata.get('category'))
print(cats)

{'FigureCaption', 'Image', 'Table', 'ListItem', 'NarrativeText', 'Title', 'UncategorizedText', 'Header'}


### Key Notes:
So, keeping only these four categories: Title, NarrativeText, Table, FigureCaption

In [None]:
Docs = docs.copy()
Docs = [doc for doc in Docs if doc.metadata.get('category') not in ['Image', 'Header', 'ListItem', 'UncategorizedText']]
print(f"Number of filtered docs: {len(Docs)}")

Number of filtered docs: 139


In [None]:
cats = set()
for do in Docs:
    cats.add(do.metadata.get('category'))
print(cats)

{'Table', 'NarrativeText', 'Title', 'FigureCaption'}


## Creating functions to clean texts

In [None]:
# serach for unnecessary narrative texts like 18 of 22 etc
s= set()
for do in Docs:
    if do.metadata.get('category') == 'NarrativeText':
        s.add(len(do.page_content))
        print(do.page_content if len(do.page_content)<150 else "", end="\n")

Article

Academic Editor: Ohad Medalia


Received: 28 July 2022 Accepted: 23 August 2022 Published: 25 August 2022
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional afﬁl- iations.


Cells 2022, 11, 2650. https://doi.org/10.3390/cells11172650
https://www.mdpi.com/journal/cells


2.1. Electrical Stimulation (ES) Chamber C-Pace EM 100






2.2. Calculation and Simulation of ES


2.3. Current Measurement

2.4. Characterization of AC-Stimulated Liquid


For all Controls, the electrodes were placed in the medium, and the power switch was not returned on (no ES).
2.5. Cell Culture

2.6. AC-Activated Medium and Online Monitoring of Cell Adhesion


2.7. AC stimulation of Cells and Initial Adhesion

2.8. Cell Morphology and Spreading



2.9. Calcium Mobilization and AC stimulation


2.10. Cellular Reactive Oxygen Species (ROS)



2.11. Statistics


3.1. Characterization of AC Electrical Stimulation (ES) Parameter


8 of 22
Table 1. Ind

In [None]:
# Remove boilerplate narrative texts
import re
from langchain_core.documents import Document
BOILERPLATE_PATTERN = r"""
\b\d+\s*of\s*\d+\b | # Matches '18 of 22'
\b[A-Za-z]+\s+\d{4},\s*\d+,\s*\d+\b # Matches 'Cells 2022, 11, 2650'
"""

boilerplate_regex = re.compile(BOILERPLATE_PATTERN, re.VERBOSE)
def remove_boilerplate(Docs: Document) -> list[Document]:
    for doc in Docs:
        if (doc.metadata.get('category') == 'NarrativeText'):
            if boilerplate_regex.fullmatch(doc.page_content.strip()):
                Docs.remove(doc)
    return Docs
            

In [None]:
# Remove unnnecessary narattivetext like citation publisher info

def clean_narrative_text(Docs: Document) -> list[Document]:
    text_start_with = ['Cells 2022, 11, 2650', 'Article', 'Academic Editor','Received:', 'Citation',
                       'Accepted:' , 'Published:',"Publisher’s Note:", "G check for updates", "Susanne Staehlke 1",
                       "Institutional Review Board Statement: Not applic",'Informed Consent Statement: Not applicable',
                       'Conﬂicts of Interest:', 'https:', 'Copyright: © ', 'Acknowledgments:', 'Funding:','Data Availability Statement','References']
    docs = [doc for doc in Docs if not any(doc.page_content.strip().startswith(text) for text in text_start_with)]
    return docs

In [None]:
Docs = remove_boilerplate(Docs)
Docs = clean_narrative_text(Docs)
len(Docs)
metadata_keys = set()
for doc in Docs:
    print(doc.page_content if doc.metadata.get('category') == 'Title' else "",)

Pulsed Electrical Stimulation Affects Osteoblast Adhesion and Calcium Ion Signaling


1. Introduction



2. Materials and Methods






































3. Results






























3.4.4. Cell Morphology










4. Discussion



















5. Conclusions






In [None]:
from langchain.chat_models import init_chat_model
from langchain.messages import SystemMessage,HumanMessage,AIMessage
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from dotenv import load_dotenv
import os

# Loading the environment variables, such as openAI API key. This will keep my API key secret.

load_dotenv()

model = init_chat_model("openai:gpt-4o-mini", 
                        temperature = 0.8, 
                        max_tokens = 2000, 
                        max_retries = 2,
                        timeout = 60)

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import PydanticOutputParser, StrOutputParser

def generate_table_summary(txt_as_html: str, caption: str) -> str:

    # defining a role for model
    Role = """You are an expert biomedical research assistant who specializes in analyzing scientific tables.
            Your ONLY task is to generate a written summary of information contained in any table provided in 100 words.

            You must follow these strict rules:

                You must NOT generate, recreate, or display any table, in any format (including Markdown, text-based rows, bullet-formatted tables, or aligned columns).
    
                You must respond ONLY with a narrative summary, written in clear and concise sentences.
    
                Your summary must highlight key trends, important values, significant differences, and notable insights present in the original table.
    
                You must NOT list every value. Focus only on meaningful findings and overall patterns.
    
                Output must be free of table structures, headings, or column-like formatting."""

    # setting up the parser. The summary will be in string only so, using string output parser
    #parser = PydanticOutputParser(pydantic_object=tablesummary)

    parser = StrOutputParser()
    # defining prompt
    table_template = ChatPromptTemplate([
        ("system", Role),
        ("human", "Please generate summary, Table content: \n{table_content}\nCaption: {table_caption}")
    ],
    )

    
    # defining a chain
    chain = table_template | model | parser
    response = chain.invoke({"table_content": txt_as_html, "table_caption": caption})
    return response

### Key Notes: 
The above two functions will clean the complete NarativeText and Title category.

## Strategy: Parent-child chunking for Parent Document Retriever

In [None]:
# First check the what metadata keys are present
metadata_keys = set()
for doc in Docs:
    for key in doc.metadata.keys():
        metadata_keys.add(key)
print(metadata_keys)

{'coordinates', 'links', 'parent_id', 'text_as_html', 'last_modified', 'is_extracted', 'filename', 'filetype', 'element_id', 'source', 'languages', 'category', 'detection_class_prob', 'page_number'}


In [None]:
# Filtering the metadtata keys
def filter_metadata_keys(Docs: list[Document], not_allowed_keys: list[str]) -> list[Document]:
    filtered_docs = []
    for doc in Docs:
        filtered_metadata = {key: value for key, value in doc.metadata.items() if key not in not_allowed_keys}
        filtered_doc = Document(page_content=doc.page_content, metadata=filtered_metadata)
        filtered_docs.append(filtered_doc)
    return filtered_docs

filter_docs = filter_metadata_keys(Docs, not_allowed_keys=['coordinates', 'languages'])


In [None]:
# First check the what metadata keys are present
metadata_keys = set()
for doc in filter_docs:
    for key in doc.metadata.keys():
        metadata_keys.add(key)
print(metadata_keys)

{'links', 'parent_id', 'text_as_html', 'last_modified', 'is_extracted', 'filename', 'filetype', 'element_id', 'source', 'category', 'detection_class_prob', 'page_number'}


### Generating parent chunks
> Generating parent chunks by diving document into sub-section wise chunks

In [None]:
## Creating sub-section wise chunks from cleaned documents
from uuid import uuid4
import re

def Create_parent_chunks(Docs: list[Document]) -> list[Document]:
    new_docs = []
    new_content = ""
    section_header = ""
    subsection_header = ""
    Paper_title = "Pulsed Electrical Stimulation Affects Osteoblast Adhesion and Calcium Ion Signaling"
    metadata = {}
    current_chunk_page = None
    section_pattern = r"^(\d+\.)+\s*(.+)$"
    subsection_pattern = r"`\s*\d+(?:\.\d+)*\s+[^\n]+`"
    for doc in Docs:
        metadata_copy = doc.metadata.copy()
        
        content = doc.page_content
        current_page = metadata_copy.get('page_number', None)
        doc_category = doc.metadata.get('category', '')
        is_section_header = (doc_category == 'Title')

        if is_section_header:
            if new_content:
                chunk_metadata = doc.metadata.copy()
                chunk_metadata['section_header'] = section_header
                
                if current_chunk_page is not None:
                    chunk_metadata['start_page'] = current_chunk_page
                chunk_metadata['end_page'] = current_page
                chunk_metadata['paper_title'] = Paper_title
                document_id = str(uuid4())
                chunk_metadata['document_id'] = document_id
                new_docs.append(Document(page_content=new_content.strip(), metadata=chunk_metadata))

            if content.strip() != Paper_title:
                section_header = content.strip()
                new_content = ""
                metadata = metadata_copy
                current_chunk_page = current_page
            continue

        # Now we will add the category wise content, unless it is a section header
        if doc_category in ['NarrativeText', 'Table', 'FigureCaption']:
            current_page = doc.metadata.get('page_number', None)
            chunk_metadata = doc.metadata.copy()
            chunk_metadata['section_header'] = section_header
            chunk_metadata['subsection_header'] = subsection_header
            if re.match(section_pattern, content.strip()):
                subsection_header = content.strip()
            new_content += "\n" + content.strip()+ "\n"

        if (doc_category == "NarrativeText" and content.startswith("Table")) or (doc_category == "FigureCaption" and content.startswith("Table")):
                caption = content.strip()

        if doc_category == "Table" and metadata_copy.get('text_as_html'):
                txt_html = metadata_copy.get('text_as_html')
                summary = generate_table_summary(txt_html, caption)
                new_content += "\n Table Summary: \n" + summary.replace("Assistant: ", "") + "\n"
                
        metadata.update(chunk_metadata)

    # After processing all documents, check if there's remaining content to be added
    if new_content:
        chunk_metadata = metadata.copy()
        chunk_metadata['section_header'] = section_header
        if current_chunk_page is not None:
            chunk_metadata['start_page'] = current_chunk_page
        chunk_metadata['end_page'] = current_page
        chunk_metadata['paper_title'] = Paper_title
        document_id = str(uuid4())
        chunk_metadata['document_id'] = document_id
        new_docs.append(Document(page_content=new_content.strip(), metadata=chunk_metadata))
    return new_docs



In [None]:
new_docs = Create_parent_chunks(filter_docs)
for doc in new_docs:
    print(doc.page_content)
    print("\n"+"---"*20+"\n" + str(doc.metadata)+"\n" +"---"*20 +"\n")

Abstract: An extensive research ﬁeld in regenerative medicine is electrical stimulation (ES) and its impact on tissue and cells. The mechanism of action of ES, particularly the role of electrical parameters like intensity, frequency, and duration of the electric ﬁeld, is not yet fully understood. Human MG-63 osteoblasts were electrically stimulated for 10 min with a commercially available multi-channel system (IonOptix). We generated alternating current (AC) electrical ﬁelds with a voltage of 1 or 5 V and frequencies of 7.9 or 20 Hz, respectively. To exclude liquid-mediated effects, we characterized the AC-stimulated culture medium. AC stimulation did not change the medium’s pH, temperature, and oxygen content. The H2O2 level was comparable with the unstimulated samples except at 5 V_7.9 Hz, where a signiﬁcant increase in H2O2 was found within the ﬁrst 30 min. Pulsed electrical stimulation was beneﬁcial for the process of attachment and initial adhesion of suspended osteoblasts. At the

In [None]:
for doc in new_docs:
    print(doc.metadata.get('document_id'))


1878c6e0-4a58-467e-aa73-b4903b8c96dc
f1ba21fc-73fd-46f5-b372-fb6fe1501b10
9bea5f9e-f290-4e69-9708-f3438211636a
f748e9ba-a518-4862-b4a3-f256aac2917c
d88a7df0-b347-4734-bffa-f60b301c50f3
e5d30f91-0bca-4a57-a880-93fb97a2c031
afb903fe-91e7-4149-b578-320ead27db03


### Crating parent and child chunks

In [None]:
import tiktoken
token_encoding = "cl100k_base"
encoding = tiktoken.get_encoding(token_encoding)

def token_length(text: str) -> int:
    encoding = tiktoken.get_encoding(token_encoding)
    return len(encoding.encode(text))

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.store.memory import InMemoryStore
from langchain_classic.retrievers import ParentDocumentRetriever
from langchain_core.documents import Document

parent_splitter = RecursiveCharacterTextSplitter(chunk_size = 3500, 
                                                 chunk_overlap =0,
                                                separators=["\n\n"],
                                                 length_function = token_length,
                                                 add_start_index = True)

child_splitter = RecursiveCharacterTextSplitter(chunk_size = 400, 
                                                chunk_overlap = 200,
                                                separators=["\n\n", "\n", "."],
                                                add_start_index = True)





In [None]:
parent_chunks = parent_splitter.split_documents(new_docs)
child_chunks = child_splitter.split_documents(new_docs)
print(f"Number of parent chunks: {len(parent_chunks)}")
print(f"Number of child chunks: {len(child_chunks)}")
token_length(child_chunks[0].page_content)

Number of parent chunks: 10
Number of child chunks: 257


58

In [None]:
# Defining the embedding model and vector store

from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
from langchain_chroma import Chroma

load_dotenv()

embeddings = OpenAIEmbeddings(model="text-embedding-3-large",)

vector_store = Chroma(embedding_function=embeddings,
                      collection_name="biorag_indexed_docs",  # Giving vectore store a name
                      persist_directory="./chroma_db/",  # Creeating a dirtectory to store the vecotr store in the computer
                      )




In [None]:
# Defining the orchestrator to manage parent-child relationship between documents
from langchain_community.docstore import InMemoryDocstore
from langchain_core.stores import InMemoryByteStore
# Two different stores for two jobs
byte_store = InMemoryByteStore()    # Embeddings → Fast search
docstore = InMemoryDocstore()       # Full docs → Rich context

retriever = ParentDocumentRetriever(
    vectorstore=vector_store,    # Uses byte_store internally
    docstore=byte_store,           # Full documents
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
    search_kwargs={"k": 5}
)

retriever.add_documents(new_docs)

In [None]:
# Retrieve Documents relevent to query

query = "Compare the simulated electric field outcomes with the measured current peak values."
relevent_docs = retriever.invoke(query)
print(len(relevent_docs))

2


In [None]:
# Retrieve Documents relevent to query

query = "What are the key findings?"
relevent_docs = retriever.invoke(query)
print(relevent_docs)

[Document(metadata={'source': 'cells-11-02650-v2.pdf', 'detection_class_prob': 0.9470230937004089, 'is_extracted': 'true', 'last_modified': '2025-12-13T08:33:42', 'filetype': 'application/pdf', 'page_number': 19, 'filename': 'cells-11-02650-v2.pdf', 'parent_id': 'fa631b702d35563385205e6e468c8500', 'category': 'NarrativeText', 'element_id': 'c0f4541538a2cec1d73446f49db8584c', 'section_header': '5. Conclusions', 'subsection_header': '3.4.5. Reactive Oxygen Species (ROS) Production', 'start_page': 19, 'end_page': 19, 'paper_title': 'Pulsed Electrical Stimulation Affects Osteoblast Adhesion and Calcium Ion Signaling', 'document_id': 'afb903fe-91e7-4149-b578-320ead27db03', 'start_index': 0}, page_content='Precisely characterizing the electric ﬁelds and their inﬂuence on the environment is crucial to comparing and evaluating in vitro stimulation studies. The liquid DMEM electrically stimulated with 1 V (7.9 and 20 Hz) did not change their pH, temperature, H2O2, and O2 content characteristics

### Generation

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()
template =  ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an **Expert Scientific Literature Analyst**. Your primary goal is to provide concise, factual, and strictly evidence-based answers."
            "\n\n--- INSTRUCTIONS ---\n"
            "1. **STRICT CONSTRAINT:** You MUST answer the user's question *ONLY* using the text provided in the 'CONTEXT' section below."
            "2. **Evidence Requirement:** If the answer is found, you MUST cite the relevant source document by including the content of the document's 'source' metadata tag (e.g., [Source: document_name.pdf, page 5])."
            "3. **Uncertainty:** If the answer cannot be found in the provided CONTEXT, you MUST state: 'I apologize, the necessary information could not be found in the available document sections.'"
            "4. **Format:** Output your answer in a clear, well-structured format (e.g., bullet points, numbered lists, or short paragraphs)."
            "\n\n--- CONTEXT ---\n"
            "{context}" 
        ),
        ("user", "{question}"),
    ]
)

In [None]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

parallel_chain = (
    {
    "context": retriever,
    "question": RunnablePassthrough()
    }   
)

chain = parallel_chain | template | model | parser

In [None]:
response = chain.invoke("how is Calcium Ion Mobilization?")
print(response)

Calcium ion mobilization during AC (alternating current) stimulation in osteoblasts (MG-63s) is characterized by the following findings:

- **Initial Response**: The intracellular calcium ion (Ca2+) level increases significantly during the first 10 minutes of AC stimulation. This increase in Ca2+ levels is independent of the specific stimulation parameters used but is noted to be particularly pronounced at a frequency of 20 Hz.
  
- **Basal Ca2+ Level Enhancement**: After electrical stimulation, the basal intracellular Ca2+ ion level is enhanced in cells that were pre-stimulated with the higher frequency of 20 Hz. This indicates that the cells are activated during the AC stimulation.

- **Calcium Mobilization Upon Additional Stimulus**: Following the initial AC stimulation, when ATP (adenosine triphosphate) is introduced, there is a significant mobilization of calcium ions in the pre-stimulated cells, indicating a strong response to the ATP stimulation.

- **Mechanism of Mobilization**

In [None]:
response = chain.invoke("Why did the authors specifically select the frequencies 7.9 Hz and 20 Hz, and how do these frequencies relate to physiological electrical signals in bone?")
print(response)

The authors specifically selected the frequencies of 7.9 Hz and 20 Hz based on previous studies that reported positive influences on osteoblasts at these frequencies. They aimed to investigate the effects of these frequencies in the context of electrical stimulation (ES) on osteoblast behavior.

The physiological relevance of these frequencies is tied to their relationship with natural physiological electrical signals in bone. The document notes that electrical stimulation plays a role in various physiological processes, including those in bone, where electric fields can modulate cell functions to accelerate wound healing and bone regeneration. The choice of these frequencies was informed by the understanding that they could effectively mimic or enhance the natural electrical signaling that occurs during bone maintenance and repair [Source: cells-11-02650-v2.pdf, page 19].


In [None]:
response = chain.invoke("How do the authors distinguish between direct electrical effects on cells and indirect effects mediated through changes in the culture medium? ")
print(response)

The authors distinguish between direct electrical effects on cells and indirect effects mediated through changes in the culture medium through the following methods:

1. **Experimental Design**: 
   - They conducted experiments with both direct electrical stimulation of the cells and stimulation of the culture medium alone to analyze the effects on cell behavior. This included using AC-activated medium for cell adhesion and spreading experiments.

2. **Monitoring Parameters**:
   - They monitored parameters such as temperature, pH, oxygen levels, and hydrogen peroxide (H2O2) content in the acellular medium that was subjected to electrical stimulation. They aimed to ensure that any change in cell behavior could not be attributed solely to changes in the medium.

3. **Control Experiments**: 
   - Control experiments were performed where cells were subjected to the same conditions without electrical stimulation to compare results and determine the influence of electric fields directly on 