**Before using this Colab, please save a copy to your own Google Drive:
Click on “File” > “Save a copy in Drive”**

# **AI Assisted Literature Review Part II RAG/LLM**
# *A. Download Research Papers of interest*
# *B. Demo:Pre-process the downloaded file*   
# *C. Demo:Query your newly created RAG/LLM*


# This Colab notebook processes scientific papers, extracts metadata, creates a searchable vector database, and enables interactive question-answering using the Retrieval-Augmented Generation (RAG) approach with Groq's LLM. You can easily query the system for answers based on the processed documents.

### **WORKFLOW:**
* Install necessary libraries.
* Set Groq API Key
* Download research paper using [pygetpapers](https://github.com/petermr/pygetpapers)
* Parse XML Files to Markdown and Extract Metadata
* Create Vector Database
* Execute pipeline for Processing and Retrieval
* Query your newly created RAG/LLM

### **Step 1: Install dependencies**
* **pymupdf4llm:** Lightweight PDF processing for LLMs
* **langchain:** Framework for developing LLM-powered applications
* **chromadb:** Vector store for storing and querying embeddings
* **sentence-transformers:** For embedding sentences using transformer models



In [None]:
# Install dependencies
!pip install pygetpapers
!pip install lxml
!pip install langchain chromadb sentence-transformers
!pip install -U langchain-huggingface
!pip install -U langchain-community langchain-groq
!pip install pymupdf markdown2 weasyprint

Collecting pygetpapers
  Downloading pygetpapers-1.2.5-py3-none-any.whl.metadata (48 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/48.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.1/48.1 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting xmltodict (from pygetpapers)
  Downloading xmltodict-0.14.2-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting configargparse (from pygetpapers)
  Downloading configargparse-1.7.1-py3-none-any.whl.metadata (24 kB)
Collecting habanero (from pygetpapers)
  Downloading habanero-2.3.0-py3-none-any.whl.metadata (8.0 kB)
Collecting arxiv (from pygetpapers)
  Downloading arxiv-2.2.0-py3-none-any.whl.metadata (6.3 kB)
Collecting dict2xml (from pygetpapers)
  Downloading dict2xml-1.7.7-py3-none-any.whl.metadata (6.1 kB)
Collecting coloredlogs (from pygetpapers)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting feedparser~=6.0.10 (from a

### **Step 2: Set Groq API Key**
Groq’s LPU (Language Processing Unit) hardware enables real-time, low-latency responses from LLMs—ideal for interactive applications.

**Instructions:**
* Go to https://console.groq.com/

* Create an account if you don’t have one

* Generate your API token

* Copy and paste it when prompted below
* CLICK **ENTER** once done.

In [None]:
#  Set API Key
import os, getpass
os.environ["GROQ_API_KEY"] = getpass.getpass("🔐 Enter your Groq API Key: ")

🔐 Enter your Groq API Key: ··········


### **Step 3: Download Research Papers**
Use *pygetpapers* to fetch research articles related to the keyword.

In [None]:
# Download papers from EuropePMC
!pygetpapers --query '"Climate change"' --xml --limit 2 --output /content/data_climate --save_query

[1;30mINFO:[0m Total Hits are 251138
0it [00:00, ?it/s]2it [00:00, 32140.26it/s]
[1;30mINFO:[0m Saving XML files to /content/data_climate/*/fulltext.xml
100% 2/2 [00:01<00:00,  1.14it/s]


### **Step 4: Parse XML Files to Markdown and Extract Metadata**
Convert scientific articles (downloaded in XML format) into clean Markdown format and extract essential metadata like title, authors, and DOI.

In [None]:
#  Parse XMLs to Markdown and extract metadata
import pathlib
import re
from lxml import etree
import fitz  # PyMuPDF
from datetime import datetime

def sanitize_filename(name):
    return re.sub(r'[\/:"*?<>|]+', "_", name)

def extract_text_from_pdf(pdf_path):
    try:
        doc = fitz.open(pdf_path)
        text = ""
        for page in doc:
            text += page.get_text()
        return text.strip()
    except Exception as e:
        print(f"❌ Error extracting PDF {pdf_path.name}: {e}")
        return ""

def parse_xml_to_markdown_with_metadata(xml_path):
    try:
        with open(xml_path, 'rb') as f:
            tree = etree.parse(f)

        metadata = {
            "title": "",
            "authors": [],
            "doi": "",
        }

        title_elem = tree.find(".//article-title")
        if title_elem is not None:
            full_title = title_elem.xpath("string()").strip()
            metadata["title"] = full_title if full_title else xml_path.stem
        else:
            metadata["title"] = xml_path.stem

        doi_elem = tree.find(".//article-id[@pub-id-type='doi']")
        if doi_elem is not None and doi_elem.text:
            metadata["doi"] = "https://doi.org/" + doi_elem.text.strip()

        authors = []
        for contrib in tree.findall(".//contrib[@contrib-type='author']"):
            name = contrib.find('name')
            if name is not None:
                given = name.findtext('given-names', default='')
                surname = name.findtext('surname', default='')
                full_name = f"{given} {surname}".strip()
                if full_name:
                    authors.append(full_name)

        metadata["authors"] = ", ".join(authors)

        sections = tree.xpath('//body//sec')
        text_parts = []

        for sec in sections:
            title = sec.findtext('title')
            if title:
                text_parts.append(f"### {title.strip()}")
            paragraphs = sec.findall('p')
            for p in paragraphs:
                if p.text and p.text.strip():
                    text_parts.append(p.text.strip())

        markdown_text = "\n\n".join(text_parts)
        return markdown_text, metadata

    except Exception as e:
        print(f"❌ Error parsing {xml_path.name}: {e}")
        return None

def process_input_path(input_path, output_dir):
    input_path = pathlib.Path(input_path)
    output_path = pathlib.Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    metadata_records = []

    if input_path.is_file():
        if input_path.suffix.lower() == ".xml":
            result = parse_xml_to_markdown_with_metadata(input_path)
            if result:
                raw_text, metadata = result
                if raw_text.strip():
                    final_name = sanitize_filename(input_path.stem) + "_final.md"
                    final_path = output_path / final_name
                    final_path.write_text(raw_text, encoding="utf-8")
                    metadata["filename"] = final_path.name
                    metadata_records.append((final_path, metadata))

        elif input_path.suffix.lower() == ".pdf":
            text = extract_text_from_pdf(input_path)
            if text:
                # Extract better title from first few lines
                first_lines = text.split('\n')[:3]
                title_candidate = next((line.strip() for line in first_lines if len(line.strip()) > 10), input_path.stem)
                title_candidate = title_candidate.replace("_", " ").strip().title()

                # Try to extract DOI using regex
                doi_match = re.search(r"(10\.\d{4,9}/[-._;()/:A-Z0-9]+)", text, re.I)
                doi = f"https://doi.org/{doi_match.group(1)}" if doi_match else ""

                final_name = sanitize_filename(input_path.stem) + "_final.md"
                final_path = output_path / final_name
                final_path.write_text(text, encoding="utf-8")

                metadata = {
                    "title": title_candidate,
                    "authors": "Unknown",
                    "doi": doi,
                    "filename": final_path.name
                }

                metadata_records.append((final_path, metadata))

    elif input_path.is_dir():
        for file in input_path.glob("**/*"):
            metadata_records += process_input_path(file, output_dir)

    return metadata_records

### **Step 5:Create Vector Database**
Process documents to store them as vectors, enabling question-answering with a retrieval system.

In [None]:
#  Load and Chunk Documents with Metadata
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings as HFEmbeddings


def load_markdown_documents_with_metadata(metadata_records):
    documents = []
    for md_path, metadata in metadata_records:
        text = md_path.read_text(encoding="utf-8")
        if text.strip():
            doc = Document(page_content=text, metadata=metadata)
            documents.append(doc)
    return documents

def hybrid_chunking(documents, threshold=3000):
    chunks = []
    for doc in documents:
        if len(doc.page_content.strip()) <= threshold:
            chunks.append(doc)
        else:
            splitter = RecursiveCharacterTextSplitter(chunk_size=1800, chunk_overlap=300)
            split_docs = splitter.split_documents([doc])
            for chunk in split_docs:
                chunk.metadata.update(doc.metadata)
            chunks.extend(split_docs)
    return chunks

def create_vector_database(chunks):
    embeddings = HFEmbeddings(model_name="all-mpnet-base-v2")
    vector_db = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        collection_name="scientific_rag_db",
        persist_directory="/content/db"
    )
    return vector_db

def create_retrieval_chain(vector_db):
    llm = ChatGroq(
        model="llama3-70b-8192",
        temperature=0.2,
        max_tokens=512,
        api_key=os.environ.get("GROQ_API_KEY")
    )
    prompt_template = PromptTemplate.from_template(
        '''You are a helpful research paper assistant. Use the following context to answer scientific questions. Use your own knowledge only if relevant.

Context:
{context}

Question: {question}

Answer:'''
    )
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vector_db.as_retriever(search_kwargs={"k": 3}),
        return_source_documents=True,
        chain_type_kwargs={"prompt": prompt_template}
    )
    return qa_chain

### **Step 6: Execute pipeline for Processing and Retrieval**
Runs the entire pipeline from downloading scientific papers to processing them, creating a vector database, and setting up the question-answering system.

In [None]:
user_path = input("📂 Enter path to a PDF/XML file or folder (e.g., /content/data_climate): ").strip()
markdown_dir = "/content/markdowns"
os.makedirs(markdown_dir, exist_ok=True)

metadata_records = process_input_path(user_path, markdown_dir)
docs = load_markdown_documents_with_metadata(metadata_records)
chunks = hybrid_chunking(docs)
vector_db = create_vector_database(chunks)
qa_chain = create_retrieval_chain(vector_db)
print(" RAG System Ready.")

📂 Enter path to a PDF/XML file or folder (e.g., /content/data_climate): /content/data_climate
 RAG System Ready.


### **Step 7: Query your newly created RAG/LLM**
Allow users to ask scientific questions and get answers based on the documents stored in the vector database
---
### Examples of the questions to be ask

* Why?What?How?
* key-finding of the paper.

In [None]:
from IPython.display import Markdown, display

qa_log_md = "/content/QA_Log.md"
with open(qa_log_md, "w", encoding="utf-8") as log_file:
    log_file.write(f"# Q&A Log - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")

    while True:
        query = input("🧠 Ask a scientific question (or type 'quit'): ").strip()
        if query.lower() == "quit":
            print(f" Q&A Markdown log saved to {qa_log_md}")
            break

        result = qa_chain.invoke({"query": query})
        answer = result.get("result", "")

        # Get top 2 sources
        source_lines = []
        top_sources = result.get("source_documents", [])[:2]
        for doc in top_sources:
            title = doc.metadata.get("title", "").strip()
            if not title or title.lower() in ["untitled", "fulltext"]:
                title = doc.metadata.get("filename", "").replace("_final.md", "").strip()

            doi = doc.metadata.get("doi", "").strip()
            source_lines.append(f"- [{title}]({doi})" if doi else f"- {title}")

        sources_md = "\n".join(source_lines)

        # Display Answer + Sources in Markdown format
        display(Markdown(f"### Answer:\n\n{answer}\n\n**Sources:**\n{sources_md}"))

        # Save to Q&A Markdown log
        with open(qa_log_md, "a", encoding="utf-8") as log_file:
            log_file.write(f"### Question:\n{query}\n\n")
            log_file.write(f"### Answer:\n{answer}\n\n")
            if sources_md:
                log_file.write("**Sources:**\n" + sources_md + "\n\n")


🧠 Ask a scientific question (or type 'quit'): what is the effect of climate on nutritional status?


### Answer:

Based on the provided context, climate change events can have a negative impact on nutritional status, particularly among vulnerable populations such as children and adults. The effects of climate change on nutritional status can be influenced by personal and socio-demographic, economic, and environmental factors.

While the context does not provide a direct answer to the question, it suggests that climate change can lead to poor nutritional status, which can further exacerbate health problems. The references provided also support this notion, highlighting the impact of climate change on child health and nutrition.

For example, Reference 3 (Helldén et al., 2021) mentions the importance of considering the effects of climate change on child health, including nutrition. Similarly, Reference 1 (Bhutta et al., 2019) emphasizes the need for paediatricians to address the impacts of climate change on child health, which likely includes nutritional status.

In summary, while the context does not provide a direct answer, it implies that climate change can have a negative impact on nutritional status, particularly among vulnerable populations.

**Sources:**
- [Effect of climate change on the health and nutritional status of children and their families in Africa: Scoping review](https://doi.org/10.1371/journal.pgph.0004897)
- [Plos Global Public Health | Https://Doi.Org/10.1371/Journal.Pgph.0004897  July 14, 2025](https://doi.org/10.1371/journal.pgph.0004897)

🧠 Ask a scientific question (or type 'quit'): how does climate changes health of a person?


### Answer:

Based on the provided context, climate change affects human health in various ways, including:

1. **Disasters**: Climate change is associated with disasters like droughts, floods, temperature changes, and changing vector patterns, which can lead to a range of health problems.
2. **Child health**: Climate change can magnify existing vulnerabilities in children, leading to an estimated 88% of the disease burden. The effects of climate change on child health travel through many different pathways and vary significantly across geographical locations.
3. **Pregnant women**: Extreme weather changes during pregnancy have been associated with an increased risk of preterm birth, partly attributable to water scarcity, which can have implications for the health of the neonate and the development of the child.
4. **Malnutrition**: Climate change can exacerbate malnutrition, which is a leading factor in child morbidity and mortality.
5. **Mortality**: Weather variability has been reported to increase the risk of overall mortality in children, particularly in infants.

Overall, climate change can have far-reaching consequences for human health, particularly for vulnerable populations such as children and pregnant women.

**Sources:**
- [Plos Global Public Health | Https://Doi.Org/10.1371/Journal.Pgph.0004897  July 14, 2025](https://doi.org/10.1371/journal.pgph.0004897)
- [Navigating parenthood in a climate change era: determinants of childbearing intentions in Iran](https://doi.org/10.1038/s41598-025-11708-1)

🧠 Ask a scientific question (or type 'quit'): effects of climate change in africa


### Answer:

Based on the provided context, the effects of climate change in Africa include:

1. Malnutrition
2. Infectious diseases
3. Respiratory diseases in children and adults
4. Adverse pregnancy and birth outcomes
5. High child and maternal morbidity and mortality
6. Mental health problems

These health conditions are associated with various climatic change phenomena or events, such as:

1. High temperatures
2. Drought
3. Floods
4. Wildfires
5. Air pollution

Additionally, the review highlights that personal and socio-demographic, economic, and environmental factors increase the risk of people to the effects of climate change events on poor nutritional and health status.

**Sources:**
- [Navigating Parenthood In A Climate](https://doi.org/10.1038/s41598-025-11708-1)
- [Plos Global Public Health | Https://Doi.Org/10.1371/Journal.Pgph.0004897  July 14, 2025](https://doi.org/10.1371/journal.pgph.0004897)

🧠 Ask a scientific question (or type 'quit'): quit
 Q&A Markdown log saved to /content/QA_Log.md


### **(OPTIONAL)Step 8:CONVERT Q&A LOG TO PDF**

In [None]:
import markdown2
from weasyprint import HTML
from IPython.display import FileLink, display

qa_log_pdf = "/content/QA_Log.pdf"
html_text = markdown2.markdown_path(qa_log_md)
HTML(string=html_text).write_pdf(qa_log_pdf)

print(f"✅ PDF Q&A log saved at: {qa_log_pdf}")
display(FileLink(qa_log_pdf))

DEBUG:fontTools.ttLib.ttFont:Reading 'maxp' table from disk
DEBUG:fontTools.ttLib.ttFont:Decompiling 'maxp' table
DEBUG:fontTools.subset.timer:Took 0.004s to load 'maxp'
DEBUG:fontTools.subset.timer:Took 0.000s to prune 'maxp'
INFO:fontTools.subset:maxp pruned
DEBUG:fontTools.ttLib.ttFont:Reading 'cmap' table from disk
DEBUG:fontTools.ttLib.ttFont:Decompiling 'cmap' table
DEBUG:fontTools.ttLib.ttFont:Reading 'post' table from disk
DEBUG:fontTools.ttLib.ttFont:Decompiling 'post' table
DEBUG:fontTools.subset.timer:Took 0.007s to load 'cmap'
DEBUG:fontTools.subset.timer:Took 0.000s to prune 'cmap'
INFO:fontTools.subset:cmap pruned
INFO:fontTools.subset:fpgm dropped
INFO:fontTools.subset:prep dropped
INFO:fontTools.subset:cvt  dropped
DEBUG:fontTools.subset.timer:Took 0.000s to load 'post'
DEBUG:fontTools.subset.timer:Took 0.000s to prune 'post'
INFO:fontTools.subset:post pruned
DEBUG:fontTools.ttLib.ttFont:Reading 'glyf' table from disk
DEBUG:fontTools.ttLib.ttFont:Decompiling 'glyf' tabl

✅ PDF Q&A log saved at: /content/QA_Log.pdf


### In this Colab notebook, we built a scientific RAG (Retrieval-Augmented Generation) pipeline that extracts information from PDF/XML-formatted research papers focused on biodiversity, wildlife, phytochemicals, and conservation. We parsed these XMLs into structured Markdown, embedded them using all-mpnet-base-v2, and connected them to a powerful LLM (LLaMA3-70B via Groq). The assistant can now accurately answer questions about scientific names, compounds, study locations, methodologies, and research findings — all grounded in real literature.

###**References:**
- Garg A, Smith-Unna R D and Mu
rray-Rust P, (pygetpapers:
A   Python   library   for   automated   retrieval   of   scientific
literature,Journal  of  Open  Source  Software,7(75)(2022)4451. https://doi.org/10.21105/joss.04451

- [groqcloud](https://groq.com/groqcloud/)
