## **Query Search with RAG/LLM in Google Colab**

This Colab notebook processes scientific papers, extracts metadata, creates a searchable vector database, and enables interactive question-answering using the Retrieval-Augmented Generation (RAG) approach with Groq's LLM. You can easily query the system for answers based on the processed documents.

### **Step 1: Install dependencies**
* **pymupdf4llm:** Lightweight PDF processing for LLMs
* **langchain:** Framework for developing LLM-powered applications
* **chromadb:** Vector store for storing and querying embeddings
* **sentence-transformers:** For embedding sentences using transformer models



In [1]:
# Install dependencies
!pip install pygetpapers
!pip install lxml
!pip install pymupdf4llm langchain chromadb sentence-transformers
!pip install -U langchain-community langchain-groq

Collecting pygetpapers
  Downloading pygetpapers-1.2.5-py3-none-any.whl.metadata (48 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/48.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.1/48.1 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting xmltodict (from pygetpapers)
  Downloading xmltodict-0.14.2-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting configargparse (from pygetpapers)
  Downloading ConfigArgParse-1.7-py3-none-any.whl.metadata (23 kB)
Collecting habanero (from pygetpapers)
  Downloading habanero-2.2.0-py3-none-any.whl.metadata (8.0 kB)
Collecting arxiv (from pygetpapers)
  Downloading arxiv-2.2.0-py3-none-any.whl.metadata (6.3 kB)
Collecting dict2xml (from pygetpapers)
  Downloading dict2xml-1.7.6-py3-none-any.whl.metadata (6.4 kB)
Collecting coloredlogs (from pygetpapers)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting feedparser~=6.0.10 (from arx

### **Step 2: Set Groq API Key**
Groq’s LPU (Language Processing Unit) hardware enables real-time, low-latency responses from LLMs—ideal for interactive applications.

**Instructions:**
* Go to https://console.groq.com/

* Create an account if you don’t have one

* Generate your API token

* Copy and paste it when prompted below

In [None]:
#  Set API Key
import os, getpass
os.environ["GROQ_API_KEY"] = getpass.getpass("🔐 Enter your Groq API Key: ")

🔐 Enter your Groq API Key: ··········


### **Step 3: Download Research Papers**
Use *pygetpapers* to fetch research articles related to the keyword.

In [None]:
# Download papers from EuropePMC
!pygetpapers --query '"phytochemical"' --xml --limit 15 --output /content/data_phyto --save_query

[1;30mINFO:[0m Total Hits are 69697
0it [00:00, ?it/s]15it [00:00, 68609.12it/s]
[1;30mINFO:[0m Saving XML files to /content/data_phyto/*/fulltext.xml
100% 15/15 [00:14<00:00,  1.06it/s]


### **Step 4: Parse XML Files to Markdown and Extract Metadata**
Convert scientific articles (downloaded in XML format) into clean Markdown format and extract essential metadata like title, authors, and DOI.

In [None]:
#  Parse XMLs to Markdown and extract metadata
import pathlib
import re
from lxml import etree

def sanitize_filename(name):
    return re.sub(r'[\/:"*?<>|]+', "_", name)

def parse_xml_to_markdown_with_metadata(xml_path):
    try:
        with open(xml_path, 'rb') as f:
            tree = etree.parse(f)

        metadata = {
            "title": "",
            "authors": [],
            "doi": "",
        }

        # Extract title
        title_elem = tree.find(".//article-title")
        if title_elem is not None:
            full_title = title_elem.xpath("string()")  # ✅ gets entire text including inside nested tags
            metadata["title"] = full_title.strip()


        # Extract DOI
        doi_elem = tree.find(".//article-id[@pub-id-type='doi']")
        if doi_elem is not None and doi_elem.text:
            metadata["doi"] = "https://doi.org/" + doi_elem.text.strip()

        # Extract authors
        authors = []
        for contrib in tree.findall(".//contrib[@contrib-type='author']"):
            name = contrib.find('name')
            if name is not None:
                given = name.findtext('given-names', default='')
                surname = name.findtext('surname', default='')
                full_name = f"{given} {surname}".strip()
                if full_name:
                    authors.append(full_name)

        metadata["authors"] = ", ".join(authors)

        # Extract body content
        sections = tree.xpath('//body//sec')
        text_parts = []

        for sec in sections:
            title = sec.findtext('title')
            if title:
                text_parts.append(f"### {title.strip()}")
            paragraphs = sec.findall('p')
            for p in paragraphs:
                if p.text and p.text.strip():
                    text_parts.append(p.text.strip())

        markdown_text = "\n\n".join(text_parts)
        return markdown_text, metadata

    except Exception as e:
        print(f" Error parsing {xml_path.name}: {e}")
        return None  #  Safe fallback to prevent unpacking error

### **Step 5: Process and Convert Scientific XMLs to Markdown**
Automate the batch conversion of multiple scientific XML files to Markdown and collect their metadata.

In [None]:
def process_scientific_xmls(data_directory, output_directory):
    data_path = pathlib.Path(data_directory)
    output_path = pathlib.Path(output_directory)
    output_path.mkdir(parents=True, exist_ok=True)

    metadata_records = []

    xml_files = list(data_path.glob("**/fulltext.xml"))
    for xml_file in xml_files:
        print(f" Processing {xml_file.name} ...")

        #  Skip empty XML files
        if xml_file.stat().st_size == 0:
            print(f" Skipped: {xml_file.name} (Empty file)")
            continue

        #  Safe call and unpack
        result = parse_xml_to_markdown_with_metadata(xml_file)
        if result is None:
            continue
        raw_text, metadata = result

        #  Save Markdown
        sanitized_name = sanitize_filename(xml_file.parent.name)
        final_filename = output_path / f"{sanitized_name}_final.md"

        if raw_text.strip():
            final_filename.write_text(raw_text, encoding="utf-8")
            print(f" Saved: {final_filename.name}")
            metadata["filename"] = final_filename.name
            metadata_records.append((final_filename, metadata))
        else:
            print(f" Skipped: {xml_file.name} (No extractable content)")

    return metadata_records


### **Step 6: Load, Chunk, and Store Documents in a Vector Database**
Process documents to store them as vectors, enabling question-answering with a retrieval system.

In [None]:
#  Load and Chunk Documents with Metadata
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain_groq import ChatGroq

def load_markdown_documents_with_metadata(metadata_records):
    documents = []
    for md_path, metadata in metadata_records:
        text = md_path.read_text(encoding="utf-8")
        if not text.strip():
            continue
        doc = Document(page_content=text, metadata=metadata)
        documents.append(doc)
    return documents

def hybrid_chunking(documents, threshold=3000):
    chunks = []
    for doc in documents:
        if len(doc.page_content.strip()) <= threshold:
            chunks.append(doc)
        else:
            splitter = RecursiveCharacterTextSplitter(chunk_size=1800, chunk_overlap=300)
            split_docs = splitter.split_documents([doc])
            for chunk in split_docs:
                chunk.metadata.update(doc.metadata)
            chunks.extend(split_docs)
    return chunks

def create_vector_database(chunks):
    embeddings = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2")
    vector_db = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        collection_name="scientific_rag_xml",
        persist_directory="/content/db"
    )
    return vector_db

def create_retrieval_chain_with_groq(vector_db):
    llm = ChatGroq(
        model="llama3-70b-8192",
        temperature=0.2,
        max_tokens=512,
        api_key=os.environ.get("GROQ_API_KEY")
    )
    prompt_template = PromptTemplate.from_template(
        '''You are a very good research paper assistant. Use this context to provide the following questions. You can use your knowledge if asked general bio and chemistry related questions.

Context:
{context}

Question: {question}

Answer:'''
    )
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vector_db.as_retriever(search_kwargs={"k": 3}),
        return_source_documents=True,
        chain_type_kwargs={"prompt": prompt_template}
    )
    return qa_chain

### **Step 7: Execute Full Pipeline for Document Processing and Retrieval**
Run the entire pipeline from downloading scientific papers to processing them, creating a vector database, and setting up the question-answering system.

In [None]:
# ✅ Execute Full Pipeline
pdf_dir = "/content/data_phyto"
markdown_dir = "/content/markdowns"
os.makedirs(markdown_dir, exist_ok=True)

metadata_records = process_scientific_xmls(pdf_dir, markdown_dir)
docs = load_markdown_documents_with_metadata(metadata_records)
chunks = hybrid_chunking(docs)
vector_db = create_vector_database(chunks)
qa_chain = create_retrieval_chain_with_groq(vector_db)
print(" RAG System Ready.")

📄 Processing fulltext.xml ...
✅ Saved: PMC11859777_final.md
📄 Processing fulltext.xml ...
✅ Saved: PMC11850848_final.md
📄 Processing fulltext.xml ...
✅ Saved: PMC11945817_final.md
📄 Processing fulltext.xml ...
✅ Saved: PMC11892241_final.md
📄 Processing fulltext.xml ...
✅ Saved: PMC11983554_final.md
📄 Processing fulltext.xml ...
✅ Saved: PMC11876561_final.md
📄 Processing fulltext.xml ...
✅ Saved: PMC11940872_final.md
📄 Processing fulltext.xml ...
✅ Saved: PMC11819868_final.md
📄 Processing fulltext.xml ...
✅ Saved: PMC11993562_final.md
📄 Processing fulltext.xml ...
✅ Saved: PMC11792193_final.md
📄 Processing fulltext.xml ...
⚠️ Skipped: fulltext.xml (No extractable content)
📄 Processing fulltext.xml ...
⚠️ Skipped: fulltext.xml (Empty file)
📄 Processing fulltext.xml ...
✅ Saved: PMC11975133_final.md
📄 Processing fulltext.xml ...
✅ Saved: PMC11987653_final.md
📄 Processing fulltext.xml ...
✅ Saved: PMC11964270_final.md


  embeddings = HuggingFaceEmbeddings(model_name="all-mpnet-base-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

🚀 RAG System Ready.


### **Step 8: Query the Retrieval System**
Allow users to ask scientific questions and get answers based on the documents stored in the vector database

In [None]:
from IPython.display import Markdown, display

#  Ask Questions
while True:
    query = input("🧠 Ask a scientific question (or type 'quit'): ").strip()
    if query.lower() == "quit":
        break

    result = qa_chain.invoke(query)
    answer = result.get("result", "")

    # Display answer with wrapped formatting
    display(Markdown(f"###  Answer:\n\n{answer}"))

    # Format sources as clickable markdown links
    # Format sources with safe markdown titles
    source_lines = []
    for doc in result['source_documents']:
        title = doc.metadata.get("title", "Untitled")
        title = re.sub(r"[()]", "", title)  # Remove parentheses that break markdown
        doi = doc.metadata.get("doi", "")
        source_lines.append(f"- [{title}]({doi})" if doi else f"- {title}")

    display(Markdown("** Sources:**\n" + "\n".join(source_lines)))


###  Answer:

Phytochemicals are bioactive compounds produced by plants, which can be found in various parts of the plant, such as leaves, twigs, roots, fruits, and flowers. These compounds are responsible for the plant's defense mechanisms, growth, and development, and can also have beneficial effects on human health. Phytochemicals can be classified into different classes, including alkaloids, flavonoids, phenolics, terpenoids, and glycosides, among others. They have been reported to possess various biological activities, such as antimicrobial, antioxidant, anti-inflammatory, and anticancer properties, making them a valuable source for the development of new medicines and therapeutic agents.

**📚 Sources:**
- [Phytochemical Analysis and Allelopathic Potential of an Aggressive Encroacher Shrub, Euryops floribundus Asteraceae](https://doi.org/10.3390/plants14040601)
- [Qualitative phytochemical profiling, and in vitro antimicrobial and antioxidant activity of Psidium guajava Guava](https://doi.org/10.1371/journal.pone.0321190)
- [Exploring the Role of Phytochemical Classes in the Biological Activities of Fenugreek Trigonella feonum graecum: A Comprehensive Analysis Based on Statistical Evaluation](https://doi.org/10.3390/foods14060933)

###  Answer:

Based on the provided context, the chemical compounds mentioned in the paper are:

1. Flavonoids (FV)
2. Alkaloids (AL)
3. LT (likely a type of flavonoid or alkaloid, but exact identity not specified)
4. TG (likely a type of glycoside, but exact identity not specified)
5. RT (likely a type of triterpene, but exact identity not specified)
6. QT (likely a type of quinone, but exact identity not specified)
7. KF (likely a type of flavonoid or alkaloid, but exact identity not specified)

Additionally, the paper mentions the use of the following chemicals in the experimental procedures:

1. MeOH (methanol)
2. n-hexane
3. Chloroform
4. Magnesium sulfate
5. Silica
6. Helium (99.99%)

Please note that the exact identities of LT, TG, RT, QT, and KF are not specified in the provided context. If you need further clarification or have additional information, I'd be happy to help!

**📚 Sources:**
- [Exploring the Role of Phytochemical Classes in the Biological Activities of Fenugreek Trigonella feonum graecum: A Comprehensive Analysis Based on Statistical Evaluation](https://doi.org/10.3390/foods14060933)
- [Exploring the Role of Phytochemical Classes in the Biological Activities of Fenugreek Trigonella feonum graecum: A Comprehensive Analysis Based on Statistical Evaluation](https://doi.org/10.3390/foods14060933)
- [Qualitative phytochemical profiling, and in vitro antimicrobial and antioxidant activity of Psidium guajava Guava](https://doi.org/10.1371/journal.pone.0321190)

###  Answer:

Based on the provided context, the locations mentioned in the paper are:

1. Iran
2. India
3. Yemen
4. Saudi Arabia
5. Egypt

**📚 Sources:**
- [Exploring the Role of Phytochemical Classes in the Biological Activities of Fenugreek Trigonella feonum graecum: A Comprehensive Analysis Based on Statistical Evaluation](https://doi.org/10.3390/foods14060933)
- [Qualitative phytochemical profiling, and in vitro antimicrobial and antioxidant activity of Psidium guajava Guava](https://doi.org/10.1371/journal.pone.0321190)
- [Exploring the Role of Phytochemical Classes in the Biological Activities of Fenugreek Trigonella feonum graecum: A Comprehensive Analysis Based on Statistical Evaluation](https://doi.org/10.3390/foods14060933)

###  Answer:

Antioxidant activity refers to the ability of a substance to neutralize or counteract the effects of free radicals, which are unstable molecules that can cause oxidative stress and damage to cells. In the context of the research paper, antioxidant activity is measured through various assays, such as the DPPH radical method, total antioxidant capacity, radical scavenging ability, and ferric reducing-antioxidant power (FRAP). These assays assess the ability of the extracts from the Iranian oak populations to reduce or neutralize free radicals, thereby protecting against oxidative damage.

**📚 Sources:**
- [Phytochemical variation, phenolic compounds and antioxidant activity of wild populations of Iranian oak](https://doi.org/10.1038/s41598-025-90991-4)
- [Diversity of Phytochemical Content, Antioxidant Activity, and Fruit Morphometry of Three Mallow, Malva Species Malvaceae](https://doi.org/10.3390/plants14060930)
- [Phytochemical variation, phenolic compounds and antioxidant activity of wild populations of Iranian oak](https://doi.org/10.1038/s41598-025-90991-4)

###  Answer:

Some examples of phenolic compounds include:

* Carotol
* Elemicin
* Limonene
* α-Pinene
* Flavonoids (such as quercetin, kaempferol, and isorhapontigenin)
* Tannins
* Phenolic acids (such as gallic acid, caffeic acid, and ferulic acid)

Note: Phenolic compounds are a large and diverse group of compounds, and this is not an exhaustive list.

**📚 Sources:**
- [Exploring the Role of Phytochemical Classes in the Biological Activities of Fenugreek Trigonella feonum graecum: A Comprehensive Analysis Based on Statistical Evaluation](https://doi.org/10.3390/foods14060933)
- [Exploring the Role of Phytochemical Classes in the Biological Activities of Fenugreek Trigonella feonum graecum: A Comprehensive Analysis Based on Statistical Evaluation](https://doi.org/10.3390/foods14060933)
- [Morpho-phytochemical screening and biological assessments of aerial parts of Iranian populations of wild carrot Daucus carota L. subsp. carota](https://doi.org/10.1038/s41598-025-96965-w)

###  Answer:

Here is an overview of the research paper:

**Title:** Exploring the Role of Phytochemical Classes in the Biological Activities of Fenugreek (Trigonella foenum graecum): A Comprehensive Analysis Based on Statistical Evaluation

**Objective:** To investigate the correlation between the phytochemical composition of fenugreek seeds from diverse origins and their biological activities, including cytotoxicity, antibacterial, antifungal, and α-amylase inhibition.

**Methodology:**

* Fenugreek seed samples from five different origins were collected and extracted in three different solvents.
* The extracts were analyzed for their phytochemical composition, including flavonoids, phenolics, alkaloids, saponins, and flavonoids glycosides.
* The biological activities of the extracts were evaluated using cell culture studies, including cytotoxicity, antibacterial, antifungal, and α-amylase inhibition assays.
* Statistical models, including Pearson's analysis and principal component analysis (PCA), were applied to link the significant correlations and paired differences among the biological activities and phytochemicals.

**Key Findings:**

* A significant correlation was found between the phytochemical composition of fenugreek seeds and their biological activities.
* The flavonoids and alkaloids were found to be the key phytochemical classes responsible for the biological activities of fenugreek seeds.
* The extracts with higher amounts of flavonoids and alkaloids showed higher cytotoxicity, antibacterial, and α-amylase inhibitory activities.
* The PCA analysis revealed that the cytotoxicity and α-amylase activities were loaded alongside the flavonoids and alkaloids, while the antimicrobial activity was loaded in a separate cluster.

**Conclusion:**
This study provides a comprehensive analysis of the correlation between the phytochemical composition of fenugreek seeds and their biological activities. The findings suggest that the flavonoids and alkaloids are the key phytochemical classes responsible for the biological activities of fenugreek seeds, and that the statistical models can be used to predict the biological activities of fenugreek extracts based on their phytochemical composition.

**📚 Sources:**
- [Exploring the Role of Phytochemical Classes in the Biological Activities of Fenugreek Trigonella feonum graecum: A Comprehensive Analysis Based on Statistical Evaluation](https://doi.org/10.3390/foods14060933)
- [Exploring the Role of Phytochemical Classes in the Biological Activities of Fenugreek Trigonella feonum graecum: A Comprehensive Analysis Based on Statistical Evaluation](https://doi.org/10.3390/foods14060933)
- [Exploring the Role of Phytochemical Classes in the Biological Activities of Fenugreek Trigonella feonum graecum: A Comprehensive Analysis Based on Statistical Evaluation](https://doi.org/10.3390/foods14060933)