## NOFO Proposal RAG System to Support the Development of Digital Health Test Beds

# Section 0 Environment Setup

In this section I install and configure all Python libraries required for my RAG pipeline:
- sentence-transformers for generating dense embeddings of PDF text.
- faiss-cpu for efficient similarity search over embeddings (my vector store).
- openai for calling the LLM that will generate proposal text.
- python-dotenv for securely loading my API key from a `.env` file.
- PyPDF2 for extracting text from the NOFO, Application Guide, and research papers.

This ensures that anyone grading or re-running my notebook can recreate the exact software environment used to build and evaluate the system.


In [2]:
!pip install --quiet sentence-transformers faiss-cpu openai python-dotenv PyPDF2


# Section 1 Project Configuration & Paths


- Import core Python libraries (paths, typing, NumPy).
- Import FAISS and SentenceTransformer for embedding + retrieval.
- Import PdfReader to read text from all PDFs.
- Load my OPENAI_API_KEY from the `.env` file and initialize the OpenAI client.
- Define PROJECT_DIR, PAPERS_DIR, and file paths to:
  - The NOFO (`NOFO.pdf`) corresponding to PAR-25-136.
  - The “How to Apply – Application Guide” PDF.
- Print checks that confirm these paths exist.



In [None]:
import os
from pathlib import Path
from typing import List, Dict, Any, Optional
import json

import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from PyPDF2 import PdfReader
from dotenv import load_dotenv
import openai

# Load environment variables from .env
load_dotenv()

file_name = 'config.json'
with open(file_name, 'r') as file: 
    config = json.load(file)
    os.environ['OPENAI_API_KEY'] = config.get("API_KEY")
    os.environ["OPENAI_BASE_URL"]= config.get("OPENAI_API_BASE")

# Initialize OpenAI client AFTER loading env vars
openai.api_key = api_key

from openai import OpenAI

client = OpenAI()

def call_llm(prompt: str, model: str = "gpt-4o-mini") -> str:
    """
    Simple wrapper to call an OpenAI chat model.

    - Uses a strong system message that enforces NIH / NOFO rules.
    - Returns the assistant's text only.
    """
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert NIH grant writer and implementation scientist. "
                    "You strictly follow NIH 'How to Apply – Application Guide Research (R)' "
                    "instructions and the specific NOFO requirements for "
                    "'PAR-25-136: Laboratories to Optimize Digital Health (R01 Clinical Trial Required)'. "
                    "You must follow ALL applicable instructions you see in the policy context. "
                    "Do not fabricate citations. Only use information that is provided in the NOFO, "
                    "Application Guide, and retrieved evidence corpus."
                ),
            },
            {"role": "user", "content": prompt},
        ],
        temperature=0,
    )
    return response.choices[0].message.content.strip()


PROJECT_DIR = Path.home() / "Desktop" / "digital_health_RAG"
PAPERS_DIR = Path.home() / "Desktop" / "Papers"

NOFO_PATH = PROJECT_DIR / "NOFO.pdf"
GUIDE_PATH = PROJECT_DIR / "How to Apply – Application Guide _ Grants & Funding.pdf"

print("Project dir:", PROJECT_DIR)
print("Papers dir:", PAPERS_DIR)
print("NOFO exists?", NOFO_PATH.exists())
print("Guide exists?", GUIDE_PATH.exists())
print("OPENAI_API_KEY set?", os.getenv("OPENAI_API_KEY") is not None)


# Section 2 PDF Ingestion & Chunking

Here I implement reusable utilities to turn raw PDFs into text chunks that can be fed into the RAG pipeline:

- load_pdf_text(path): reads all pages from a PDF and concatenates the extracted text.
- load_papers_from_folder(folder): iterates over all PDFs in my local `Papers` directory and loads them as documents with metadata.
- chunk_text(...): splits long documents into overlapping text chunks (e.g., 1200 characters with 200-character overlap), attaching metadata such as source filename and chunk index.

Chunking is important for the RAG system because it:
- Keeps each piece of text small enough for the embedding model and LLM context window.
- Preserves local context via overlap, which improves retrieval quality.


In [60]:
from typing import List, Dict, Any, Optional

def load_pdf_text(path: Path) -> str:
    """Read all text from a PDF file using PyPDF2."""
    reader = PdfReader(str(path))
    pages = []
    for page in reader.pages:
        try:
            pages.append(page.extract_text() or "")
        except Exception as e:
            print(f"Warning: could not read page in {path.name}: {e}")
    return "\n\n".join(pages)


def load_papers_from_folder(folder: Path) -> List[Dict[str, Any]]:
    """Load all PDFs in a folder as a list of documents."""
    docs = []
    for pdf_path in folder.glob("*.pdf"):
        text = load_pdf_text(pdf_path)
        if not text.strip():
            print(f"Warning: {pdf_path.name} seems empty.")
            continue
        docs.append(
            {
                "id": pdf_path.name,
                "text": text,
                "metadata": {"source": "paper", "filename": pdf_path.name},
            }
        )
    print(f"Loaded {len(docs)} PDFs from {folder}")
    return docs


def chunk_text(
    text: str,
    chunk_size: int = 1200,
    overlap: int = 200,
    source_id: str = "",
    extra_metadata: Optional[Dict[str, Any]] = None,
) -> List[Dict[str, Any]]:
    """
    Simple character-based chunking with overlap.
    Returns list of {"id", "text", "metadata"}.
    """
    if extra_metadata is None:
        extra_metadata = {}

    chunks = []
    text_len = len(text)
    start = 0

    while start < text_len:
        end = start + chunk_size
        chunk = text[start:end]

        chunk_id = f"{source_id}_chunk_{len(chunks)}"
        metadata = {"source_id": source_id, "chunk_index": len(chunks)}
        metadata.update(extra_metadata)

        chunks.append({"id": chunk_id, "text": chunk, "metadata": metadata})

        # Move forward by chunk_size - overlap so it ALWAYS progresses
        start += (chunk_size - overlap)

    return chunks


# Section 3 Vector Store for Semantic Retrieval

In this section I define a lightweight vector store abstraction around FAISS:

- SimpleVectorStore wraps:
  - A SentenceTransformer embedding model.
  - A FAISS IndexFlatL2 for nearest-neighbor search.
  - Lists of texts and their associated metadata.

Key methods:
- add_documents(docs): embeds all text chunks and adds them to the FAISS index.
- search(query, k): retrieves the top-k most similar chunks for a user query.


In [45]:

class SimpleVectorStore:
    def __init__(self, embed_model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(embed_model_name)
        self.index = None  # FAISS index
        self.embeddings = None  # numpy array
        self.texts: List[str] = []
        self.metadatas: List[Dict[str, Any]] = []

    def add_documents(self, docs: List[Dict[str, Any]]):
        """
        docs: list of {"id", "text", "metadata"}
        """
        new_texts = [d["text"] for d in docs]
        new_metas = [d["metadata"] for d in docs]

        print(f"Embedding {len(new_texts)} chunks...")
        new_embs = self.model.encode(new_texts, show_progress_bar=True, convert_to_numpy=True)

        if self.index is None:
            d = new_embs.shape[1]
            self.index = faiss.IndexFlatL2(d)
            self.embeddings = new_embs
        else:
            self.embeddings = np.vstack([self.embeddings, new_embs])

        self.index.add(new_embs)
        self.texts.extend(new_texts)
        self.metadatas.extend(new_metas)

    def search(self, query: str, k: int = 5) -> List[Dict[str, Any]]:
        """Return top-k chunks most similar to the query."""
        if self.index is None:
            raise ValueError("Vector store is empty. Add documents first.")

        q_emb = self.model.encode([query], convert_to_numpy=True)
        distances, indices = self.index.search(q_emb, k)

        results = []
        for dist, idx in zip(distances[0], indices[0]):
            results.append(
                {
                    "text": self.texts[int(idx)],
                    "metadata": self.metadatas[int(idx)],
                    "score": float(dist),
                }
            )
        return results


# Section 4 Reading NOFO & Application Guide (Policy Corpus)


Here I:

1. Load the full text of:
   - The PAR-25-136 NOFO document.
   - The NIH “How to Apply – Application Guide” (Research (R) instructions).
2. Chunk each document using chunk_text, tagging chunks as type="policy".
3. Build policy_store, a vector store containing all NOFO + Application Guide chunks.

This policy corpus is used later to:
- Ensure my generated proposals follow the NOFO’s scientific scope (digital health test beds, mental health, health disparities, etc.).
- Enforce formatting and structural rules from the Application Guide (e.g., Specific Aims, Research Strategy with Significance/Innovation/Approach).


In [None]:

# Load NOFO and Application Guide text
nofo_text = load_pdf_text(NOFO_PATH)
guide_text = load_pdf_text(GUIDE_PATH)

policy_docs = [
    {"id": "NOFO", "text": nofo_text, "metadata": {"source": "NOFO", "type": "policy"}},
    {"id": "HOW_TO_APPLY_GUIDE", "text": guide_text, "metadata": {"source": "guide", "type": "policy"}},
]

# Chunk both policy documents
policy_chunks = []
for doc in policy_docs:
    chunks = chunk_text(
        doc["text"],
        chunk_size=1200,
        overlap=200,
        source_id=doc["id"],
        extra_metadata={"source": doc["metadata"]["source"], "type": "policy"},
    )
    policy_chunks.extend(chunks)

print(f"Total policy chunks: {len(policy_chunks)}")

policy_store = SimpleVectorStore()
policy_store.add_documents(policy_chunks)


# Section 5 Analyzing Past Research Papers (Evidence Corpus)

In this section I:

1. Call load_papers_from_folder(PAPERS_DIR) to load all PDFs from my local repository of previous work and research ideas.
2. Use chunk_text to break each paper into overlapping text chunks, attaching metadata such as filename and type="research_paper".
3. Build research_store, a vector store that contains all research-evidence chunks.



In [None]:

research_docs = load_papers_from_folder(PAPERS_DIR)

# Chunk each paper
research_chunks = []
for doc in research_docs:
    chunks = chunk_text(
        doc["text"],
        chunk_size=1200,
        overlap=200,
        source_id=doc["id"],
        extra_metadata={"filename": doc["metadata"]["filename"], "type": "research_paper"},
    )
    research_chunks.extend(chunks)

print(f"Total research chunks: {len(research_chunks)}")

# Build the vector store
research_store = SimpleVectorStore()
research_store.add_documents(research_chunks)


# Section 6 Retrieval Functions (Linking Policy + Evidence to the LLM)

This section defines helper functions for the RAG retrieval step:

- retrieve_context(goal_description, k_research, k_policy):
  - Uses research_store to retrieve the most relevant research paper chunks.
  - Uses policy_store to retrieve the most relevant NOFO/Application Guide chunks.
- format_context_for_prompt(results):
  - Formats the retrieved chunks into a structured text block with labels
    (e.g., [POLICY 1], [RESEARCH 2]) that will be inserted into the LLM prompt.


In [63]:

def retrieve_context(
    goal_description: str,
    k_research: int = 8,
    k_policy: int = 8,
) -> Dict[str, List[Dict[str, Any]]]:
    """
    Given a high-level goal/prompt, pull top chunks from research + policy stores.
    """
    research_results = research_store.search(goal_description, k=k_research)
    policy_results = policy_store.search(goal_description, k=k_policy)

    return {
        "research": research_results,
        "policy": policy_results,
    }


def format_context_for_prompt(results: Dict[str, List[Dict[str, Any]]]) -> str:
    """
    Turn retrieved chunks into a compact context block for the LLM.
    """
    lines = []

    lines.append("POLICY & INSTRUCTIONS (NOFO + How to Apply Guide) ")
    for i, r in enumerate(results["policy"], start=1):
        src = r["metadata"].get("source", "")
        lines.append(f"[POLICY {i} | source={src}]")
        lines.append(r["text"].strip())
        lines.append("")

    lines.append("SCIENTIFIC EVIDENCE & BACKGROUND (Research Papers) ")
    for i, r in enumerate(results["research"], start=1):
        fname = r["metadata"].get("filename", "")
        lines.append(f"[RESEARCH {i} | file={fname}]")
        lines.append(r["text"].strip())
        lines.append("")

    return "\n".join(lines)


# Section 7 Goal Definition & RAG Retrieval

In this section I:

- Define a **high-level goal description** that reflects the scientific and technical focus of the NOFO on digital mental health test beds.
- Use my RAG retrieval pipeline to:
  - Search the **policy corpus** (NOFO + Application Guide) for key requirements and constraints.
  - Search the **evidence corpus** (past research papers) for relevant prior work and methods.
- Format the retrieved content into a compact context block that will be fed into the LLM in later sections.


In [None]:

goal_description = """
Design and evaluate a digital health test bed to optimize real-world
deployment of AI-enabled tools for behavioral and mental health.
The system should support diverse patient populations, integrate
with existing clinical workflows, and generate evaluable outcomes
aligned with PAR-25-136: Laboratories to Optimize Digital Health
(R01 Clinical Trial Required).
""".strip()

# Number of chunks to pull from each corpus
K_POLICY = 8
K_RESEARCH = 8

results = retrieve_context(
    goal_description=goal_description,
    k_research=K_RESEARCH,
    k_policy=K_POLICY,
)

context_block = format_context_for_prompt(results)

print("Goal Description")
print(goal_description)
print("\nRetrieved Context (Policy + Research) \n")
print(context_block[:4000], "...\n")  # truncate for display


# Section 8 Generating Research Ideas (LLM Processing – Step 1)

In this section I:

- Use the **retrieved policy + evidence context** to generate structured research ideas.
- Ask the LLM to propose several candidate project concepts that:
  - Satisfy key NOFO requirements.
  - Leverage insights from my prior research papers.
  - Are feasible for an R01 digital health test bed.


In [None]:

def generate_research_ideas(goal: str, context: str, n_ideas: int = 5) -> str:
    """
    Use the RAG context to generate several candidate research ideas
    that respond to the NOFO and leverage prior work.
    Returns a formatted string of ideas.
    """
    prompt = f"""
You are an NIH implementation scientist and grant writer.

You are helping to design a **digital health test bed** grant proposal
for PAR-25-136: Laboratories to Optimize Digital Health (R01 Clinical Trial Required).

First, carefully read the following **GOAL DESCRIPTION** and **RETRIEVED CONTEXT**.

GOAL DESCRIPTION:
\"\"\"{goal}\"\"\"

RETRIEVED CONTEXT (Policy + Research):
\"\"\"{context}\"\"\"

TASK:
1. Propose exactly {n_ideas} distinct, numbered **research ideas** that:
   - Align with the NOFO’s intent, eligibility, and review criteria.
   - Use or extend methods / findings that appear in the research corpus.
   - Are realistic for an R01 test bed (multi-site or robust single-site, pragmatic design).
2. For each idea, provide:
   - A short title.
   - A 3–5 sentence description.
   - 2–3 bullet points on how it meets NOFO priorities (e.g., digital mental health, health disparities, implementation focus).

Return your answer in a clearly formatted, numbered list.
"""
    ideas_text = call_llm(prompt)
    return ideas_text


research_ideas = generate_research_ideas(goal_description, context_block, n_ideas=5)

print("Generated Research Ideas \n")
print(research_ideas)


# Section 9 Drafting a Technical Approach (LLM Processing – Step 2)

In this section I:

- Select and refine one of the research ideas as the **primary project concept**.
- Ask the LLM to draft a structured **Technical Approach** section that:
  - Is tailored to PAR-25-136 and NIH R01 conventions.
  - Uses the RAG context from the NOFO, Application Guide, and prior research.
  - Is organized into standard NIH-style headings (e.g., Overview, Significance, Innovation, Approach).


In [None]:
chosen_idea_number = 1

prompt_for_approach = f"""
You previously generated the following **candidate research ideas**:

\"\"\"{research_ideas}\"\"\"

Focus on **Idea {chosen_idea_number}** as the primary project concept.

You also have the following:
- GOAL DESCRIPTION:
\"\"\"{goal_description}\"\"\"

- RETRIEVED CONTEXT (Policy + Research):
\"\"\"{context_block}\"\"\"

TASK:
Draft a detailed, but concise **Technical Approach** for a 5-page NIH-style
proposal responding to PAR-25-136: Laboratories to Optimize Digital Health (R01 Clinical Trial Required).

Requirements:
- Assume this section will later be edited and formatted into a ≤5-page PDF.
- Use clear headings and subheadings such as:
  - Overview / Project Summary
  - Significance and Innovation
  - Test Bed Design and Sites
  - Participants and Recruitment
  - Digital Health Intervention / Platform
  - Study Design and Methods
  - Data Collection and Outcomes
  - Analysis Plan
  - Implementation and Scalability
  - Human Subjects Protections and Data Security (high-level)
  - Timeline and Milestones
- Explicitly reference alignment with NOFO priorities (e.g., digital mental health, test bed focus, health disparities, implementation).
- Integrate insights from the prior research corpus when relevant (e.g., study designs, measures, analytic strategies).

Write in professional NIH grant language, but keep the draft readable enough that a human can later trim and reformat.
"""

draft_technical_approach = call_llm(prompt_for_approach)

print("Draft Technical Approach (Version 1) \n")
print(draft_technical_approach)


# Section 10 LLM Evaluation and Iterative Refinement

In this section I:

- Ask the LLM to **critically evaluate** the draft technical approach against:
  - NOFO requirements and review criteria.
  - Clarity, coherence, feasibility, and alignment with the digital health test bed concept.
- Request **specific revision suggestions**, including missing elements or misalignments.
- Generate a **revised version** of the technical approach that incorporates this feedback.

In [None]:

def evaluate_proposal(technical_approach: str, goal: str, context: str) -> str:
    """
    Ask the LLM to critique the draft technical approach and identify
    strengths, weaknesses, and specific revision suggestions.
    """
    prompt = f"""
You are an NIH study section reviewer specializing in digital mental health.

Review the following draft **Technical Approach** for an R01 proposal
responding to PAR-25-136: Laboratories to Optimize Digital Health (R01 Clinical Trial Required).

GOAL DESCRIPTION:
\"\"\"{goal}\"\"\"

RETRIEVED CONTEXT (Policy + Research):
\"\"\"{context}\"\"\"

DRAFT TECHNICAL APPROACH:
\"\"\"{technical_approach}\"\"\"

TASK:
Provide a structured review with the following sections:

1. **Overall Assessment** (1–2 paragraphs).
2. **Strengths** (bullet points).
3. **Weaknesses / Gaps** (bullet points).
4. **Specific Recommendations for Revision**:
   - Missing elements relative to the NOFO or Application Guide.
   - Areas that need greater methodological detail or clarity.
   - Potential concerns about feasibility, innovation, or significance.
5. **Priority Revisions**:
   - A short list (3–5 items) of the most important changes to address before submission.

Write in a constructive, reviewer-style tone.
"""
    review_text = call_llm(prompt)
    return review_text


proposal_review = evaluate_proposal(
    technical_approach=draft_technical_approach,
    goal=goal_description,
    context=context_block,
)

print("LLM Review of Draft Technical Approach\n")
print(proposal_review)


In [None]:
prompt_for_revision = f"""
You are the same NIH grant writer who drafted the previous technical approach.

You just received the following **review and recommendations**:

\"\"\"{proposal_review}\"\"\"

Original DRAFT TECHNICAL APPROACH:
\"\"\"{draft_technical_approach}\"\"\"

TASK:
Produce a **revised Technical Approach (Version 2)** that:

- Incorporates the most important reviewer suggestions.
- Improves alignment with PAR-25-136 and NIH R01 expectations.
- Clarifies methods, study design, and implementation details where requested.
- Keeps the structure suitable for a ≤5-page Technical Approach section.

Do NOT write a summary of the changes.  
Instead, output only the **revised Technical Approach**, fully written out with clear headings and subheadings.
"""

revised_technical_approach = call_llm(prompt_for_revision)

print("Revised Technical Approach (Version 2) \n")
print(revised_technical_approach)


# Section 11 Saving Proposal Outputs for Human Evaluation

In this final technical step I:

- Save the **goal description**, **selected idea**, **LLM-generated technical approaches**, and **review** to text files.
- These files serve as the raw materials that I will manually edit and format into the final ≤5-page PDF Technical Approach required by the assignment.

This corresponds to the **"Human Evaluation"** and **"Final Proposal"** steps in the workflow, where I:
- Review the LLM output.
- Make any necessary conceptual, methodological, or formatting changes.
- Export the final version as a PDF for submission.


In [None]:
OUTPUT_DIR = PROJECT_DIR / "outputs"
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

files_to_write = {
    "goal_description.txt": goal_description,
    "retrieved_context.txt": context_block,
    "research_ideas.txt": research_ideas,
    "draft_technical_approach_v1.txt": draft_technical_approach,
    "proposal_review.txt": proposal_review,
    "revised_technical_approach_v2.txt": revised_technical_approach,
}

for fname, text in files_to_write.items():
    out_path = OUTPUT_DIR / fname
    with open(out_path, "w", encoding="utf-8") as f:
        f.write(text)
    print(f"Wrote: {out_path}")
