
# Notebook Map: Relevance Evaluation Pipeline with Few-Shot + Agentic Enhancements

This table of contents provides a structured overview of the notebook, describing each section's purpose and how it fits into the workflow.

---

## 1. Quick Reference
- Overview of semantic versioning, few-shot prompting, and agentic conflict resolver.

## 2. Imports and Configuration
- Load required libraries and define configuration constants (e.g., few-shot parameters, log paths).

## 3. Core Utility Functions
- `verify_decision`: Ensures model decisions are consistent.
- `calculate_rule_confidence`: Computes rule-based confidence from criteria.
- `get_next_prompt_version`: Auto-increments semantic prompt version.
- `rebuild_few_shot_pool`: Builds balanced few-shot example set from log.
- `agentic_conflict_resolver`: Resolves discrepancies between model and rule evaluations.

## 4. Data Preparation
- Load PDF research papers from `data/raw`.
- Truncate text to fit LLM context window.

## 5. Few-Shot Prompt Building
- Retrieve high-confidence examples from log.
- Prepend examples to base relevance prompt.

## 6. Main Evaluation Loop
- Iterate through PDFs.
- Evaluate relevance using LLM.
- Apply rule-based scoring and hybrid confidence calculation.
- Flag documents for review when model vs. rule confidence diverges.

## 7. Logging and Versioning
- Append results to `prompt_evaluation_log.json`.
- Add `prompt_version`, `decision_source`, and `agentic_resolution` where applicable.

## 8. Visualization
- Display confidence distribution, relevance drift, and flagged discrepancy trends.

## 9. Enhancements (Appended)
- Additional functions and logging improvements appended at the end for optional use.

---


> **Note:** The following section explains core functionality and workflow.

<font size=10>**End-Term / Final Project**</font>

<font size=6>**AI for Research Proposal Automation**</font>

### **Business Problem - Create an AI system which will help you writing the research proposal aligning with the NOFO Document**
   



Meet Dr. Ian McCulloh, a seasoned research advisor and a leading voice in interdisciplinary science. Over the years, his lab has explored everything from AI for counterterrorism to social network analysis in neuroscience. His publication portfolio is vast, rich, and... chaotic.

When the National Institute of Mental Health released a new NOFO (Notice of Funding Opportunity) seeking innovative digital health solutions for mental health equity, Dr. Ian saw an opportunity. But there was a problem: despite his extensive work, none of his existing research was directly aligned with digital mental health interventions. And with NIH deadlines looming, manually identifying relevant angles and generating a competitive proposal would be a massive lift.

Dr. Ian wished for a smart assistant—one that could digest his past work, interpret the NOFO’s intent, spark new research directions, and even help draft proposal sections.

**The Challenge:**

Organizations and researchers often maintain large archives of publications and prior work. When responding to competitive grants—especially highly specific ones like NIH NOFOs—it becomes extremely difficult and time-consuming to:

1. Align past work with a new funding call.
2. Extract relevant expertise from unrelated projects.
3. Ideate novel, fundable research proposals tailored to complex criteria.
4. Generate high-quality text for grant submission that satisfies technical and scientific review criteria.

The manual effort to sift through dense research documents, match them to nuanced funding criteria, and write compelling, compliant proposals is labor-intensive, inconsistent, and prone to missed opportunities.

> **Note:** The following section explains core functionality and workflow.

### **The Case Study Approach**

**Objective**
1. Develop a generative AI-powered system using LLMs to automate and optimize the creation of NIH research proposals.
2. The tool will identify relevant prior research, generate aligned project ideas, and draft high-quality proposal content tailored to specific NOFO requirements.

**Given workflow:**

```mermaid
flowchart TD
    A[Read NOFO Document] --> B[Analyze Research Papers]
    B --> C[Filter Papers by Topic]
    C --> D[Generate Research Ideas]
    D --> E[Upload ideas to LLM]
    E --> F[Generate Proposal]
    F --> G[LLM Evaluation]
    G --> H{Meets criteria?}
    H -- NO --> F
    H -- YES --> I[Human Review]
    I --> J{Approved?}
    J -- NO --> F
    J -- YES --> K[Final Proposal]
```

**Enhanced workflow A based on conversations with ChatGPT and Claude:**

```mermaid
flowchart TD
    A[Read NOFO Document] --> B[Extract Key Requirements & Evaluation Criteria]
    B --> C[Multi-Stage Paper Processing<br>(PyPDF → OCR)]
    C --> C1[Table Extraction]
    C --> C2[Figure Extraction (OCR + Captioning)]
    C1 --> D
    C2 --> D
    D[Hybrid Indexing & Filtering<br>(BM25 + Embeddings + Metadata)]
    D --> E[Agentic Research Synthesis<br>(Research Analyst + Proposal Writer + Compliance Checker)]
    E --> F[Generate Proposal Blueprint + Draft]
    F --> G[Multi-Criteria Evaluation<br/>(RAG + LLM-as-Judge + Guardrails)]
    G --> H{Score ≥ Threshold?}
    H -- NO --> I[Targeted Refinement Loop<br/>(Weakness-Specific Prompts)]
    I --> F
    H -- YES --> J[Caching + Checkpointing of Results]
    J --> K[Human Review Interface]
    K --> L{Approved?}
    L -- NO --> M[Capture Feedback & Return to Refinement]
    M --> F
    L -- YES --> N[Final Proposal + Deliverables]
    
    subgraph "Agentic Components"
        E1[Research Analyst Agent]
        E2[Proposal Writer Agent]
        E3[Compliance Checker Agent]
        E1 <--> E2
        E2 <--> E3
        E3 <--> E1
    end
```

**Enhanced workflow B based iterations:**
```mermaid
flowchart TD
    %% ================================
    %% DATA INGESTION & PREP
    %% ================================
    A[NOFO PDF]:::doc --> A1[Topic Extraction (LLM 3–8 words)]
    B[112 Research Papers (PDFs)]:::doc --> B1[Pre-process PDFs<br/>(clean, dedupe, normalize)]
    B1 --> B2[Chunk Papers] --> B3[Embed Chunks → Vector DB (FAISS)]
    B2 --> B4[Keyword Index (BM25)]
    
    %% ================================
    %% FEW-SHOT & CONFIG
    %% ================================
    C[Few-shot Examples<br/>(from prior logs)]:::meta --> C1[Build Prompt Prefix]
    A1 --> C1
    C1 --> P0[Prompt(s) for Relevance & Evaluation]:::meta
    
    %% ================================
    %% PHASE 1: HYBRID RETRIEVAL
    %% ================================
    subgraph PH1[Phase 1 — Hybrid Retrieval & Paper Screening]
      direction TB
      B3 --> D1[Hybrid Search: BM25 + Cosine<br/>(overfetch top chunks)]
      B4 --> D1
      D1 --> D2[Group by Paper → Best Hybrid Score]
      D2 --> D3[Chunk Count Normalization]
      D3 --> D4[Rank Papers (normalized)]
      D4 --> D5[Select Top N for Phase 2]
    end
    
    %% ================================
    %% PHASE 2: LLM RELEVANCE
    %% ================================
    subgraph PH2[Phase 2 — LLM Relevance (Nuanced)]
      direction TB
      D5 --> E1[Fetch all chunks for Top N]
      E1 --> E2[Cap top-k chunks per paper (e.g., 10)]
      E2 --> E3[LLM Relevance Scoring per Paper<br/>(confidence 0–100)]
      E3 --> E4[Trim false positives; Final Ranked Set]
    end
    
    %% ================================
    %% IDEATION → BLUEPRINT → DRAFT
    %% ================================
    subgraph GEN[Proposal Generation]
      direction TB
      E4 --> G1[Proposal Ideation (5 ideas)<br/>NOFO + Priorities + Matched Papers]
      G1 --> G2[Select 1 Idea + Fetch Details]
      G2 --> G3[Proposal Blueprint<br/>(sections, objectives, methods)]
      G3 --> G4[Draft Proposal from Template]
    end
    
    %% ================================
    %% EVALUATION & REVIEW
    %% ================================
    subgraph EVAL[Evaluation & Review]
      direction TB
      G4 --> H1[Evaluate vs NOFO Criteria<br/>(LLM-as-judge JSON)]
      H1 --> H2[Human Review & Refinement]
      H2 --> H3[Summary & Recommendations]
    end
    
    %% ================================
    %% LOGGING / VERSIONING / VIS
    %% ================================
    subgraph LOG[Logging, Versioning, Visualization]
      direction TB
      D1 --> L1[Log retrieval & scores]
      E3 --> L1
      G4 --> L2[Versioned Outputs]
      H1 --> L3[Visualization (optional)]
    end

    classDef doc fill:#eef,stroke:#88a,stroke-width:1.5px;
    classDef meta fill:#efe,stroke:#7a7,stroke-width:1.5px;
```

> **Note:** The following section explains core functionality and workflow.

## **Setup - [2 Marks]**
---
<font color=Red>**Note:**</font> *1 marks is awarded for the Embedding Model configuration and 1 mark for the LLM Configuration.*

## Configuration and Setup

In [1]:
# Install required packages with progress and output displayed

# Encountered multiple conflicts between packages and within codespace core packages. Ended up installing all packages via the .venv

# DISPLAY FINAL REQUIREMENTS.TXT for final file

In [2]:
# Import required libraries for core functionality
import os
import warnings
api_key = os.getenv("OPENAI_API_KEY")
base_url = os.getenv("OPENAI_BASE_URL")
warnings.filterwarnings('ignore')

In [3]:
# Define the LLM Model - Use `gpt-4o-mini` Model
from langchain_openai import ChatOpenAI
import os
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    base_url=os.getenv("OPENAI_BASE_URL")  # optional; only if using non-default
)

In [4]:
# ------------------------------------------------------------
# FEW-SHOT AND LOGGING CONFIG
# ------------------------------------------------------------
# These constants control how many examples are retrieved and the minimum confidence threshold.
# Modify here if you want more or fewer few-shot examples or to change the confidence cutoff.
FEW_SHOT_MAX_EXAMPLES = 4         # Total examples (balanced between relevant/irrelevant if possible)
# Minimum confidence threshold for including examples in few-shot prompting
MIN_CONFIDENCE_FOR_FEWSHOT = 70   # Minimum hybrid confidence (%) to consider for few-shot retrieval

# JSON log path
# Path to the cleaned JSON log file where prompt evaluation iterations are stored
LOG_PATH = "prompt_evaluation_log_cleaned.json"

# PDF Pre-Processing

In [5]:
# PDF Cleaning Step: Remove non-visual annotations (comments, links, form fields)
# Keeps images, diagrams, and visible callouts intact

# Import required libraries for core functionality
import os
import fitz  # PyMuPDF
import json  # <-- NEW: Required to write annotation logs

# Create a global dictionary to store removed annotations
annotation_log = {}  # <-- NEW: Accumulates logs of all removed annotations

# Initial standardization step to remove annotations for parsing
def clean_pdf_annotations(input_path, output_path):
    """
    Strips non-visual annotations (comments, form fields, links) from a PDF
    while preserving visible images and diagrams.
    Also logs removed annotations to a global dictionary.
    """
    doc = fitz.open(input_path)
    removed_annots = []  # <-- NEW: Stores removed annotations for this PDF

    for page in doc:
        # Iterate over all annotations (not images)
        annot = page.first_annot
        while annot:
            next_annot = annot.next  # store reference to next annotation
            
            # Try to extract meaningful annotation content
            try:
                annot_info = annot.info  # Dictionary of annotation metadata
                content = annot_info.get("content", "").strip()
                subtype = annot_info.get("subtype", "").strip()
                if content:
                    removed_annots.append(f"{subtype}: {content}")
                else:
                    removed_annots.append(f"{subtype}: [no content]")
            except Exception as e:
                # Fallback if annotation metadata is inaccessible
                removed_annots.append("Unknown annotation (could not extract content)")

            # Remove annotation object (highlights, comments, links)
            page.delete_annot(annot)
            annot = next_annot

    # Save cleaned PDF
    doc.save(output_path, garbage=4, deflate=True)
    doc.close()

    # Add entry to annotation log using the input filename as key
    annotation_log[os.path.basename(input_path)] = removed_annots  # <-- NEW: Log entries keyed by file

# Clean NOFO file
input_pdf = "../data/NOFO.pdf"
cleaned_pdf = "../data/NOFO_cleaned.pdf"
clean_pdf_annotations(input_pdf, cleaned_pdf)
print(f"Cleaned PDF saved to: {cleaned_pdf}")

# Get de-annotated NOFO doc content using PyPDFLoader for evaluation step
from langchain.document_loaders import PyPDFLoader
pdf_file = "../data/NOFO_cleaned.pdf"
pdf_loader = PyPDFLoader(pdf_file)
NOFO_pdf = pdf_loader.load()

# Prepare output folder for de-annotated research papers
os.makedirs("data/raw", exist_ok=True)

# Set variables for de-annotating the research paper PDF collection 
source_dir = "../content"
output_dir = "data/raw"

# Loop through content folder, de-annotate each PDF, and save to a 'clean' output directory
for file_name in os.listdir(source_dir):
    if file_name.lower().endswith(".pdf"):
        input_pdf = os.path.join(source_dir, file_name)
        cleaned_pdf = os.path.join(output_dir, file_name.replace(".pdf", "_cleaned.pdf"))
        print(f"Cleaning annotations for: {file_name}")
        clean_pdf_annotations(input_pdf, cleaned_pdf)
        print(f"Cleaned PDF saved to: {cleaned_pdf}")

# Write annotation log to disk after all PDFs are processed
log_path = "annotation_log.json"  # <-- NEW: File to store the annotation log
with open(log_path, "w", encoding="utf-8") as log_file:
    json.dump(annotation_log, log_file, indent=2, ensure_ascii=False)  # <-- NEW: Write log to file
print(f"Annotation removal log written to: {log_path}")  # <-- NEW: Confirm log creation

print("All research PDFs cleaned and saved in data/raw/")


Cleaned PDF saved to: ../data/NOFO_cleaned.pdf
Cleaning annotations for: cycon-final-draft.pdf
Cleaned PDF saved to: data/raw/cycon-final-draft_cleaned.pdf
Cleaning annotations for: Chat GPT Bias final w copyright.pdf
Cleaned PDF saved to: data/raw/Chat GPT Bias final w copyright_cleaned.pdf
Cleaning annotations for: Genetic_Algorithms_for_Prompt_Optimization.pdf
Cleaned PDF saved to: data/raw/Genetic_Algorithms_for_Prompt_Optimization_cleaned.pdf
Cleaning annotations for: DIVERSE_LLM_Dataset___IEEE_Big_Data.pdf
Cleaned PDF saved to: data/raw/DIVERSE_LLM_Dataset___IEEE_Big_Data_cleaned.pdf
Cleaning annotations for: Hashtag_Revival.pdf
Cleaned PDF saved to: data/raw/Hashtag_Revival_cleaned.pdf
Cleaning annotations for: FBI_Recruit_Hire_Final.pdf
Cleaned PDF saved to: data/raw/FBI_Recruit_Hire_Final_cleaned.pdf
Cleaning annotations for: Benson_MA491_NLP.pdf
Cleaned PDF saved to: data/raw/Benson_MA491_NLP_cleaned.pdf
Cleaning annotations for: Extreme Cohesion Darknet 20190815.pdf
Cleaned 

# Extract, Clean, and Chunk Text

In [6]:
# ---------------------------------------------------------------
# Function to split cleaned text into 3000-token chunks with overlap for RAG
# ---------------------------------------------------------------
# This function breaks long text into overlapping token-based chunks for use in
# Retrieval-Augmented Generation (RAG) pipelines. Overlapping chunks help
# preserve context continuity across boundaries, improving answer quality.

import tiktoken  # OpenAI tokenizer library for counting and managing tokens

# ---------------------------------------------------------------
# Load tokenizer for the target model
# ---------------------------------------------------------------
# `tiktoken` provides tokenization rules tailored to specific OpenAI models.
# Here we select the encoding used by gpt-4o-mini to ensure our token counting
# aligns with how the model actually interprets input.
encoding = tiktoken.encoding_for_model("gpt-4o-mini")

# ---------------------------------------------------------------
# Define chunking function
# ---------------------------------------------------------------
# Inputs:
# - text: full string to be split into chunks
# - chunk_size: max number of tokens per chunk (default 3000)
# - overlap: number of tokens to repeat from the previous chunk (default 200)
# This overlap preserves some context from earlier chunks in each new chunk.

def chunk_text(text, chunk_size=3000, overlap=200):
    # Convert text into a list of token IDs using the tokenizer
    tokens = encoding.encode(text)

    # Initialize an empty list to store the final chunks
    chunks = []

    # Step through the token list in increments of (chunk_size - overlap)
    # This ensures that each new chunk shares `overlap` tokens with the previous one
    for i in range(0, len(tokens), chunk_size - overlap):
        # Slice the token list to get a window of `chunk_size` tokens
        chunk_tokens = tokens[i:i+chunk_size]

        # Decode the token slice back into text and add it to the list of chunks
        chunks.append(encoding.decode(chunk_tokens))

    # Return the full list of overlapping text chunks
    return chunks


In [8]:
# ---------------------------------------------------------------
# Function to clean extracted text by:
# - removing headers/footers
# - removing noise
# - fixing multi-column layout issues
# ---------------------------------------------------------------
# This function is useful for preprocessing text extracted from PDFs
# (e.g., via OCR or PDF parsers), which often contain artifacts such as
# page numbers, repeating headers/footers, hyphenated line breaks,
# and broken column layouts.

import re  # Regular expressions for pattern matching and substitution

def clean_extracted_text(text):
    """Remove noise (page numbers, headers, footers), merge hyphenated words,
    and flatten potential two-column layouts."""

    # ---------------------------------------------------------------
    # 1. Remove page numbering and common artifacts
    # ---------------------------------------------------------------
    # These patterns often appear in academic papers, reports, and government documents.
    # Removing them improves the quality of downstream embedding and summarization.

    text = re.sub(r'\bPage \d+\b', '', text, flags=re.IGNORECASE)  # Remove 'Page X'
    text = re.sub(r'\d+ of \d+', '', text, flags=re.IGNORECASE)    # Remove 'X of Y' style page counts

    # ---------------------------------------------------------------
    # 2. Identify and remove repeating headers/footers
    # ---------------------------------------------------------------
    # Strategy: count how many times each line occurs.
    # Merge two-column text by pairing lines
    merged_lines = []
    lines = text.split('\n')
    for i in range(0, len(lines), 2):
        if i+1 < len(lines):
            merged_lines.append(lines[i] + " " + lines[i+1])
        else:
            merged_lines.append(lines[i])
    return "\n".join(merged_lines)

In [9]:
# ---------------------------------------------------------------
# Function to extract, clean, and chunk research paper PDFs
# ---------------------------------------------------------------
# This function performs a full preprocessing pipeline for PDF documents,
# including text extraction (via PyPDF), cleaning (removing headers/footers, noise),
# and token-based chunking for use in downstream RAG pipelines.

from pypdf import PdfReader  # PyPDF is used for reading PDF documents and extracting text

# Additional imports for patching
from pdf2image import convert_from_path  # Convert PDF pages to images
import pytesseract  # OCR engine to extract text from images
import os
from datetime import datetime

def process_pdf_multistage(file_path):
    # Initialize an empty string to collect the full text from the PDF
    content = ""

    # Extract filename for metadata
    filename = os.path.basename(file_path)
    author = None
    creation_date = None
    num_pages = None

    try:
        # ---------------------------------------------------------------
        # 1. Attempt to load and parse the PDF
        # ---------------------------------------------------------------
        reader = PdfReader(file_path)  # Create a PdfReader object from the file path

        # Get basic document metadata (if available)
        meta = reader.metadata or {}
        author = meta.get('/Author', None)

        # Iterate through each page in the PDF
        for page in reader.pages:
            # Extract text from the page; if extraction fails or returns None, use an empty string
            page_text = page.extract_text() or ""

            # Append the page's text to the full document content
            content += page_text

    except Exception as e:
        # ---------------------------------------------------------------
        # 2. Handle extraction failures gracefully
        # ---------------------------------------------------------------
        # If any exception is raised during PDF reading or parsing,
        # log the error and allow the function to continue (returning empty chunks).
        print(f"PyPDF extraction failed: {e}")
        print("Falling back to OCR...")

        try:
            # Convert PDF pages to images using pdf2image
            images = convert_from_path(file_path)
            ocr_text_list = []

            for i, img in enumerate(images):
                # Run OCR on each image page using pytesseract
                page_text = pytesseract.image_to_string(img)
                ocr_text_list.append(page_text)

            # Combine all OCR'd page text into one document
            content = "\n".join(ocr_text_list)

        except Exception as ocr_error:
            print(f"OCR fallback also failed: {ocr_error}")
            return []

    # ---------------------------------------------------------------
    # 3. Clean the raw extracted text
    # ---------------------------------------------------------------
    # Use a dedicated cleaning function to:
    # - Remove headers, footers, and page numbers
    # - Merge hyphenated line breaks
    # - Flatten multi-column layouts
    # This improves the quality of embeddings and downstream retrieval.
    cleaned_text = clean_extracted_text(content)

    # ---------------------------------------------------------------
    # 4. Chunk the cleaned text into token-bounded segments
    # ---------------------------------------------------------------
    # Break the cleaned document into overlapping token chunks (e.g., 3000 tokens with 200-token overlap),
    # ensuring context continuity across chunks. This is critical for performance in RAG.
    chunks = chunk_text(cleaned_text)

    def safe_str(obj):
        """
        Convert a potentially non-serializable object (e.g., PyPDF's IndirectObject)
        into a JSON-compatible Python string or None.

        This is especially useful when working with metadata fields extracted from PDFs,
        where objects may be wrapped in non-primitive types (like PyPDF2.generic.IndirectObject),
        which the `json` module cannot serialize directly.

        Returns:
            - `str(obj)` if the object can be stringified without error
            - `None` if string conversion fails
        """
        try:
            # Attempt to cast the object to a string (e.g., IndirectObject → str)
            # This is usually sufficient for basic metadata like author, title, date, etc.
            return str(obj)
        
        except Exception:
            # If casting to string fails (e.g., object is not readable or triggers an exception),
            # return None instead, making the output JSON-safe.
            return None

    # ---------------------------------------------------------------
    # 4.1 Attach file-level metadata to each chunk
    # ---------------------------------------------------------------
    # This metadata can help with filtering, attribution, and retrieval analysis.
    chunks_with_metadata = [
        {
            "text": chunk,
            "metadata": {
                "source_file": safe_str(filename),
                "author": safe_str(author),
                "creation_date": safe_str(creation_date),
                "num_pages": safe_str(num_pages),
                "chunk_index": i
            }
        }
        for i, chunk in enumerate(chunks)
    ]

    # ---------------------------------------------------------------
    # 5. Return the processed chunks
    # ---------------------------------------------------------------
    # The final output is a list of text chunks, ready for embedding, storage, or retrieval.
    return chunks_with_metadata


In [10]:
# Extract, clean, chunk, and store raw chunks for all research paper PDFs

# ---------------------------------------------------------------
# 1. Import necessary libraries
# ---------------------------------------------------------------
import os              # Used for file path manipulation and directory handling
import json            # Used to save the final result as a JSON file
from glob import glob  # Used to match all PDF files in a directory

# ---------------------------------------------------------------
# 2. Set input/output paths
# ---------------------------------------------------------------

# Folder containing raw research paper PDFs (to be processed)
pdf_folder = "data/raw"

# Output file to save cleaned + chunked results
output_json_path = "data/cleaned_chunked_papers.json"

# ---------------------------------------------------------------
# 3. Initialize storage for processed results
# ---------------------------------------------------------------

# This list will store the result for each paper.
# Each element is a dictionary with:
#   - 'id': PDF filename
#   - 'chunks': list of cleaned and tokenized text chunks from that PDF
all_chunks = []

# ---------------------------------------------------------------
# 4. Loop through all PDF files in the target folder
# ---------------------------------------------------------------

# `glob` finds all .pdf files in the specified folder
for pdf_path in glob(os.path.join(pdf_folder, "*.pdf")):
    doc_name = os.path.basename(pdf_path)  # Extract just the filename (used as a unique ID)
    print(f"Processing: {pdf_path}")       # Log the file being processed
    
    try:
        # ---------------------------------------------------------------
        # Attempt to extract, clean, and chunk the PDF content
        # ---------------------------------------------------------------
        # `process_pdf_multistage()` is your custom pipeline that:
        #   1. Extracts text using PyPDF (and optionally OCR if needed)
        #   2. Cleans the text (removes noise, merges hyphenated lines, etc.)
        #   3. Chunks the cleaned text into token-bounded segments
        chunks = process_pdf_multistage(pdf_path)

        # ---------------------------------------------------------------
        # Append the processed result to the `all_chunks` list
        # ---------------------------------------------------------------
        # Each record contains the filename (as ID) and a list of chunks
        all_chunks.append({
            "id": doc_name,
            "chunks": chunks
        })

    except Exception as e:
        # ---------------------------------------------------------------
        # If anything goes wrong during processing, catch the error
        # ---------------------------------------------------------------
        print(f"Error processing {pdf_path}: {e}")  # Log the error for debugging

# ---------------------------------------------------------------
# 5. Save all processed results to a JSON file
# ---------------------------------------------------------------

# Write the list of all processed documents to a single JSON file
# - `indent=2` for human-readable formatting
# - `ensure_ascii=False` allows Unicode characters (like symbols or accents)
with open(output_json_path, "w", encoding="utf-8") as f:
    json.dump(all_chunks, f, indent=2, ensure_ascii=False)

# Final confirmation message
print(f"Saved cleaned + chunked text for {len(all_chunks)} PDFs to {output_json_path}")


Processing: data/raw/AAAI IAA CV_cleaned.pdf
Processing: data/raw/Sim of Decon_cleaned.pdf
Processing: data/raw/BotBuster___AAAI_cleaned.pdf
Processing: data/raw/Political_Networks_Conference_cleaned.pdf
Processing: data/raw/EmergencyResponseAI_cleaned.pdf
Processing: data/raw/FSS-19_paper_137_cleaned.pdf
Processing: data/raw/DIVERSE_LLM_Dataset___IEEE_Big_Data_cleaned.pdf
Processing: data/raw/Clustering_Analysis_of_Website_Usage_on_Twitter_during_the_COVID_19_Pandemic_cleaned.pdf
Processing: data/raw/Cohort_Optimization_Methods_SNAMS_2021_working_draft (4)_cleaned.pdf
Processing: data/raw/Lead-Azide_cleaned.pdf
Processing: data/raw/Knowing the Terrain_cleaned.pdf
Processing: data/raw/Leadership of Data Annotation 20180304v2_cleaned.pdf
Processing: data/raw/A_Complex_Network_Approach_to_Find_Latent_Terorrist_Communities_cleaned.pdf
Processing: data/raw/Designed Networks_cleaned.pdf
Processing: data/raw/Organizational risk using network analysis_cleaned.pdf
Processing: data/raw/LongNetV

## Pre-compute and Store Embeddings for RAG-enabled Tasks

In [22]:
# Check torch version
import os
os.environ["TRANSFORMERS_NO_AUDIO"] = "1"

try:
    import safetensors
    import safetensors.torch
    import torch
    import transformers
    import sqlite3
    print("Sqlite version:", sqlite3.sqlite_version)
    print("Transformers version:", transformers.__version__)
    print("Torch version:", torch.__version__)
    print("Torch path:", torch.__file__)
    print("Transformers version:", transformers.__version__)
    print("Transformers path:", transformers.__file__)
    print("uint64 exists:", hasattr(torch, "uint64"))
    print("safetensors version:", safetensors.__version__)
    print("safetensors.torch available")
except Exception as e:
    print("Transformers import failed:", e)

Sqlite version: 3.34.1
Transformers version: 4.37.2
Torch version: 2.3.0+cpu
Torch path: /workspaces/genai_capstone/.venv/lib/python3.10/site-packages/torch/__init__.py
Transformers version: 4.37.2
Transformers path: /workspaces/genai_capstone/.venv/lib/python3.10/site-packages/transformers/__init__.py
uint64 exists: True
safetensors version: 0.6.1
safetensors.torch available


In [None]:
# Import HuggingFace embedding wrapper from LangChain
from langchain_huggingface import HuggingFaceEmbeddings
import os

# ---------------------------------------------------------------
# 1. Initialize embedding model
# ---------------------------------------------------------------
# This wraps a HuggingFace model (MiniLM) so it can be used with LangChain.
# MiniLM is a lightweight transformer model that produces sentence embeddings.
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# ---------------------------------------------------------------
# 2. Set the persistent storage directory for Chroma
# ---------------------------------------------------------------
# This is where Chroma will store the vector index on disk.
# The directory is placed *outside the repo* to avoid accidentally committing large files to Git.
persist_dir = "/workspaces/chroma_storage/chroma_embeddings"

# Create the directory if it doesn’t exist (idempotent)
os.makedirs(persist_dir, exist_ok=True)

# ---------------------------------------------------------------
# 3. Prepare LangChain Document objects
# ---------------------------------------------------------------
# LangChain expects documents in a specific format: each one must be a Document object
# containing `page_content` (the raw text) and optional `metadata`.
# Here, we pair each text chunk with its corresponding paper ID.
from langchain_core.documents import Document

# ---------------------------------------------------------------
# ⬇️ PATCHED SECTION: Load and flatten all_chunks + chunk_to_paper from saved JSON
# ---------------------------------------------------------------
import json

# Path to the preprocessed chunked papers JSON file
chunked_data_path = "data/cleaned_chunked_papers.json"

# Load the saved chunked results from disk
with open(chunked_data_path, "r", encoding="utf-8") as f:
    saved_papers = json.load(f)

# Initialize flat lists to store chunk text and paper ID metadata
all_chunks_flat = []       # Each entry will be a string (chunk text)
chunk_to_paper = []        # Each entry will be the paper ID (filename)

# Flatten all chunks from all papers into a single list
for paper in saved_papers:
    paper_id = paper["id"]
    for chunk in paper["chunks"]:
        all_chunks_flat.append(chunk["text"])        # Extract chunk text
        chunk_to_paper.append(paper_id)              # Track source paper ID

# ---------------------------------------------------------------
# ✅ UPDATED: Build Document objects from flattened chunks
# ---------------------------------------------------------------
docs = [
    Document(page_content=chunk, metadata={"paper_id": paper_id})
    for chunk, paper_id in zip(all_chunks_flat, chunk_to_paper)
]

# ---------------------------------------------------------------
# 4. Create FAISS vectorstore (patched to replace Chroma)
# ---------------------------------------------------------------
# 🔁 PREVIOUSLY USED: Chroma (commented out due to unsupported sqlite3 version)
# vectorstore = Chroma.from_documents(
#     documents=docs,
#     embedding=embedding_model,
#     collection_name="research_chunks",
#     persist_directory=persist_dir,
#     client_settings={"is_persistent": True}
# )

# ✅ PATCHED: Use FAISS instead of Chroma
# ---------------------------------------------------------------
# WHY FAISS?
# - FAISS (Facebook AI Similarity Search) is a fast and widely used vector indexing library
# - FAISS does NOT rely on sqlite3 or any external DB, making it ideal for CPU-only or restricted environments
# - It supports fast nearest-neighbor searches in memory
# ---------------------------------------------------------------
# PROS:
# - No sqlite dependency, cross-platform compatible
# - Fast, well-tested, and LangChain compatible
# - Simple and lightweight
# CONS:
# - Does not include built-in persistent metadata store like Chroma
# - If you want disk persistence, you must manually serialize the FAISS index
# ---------------------------------------------------------------
from langchain_community.vectorstores import FAISS  # ✅ Import FAISS from LangChain

# Build FAISS vectorstore from documents and embedding model (lives in memory initially)
vectorstore = FAISS.from_documents(docs, embedding_model)

# ---------------------------------------------------------------
# ✅ NEW: Persist the FAISS index to disk for later reloading
# ---------------------------------------------------------------
# Unlike Chroma (which persists automatically when `persist_directory` is set),
# FAISS requires an explicit save call. This ensures that a later cell can
# successfully call `FAISS.load_local(...)` without recomputing embeddings.
#
# What gets written:
#   - <save_dir>/index.faiss : the raw FAISS vector index (binary)
#   - <save_dir>/index.pkl   : LangChain docstore + metadata (pickled)
#
# Notes:
# - The directory is created if it doesn't exist.
# - Use the SAME `save_dir` when calling `FAISS.load_local(...)` later.
# - If you change embedding models between save/load, retrieval quality will degrade.
faiss_index_dir = "data/faiss_index"

# Ensure the save directory exists (idempotent)
os.makedirs(faiss_index_dir, exist_ok=True)

# Save the FAISS index + docstore metadata to disk
vectorstore.save_local(faiss_index_dir)

print(f"✅ FAISS index saved to: {faiss_index_dir} (files: index.faiss, index.pkl)")

# ---------------------------------------------------------------
# ✅ NEW STEP: Save flattened document list for later rehydration
# ---------------------------------------------------------------
# Even though FAISS index can be persisted, we save the LangChain documents list
# so we can rebuild the FAISS vectorstore later if needed.
import pickle

vectorstore_doc_path = "data/vectorstore_docs.pkl"

# Serialize the docs list to disk
with open(vectorstore_doc_path, "wb") as f:
    pickle.dump(docs, f)

print(f"✅ Document list saved to {vectorstore_doc_path} for future vectorstore reloading.")

# ---------------------------------------------------------------
# 5. Confirm FAISS index object was created
# ---------------------------------------------------------------
# Unlike Chroma, FAISS doesn't persist by default. We can inspect the index manually if needed.
# You can optionally save the FAISS index to disk using:
# vectorstore.save_local("data/faiss_index")
print("✅ FAISS vectorstore created in memory (Chroma disabled due to sqlite version).")

# (Optional) Tiny sanity check: list the saved files so the next cell can load them confidently.
try:
    print("📁 Saved FAISS files:", os.listdir(faiss_index_dir))
except Exception as e:
    print(f"⚠️ Could not list {faiss_index_dir}: {e}")



✅ FAISS index saved to: data/faiss_index (files: index.faiss, index.pkl)
✅ Document list saved to data/vectorstore_docs.pkl for future vectorstore reloading.
✅ FAISS vectorstore created in memory (Chroma disabled due to sqlite version).
📁 Saved FAISS files: ['index.faiss', 'index.pkl']


In [34]:
# Function to reuse vectorstore later if needed

def rehydrate_faiss_vectorstore(
    save_dir: str = "data/faiss_index",
    docs_pickle_path: str = "data/vectorstore_docs.pkl",
    model_name: str = "all-MiniLM-L6-v2",
    embedding_kwargs: dict | None = None,
    allow_dangerous_deserialization: bool = True,
):
    """
    Rehydrate a FAISS vectorstore that was previously saved with `vectorstore.save_local(save_dir)`.

    What this does:
    1) Rebuilds the SAME embedding function you used before (defaults to MiniLM).
    2) Loads the FAISS index from `save_dir` using LangChain's `FAISS.load_local`.
    3) Optionally reloads your original `docs` list from a pickle file (if present).
       - This is useful for provenance, debugging, exporting, or rebuilding indices again later.

    Parameters
    ----------
    save_dir : str
        Directory that contains the saved FAISS index files. Use the exact path you passed to `save_local`.
    docs_pickle_path : str
        Path to the pickle file where you saved the `docs` (Document list). If missing, we’ll warn and return None.
    model_name : str
        Name of the HuggingFace sentence embedding model to rebuild.
        Must match what you used to CREATE the index, otherwise search quality will degrade.
    embedding_kwargs : dict | None
        Extra kwargs for HuggingFaceEmbeddings (e.g., {"model_kwargs": {"device": "cpu"}}).
        Keep this consistent with the original build for reproducibility.
    allow_dangerous_deserialization : bool
        LangChain’s FAISS loader uses pickle internally for metadata. This flag must be True to load.
        Only set to True for trusted artifacts you created yourself.

    Returns
    -------
    vectorstore : langchain_community.vectorstores.faiss.FAISS
        The loaded FAISS vectorstore, ready for `.similarity_search(...)`, etc.
    docs : list[langchain_core.documents.Document] | None
        The reloaded `docs` list if `docs_pickle_path` exists, else None.

    Usage
    -----
    # 1) Save (during build time):
    # vectorstore.save_local("data/faiss_index")

    # 2) Rehydrate later:
    # vs, docs = rehydrate_faiss_vectorstore(
    #     save_dir="data/faiss_index",
    #     docs_pickle_path="data/vectorstore_docs.pkl",
    #     model_name="all-MiniLM-L6-v2"
    # )
    # results = vs.similarity_search("your query", k=5)

    Notes
    -----
    - If you change embedding models between save/load, vector dimensions won’t match queries,
      and retrieval quality will tank (even if it doesn’t crash). Keep the SAME model+settings.
    - If you moved machines or containers, ensure the same or compatible versions of:
        * langchain, langchain-community, faiss, sentence-transformers, transformers
      Exact matches are ideal for reproducibility.
    """

    # -----------------------------
    # 0) Imports (kept inside for portability)
    # -----------------------------
    import os
    import pickle
    from langchain_huggingface import HuggingFaceEmbeddings
    from langchain_community.vectorstores import FAISS

    # -----------------------------
    # 1) Rebuild the embedding function
    # -----------------------------
    # IMPORTANT: Use the same model + kwargs as when you created the index.
    embedding_kwargs = embedding_kwargs or {}
    embeddings = HuggingFaceEmbeddings(model_name=model_name, **embedding_kwargs)

    # -----------------------------
    # 2) Load the FAISS index
    # -----------------------------
    if not os.path.isdir(save_dir):
        raise FileNotFoundError(
            f"FAISS index directory not found: {save_dir}\n"
            "Make sure you previously called `vectorstore.save_local(save_dir)` with this path."
        )

    vectorstore = FAISS.load_local(
        save_dir,
        embeddings=embeddings,
        allow_dangerous_deserialization=allow_dangerous_deserialization,
    )

    # -----------------------------
    # 3) Try to reload original Document list (optional but handy)
    # -----------------------------
    docs = None
    if os.path.exists(docs_pickle_path):
        try:
            with open(docs_pickle_path, "rb") as f:
                docs = pickle.load(f)
        except Exception as e:
            print(f"⚠️  Warning: Failed to load docs from {docs_pickle_path}: {e}")
    else:
        print(f"ℹ️  No pickled docs found at {docs_pickle_path}. Returning vectorstore only.")

    # -----------------------------
    # 4) Smoke test (optional): ensure the index is queryable
    # -----------------------------
    try:
        _ = vectorstore.similarity_search("health check", k=1)
    except Exception as e:
        print("⚠️  Vectorstore loaded but test query failed. Check model compatibility and versions.")
        print(f"Details: {e}")

    print("✅ FAISS vectorstore rehydrated successfully.")
    return vectorstore, docs


In [35]:
# ---------------------------------------------------------------
# Generic query to test retrieval from FAISS
# ---------------------------------------------------------------
# This block performs a semantic search query over text chunks that were already
# embedded and indexed into a FAISS vector store. It shows how to *reload* a
# previously saved FAISS index (no re-embedding required) and retrieve the
# top-k most relevant chunks for a natural-language query.
#
# Why FAISS here (quick recap):
# - We switched from Chroma to FAISS because the container’s SQLite (3.34.1)
#   is below Chroma’s minimum requirement (3.35.0). FAISS avoids SQLite entirely.
# - Functionally, you still get fast nearest-neighbor search over your embeddings.
# - Persistence works via FAISS’s own `save_local`/`load_local` helpers in LangChain.
# ---------------------------------------------------------------

# ---------------------------------------------------------------
# Reload vectorstore (no need to re-embed)
# ---------------------------------------------------------------
# We reload the FAISS vectorstore from disk using the same directory path that
# we used during initial indexing when calling `vectorstore.save_local(<dir>)`.
# Re-creating the SAME embedding function is important to ensure query vectors
# live in the same space as the stored vectors.
#
# NOTE:
# - `faiss_index_dir` should match the directory you passed to `save_local`.
# - `allow_dangerous_deserialization=True` is required by LangChain to unpickle
#   metadata. Only load indices you trust (i.e., ones you saved yourself).
# ---------------------------------------------------------------
from langchain_community.vectorstores import FAISS

faiss_index_dir = "data/faiss_index"  # <-- must match your earlier `vectorstore.save_local(...)`

# (Re)create the SAME embedding function used during indexing.
# If you already have `embedding_model` in scope (MiniLM), we reuse it.
# If not, uncomment the two lines below to rebuild it.
# from langchain_huggingface import HuggingFaceEmbeddings
# embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Load the FAISS index from disk. This does NOT recompute embeddings.
vectorstore = FAISS.load_local(
    faiss_index_dir,
    embeddings=embedding_model,
    allow_dangerous_deserialization=True,  # required due to pickle-based metadata
)

# ---------------------------------------------------------------
# Set up the query topic (can be static or dynamic)
# ---------------------------------------------------------------
# This string will be embedded with the SAME model as your documents (MiniLM),
# and FAISS will retrieve the most similar vectors by inner product / L2 (as used by LangChain).
# In production, this could come from a NOFO, user input, or an upstream pipeline.
# ---------------------------------------------------------------
priority_topic = "mental health"  # Example query topic, e.g., extracted from a NOFO
query = priority_topic  # Alias for clarity — makes it easy to swap in a different source later

# ---------------------------------------------------------------
# Run similarity search
# ---------------------------------------------------------------
# FAISS performs a nearest-neighbor search in vector space using the underlying
# index built earlier. `k=5` returns the 5 most similar chunks.
# Each result is a LangChain `Document` with `.page_content` and `.metadata`.
# ---------------------------------------------------------------
results = vectorstore.similarity_search(query, k=5)

# ---------------------------------------------------------------
# Display results
# ---------------------------------------------------------------
# For each result, we print:
# - The `paper_id` from metadata (so you know which PDF it came from)
# - A 250-character snippet of the text for quick inspection
# This is handy for validating that the retrieval matches your intent.
# ---------------------------------------------------------------
for r in results:
    # Be defensive: metadata keys can vary. Use `.get()` to avoid KeyErrors.
    source_id = r.metadata.get("paper_id", r.metadata.get("source_file", "unknown_source"))
    preview = (r.page_content or "")[:250].replace("\n", " ")
    print(f"{source_id}:\n{preview}...\n")


Social Media Mental Health Final_cleaned.pdf:
ASONAM ’23, November 6-9, 2023, Kusadasi, Turkey © 2023 Association for Computing Machinery.  ACM ISBN 979-8-4007-0409-3/23/11. . . $15.00 https://doi.org/10.1145/3625007.3627490  Fragile Minds: Exploring the Link Between Social  Media and Young Adul...

Food Addiction 20231222 v3_cleaned.pdf:
	but	also	influence	the	quality	of	healthcare	provided	to	obese	patients.		 7		 Discussion Obesity	 adversely	 affects	 health	 involving	 multiple	 systems	 at	 multiple	 levels:		endocrine,	 environmental,	 gastrointestinal,	 genomic,	 immunologic,...

Food Addiction 20231222 v3_cleaned.pdf:
	 1		 Viewpoint 	Ian	McCulloh,	Ph.D.	ian@brainrisefoundation.org,	imccull4@jhu.edu		Michael	Oler,	M.D.	mike@brainrisefoundation.org		Anna	McCulloh.	amccul16@jhu.edu			The	Brain	Rise	Foundation	14749	Walcott	Ave	Orlando,	FL	32827		Johns	Hopkins	Bloomb...

NeuroCogInfluence_cleaned.pdf:
 persuasiveness of public narratives.   Journal of personality and social psyc

> **Note:** The following section explains core functionality and workflow.

## **Step 1: Topic Extraction - [3 Marks]**

> **Read the NOFO doc and identify the topic for which the funding is to be given.**
---
<font color=Red>**Note:**</font> *2 marks are awarded for the prompt and 1 mark for the successful completion of this section, including debugging or modifying the code if necessary.*
   

**TASK:** Write an LLM prompt to extract the Topic for what the funding is been provided, from the NOFO document, Ask the LLM to respond back with the topic name only and nothing else.

In [30]:
# Topic extraction prompt
topic_extraction_prompt = f"""
You are a research grant specialist with expertise in analyzing NIH funding announcements and extracting key research priorities.

Your task: Analyze this NOFO document from the National Institute of Mental Health (NIMH) to identify the PRIMARY funding topic.

The document may describe multiple research areas, objectives, and priorities. Extract the single overarching topic that encompasses the main focus of this funding opportunity.

Return ONLY the primary topic in 3-8 words. No explanations, descriptions, or additional text.

Document:
{NOFO_pdf[0].page_content}
"""

In [32]:
# Finding the topic for which the Funding is been given
topic_extraction = llm.invoke(topic_extraction_prompt)
topic = topic_extraction.content
print(topic)

Digital mental health interventions


## Few-shot Prompt Setup for Assessing Relevance

> **Note:** The following section explains core functionality and workflow.

## **Step 2: Research Paper Relevance Assessment - [3 Marks]**
> **Analyze all the Research Papers and filter out the research papers based on the topic of NOFO**
---
<font color=Red>**Note:**</font> *2 marks are awarded for the prompt and 1 mark for the successful completion of this section, including debugging or modifying the code if necessary.*

> **Note:** The following section explains core functionality and workflow.

**TASK:** Write an Prompt which can be used to analyze the relevance of the provided research paper in relation to the topic outlined in the NOFO (Notice of Funding Opportunity) document. Determine whether the research aligns with the goals, objectives, and funding criteria specified in the NOFO. Additionally, assess whether the research paper can be used to support or develop a viable project idea that fits within the scope of the funding opportunity.

<br>

**Note:** If the paper does **not** significantly relate to the topic—by domain, method, theory, or application ask the LLM to return: **"PAPER NOT RELATED TO TOPIC"**


<br>

Ask the LLM to respond in the below specified structure:

```
### Output Format:
"summary": "<summary of the paper under 300 words, or return: PAPER NOT RELATED TO TOPIC>"

```

In [52]:
# ------------------------------------------------------------
# RELEVANCE PROMPT (revised to match instructor's required output)
# ------------------------------------------------------------
relevance_prompt_a = f"""
You are a research grant specialist evaluating whether a research paper is relevant to the NIH NOFO topic: {topic}.

EVALUATION CRITERIA:
Determine if the paper relates to digital mental health interventions through ANY of:
- Direct focus on digital/technology-based mental health solutions
- Mental health conditions, treatments, or outcomes (even if not digital)
- Digital health technologies that could be applied to mental health
- Intervention design, implementation, or evaluation methodologies
- User engagement, adherence, or behavior change in health contexts
- Relevant populations, settings, or delivery mechanisms

DECISION:
- If the paper has NO reasonable connection to the topic area: 
  return exactly: PAPER NOT RELATED TO TOPIC
- If the paper has ANY potential relevance:
  return a <300 word summary highlighting key findings, methods, or insights

OUTPUT FORMAT (return ONLY valid JSON):
{{
  "summary": "<summary under 300 words OR exactly: PAPER NOT RELATED TO TOPIC>"
  "relevance_confidence": "high|medium|low"
}}

### Paper content:
"""

In [37]:
# ------------------------------------------------------------
# FEW-SHOT RETRIEVAL FUNCTION (updated to emit summary-only examples)
# ------------------------------------------------------------
def get_few_shot_examples(
    json_path,
    max_examples=4,                 # total examples to include
    min_confidence=70               # minimum confidence threshold (used only if log has old shape)
):
    """
    Retrieve few-shot examples for prompt building, normalized to the REQUIRED output:
      { "summary": "<summary or PAPER NOT RELATED TO TOPIC>" }

    We adapt older log formats if needed by extracting/deriving a summary.
    """
    import os, json, random

    def _coerce_to_summary_only(example_obj):
        """
        Accepts prior example objects (possibly with older fields) and returns
        a JSON string containing ONLY the 'summary' field per spec.
        """
        # If already summary-only JSON string, pass through.
        if isinstance(example_obj, str):
            # Try to detect if it's already JSON with "summary"; if not, wrap it.
            try:
                data = json.loads(example_obj)
                if isinstance(data, dict) and "summary" in data:
                    return json.dumps({"summary": data["summary"]}, ensure_ascii=False)
            except Exception:
                # treat as raw summary text
                pass
            return json.dumps({"summary": example_obj}, ensure_ascii=False)

        # If dict-like, try to derive summary:
        if isinstance(example_obj, dict):
            # Preferred: already has 'summary'
            if "summary" in example_obj:
                return json.dumps({"summary": example_obj["summary"]}, ensure_ascii=False)

            # If older shape (criteria + decision), infer:
            decision = example_obj.get("decision", "")
            if isinstance(decision, str) and "NOT RELATED" in decision.upper():
                return json.dumps({"summary": "PAPER NOT RELATED TO TOPIC"}, ensure_ascii=False)

            # Fallback: try to shrink any 'llm_reasoning' field to <=300 words
            reasoning = example_obj.get("llm_reasoning") or example_obj.get("reasoning")
            if isinstance(reasoning, str) and reasoning.strip():
                # naive trim; in practice we had model generate the summary already
                return json.dumps({"summary": reasoning[:2000]}, ensure_ascii=False)

        # Last resort: unrelated
        return json.dumps({"summary": "PAPER NOT RELATED TO TOPIC"}, ensure_ascii=False)

    examples = []

    # 1) Try to load from log
    if os.path.exists(json_path):
        with open(json_path, "r", encoding="utf-8") as f:
            try:
                data = json.load(f)
            except json.JSONDecodeError:
                data = []

        relevant, irrelevant = [], []

        for iteration in data:
            for doc in iteration.get("relevant_documents", []):
                # Old logs may have confidences and long reasoning; coerce
                hybrid_conf = max(doc.get("model_confidence", 0) or 0, doc.get("rule_confidence", 0) or 0)
                if hybrid_conf >= min_confidence:
                    title = doc.get("title", "Untitled")
                    reasoning_json = _coerce_to_summary_only({
                        "summary": doc.get("summary") or doc.get("llm_reasoning") or ""
                    })
                    relevant.append((title, reasoning_json))

            for doc in iteration.get("irrelevant_documents", []):
                # 'doc' may be a title or dict; normalize
                title = doc if isinstance(doc, str) else doc.get("title", "Untitled")
                reasoning_json = json.dumps({"summary": "PAPER NOT RELATED TO TOPIC"})
                irrelevant.append((title, reasoning_json))

        half = max_examples // 2
        random.shuffle(relevant)
        random.shuffle(irrelevant)
        examples = relevant[:half] + irrelevant[:half]

    # 2) Fallback seed examples (already in summary-only format)
    if not examples:
        print("No high-confidence examples found. Using fallback seed examples.")
        examples = [
            (
                "Digital CBT for Adolescents",
                json.dumps({
                    "summary": "A randomized study of a mobile CBT app for adolescents shows clinically meaningful reductions in anxiety/depression versus control and provides implementation insights for school-based deployment."
                }, ensure_ascii=False)
            ),
            (
                "Oncology Drug Delivery Review",
                json.dumps({
                    "summary": "PAPER NOT RELATED TO TOPIC"
                })
            )
        ][:max_examples]

    return examples


In [38]:
# ------------------------------------------------------------
# FUNCTION: build_prompt_with_examples (unchanged behavior, new output format)
# ------------------------------------------------------------
def build_prompt_with_examples(topic, base_prompt, examples):
    """
    Build a few-shot prompt for the relevance task.
    Few-shot examples are *already* normalized to summary-only JSON outputs.
    """
    examples_str = "\n\n".join(
        [f"Example ({title}):\n{reasoning}" for title, reasoning in examples]
    )

    prompt = f"""
You are a research grant specialist evaluating research papers for relevance to NIH NOFO objectives: {topic}.

Below are examples of prior evaluations for context (note: each returns ONLY a JSON object with a 'summary' field):
{examples_str}

Now evaluate the following paper using the SAME OUTPUT FORMAT:

{base_prompt}
"""
    return prompt


In [44]:
# =========================
# BOOTSTRAP / SAFETY NET
# Place this ABOVE PHASE 1 and run it once per session
# =========================

# Tokenizer + MAX_TOKENS (define if missing)
try:
    MAX_TOKENS  # noqa: F821
except NameError:
    MAX_TOKENS = 300_000  # safety ceiling for prompt+context

try:
    encoding  # noqa: F821
except NameError:
    import tiktoken
    # Use your target model’s encoding (you used gpt-4o-mini elsewhere)
    encoding = tiktoken.encoding_for_model("gpt-4o-mini")

In [51]:
# =========================
# LIVE BUDGET GUARDRAIL
# =========================

from dataclasses import dataclass

# --- Pricing table per 1K tokens (USD). Adjust if you switch models. ---
PRICING = {
    "gpt-4o-mini": {"input": 0.00015, "output": 0.00060},
    "gpt-4o":      {"input": 0.00250, "output": 0.01000},
    # Add others if needed...
}

# --- Your daily budget from the course: ---
DAILY_BUDGET_USD = 4.00
STOP_MARGIN = 0.05   # stop if predicted total would exceed budget - margin

# Global trackers (idempotent definitions)
try: total_input_tokens
except NameError: total_input_tokens = 0
try: total_output_tokens
except NameError: total_output_tokens = 0
try: total_cost_usd
except NameError: total_cost_usd = 0.0

@dataclass
class CostBreakdown:
    prompt_tokens: int
    completion_tokens: int
    input_cost: float
    output_cost: float
    total_cost: float

def get_model_prices(model_name: str):
    # Default to 4o-mini if unknown (prevents KeyError)
    p = PRICING.get(model_name, PRICING["gpt-4o-mini"])
    return p["input"], p["output"]

def estimate_cost(model_name: str, prompt_tokens: int, completion_tokens: int) -> CostBreakdown:
    in_p_per_1k, out_p_per_1k = get_model_prices(model_name)
    input_cost  = (prompt_tokens    / 1000.0) * in_p_per_1k
    output_cost = (completion_tokens / 1000.0) * out_p_per_1k
    return CostBreakdown(prompt_tokens, completion_tokens, input_cost, output_cost, input_cost + output_cost)

def will_exceed_budget(predicted_increment_usd: float) -> bool:
    return (total_cost_usd + predicted_increment_usd) >= (DAILY_BUDGET_USD - STOP_MARGIN)

def invoke_with_budget_guardrail(llm, prompt: str, model_name: str = "gpt-4o-mini", trace_info: dict | None = None):
    """
    Wraps llm.invoke to:
      1) Daily reset check (UTC)
      2) Pre-check cost estimate (budget guard)
      3) Call model
      4) Use real usage if available; else fallback estimate
      5) Update counters
      6) LOG each call (CSV + JSONL) with context from `trace_info`
    """
    global total_input_tokens, total_output_tokens, total_cost_usd
    _maybe_daily_reset()  # <-- new

    # --- Pre-flight rough estimate ---
    approx_prompt_tokens = len(encoding.encode(prompt)) if "encoding" in globals() else int(len(prompt.split()) * 1.3)
    approx_completion_tokens = 300
    pre_cost = estimate_cost(model_name, approx_prompt_tokens, approx_completion_tokens)
    if will_exceed_budget(pre_cost.total_cost):
        raise RuntimeError(
            f"[BUDGET GUARD] Aborting: this request is predicted to exceed the daily cap. "
            f"(current=${total_cost_usd:.4f}, +${pre_cost.total_cost:.4f} >= ${DAILY_BUDGET_USD:.2f})"
        )

    # --- Call the model ---
    resp = llm.invoke(prompt)

    # --- Pull usage (prefer real numbers) ---
    prompt_tokens = approx_prompt_tokens
    completion_tokens = approx_completion_tokens
    meta = getattr(resp, "response_metadata", {}) or {}
    usage = meta.get("token_usage") or meta.get("usage") or {}
    if usage:
        prompt_tokens = int(usage.get("prompt_tokens", approx_prompt_tokens) or 0)
        completion_tokens = int(usage.get("completion_tokens", approx_completion_tokens) or 0)

    inc = estimate_cost(model_name, prompt_tokens, completion_tokens)

    # Final guard with real usage
    if will_exceed_budget(inc.total_cost):
        raise RuntimeError(
            f"[BUDGET GUARD] Aborting post-call: usage would exceed cap. "
            f"(current=${total_cost_usd:.4f}, +${inc.total_cost:.4f} >= ${DAILY_BUDGET_USD:.2f})"
        )

    # --- Update counters ---
    total_input_tokens += prompt_tokens
    total_output_tokens += completion_tokens
    total_cost_usd += inc.total_cost

    # --- Build log row ---
    ctx = trace_info or {}  # optional metadata about the call (paper_id, phase, etc.)
    log_row = {
        "ts_utc": _now_iso(),
        "model": model_name,
        "paper_id": ctx.get("paper_id"),
        "phase": ctx.get("phase"),                 # e.g., "relevance_eval" / "summary"
        "batch": ctx.get("batch"),
        "iteration": ctx.get("iteration"),
        "prompt_tokens": prompt_tokens,
        "completion_tokens": completion_tokens,
        "input_cost_usd": round(inc.input_cost, 6),
        "output_cost_usd": round(inc.output_cost, 6),
        "increment_cost_usd": round(inc.total_cost, 6),
        "run_total_cost_usd": round(total_cost_usd, 6),
        "run_total_prompt_tokens": total_input_tokens,
        "run_total_completion_tokens": total_output_tokens,
        "daily_budget_usd": DAILY_BUDGET_USD,
    }

    # --- Persist logs (CSV + JSONL) ---
    _append_csv(COST_LOG_CSV, log_row)
    _append_jsonl(COST_LOG_JSONL, log_row)

    print(f"[COST] +${inc.total_cost:.4f} (in={prompt_tokens}, out={completion_tokens}) "
          f"→ run=${total_cost_usd:.4f} | logged.")

    return resp


# =========================
# COST LOGGING + DAILY RESET
# =========================
import os, csv, json, datetime
from dataclasses import asdict

# Where to store logs
COST_LOG_JSONL = "logs/cost_usage.ndjson"   # one JSON object per line
COST_LOG_CSV   = "logs/cost_usage.csv"      # tabular
os.makedirs("logs", exist_ok=True)

# Track the current UTC day to auto-reset counters
try:
    _BILLING_DAY_UTC
except NameError:
    _BILLING_DAY_UTC = datetime.datetime.utcnow().date()

def _ensure_csv_header(path: str, fieldnames: list[str]):
    """Create CSV with header if it doesn't exist yet."""
    if not os.path.exists(path):
        with open(path, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=fieldnames)
            writer.writeheader()

def _append_csv(path: str, row: dict):
    """Append one row to CSV."""
    fieldnames = list(row.keys())
    _ensure_csv_header(path, fieldnames)
    with open(path, "a", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writerow(row)

def _append_jsonl(path: str, obj: dict):
    """Append one JSON object per line (NDJSON)."""
    with open(path, "a", encoding="utf-8") as f:
        f.write(json.dumps(obj, ensure_ascii=False) + "\n")

def _maybe_daily_reset():
    """
    If UTC day changed, reset the in-memory counters so the guardrail
    enforces the $4/day limit per UTC day.
    """
    global _BILLING_DAY_UTC, total_input_tokens, total_output_tokens, total_cost_usd
    today = datetime.datetime.utcnow().date()
    if today != _BILLING_DAY_UTC:
        _BILLING_DAY_UTC = today
        total_input_tokens = 0
        total_output_tokens = 0
        total_cost_usd = 0.0
        print(f"[BUDGET] New UTC day detected → counters reset ({today.isoformat()}).")

def _now_iso():
    return datetime.datetime.utcnow().replace(microsecond=0).isoformat() + "Z"

In [None]:
# ------------------------------------------------------------
# SUMMARIZATION HELPER (must be defined before first use)
# ------------------------------------------------------------
def summarize_text(paper_text: str, max_words: int = 300) -> str:
    """
    Summarizes a chunk or full paper to ~max_words words.
    Prefers the LLM path if `llm` is available; otherwise, falls back to a
    simple heuristic (first N words after whitespace normalization).

    This function is placed BEFORE it's first use (e.g., in PHASE 2 early-summary),
    to avoid NameError.
    """
    # Try LLM path if present
    if "llm" in globals() and llm is not None:
        summary_prompt = f"""
        Summarize the following research paper into ~{max_words} words,
        focusing on digital mental health interventions, methods, and outcomes.

        Return plain text only (no markdown):

        {paper_text}
        """
        try:
            summary_response = invoke_with_budget_guardrail(
                llm,
                summary_prompt,
                model_name="gpt-4o-mini",
                trace_info={
                    "paper_id": paper_id,
                    "phase": "relevance_eval",
                    "batch": batch_start // BATCH_SIZE + 1,
                    "iteration": progress_cnt,
                },
            )

            text = (summary_response.content or "").strip()
            if text:
                return text
        except Exception as e:
            print(f"[summarize_text] LLM summarization failed -> fallback. Error: {e}")

    # Heuristic fallback: whitespace normalize then take first N words
    # Keeps you unblocked if the LLM client isn't initialized yet.
    compact = " ".join((paper_text or "").split())
    words = compact.split()
    return " ".join(words[:max_words])

In [61]:
# ---------- SAFETY SHIM: ensure logger exists BEFORE we call it ----------
if "log_prompt_iteration" not in globals():
    import json, os
    from datetime import datetime

    def log_prompt_iteration(
        json_path,
        prompt,
        relevant_docs_with_reasoning,
        irrelevant_docs,
    ):
        """
        Minimal logger so the pipeline won't crash if the full logger cell
        hasn't run yet. Writes/updates a JSON list at `json_path`.
        """
        # Load existing log (if any)
        if os.path.exists(json_path):
            try:
                with open(json_path, "r", encoding="utf-8") as f:
                    data = json.load(f)
                if not isinstance(data, list):
                    data = []
            except Exception:
                data = []
        else:
            data = []

        iteration_id = len(data) + 1
        from datetime import datetime
        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")


        entry = {
            "iteration_id": iteration_id,
            "timestamp": timestamp,
            "prompt": prompt,
            "relevant_documents": relevant_docs_with_reasoning,
            "irrelevant_documents": irrelevant_docs,
        }

        data.append(entry)
        with open(json_path, "w", encoding="utf-8") as f:
            json.dump(data, f, indent=2, ensure_ascii=False)

        print(f"Logged iteration {iteration_id} to {json_path}")


```mermaid
%% --- CHUNK ANALYSIS START ---
flowchart TD
    A[All 112 Research Papers] --> B[Split into Chunks per Paper]
    B --> C[Embed Chunks into Vector DB + Build BM25 Index]
    
    %% PHASE 1
    C --> D[Phase 1 Retrieval: Overfetch Top Chunks (BM25 + Cosine Hybrid)]
    D --> E[Group Retrieved Chunks by Paper]
    E --> F[For Each Paper: Get Best Hybrid Score]
    F --> G[Count Chunks Seen per Paper]
    G --> H[Normalize Score: best_score / (1 + log1p(chunk_count))]
    H --> I[Sort Papers by Normalized Score (Lower = Better)]
    I --> J[Select Top N Papers for Phase 2]
    
    %% PHASE 2
    J --> K[Retrieve All Chunks for Top N Papers]
    K --> L[Cap Chunks per Paper (e.g., Max 10)]
    L --> M[Send Capped Chunks to LLM for Full Relevance Eval]
    M --> N[Final Ranked & Filtered Papers + Summaries]
%% --- END CHUNK ANALYSIS ---
```

In [1]:
# --- Few-shot setup ---

# ==============================================================
# SHARED UTILITIES: LoRA loader + tiny prompt optimizer scaffold
# Place this once near your imports/utilities.
# ==============================================================

import os, json, statistics as stats
from dataclasses import dataclass
from typing import List, Dict, Any, Callable

# ---- LoRA adapter loader (PEFT) ----
def maybe_load_lora(model, adapter_path: str, enabled: bool):
    """
    Load a LoRA adapter into a model if enabled and present.
    Safe no-op if the adapter folder is missing or PEFT is unavailable.
    """
    if not enabled:
        return model
    if not os.path.isdir(adapter_path):
        print(f"[LoRA] Adapter not found at {adapter_path}. Running base model.")
        return model
    try:
        from peft import PeftModel
        model = PeftModel.from_pretrained(model, adapter_path)
        print(f"[LoRA] Loaded adapter from: {adapter_path}")
    except Exception as e:
        print(f"[LoRA] Failed to load adapter: {e}")
    return model

# ---- Minimal multi-agent prompt optimizer (proposer/evaluator/coord) ----
@dataclass
class Variant:
    name: str
    system: str
    user: str
    meta: Dict[str, Any]

@dataclass
class EvalResult:
    name: str
    overall: float
    details: Dict[str, Any]

def propose_variants(base_system: str, base_user: str, n: int) -> List[Variant]:
    """
    Proposer agent: emits n small variations that encourage different emphases.
    You can enrich this later with your prompt-pattern library.
    """
    variants = []
    knobs = [
        "emphasize conceptual equivalence over keyword overlap",
        "penalize keyword-only matches lacking conceptual tie-in",
        "reward explicit linkage to NOFO topic constraints",
        "require citing paper passages (low-temp extraction)",
        "downweight very short abstracts with no methods detail",
        "require a 3-point justification checklist before scoring"
    ]
    for i in range(n):
        sys_append = f"\n- Additional rule: {knobs[i % len(knobs)]}."
        variants.append(
            Variant(
                name=f"v{i+1}",
                system=base_system + sys_append,
                user=base_user,
                meta={"rule": knobs[i % len(knobs)]}
            )
        )
    return variants

def evaluate_variants(variants: List[Variant], devset: List[Dict[str, Any]],
                      llm_call: Callable[[str,str], str],
                      judge_call: Callable[[Dict[str,Any], str], float]) -> List[EvalResult]:
    """
    Evaluator agent: runs each variant on a small calibration set and returns mean scores.
    - llm_call(system, user) -> model text
    - judge_call(example, output) -> numeric score (0..1 or 0..100)
    """
    results = []
    for v in variants:
        scores = []
        case_notes = []
        for ex in devset:
            out = llm_call(v.system, v.user.format(**ex)) if "{"
            in v.user else llm_call(v.system, v.user + "\n\n" + ex.get("content",""))
            s = judge_call(ex, out)
            scores.append(s)
            case_notes.append({"id": ex.get("id"), "score": s})
        mean = float(sum(scores) / max(1, len(scores)))
        results.append(EvalResult(name=v.name, overall=mean, details={"cases": case_notes, "meta": v.meta}))
    return results

def coordinate_prompt_optimization(base_system: str, base_user: str, devset: List[Dict[str,Any]],
                                   llm_call: Callable[[str,str], str],
                                   judge_call: Callable[[Dict[str,Any], str], float],
                                   max_generations: int = 3, n_variants: int = 6, early_stop_delta: float = 0.5):
    """
    Coordinator: iterates proposer → evaluator; stops when improvement < delta.
    Returns (best_system, best_user, history).
    """
    history = []
    best_overall = -1e9
    best_system, best_user = base_system, base_user
    for gen in range(max_generations):
        variants = propose_variants(best_system, best_user, n_variants)
        results = evaluate_variants(variants, devset, llm_call, judge_call)
        results.sort(key=lambda r: r.overall, reverse=True)
        history.append({"gen": gen+1, "results": [r.__dict__ for r in results]})
        lift = results[0].overall - best_overall
        print(f"[PromptOpt] Gen {gen+1}: best={results[0].name} score={results[0].overall:.3f} (Δ={lift:.3f})")
        if lift < early_stop_delta:
            print("[PromptOpt] Early stop: minimal improvement.")
            break
        best_overall = results[0].overall
        # adopt best variant’s system (user stays the same for grading prompts)
        best_system = results[0].details.get("system", best_system) if "system" in results[0].details else variants[0].system
    return best_system, best_user, history


LOG_PATH = "prompt_evaluation_log_cleaned.json"
few_shot_examples = get_few_shot_examples(LOG_PATH)
prompt_with_examples = build_prompt_with_examples(topic, relevance_prompt_a, few_shot_examples)

# Imports (kept)
import os
import json
import random
import tiktoken
from datetime import datetime
# --- Patch: avoid shadowing the datetime module ---
# Some helpers (e.g., invoke_with_budget_guardrail) call datetime.datetime.now().
# But `from datetime import datetime` above binds `datetime` to the *class*,
# breaking references to `datetime.datetime`. Rebind the name to the module if needed.
import datetime as _datetime_module
if not hasattr(datetime, "now"):          # If datetime is the class, it has .now but not .datetime attr
    datetime = _datetime_module           # Rebind to module so datetime.datetime is valid
elif not hasattr(datetime, "datetime"):   # Extra guard: ensure module-like attr is present
    datetime = _datetime_module
import re
import matplotlib.pyplot as plt
import time
import numpy as np

# ------------------------------------------------------------
# CONFIGURATION (kept)
# ------------------------------------------------------------
TEST_MODE = True
DISCREPANCY_THRESHOLD = 20  # (kept for backwards-compat; no longer used in FAISS flow)

FAST_MODE = True
TOP_K_PAPERS = 60

CHUNK_OVERFETCH_FACTOR = 5
CHUNK_CAP_PER_PAPER = 10
LONG_PAPER_THRESHOLD = 30
TOKEN_LIMIT_BEFORE_SUMMARY = 100000

BATCH_SIZE = 10
BATCH_DELAY = 3

INPUT_COST_PER_1K = 0.00015
OUTPUT_COST_PER_1K = 0.0006

# --- Hybrid & Agent Settings ---
LLM_RECHECK_MIN_CONFIDENCE = 60        # %; if Phase 2 judge < this, treat as not relevant
AGENTIC_EXPANSION_MAX_TERMS = 12       # request up to N new terms
AGENTIC_TOP_SNIPPETS_PER_PAPER = 2     # snippets per top paper for the agent

# --- Retrieval toggles & weights (A/B friendly) ---
# Options: "faiss" (cosine only), "hybrid" (cosine+bm25), "hybrid_tiebreak" (cosine primary, bm25 as tie-breaker)
RETRIEVAL_MODE = "hybrid"
HYBRID_WEIGHTS = {"cos": 0.7, "bm25": 0.3}    # soften BM25 so it nudges, not dominates
TIEBREAK_WINDOW_MULTIPLIER = 2                # how big a window (in papers) to consider for tie-breaking
# --------------------------------------------------

total_input_tokens = 0
total_output_tokens = 0
total_cost_usd = 0.0

prior_classification = {"relevant": [], "irrelevant": [], "unknown": []}

# ------------------------------------------------------------
# FAISS INTEGRATION (replaces Chroma — no sqlite dependency)
# ------------------------------------------------------------
# We reload the FAISS index that you saved earlier with vectorstore.save_local("data/faiss_index").
# IMPORTANT: FAISS 'similarity_search_with_score' returns (doc, distance);
# lower distance = *more* similar. We handle ranking accordingly below.
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
faiss_index_dir = "data/faiss_index"

vectorstore = FAISS.load_local(
    faiss_index_dir,
    embeddings=embedding_model,
    allow_dangerous_deserialization=True,
)

# ============================================================
# PHASE 1: PAPER-LEVEL PRE-FILTER (OVERFETCH + RANKING)
# ============================================================

print(f"[PHASE 1] Overfetching chunks for paper-level scoring (factor={CHUNK_OVERFETCH_FACTOR})")

# Step 1: Retrieve top-N chunks by similarity (overfetch)
overfetch_k = TOP_K_PAPERS * CHUNK_OVERFETCH_FACTOR
retrieved_chunks = vectorstore.similarity_search_with_score(topic, k=overfetch_k)  # -> list[(Document, distance)]

# --- Hybrid BM25 + Embedding Cosine per-chunk ---
# ------------------------------------------------------------------------------------
# OVERVIEW (added):
# We combine a lexical scorer (BM25) with a semantic scorer (embedding-based cosine via FAISS).
# - BM25 captures *exact/term-frequency* matches. It is strong when the query contains
#   must-have keywords (e.g., "EMA", "randomized controlled trial", "wearable sensor").
# - Embedding cosine captures *semantic* similarity. It helps when papers use different
#   phrasing/synonyms for the same idea that BM25 might miss.
# How they work together in this notebook:
#   1) Here, we compute BM25 scores for each retrieved chunk. We also compute a second
#      BM25 "soft boost" using your rubric/criteria text (relevance_prompt_a). We
#      min–max normalize both and blend them (weighted sum) to get a single BM25 score
#      per chunk (`bm25_raw`).
#   2) Downstream (outside this block), we convert FAISS distances to a cosine-like
#      similarity (`cos_sim = 1/(1+distance)`) and average it with the normalized BM25
#      score: `hybrid = 0.5 * (bm25_norm + cos_sim)`.
# This hybrid balances *recall* (semantic) and *precision on keywords* (BM25), and the
# rubric boost keeps Phase 1 aligned with Phase 2’s judging criteria—without being a
# hard filter in Phase 1.
# ------------------------------------------------------------------------------------

# Build a BM25 index over the *retrieved* chunk texts and blend with cosine (from FAISS distance)
from rank_bm25 import BM25Okapi

# 1) Prepare a BM25 corpus from the overfetched candidate chunks.
#    - We use the chunk text already returned by FAISS as our "documents".
#    - `or ""` avoids None; BM25Okapi expects lists of strings.
chunk_texts = [doc.page_content or "" for (doc, _d) in retrieved_chunks]

# 2) Very simple tokenization (whitespace split). This is intentionally lightweight to
#    keep parity with your current preprocessing; you can swap in a smarter tokenizer
#    later (e.g., regex, spaCy) without changing the surrounding logic.
tokenized = [t.split() for t in chunk_texts]

# 3) Initialize BM25 with the tokenized candidate set.
bm25 = BM25Okapi(tokenized)

# 4) Optional "soft boost" from rubric criteria:
#    - `relevance_prompt_a` encodes what *matters* per the NOFO (Phase 2 rubric).
#    - We score the same chunks against these criteria tokens and then blend that
#      score with the topic-only score below. This gently nudges ranking toward
#      rubric-aligned chunks without excluding anything at this stage.
criteria_text = (relevance_prompt_a or "").strip()
criteria_tokens = criteria_text.split() if criteria_text else []

# 5) Compute raw BM25 scores for (a) the topic and (b) the rubric criteria.
#    - Both arrays are the length of `retrieved_chunks`, aligned by index.
bm25_topic_raw = bm25.get_scores(topic.split())
bm25_crit_raw  = bm25.get_scores(criteria_tokens) if criteria_tokens else np.zeros_like(bm25_topic_raw)

# 6) Helper: min–max normalization to [0,1] per candidate set.
#    - This prevents one scorer from dominating due to scale differences.
#    - The small epsilon (1e-9) guards against division by zero when all values are equal.
def _mm(x):
    x = np.array(x, dtype=float)
    mn, mx = x.min(), x.max()
    return (x - mn) / (mx - mn + 1e-9)

# 7) Normalize both BM25 channels to [0,1].
bm25_topic = _mm(bm25_topic_raw)
bm25_crit  = _mm(bm25_crit_raw) if criteria_tokens else np.zeros_like(bm25_topic)

# 8) Blend topic vs. criteria with explicit weights.
#    - W_TOPIC high preserves broad recall on the main query.
#    - W_CRITERIA injects rubric awareness as a *soft* signal.
W_TOPIC, W_CRITERIA = 0.7, 0.3  # keep recall strong on topic
bm25_raw = W_TOPIC * bm25_topic + W_CRITERIA * bm25_crit  # blended BM25 per chunk

# ------------------------------------------------------------------------------------
# ORIGINAL (topic-only) BM25 overwrite (DEFUNCT):
# The next two lines would *override* the blended `bm25_raw` with topic-only scores,
# nullifying the rubric soft boost we just computed. Per your rules, we DO NOT delete
# them; we comment them out and explain why.
# ------------------------------------------------------------------------------------
# # BM25 scores for the topic
# bm25_raw = bm25.get_scores(topic.split())  # DEFUNCT: would discard rubric blending above.

# 9) Final min–max normalization of the (now blended) BM25 to [0,1] for fusion.
#    - Downstream we combine `bm25_norm` with `cos_sim` (converted from FAISS distance)
#      as `hybrid = 0.5 * (bm25_norm + cos_sim)`.
bm25_min, bm25_max = float(np.min(bm25_raw)), float(np.max(bm25_raw))
bm25_norm = [(s - bm25_min) / (bm25_max - bm25_min + 1e-9) for s in bm25_raw]

# Convert FAISS distance to a [0,1] similarity (smaller distance => larger similarity)
def dist_to_sim(d: float) -> float:
    return 1.0 / (1.0 + float(d))

# Score helpers (use RETRIEVAL_MODE + HYBRID_WEIGHTS)
def chunk_score(bm25_val: float, distance: float) -> float:
    """Return the per-chunk score used for ranking.
    Larger is better (cosine-like similarity / normalized BM25)."""
    cos = 1.0 / (1.0 + float(distance))
    if RETRIEVAL_MODE == "faiss":
        return cos
    elif RETRIEVAL_MODE == "hybrid":
        return HYBRID_WEIGHTS["cos"] * cos + HYBRID_WEIGHTS["bm25"] * float(bm25_val)
    elif RETRIEVAL_MODE == "hybrid_tiebreak":
        # Primary key is cosine; BM25 used later as secondary tie-breaker at paper level.
        return cos
    else:
        return cos  # safe default

paper_chunk_counts = {}       # {paper_id: count}
paper_chunk_hybrids = {}      # {paper_id: [hybrid_chunk_scores]}

paper_bm25_best = {}          # {paper_id: best bm25_norm}  # for hybrid_tiebreak secondary key

for (doc, distance), b_s in zip(retrieved_chunks, bm25_norm):
    pid = doc.metadata.get("paper_id", "Unknown_Paper")
    # DEFUNCT fixed 50/50 fusion (kept for traceability):
    # cos_sim = dist_to_sim(distance)
    # hybrid = 0.5 * (b_s + cos_sim)

    score = chunk_score(b_s, distance)       # honors RETRIEVAL_MODE + HYBRID_WEIGHTS
    paper_chunk_counts[pid] = paper_chunk_counts.get(pid, 0) + 1
    paper_chunk_hybrids.setdefault(pid, []).append(score)

    # Track best BM25 per paper for later tie-breaking (only used in 'hybrid_tiebreak')
    if b_s is not None:
        prev = paper_bm25_best.get(pid, 0.0)
        if b_s > prev:
            paper_bm25_best[pid] = float(b_s)

# Per-paper: use BEST hybrid chunk, then normalize by chunk count to avoid big-paper bias
paper_hybrid_best = {pid: max(sc_list) for pid, sc_list in paper_chunk_hybrids.items()}
paper_norm_score = {
    pid: best / (1.0 + np.log1p(paper_chunk_counts.get(pid, 1)))   # smooth penalty by chunk count
    for pid, best in paper_hybrid_best.items()
}

# Step 2: Aggregate distances per paper_id
# --- NEW: accumulate distances across both retrieval passes ---
paper_distances = {}  # {paper_id: [distances]}  # unified, first + expanded pass
# --------------------------------------------------------------
# NOTE: In FAISS, *smaller* distance is better. We will rank by MIN distance per paper.
paper_scores = {}  # {paper_id: [distances]}
for doc, distance in retrieved_chunks:
    pid = doc.metadata.get("paper_id", "Unknown_Paper")
    paper_scores.setdefault(pid, []).append(distance)       # kept
    paper_distances.setdefault(pid, []).append(distance)    # NEW: unified

# Step 3: Rank papers by **min distance** (primary) and print **mean** for debugging

# ranked_papers = sorted(
#     paper_scores.items(),
#     key=lambda x: min(x[1])  # smaller distance = closer
# )

# --- New: rank by normalized per-paper score (higher is better) ---
ranked_papers = sorted(paper_norm_score.items(), key=lambda x: x[1], reverse=True)

# Optional tie-break: if RETRIEVAL_MODE == "hybrid_tiebreak",
# refine ordering for the top window using best BM25 as a secondary key.
if RETRIEVAL_MODE == "hybrid_tiebreak":
    window = max(1, TIEBREAK_WINDOW_MULTIPLIER * TOP_K_PAPERS)
    head = ranked_papers[:window]
    tail = ranked_papers[window:]
    # Within the head, sort by (norm_score primary, bm25_best secondary)
    head = sorted(
        head,
        key=lambda x: (x[1], paper_bm25_best.get(x[0], 0.0)),
        reverse=True
    )
    ranked_papers = head + tail

# Step 4: Select top-K unique papers
top_paper_ids = [pid for pid, _ in ranked_papers[:TOP_K_PAPERS]]

# --- Debug: Show Phase 1 normalization details for top-ranked papers ---
print("[PHASE 1] Normalization preview for top papers (higher score = more relevant after penalty):")
for pid, norm_score in ranked_papers[:TOP_K_PAPERS]:
    chunks_seen = paper_chunk_counts.get(pid, 0)
    divisor = 1.0 + np.log1p(chunks_seen)
    best_hybrid = paper_hybrid_best.get(pid, float("nan"))
    print(
        f" - {pid}\n"
        f"    best_hybrid_score = {best_hybrid:.4f}\n"
        f"    chunks_seen       = {chunks_seen}\n"
        f"    divisor           = {divisor:.3f}  # penalty for chunk count\n"
        f"    normalized_score  = {norm_score:.4f}"
    )

# --- Agentic query expansion (2nd pass) ---
try:
    # Collect seed snippets from the already-retrieved chunks for top papers
    by_pid_chunks = {}
    for doc, dist in retrieved_chunks:
        pid = doc.metadata.get("paper_id", "Unknown_Paper")
        by_pid_chunks.setdefault(pid, []).append((dist, doc.page_content or ""))

    seed_snippets = []
    for pid in top_paper_ids:
        # take the closest snippets for each top paper
        for dist, txt in sorted(by_pid_chunks.get(pid, []), key=lambda x: x[0])[:AGENTIC_TOP_SNIPPETS_PER_PAPER]:
            seed_snippets.append(txt[:500])

    # ----------------------------------------------------------------------
    # ORIGINAL PROMPT (now commented out): This placed the bullet list *outside*
    # the f-string, so "SNIPPETS:" appeared but the bullets were appended after
    # the string literal. It also lacked CRITERIA guidance and had no guard for
    # empty snippets or None criteria. Keeping for traceability.
    # ----------------------------------------------------------------------
    # expansion_prompt = f"""
    # You are assisting literature retrieval for an NIH NOFO with topic:
    # "{topic}"
    #
    # From the following snippets of already-retrieved papers, identify 5–{AGENTIC_EXPANSION_MAX_TERMS} domain terms,
    # phrases, or synonyms that are likely important BUT NOT explicitly present in the query above.
    # Prefer multi-word phrases when meaningful. Return ONLY a comma-separated list (no numbering, no quotes).
    #
    # SNIPPETS:
    #
    # - """ + "\n- ".join(seed_snippets[:80])

    # ----------------------------------------------------------------------
    # PATCHED PROMPT (active):
    # - Ensures SNIPPETS bullet list is rendered *directly under* the SNIPPETS header.
    # - Adds CRITERIA guidance (from relevance_prompt_a) *after* the evidence.
    # - Guards against None/empty: shows "(no snippets available)" when no snippets;
    #   uses empty string for criteria if relevance_prompt_a is None.
    # - Extensively commented per user rules.
    # ----------------------------------------------------------------------
    crit_text = (relevance_prompt_a or "").strip()  # Safe: if None, becomes ""
    # Clean and cap snippets (keep existing 80-cap); drop empties/whitespace-only
    snips = [s.strip() for s in seed_snippets[:80] if s and s.strip()]
    snip_block = ("- " + "\n- ".join(snips)) if snips else "(no snippets available)"

    expansion_prompt = (
        f"""You are assisting literature retrieval for an NIH NOFO with topic:
"{topic}"

From the following snippets of already-retrieved papers, identify 5–{AGENTIC_EXPANSION_MAX_TERMS} domain terms,
phrases, or synonyms that are likely important BUT NOT explicitly present in the query above.
Prefer multi-word phrases when meaningful. Return ONLY a comma-separated list (no numbering, no quotes).

SNIPPETS:
{snip_block}

CRITERIA (use these to guide which concepts are missing):
{crit_text}
"""
    )
    # ----------------------------------------------------------------------

    exp = invoke_with_budget_guardrail(
        llm, expansion_prompt, model_name="gpt-4o-mini",
        trace_info={"phase": "agentic_query_expansion"}
    )
    raw_terms = exp.content or ""
    new_terms = [t.strip() for t in re.split(r"[,\n;]", raw_terms) if t.strip()]
    new_terms = [t for t in new_terms if t.lower() not in topic.lower()]

    if new_terms:
        expanded_query = topic + " " + " ".join(new_terms[:AGENTIC_EXPANSION_MAX_TERMS])
        print(f"[AGENT] Expanded query with {len(new_terms)} terms -> rerunning retrieval...")

        # Rerun retrieval with expanded query
        retrieved_chunks_exp = vectorstore.similarity_search_with_score(expanded_query, k=overfetch_k)
        # NEW: record distances from expanded pass too
        for doc, dist in retrieved_chunks_exp:
            pid = doc.metadata.get("paper_id", "Unknown_Paper")
            paper_distances.setdefault(pid, []).append(dist)

        # Recompute BM25 on the expanded candidate set and hybrid scores
        exp_texts = [doc.page_content or "" for (doc, _d) in retrieved_chunks_exp]
        exp_tokenized = [t.split() for t in exp_texts]
        bm25_exp = BM25Okapi(exp_tokenized)
        bm25_raw_exp = bm25_exp.get_scores(expanded_query.split())
        bmin_e, bmax_e = float(np.min(bm25_raw_exp)), float(np.max(bm25_raw_exp))
        bm25_norm_exp = [(s - bmin_e) / (bmax_e - bmin_e + 1e-9) for s in bm25_raw_exp]

        # Update normalized per-paper hybrid scores with the expanded pass (merge + dedupe)
        paper_norm_score_exp = dict(paper_norm_score)  # start with first-pass scores
        # (also reuse best hybrid we already tracked)
        for (doc, dist), b_s in zip(retrieved_chunks_exp, bm25_norm_exp):
            pid = doc.metadata.get("paper_id", "Unknown_Paper")
            h = 0.5 * (b_s + dist_to_sim(dist))
            # refresh best hybrid and chunk counts
            best = max(h, paper_hybrid_best.get(pid, 0.0))
            cnt = paper_chunk_counts.get(pid, 0) + 1
            paper_norm_score_exp[pid] = max(paper_norm_score_exp.get(pid, 0.0), best / (1.0 + np.log1p(cnt)))

        # Re-rank and take top-K
        ranked_papers = sorted(paper_norm_score_exp.items(), key=lambda x: x[1], reverse=True)
        top_paper_ids = [pid for pid, _ in ranked_papers[:TOP_K_PAPERS]]
    else:
        print("[AGENT] No new terms produced; using original ranking.")
except Exception as e:
    print(f"[AGENT] Expansion skipped due to error: {e}")

print("[PHASE 1] Selecting top papers based on normalized per-paper score",
      "(best_chunk_score penalized by chunk_count; higher = better).")

# Debug: Print ranking summary
print(f"Selected top {len(top_paper_ids)} papers for Phase 2 evaluation:")
for pid in top_paper_ids:
    distances = paper_distances.get(pid, [])
    if not distances:
        print(f" - {pid}: (no distance stats available from retrieval passes)")
        continue
    print(f" - {pid}: best(min)={min(distances):.4f}, avg={np.mean(distances):.4f}, chunks={len(distances)}")

# Percentile summary (of best distances per paper)
best_dists = [min(dlist) for dlist in paper_distances.values()]

if best_dists:
    percentiles = [25, 50, 75, 90]
    print("\n[PHASE 1] Distance Percentile Summary (lower = more similar):")
    for p in percentiles:
        val = np.percentile(best_dists, p)
        print(f"  {p}th percentile: {val:.4f}")
    print(f"  Min (best): {min(best_dists):.4f}")
    print(f"  Max (worst): {max(best_dists):.4f}")
else:
    print("[PHASE 1] No distances available to compute percentiles.")

# ============================================================
# PHASE 2: FULL-CHUNK RETRIEVAL WITH CAP + EARLY SUMMARY
# ============================================================

print("[PHASE 2] Retrieving all chunks (bulk) and filtering for top papers...")

# Retrieve many chunks at once
# NOTE: FAISS returns (doc, distance). Smaller distance = better.
all_chunks = vectorstore.similarity_search_with_score(topic, k=9999)

paper_chunk_data = {}  # {paper_id: [(distance, text), ...]}

for doc, distance in all_chunks:
    pid = doc.metadata.get("paper_id", "Unknown_Paper")
    if pid in top_paper_ids:
        paper_chunk_data.setdefault(pid, []).append((distance, doc.page_content))

# Build per-paper text with cap and optional early summaries
paper_chunks = {}
paper_chunk_counts = {}

for pid, chunks in paper_chunk_data.items():
    # Sort by ascending distance (most similar first)
    sorted_chunks = sorted(chunks, key=lambda x: x[0])
    capped = sorted_chunks[:CHUNK_CAP_PER_PAPER]
    paper_chunk_counts[pid] = len(sorted_chunks)

    if len(sorted_chunks) > LONG_PAPER_THRESHOLD:
        print(f"[EARLY SUMMARY] Paper '{pid}' exceeds {LONG_PAPER_THRESHOLD} chunks → summarizing chunks.")
        chunk_summaries = []
        for _, chunk_text in capped:
            summary = summarize_text(chunk_text)
            chunk_summaries.append(summary)
        paper_chunks[pid] = "\n".join(chunk_summaries)
    else:
        paper_chunks[pid] = "\n".join([text for _, text in capped])

aggregated_papers = list(paper_chunks.items())
print(f"[PHASE 2] Aggregated {len(aggregated_papers)} papers for full relevance evaluation.")

print("[PHASE 2] Chunk count per selected paper (post-aggregation with cap):")
for pid, total in paper_chunk_counts.items():
    print(f" - {pid}: {min(total, CHUNK_CAP_PER_PAPER)} chunks used (original {total})")

# Optional warning for overrepresentation (kept)
WARNING_THRESHOLD_PERCENT = 25
total_chunks_used = sum(min(count, CHUNK_CAP_PER_PAPER) for count in paper_chunk_counts.values())
for pid, count in paper_chunk_counts.items():
    used = min(count, CHUNK_CAP_PER_PAPER)
    percent = (used / total_chunks_used) * 100 if total_chunks_used else 0
    if percent > WARNING_THRESHOLD_PERCENT:
        print(f"*** WARNING: Paper '{pid}' contributes {percent:.1f}% of post-cap chunks ({used}/{total_chunks_used}). ***")

# ------------------------------------------------------------
# MAIN LOOP (updated to parse summary-only JSON)
# ------------------------------------------------------------
documents = []
irrelevant_docs_list = []
progress_cnt = 1
relevant_papers_count = 0
irrelevant_papers_count = 0
total_files = len(aggregated_papers)

for batch_start in range(0, total_files, BATCH_SIZE):
    batch = aggregated_papers[batch_start: batch_start + BATCH_SIZE]
    print(f"\nProcessing batch {batch_start//BATCH_SIZE + 1} "
          f"({len(batch)} papers) out of {total_files} total papers...")

    for paper_id, paper_text in batch:
        try:
            # Token budget check
            token_estimate = len(encoding.encode(paper_text))
            if token_estimate > TOKEN_LIMIT_BEFORE_SUMMARY:
                print(f"[TOKEN GUARD] Paper '{paper_id}' estimated {token_estimate} tokens → auto-summarizing.")
                paper_text = summarize_text(paper_text)

            # Build final prompt (few-shot prefix + paper text appended to base prompt)
            available_tokens = MAX_TOKENS - len(encoding.encode(prompt_with_examples))
            truncated_text = encoding.decode(encoding.encode(paper_text)[:available_tokens])
            full_prompt = prompt_with_examples + truncated_text

            # Cost tracking
            token_count = len(encoding.encode(full_prompt))
            total_input_tokens += token_count
            total_output_tokens += int(token_count * 0.1)
            total_cost_usd = (
                (total_input_tokens / 1000) * INPUT_COST_PER_1K +
                (total_output_tokens / 1000) * OUTPUT_COST_PER_1K
            )
            print(f"[Token Count] {paper_id}: {token_count} tokens "
                  f"(Estimated running cost: ${total_cost_usd:.4f})")

            # Call LLM
            response = invoke_with_budget_guardrail(
                llm,
                full_prompt,
                model_name="gpt-4o-mini",
                trace_info={
                    "paper_id": paper_id,
                    "phase": "relevance_eval",
                    "batch": batch_start // BATCH_SIZE + 1,
                    "iteration": progress_cnt,
                },
            )

            progress_cnt += 1

            # --- Parse STRICT summary-only JSON per instructor spec ---
            parsed = {}
            try:
                parsed = json.loads(response.content)
            except json.JSONDecodeError:
                m = re.search(r"\{.*\}", response.content, re.DOTALL)
                parsed = json.loads(m.group(0)) if m else {"summary": "PAPER NOT RELATED TO TOPIC"}

            summary_text = (parsed or {}).get("summary", "").strip()

            # --- LLM relevance re-check (0–100%) ---
            try:
                judge_prompt = f"""
            Rate relevance 0–100 using this rubric:
            - Coverage of the NOFO's primary topic (40%)
            - Alignment with these criteria (60% total), score each 0–100 and weight equally:
            {relevance_prompt_a}

            Return ONLY a single integer (0–100). If evidence is weak or speculative, score lower.

            TOPIC:
            {topic}

            PAPER SUMMARY:
            {summary_text}

            """
                judge_resp = invoke_with_budget_guardrail(
                    llm, judge_prompt, model_name="gpt-4o-mini",
                    trace_info={"phase": "relevance_recheck", "paper_id": paper_id}
                )
                m = re.search(r"\b(\d{1,3})\b", (judge_resp.content or ""))
                llm_conf = max(0, min(100, int(m.group(1)))) if m else 50
                # Map numeric judge score to an INTERNAL label (not part of model output)
                if llm_conf >= 85:
                    internal_conf_label = "high"
                elif llm_conf >= 70:
                    internal_conf_label = "medium"
                else:
                    internal_conf_label = "low"

            except Exception as e:
                print(f"[PHASE 2] LLM re-check failed; defaulting to 50%. Error: {e}")
                llm_conf = 50

            # Trim likely false positives
            if llm_conf < LLM_RECHECK_MIN_CONFIDENCE:
                irrelevant_papers_count += 1
                irrelevant_docs_list.append(paper_id)
                continue

            # Decision: relevant vs not
            if summary_text.upper() == "PAPER NOT RELATED TO TOPIC" or not summary_text:
                irrelevant_papers_count += 1
                irrelevant_docs_list.append(paper_id)
                continue

            # Store result (confidence fields now N/A; we keep placeholders for log schema compatibility)
            documents.append({
                'title': paper_id,
                'file_path': "(from FAISS index)",
                'llm_reasoning': json.dumps({"summary": summary_text}, ensure_ascii=False),  # OUTPUT stays single-field JSON
                'model_confidence': None,      # kept for schema compatibility
                'rule_confidence': None,
                'confidence_discrepancy': None,
                'flagged_for_review': False,
                'internal_relevance_confidence': internal_conf_label  # INTERNAL ONLY; not part of returned JSON
            })

            relevant_papers_count += 1

        except Exception as e:
            print(f"!!! Error processing {paper_id}: {str(e)}")

    print(f"Batch {batch_start//BATCH_SIZE + 1} complete. Sleeping {BATCH_DELAY} seconds...")
    time.sleep(BATCH_DELAY)

# ------------------------------------------------------------
# SUMMARY OUTPUT (kept; confidence fields will be None)
# ------------------------------------------------------------
print("=" * 50)
print(f"Relevant Papers: {relevant_papers_count}/{total_files}")
print(f"Irrelevant Papers: {irrelevant_papers_count}/{total_files}")
print(f"Estimated Total Input Tokens: {total_input_tokens}")
print(f"Estimated Total Output Tokens: {total_output_tokens}")
print(f"Estimated Total Cost: ${total_cost_usd:.4f}")
print("=" * 50)

print("\nList of relevant papers:")
for doc in documents:
    print(f"\nTitle: {doc['title']}")
    print(f"Reasoning (summary only; truncated): {doc['llm_reasoning'][:300]}...")

# ------------------------------------------------------------
# LOGGING (kept; now logs only summary JSON in 'reasoning')
# ------------------------------------------------------------
relevant_docs_with_reasoning = [
    {
        "title": d['title'],
        "reasoning": d['llm_reasoning'],  # contains {"summary": "..."} per spec
        "model_confidence": d['model_confidence'],
        "rule_confidence": d['rule_confidence'],
        "confidence_discrepancy": d['confidence_discrepancy'],
        "flagged_for_review": d['flagged_for_review']
    }
    for d in documents
]

log_prompt_iteration(
    json_path="prompt_evaluation_log_cleaned.json",
    prompt=prompt_with_examples,
    relevant_docs_with_reasoning=relevant_docs_with_reasoning,
    irrelevant_docs=irrelevant_docs_list
)


SyntaxError: expected 'else' after 'if' expression (790476105.py, line 84)

In [None]:
# Suggested interim step from Claude -- Need to think through and also add agentic step here
# Use this AFTER collecting relevant papers
proposal_alignment_prompt = """
Given this collection of relevant research papers and the NOFO requirements, 
evaluate how each paper could contribute to a fundable proposal:

ALIGNMENT DIMENSIONS:
1. Methodological Alignment
   - Does it provide validated research methods?
   - Clinical trial designs or evaluation frameworks?
   
2. Theoretical Contribution
   - Relevant theoretical frameworks?
   - Evidence base for intervention design?
   
3. Practical Application
   - Direct implementation pathways?
   - Technology solutions or components?
   
4. Gap Identification
   - What problems does it highlight?
   - What opportunities for innovation?

[Continue with specific NIMH priorities and NOFO requirements...]
"""

> **Note:** The following section explains core functionality and workflow.

## **Step 3: Proposal Ideation Based on Filtered Research - [4 marks]**
> **Use the filtered papers, to generate ideas for the Reseach Proposal.**
---
<font color=Red>**Note:**</font> *2 marks are awarded for the prompt, 1 mark for the Generating Idea and 1 mark for fetching file path of chosen idea along with successful completion of this section, including debugging or modifying the code if necessary.*

> **Note:** The following section explains core functionality and workflow.

**TASK:** Write an Prompt which can be used to generate 5 ideas for the Research Proposal, each idea should consist:

1. **Idea X:** [Concise Title of the Project Idea]  \n
2. **Description:** [Brief and targeted description summarizing the objectives, innovative elements, scientific rationale, and anticipated impact.]  \n
3. **Citation:** [Author(s), Year or Paper Title]  \n
4. **NOFO Alignment:** [List two or more specific NOFO requirements that this idea directly addresses]  \n
5. **File Path of the Research Paper:** [Exact file path, ending in .pdf]

- Use the Delimiter `---` for defining the structure of the sample outputs in the prompt





> **Note:** The following section explains core functionality and workflow.

#### Generating 5 Ideas

In [None]:
# Note to self: Be sure to add additional details from page linked in the NOFO pdf
# Also need to include constraints, e.g., "Digital health test beds that leverage well-established 
# digital health platforms to optimize evidence-based digital mental health interventions"

# Suggested criteria from Claude:


gen_idea_prompt = f"""


<WRITE YOUR PROMPT HERE>


"""

In [None]:
ideas = invoke_with_budget_guardrail(
    llm,
    gen_idea_prompt,
    model_name="gpt-4o-mini",
    trace_info={
        "paper_id": paper_id,
        "phase": "relevance_eval",
        "batch": batch_start // BATCH_SIZE + 1,
        "iteration": progress_cnt,
    },
)

In [None]:
from IPython.display import Markdown, display
display(Markdown(ideas.content))

In [None]:
# For consideration if extracted text is not clean enough
# Add post-extraction GPT-enabled noise removal step
# to remove additional noise from chunks

# Too resource intensive for full data set. Add later if needed.

# import json
# from openai import OpenAI

# # Initialize OpenAI client
# client = OpenAI()

# def semantic_clean_text(raw_text):
#     prompt = f"""
# You are a document cleaner. Extract ONLY the main body text from the following academic or technical document:
# - Remove page numbers, headers/footers
# - Remove title page, author affiliations, figure/table captions
# - Remove references/bibliography sections
# - Keep abstracts, introductions, main sections, and conclusions

# Document:
# \"\"\"{raw_text}\"\"\"

# Return only the cleaned text.
# """
#     response = client.responses.create(
#         model="gpt-4o-mini",
#         input=prompt,
#         max_output_tokens=4000
#     )
#     return response.output_text

# # --- Ingest cleaned + chunked data and post-process with GPT ---
# input_path = "data/cleaned_chunked_papers.json"
# output_path = "data/cleaned_gpt.json"

# # Load chunked data
# with open(input_path, "r", encoding="utf-8") as f:
#     chunked_data = json.load(f)

# # Prepare list for GPT-processed results
# gpt_cleaned_data = []

# # Loop through each document
# for record in chunked_data:
#     doc_id = record["id"]
#     gpt_chunks = []

#     print(f"Post-processing (GPT cleanup): {doc_id}")

#     # Apply GPT cleaning to each chunk
#     for chunk in record["chunks"]:
#         cleaned_chunk = semantic_clean_text(chunk)
#         gpt_chunks.append(cleaned_chunk)

#     # Store result
#     gpt_cleaned_data.append({
#         "id": doc_id,
#         "chunks": gpt_chunks
#     })

# # Save GPT-cleaned data
# with open(output_path, "w", encoding="utf-8") as f:
#     json.dump(gpt_cleaned_data, f, indent=2, ensure_ascii=False)

# print(f"Saved GPT post-processed chunks to {output_path}")


> **Note:** The following section explains core functionality and workflow.

#### Choosing 1 Idea and fetching details

In [None]:
# Modify the idea_number for choosing the different idea
idea_number = 5   # change the number if you wish to choose and generate the research proposal for another idea
chosen_idea = ideas.content.split("---")[idea_number]

In [None]:
# Import required libraries for core functionality
import re

# Use a regular expression to find the file path of the research paper

pattern = r"File Path of the Research Paper:\*\*\s*(.+?)\n"
# If you are unable to extract the file path successfully using this pattern, use the `ChatGPT` or any other LLM to find the pattern that works for you, simply provide the LLM the sample response of your whole ideas and ask the LLM to generate the regex patterm for extracting the "File Path of the Research Paper"

match = re.search(pattern, chosen_idea)

if match:
  idea_generated_from_research_paper = match.group(1).strip()
  print("Filepath : ", idea_generated_from_research_paper)
else:
  print("File Path of the Research Paper not found in the chosen idea.")

> **Note:** The following section explains core functionality and workflow.

## **Step 4: Proposal Blueprint Preparation - [3 Marks]**

> **Select appropriate research ideas for the proposal and supply 'Sample Research Proposals' as templates to the LLM to support the generation of the final proposal.**
---   
<font color=Red>**Note:**</font> *2 marks are awarded for the prompt and 1 mark for the successful completion of this section, including debugging or modifying the code if necessary.*

**TASK:** Write an Prompt which can be used to generate the Research Proposal.

The prompt should be able to craft a research proposal based on the sample research proposal template, using one of the ideas generated above. The proposal should include references to the actual research papers from which the ideas are derived and should align well with the NOFO documents.

In [None]:
# Here we need to add the full papers instead of the summary
# Load PDF files and extract content using PyPDFLoader
chosen_idea_rp = PyPDFLoader(idea_generated_from_research_paper, mode="single").load()

# Loading the sample research proposal template
# Load PDF files and extract content using PyPDFLoader
research_proposal_template = PyPDFLoader(" <Path of Research Proposal Template> ", mode="single").load()

In [None]:
import json
import os
from pypdf import PdfReader
import camelot
import pytesseract
from pdf2image import convert_from_path
import tiktoken

# --- Tokenization setup ---
encoding = tiktoken.encoding_for_model("gpt-4o-mini")
MAX_TOKENS = 127500          # total model context window
EXTRACTION_BUDGET = 100000   # reserve ~20% for prompts/response

def count_tokens(text):
    """Count tokens using tiktoken encoding."""
    return len(encoding.encode(text))

# --- Load matching papers from JSON log ---
def load_matched_papers(json_path, pdf_folder="content"):
    """
    Extract list of relevant document file paths from the latest JSON iteration.
    """
    with open(json_path, "r") as f:
        data = json.load(f)
    
    # Take the last iteration's relevant_documents
    last_iteration = data[-1]
    relevant_docs = last_iteration.get("relevant_documents", [])
    
    # Build file paths for each relevant doc (assumes they exist in pdf_folder)
    file_paths = []
    for doc in relevant_docs:
        title = doc["title"]
        pdf_path = os.path.join(pdf_folder, title)
        if os.path.exists(pdf_path):
            file_paths.append(pdf_path)
        else:
            print(f"Warning: {pdf_path} not found. Skipping.")
    return file_paths

# --- Stage 1 & 2: Text + Table extraction ---
def extract_text_and_tables(file_path, token_budget):
    """Extract text and tables within token budget."""
    content = ""
    token_count = 0

    # Stage 1: PyPDF text extraction
    try:
        reader = PdfReader(file_path)
        for page in reader.pages:
            page_text = page.extract_text() or ""
            token_count += count_tokens(page_text)
            if token_count > token_budget:
                print(f"Token budget reached during text extraction: {file_path}")
                break
            content += page_text
    except Exception as e:
        print(f"PyPDF extraction failed: {e}")

    # Stage 2: Table extraction (Camelot)
    # try:
    #     tables = camelot.read_pdf(file_path, pages='all')
    #     for table in tables:
    #         table_text = "\n[Table Extracted]\n" + table.df.to_string()
    #         token_count += count_tokens(table_text)
    #         if token_count > token_budget:
    #             print(f"Token budget reached during table extraction: {file_path}")
    #             break
    #         content += table_text
    # except Exception:
    #     pass

    return content, token_count

# --- Stage 3: OCR extraction ---
# def extract_ocr(file_path, token_budget, current_tokens=0):
#     """Extract OCR text (figures/scanned pages) within remaining token budget."""
#     content = ""
#     token_count = current_tokens

#     try:
#         images = convert_from_path(file_path)
#         for image in images:
#             ocr_text = pytesseract.image_to_string(image)
#             token_count += count_tokens(ocr_text)
#             if token_count > token_budget:
#                 print(f"Token budget reached during OCR extraction: {file_path}")
#                 break
#             content += "\n[OCR Extracted]\n" + ocr_text
#     except Exception:
#         pass

    return content

# --- Process all matched papers ---
def process_matched_papers(json_path, pdf_folder="content"):
    """
    Load matched papers from JSON and process them using multi-stage extraction:
    Pass 1: Text + Tables
    Pass 2: OCR (Figures)
    Returns dict mapping filename -> combined extracted content.
    """
    matched_files = load_matched_papers(json_path, pdf_folder)
    text_table_data = {}
    token_usage = {}

    for file_path in matched_files:
        print(f"Extracting text/tables: {os.path.basename(file_path)}")
        content, tokens_used = extract_text_and_tables(file_path, EXTRACTION_BUDGET)
        text_table_data[os.path.basename(file_path)] = content
        token_usage[os.path.basename(file_path)] = tokens_used

    # Return text_table_data directly
    return text_table_data

    # Pass 2: Extract OCR for all files (if budget allows)
    # for file_path in matched_files:
    #     filename = os.path.basename(file_path)
    #     remaining_budget = EXTRACTION_BUDGET - token_usage.get(filename, 0)
    #     if remaining_budget > 0:
    #         print(f"Extracting OCR: {filename} (remaining budget: {remaining_budget})")
    #         ocr_content = extract_ocr(file_path, EXTRACTION_BUDGET, token_usage[filename])
    #         results[filename] = text_table_data[filename] + ocr_content
    #     else:
    #         print(f"Skipping OCR for {filename} (no remaining token budget)")
    #         results[filename] = text_table_data[filename]

# Example usage:
# matched_content = process_matched_papers("/mnt/data/prompt_evaluation_log_cleaned.json", pdf_folder="../content")
# print(matched_content.keys())


In [None]:
matched_content = process_matched_papers("prompt_evaluation_log_cleaned.json", pdf_folder="data/raw")

In [None]:
print(matched_content)

In [None]:
research_proposal_template_prompt = f"""


<WRITE YOUR PROMPT HERE>


"""

In [None]:
research_plan = invoke_with_budget_guardrail(
    llm,
    research_proposal_template_prompt,
    model_name="gpt-4o-mini",
    trace_info={
        "paper_id": paper_id,
        "phase": "relevance_eval",
        "batch": batch_start // BATCH_SIZE + 1,
        "iteration": progress_cnt,
    },
)

In [None]:
display(Markdown(research_plan.content))

In [None]:
# @title **Optional Part - Creating a PDF of the Research Proposal**
# The code in this cell block is used for printing out the output in the PDF format
from markdown_pdf import MarkdownPdf, Section

pdf = MarkdownPdf()
pdf.add_section(Section(research_plan.content))
pdf.save("Reseach Proposal First Draft.pdf")

> **Note:** The following section explains core functionality and workflow.

## **Step 5: Proposal Evaluation Against NOFO Criteria - [3 Marks]**
> **Use the LLM to evaluate the generated proposal (LLM-as-Judge) and assess its alignment with the NOFO criteria.**
   

---
<font color=Red>**Note:**</font> *2 marks are awarded for the prompt and 1 mark for the successful completion of this section, including debugging or modifying the code if necessary.*

**TASK:** Write an Prompt which can be used to evaluate the Research Proposal based on:
1. **Innovation**
2. **Significance**
3. **Approach**
4. **Investigator Expertise**

- Ask the LLM to rate on each of the criteria from **1 (Poor)** to **5 (Excellent)**
- Ask the LLM to provide the resonse in the json format
```JSON
name: Innovation
    justification: "<Justification>"
    score: <1-5>
    strengths: "<Strength 1>"
    weaknesses: "<Weakness 1>"
    recommendations: "<Recommendation 1>"
```



In [None]:
evaluation_prompt = f'''


<WRITE YOUR PROMPT HERE>


'''

In [None]:
# Call the LLM with the prepared prompt and truncated paper content
eval_response = invoke_with_budget_guardrail(
    llm,
    evaluation_prompt,
    model_name="gpt-4o-mini",
    trace_info={
        "paper_id": paper_id,
        "phase": "relevance_eval",
        "batch": batch_start // BATCH_SIZE + 1,
        "iteration": progress_cnt,
    },
)


In [None]:
# Import required libraries for core functionality
import json
json_resp = json.loads(eval_response.content[7:-3])

In [None]:
for key, value in json_resp.items():
  print(f"---\n{key}:")
  if isinstance(value, list):
    for item in value:
      for k, v in item.items():
        print(f"  {k}: {v}")
      print("="*50)
  elif isinstance(value, dict):
    for k, v in value.items():
      print(f"  {k}: {v}")
  else:
    print(f"  {value}")

> **Note:** The following section explains core functionality and workflow.

## **Step 6: Human Review and Refinement of Proposal**
> **Perform Human Evaluation of the generated Proposal. Edit or Modify the proposal as necessary.**

In [None]:
display(Markdown(research_plan.content))

# **Step 7: Summary and Recommendation - [2 Marks]**


Based on the projects, learners are expected to share their observations, key learnings, and insights related to this business use case, including the challenges they encountered.

Additionally, they should recommend or explain any changes that could improve the project, along with suggesting additional steps that could be taken for further enhancement.



In [None]:

# --- Enhanced PDF Processing (Commenting original PyPDF-only approach) ---
# Original starter code (commented for traceability):
# Load PDF files and extract content using PyPDFLoader
# docs = PyPDFLoader(file_path, mode="single").load()

# New Implementation: Multi-stage parsing (PyPDF → Camelot/Tabula → OCR fallback)
# Purpose: Capture text, tables, and figures from diverse PDF formats (Mermaid C node, Rubric Step 2).

from PyPDF2 import PdfReader
# Import required libraries for core functionality
import camelot
import pytesseract
from pdf2image import convert_from_path

def process_pdf_multistage(file_path):
    """
    Multi-stage pipeline for extracting text, tables, and figures from PDFs.
    Stages:
    1. PyPDF (text)
    2. Camelot/Tabula (tables)
    3. OCR (scanned pages/figures)
    """
    content = ""

    # Stage 1: PyPDF text extraction
    try:
        reader = PdfReader(file_path)
        for page in reader.pages:
            content += page.extract_text() or ""
    except Exception as e:
        print(f"PyPDF extraction failed: {e}")

    # Stage 2: Table extraction (Camelot)
    try:
        tables = camelot.read_pdf(file_path, pages='all')
        for table in tables:
            content += "\n[Table Extracted]\n" + table.df.to_string()
    except Exception:
        pass

    # Stage 3: OCR fallback for scanned pages or figures
    try:
        images = convert_from_path(file_path)
        for image in images:
            text = pytesseract.image_to_string(image)
            content += "\n[OCR Extracted]\n" + text
    except Exception:
        pass

    return content


In [None]:

# --- Hybrid Retrieval (BM25 + Embeddings) ---
# Original code used either BM25 OR embeddings; this combines both (Mermaid D node, Rubric Step 2).

from rank_bm25 import BM25Okapi
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

def hybrid_retrieval_setup(docs_text):
    """
    Creates BM25 and embedding indexes for hybrid search.
    """
    # BM25 Index
    tokenized_corpus = [doc.split(" ") for doc in docs_text]
    bm25 = BM25Okapi(tokenized_corpus)

    # Embedding Index
    embed_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    vectorstore = Chroma.from_texts(docs_text, embed_model)

    return bm25, vectorstore


In [None]:

# --- Agentic Components (Research Analyst, Proposal Writer, Compliance Checker) ---
# Implements multi-agent workflow (Mermaid E subgraph, Rubric Step 3-4).

from langchain.agents import initialize_agent, Tool

def analyze_papers(query):
    return "Synthesis of relevant papers"

def check_compliance(proposal):
    return "Compliance report"

tools = [
    Tool(name="Research Analyst", func=analyze_papers, description="Synthesizes relevant papers."),
    Tool(name="Compliance Checker", func=check_compliance, description="Ensures NOFO alignment.")
]

# Initialize agent with zero-shot reasoning and tools
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)


In [None]:

# --- Agentic Components (Research Analyst, Proposal Writer, Compliance Checker) ---
# Implements multi-agent workflow (Mermaid E subgraph, Rubric Step 3-4).

from langchain.agents import initialize_agent, Tool

def analyze_papers(query):
    return "Synthesis of relevant papers"

def check_compliance(proposal):
    return "Compliance report"

tools = [
    Tool(name="Research Analyst", func=analyze_papers, description="Synthesizes relevant papers."),
    Tool(name="Compliance Checker", func=check_compliance, description="Ensures NOFO alignment.")
]

# Initialize agent with zero-shot reasoning and tools
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)


In [None]:

# --- Multi-Criteria Evaluation with Guardrails ---
# Original evaluation only scored NIH criteria; now adds guardrail flags (Mermaid G node, Rubric Step 5).

evaluation_prompt = f"""
Evaluate the proposal on:
1. Innovation
2. Significance
3. Approach
4. Investigator Expertise

Return JSON:
{{
  "criteria": [
    {{
      "name": "Innovation",
      "score": 1-5,
      "strengths": "...",
      "weaknesses": "...",
      "recommendations": "..."
    }},
    ...
  ],
  "overall_score": 1-5,
  "guardrail_flags": ["hallucination risk", "compliance gap"]
}}
"""


In [None]:

# --- Caching Intermediate Steps ---
# Saves embeddings, filtered papers, and draft proposals for reuse (Mermaid J node, Rubric Step 7).

# Import required libraries for core functionality
import pickle

def save_checkpoint(data, name):
    with open(f"checkpoint_{name}.pkl", "wb") as f:
        pickle.dump(data, f)

def load_checkpoint(name):
    try:
        with open(f"checkpoint_{name}.pkl", "rb") as f:
            return pickle.load(f)
    except FileNotFoundError:
        return None


> **Note:** The following section explains core functionality and workflow.


# Quick Reference: Few-Shot + Agentic Enhancements

This section provides details about the few-shot pool, semantic versioning, and agentic conflict resolver integrated into this workflow.

---

## Key Features

**Semantic Versioning**
- Automatically increments version numbers (`v2-fewshot`, `v3-agentic`) based on features used.
- Few-shot only → `-fewshot`
- Few-shot + agentic resolver → `-agentic`

**Few-Shot Pool**
- Derived from cleaned log (`prompt_evaluation_log_cleaned.json`).
- Filters examples with ≥80% hybrid confidence.
- Balances relevant/irrelevant examples 50/50 and ensures diversity.

**Agentic Conflict Resolver**
- Activates when model vs. rule confidence differs by >20%.
- Produces reconciled decision and rationale logged under `agentic_resolution`.

**Enhanced Logging Fields**
- `decision_source`: hybrid (model + rule)
- `hybrid_confidence`: average of model and rule confidence
- `agentic_resolution`: reconciliation result (if applicable)
- `prompt_version`: auto-generated semantic version


In [None]:

# ------------------------------------------------------------
# VERSION TRACKING + FEW-SHOT REBUILDER + AGENTIC RESOLVER
# ------------------------------------------------------------

# Function: Determine the next semantic version string for the prompt
def get_next_prompt_version(log_path, agentic_enabled=False):
    """
    Determine next semantic version based on last logged version.
    Increments number, adds suffix based on features used.
    """
# Import required libraries for core functionality
    import os, json, re
    version_num = 1
    if os.path.exists(log_path):
        with open(log_path, "r", encoding="utf-8") as f:
            try:
                data = json.load(f)
            except json.JSONDecodeError:
                data = []
        # Extract last version number
        for entry in reversed(data):
            if "prompt_version" in entry:
                match = re.match(r"v(\d+)", entry["prompt_version"])
                if match:
                    version_num = int(match.group(1)) + 1
                break

    suffix = "-agentic" if agentic_enabled else "-fewshot"
    return f"v{version_num}{suffix}"


# Function: Build balanced high-confidence few-shot example pool from the log
def rebuild_few_shot_pool(cleaned_log_path, min_conf=80, max_examples=4):
    """
    Build balanced high-confidence few-shot pool from cleaned log.
    Balances relevant and irrelevant, ensures diversity.
    """
# Import required libraries for core functionality
    import json, random
    with open(cleaned_log_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    relevant, irrelevant = [], []
    for iteration in data:
        for doc in iteration.get("relevant_documents", []):
            hybrid_conf = max(doc.get("model_confidence", 0), doc.get("rule_confidence", 0))
            if hybrid_conf >= min_conf:
                relevant.append((doc["title"], doc["reasoning"]))
        for doc in iteration.get("irrelevant_documents", []):
            irrelevant.append((doc, "PAPER NOT RELATED TO TOPIC"))

    # Shuffle and balance
    half = max_examples // 2
    random.shuffle(relevant)
    random.shuffle(irrelevant)
    return relevant[:half] + irrelevant[:half]


# Function: Resolve discrepancies between model and rule confidences using agentic logic
def agentic_conflict_resolver(doc_title, reasoning_json, model_conf, rule_conf):
    """
    Agentic layer to reconcile conflicts:
    - Triggered when discrepancy exceeds threshold
    - Returns reconciled decision and rationale
    """
    rationale = []
    if abs(model_conf - rule_conf) > 20:
        if rule_conf > model_conf:
            final_decision = "RELEVANT" if rule_conf >= 50 else "PAPER NOT RELATED TO TOPIC"
            rationale.append("Rule confidence higher; prioritizing deterministic criteria.")
        else:
            final_decision = "RELEVANT" if model_conf >= 50 else "PAPER NOT RELATED TO TOPIC"
            rationale.append("Model confidence higher; prioritizing LLM interpretation.")
    else:
        final_decision = "RELEVANT" if (model_conf + rule_conf) / 2 >= 50 else "PAPER NOT RELATED TO TOPIC"
        rationale.append("Confidences close; hybrid average used for decision.")

    return {
        "final_decision": final_decision,
        "rationale": " ".join(rationale)
    }


In [None]:

# ------------------------------------------------------------
# ENHANCED LOGGING WITH SEMANTIC VERSIONING AND AGENTIC RESOLUTION
# ------------------------------------------------------------

# Ensure this cell is run AFTER document processing and building relevant_docs_with_reasoning

# Define constants for few-shot
# Define configuration for few-shot example retrieval (number of examples)
FEW_SHOT_MAX_EXAMPLES = 4
# Minimum confidence threshold for including examples in few-shot prompting
MIN_CONFIDENCE_FOR_FEWSHOT = 70
# Path to the cleaned JSON log file where prompt evaluation iterations are stored
LOG_PATH = "prompt_evaluation_log_cleaned.json"

# Determine prompt version
# Path to the cleaned JSON log file where prompt evaluation iterations are stored
current_version = get_next_prompt_version(LOG_PATH, agentic_enabled=any(doc.get('flagged_for_review', False) for doc in relevant_docs_with_reasoning))

# Add decision source and hybrid confidence
for doc in relevant_docs_with_reasoning:
    doc["decision_source"] = "hybrid"
    doc["hybrid_confidence"] = (doc["model_confidence"] + doc["rule_confidence"]) / 2

# Add agentic resolution for flagged docs
for doc in relevant_docs_with_reasoning:
    if doc.get("flagged_for_review"):
        resolution = agentic_conflict_resolver(
            doc_title=doc["title"],
            reasoning_json=doc["reasoning"],
            model_conf=doc["model_confidence"],
            rule_conf=doc["rule_confidence"]
        )
        doc["agentic_resolution"] = resolution

# Append prompt_version to log
# Path to the cleaned JSON log file where prompt evaluation iterations are stored
with open(LOG_PATH, "r+", encoding="utf-8") as f:
    data = json.load(f)
    if data:
        data[-1]["prompt_version"] = current_version
    f.seek(0)
    json.dump(data, f, indent=2, ensure_ascii=False)
    f.truncate()

print(f"Logged with prompt version: {current_version}")



# Additional enhancements proposed by Claude

Your flowchart shows a well-structured approach to the RFP response generation system. Here are several improvements I'd recommend to enhance the robustness and effectiveness of your solution:

1. Enhanced RFP Requirements Extraction
After step B, add a sub-process for:

Requirement Categorization: Classify requirements into mandatory vs. optional, technical vs. administrative
Scoring Rubric Extraction: Specifically parse how proposals will be evaluated
Budget Constraints Analysis: Extract funding limits and cost-effectiveness criteria
Timeline Extraction: Identify key dates and milestone requirements

2. Improved Paper Processing Pipeline
Between steps C and D, consider adding:

Citation Network Analysis: Map relationships between papers to identify influential work
Method/Innovation Extraction: Specifically extract methodologies and novel approaches
Results/Outcomes Extraction: Capture quantitative results and impact metrics
Quality Assessment: Add a paper quality scoring mechanism (impact factor, recency, relevance)

3. Enhanced Retrieval and Ranking
Expand step D with:

Multi-Query Generation: Generate multiple search queries from different RFP aspects
Cross-Reference Validation: Verify that selected papers actually support proposed innovations
Diversity Scoring: Ensure selected papers cover different aspects of the RFP
Gap Analysis: Identify what the RFP asks for that isn't well-covered in existing research

4. Strengthened Agentic Architecture
Add these specialized agents to your existing three:

Innovation Synthesizer Agent: Combines findings from multiple papers into novel approaches
Budget Estimator Agent: Ensures proposals are financially realistic
Risk Assessment Agent: Identifies potential implementation challenges
Competitive Analysis Agent: Positions your proposal against likely competitors

5. Improved Evaluation and Refinement
Enhance the evaluation loop (G-I) with:

Specific Weakness Detection: Not just overall score, but identify specific weak sections
Competitive Benchmarking: Compare against successful past proposals if available
Consistency Checking: Ensure all sections align and support each other
Technical Feasibility Validation: Verify proposed solutions are implementable

6. Additional Process Improvements
Consider these architectural enhancements:
flowchart LR
    subgraph "Knowledge Management"
        KB1[Domain Ontology]
        KB2[Success Patterns DB]
        KB3[Common Pitfalls DB]
    end
    
    subgraph "Feedback Loops"
        FL1[Real-time Agent Collaboration]
        FL2[Iterative Improvement Tracking]
        FL3[Version Control System]
    end
7. Quality Assurance Additions

Plagiarism Detection: Ensure generated content is original
Fact Verification: Cross-check claims against source papers
Readability Analysis: Ensure proposal meets target audience expectations
Compliance Validation: Automated checks against all RFP requirements

8. Output Enhancement
For the final deliverables (step N), consider generating:

Executive Summary: One-page overview for quick review
Technical Appendix: Detailed methodology descriptions
Budget Justification: Line-by-line cost explanations
Risk Mitigation Plan: Addressing identified challenges
Evaluation Metrics: How success will be measured

9. Monitoring and Logging
Add throughout the pipeline:

Decision Logging: Track why papers were selected/rejected
Agent Reasoning Traces: Understand how proposals were generated
Performance Metrics: Time taken, resources used, quality scores
Error Handling: Graceful degradation if components fail

10. Advanced Features
Consider these stretch goals:

Multi-RFP Learning: Learn from multiple RFPs to improve over time
Collaborative Filtering: If multiple users, learn from collective behavior
Adaptive Prompting: Adjust prompts based on intermediate results
Uncertainty Quantification: Flag areas where the system is less confident
