
# Notebook Map: Relevance Evaluation Pipeline with Few-Shot + Agentic Enhancements

This table of contents provides a structured overview of the notebook, describing each section's purpose and how it fits into the workflow.

---

## 1. Quick Reference
- Overview of semantic versioning, few-shot prompting, and agentic conflict resolver.

## 2. Imports and Configuration
- Load required libraries and define configuration constants (e.g., few-shot parameters, log paths).

## 3. Core Utility Functions
- `verify_decision`: Ensures model decisions are consistent.
- `calculate_rule_confidence`: Computes rule-based confidence from criteria.
- `get_next_prompt_version`: Auto-increments semantic prompt version.
- `rebuild_few_shot_pool`: Builds balanced few-shot example set from log.
- `agentic_conflict_resolver`: Resolves discrepancies between model and rule evaluations.

## 4. Data Preparation
- Load PDF research papers from `data/raw`.
- Truncate text to fit LLM context window.

## 5. Few-Shot Prompt Building
- Retrieve high-confidence examples from log.
- Prepend examples to base relevance prompt.

## 6. Main Evaluation Loop
- Iterate through PDFs.
- Evaluate relevance using LLM.
- Apply rule-based scoring and hybrid confidence calculation.
- Flag documents for review when model vs. rule confidence diverges.

## 7. Logging and Versioning
- Append results to `prompt_evaluation_log.json`.
- Add `prompt_version`, `decision_source`, and `agentic_resolution` where applicable.

## 8. Visualization
- Display confidence distribution, relevance drift, and flagged discrepancy trends.

## 9. Enhancements (Appended)
- Additional functions and logging improvements appended at the end for optional use.

---


> **Note:** The following section explains core functionality and workflow.

<font size=10>**End-Term / Final Project**</font>

<font size=6>**AI for Research Proposal Automation**</font>

### **Business Problem - Create an AI system which will help you writing the research proposal aligning with the NOFO Document**
   



Meet Dr. Ian McCulloh, a seasoned research advisor and a leading voice in interdisciplinary science. Over the years, his lab has explored everything from AI for counterterrorism to social network analysis in neuroscience. His publication portfolio is vast, rich, and... chaotic.

When the National Institute of Mental Health released a new NOFO (Notice of Funding Opportunity) seeking innovative digital health solutions for mental health equity, Dr. Ian saw an opportunity. But there was a problem: despite his extensive work, none of his existing research was directly aligned with digital mental health interventions. And with NIH deadlines looming, manually identifying relevant angles and generating a competitive proposal would be a massive lift.

Dr. Ian wished for a smart assistant—one that could digest his past work, interpret the NOFO’s intent, spark new research directions, and even help draft proposal sections.

**The Challenge:**

Organizations and researchers often maintain large archives of publications and prior work. When responding to competitive grants—especially highly specific ones like NIH NOFOs—it becomes extremely difficult and time-consuming to:

1. Align past work with a new funding call.
2. Extract relevant expertise from unrelated projects.
3. Ideate novel, fundable research proposals tailored to complex criteria.
4. Generate high-quality text for grant submission that satisfies technical and scientific review criteria.

The manual effort to sift through dense research documents, match them to nuanced funding criteria, and write compelling, compliant proposals is labor-intensive, inconsistent, and prone to missed opportunities.

> **Note:** The following section explains core functionality and workflow.

### **The Case Study Approach**

**Objective**
1. Develop a generative AI-powered system using LLMs to automate and optimize the creation of NIH research proposals.
2. The tool will identify relevant prior research, generate aligned project ideas, and draft high-quality proposal content tailored to specific NOFO requirements.

**Given workflow:**

```mermaid
flowchart TD
    A[Read NOFO Document] --> B[Analyze Research Papers]
    B --> C[Filter Papers by Topic]
    C --> D[Generate Research Ideas]
    D --> E[Upload ideas to LLM]
    E --> F[Generate Proposal]
    F --> G[LLM Evaluation]
    G --> H{Meets criteria?}
    H -- NO --> F
    H -- YES --> I[Human Review]
    I --> J{Approved?}
    J -- NO --> F
    J -- YES --> K[Final Proposal]
```

**Enhanced workflow based on conversations with ChatGPT and Claude:**

```mermaid
flowchart TD
    A[Read NOFO Document] --> B[Extract Key Requirements & Evaluation Criteria]
    B --> C[Multi-Stage Paper Processing<br>(PyPDF → OCR)]
    C --> C1[Table Extraction]
    C --> C2[Figure Extraction (OCR + Captioning)]
    C1 --> D
    C2 --> D
    D[Hybrid Indexing & Filtering<br>(BM25 + Embeddings + Metadata)]
    D --> E[Agentic Research Synthesis<br>(Research Analyst + Proposal Writer + Compliance Checker)]
    E --> F[Generate Proposal Blueprint + Draft]
    F --> G[Multi-Criteria Evaluation<br/>(RAG + LLM-as-Judge + Guardrails)]
    G --> H{Score ≥ Threshold?}
    H -- NO --> I[Targeted Refinement Loop<br/>(Weakness-Specific Prompts)]
    I --> F
    H -- YES --> J[Caching + Checkpointing of Results]
    J --> K[Human Review Interface]
    K --> L{Approved?}
    L -- NO --> M[Capture Feedback & Return to Refinement]
    M --> F
    L -- YES --> N[Final Proposal + Deliverables]
    
    subgraph "Agentic Components"
        E1[Research Analyst Agent]
        E2[Proposal Writer Agent]
        E3[Compliance Checker Agent]
        E1 <--> E2
        E2 <--> E3
        E3 <--> E1
    end
```

> **Note:** The following section explains core functionality and workflow.

## **Setup - [2 Marks]**
---
<font color=Red>**Note:**</font> *1 marks is awarded for the Embedding Model configuration and 1 mark for the LLM Configuration.*

## Configuration and Setup

In [None]:
# Install required packages with progress and output displayed

# Encountered multiple conflicts between packages and with codespace. Ended up installing all packages via the .venv

# DISPLAY FINAL REQUIREMENTS.TXT for final file

In [2]:
# Import required libraries for core functionality
import os
import warnings
api_key = os.getenv("OPENAI_API_KEY")
base_url = os.getenv("OPENAI_BASE_URL")
warnings.filterwarnings('ignore')

In [3]:
# Define the LLM Model - Use `gpt-4o-mini` Model
from langchain_openai import ChatOpenAI
import os
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    base_url=os.getenv("OPENAI_BASE_URL")  # optional; only if using non-default
)

In [4]:
# ------------------------------------------------------------
# FEW-SHOT AND LOGGING CONFIG
# ------------------------------------------------------------
# These constants control how many examples are retrieved and the minimum confidence threshold.
# Modify here if you want more or fewer few-shot examples or to change the confidence cutoff.
FEW_SHOT_MAX_EXAMPLES = 4         # Total examples (balanced between relevant/irrelevant if possible)
# Minimum confidence threshold for including examples in few-shot prompting
MIN_CONFIDENCE_FOR_FEWSHOT = 70   # Minimum hybrid confidence (%) to consider for few-shot retrieval

# JSON log path
# Path to the cleaned JSON log file where prompt evaluation iterations are stored
LOG_PATH = "prompt_evaluation_log_cleaned.json"

# PDF Pre-Processing

In [5]:
# PDF Cleaning Step: Remove non-visual annotations (comments, links, form fields)
# Keeps images, diagrams, and visible callouts intact

# Import required libraries for core functionality
import os
import fitz  # PyMuPDF
import json  # <-- NEW: Required to write annotation logs

# Create a global dictionary to store removed annotations
annotation_log = {}  # <-- NEW: Accumulates logs of all removed annotations

# Initial standardization step to remove annotations for parsing
def clean_pdf_annotations(input_path, output_path):
    """
    Strips non-visual annotations (comments, form fields, links) from a PDF
    while preserving visible images and diagrams.
    Also logs removed annotations to a global dictionary.
    """
    doc = fitz.open(input_path)
    removed_annots = []  # <-- NEW: Stores removed annotations for this PDF

    for page in doc:
        # Iterate over all annotations (not images)
        annot = page.first_annot
        while annot:
            next_annot = annot.next  # store reference to next annotation
            
            # Try to extract meaningful annotation content
            try:
                annot_info = annot.info  # Dictionary of annotation metadata
                content = annot_info.get("content", "").strip()
                subtype = annot_info.get("subtype", "").strip()
                if content:
                    removed_annots.append(f"{subtype}: {content}")
                else:
                    removed_annots.append(f"{subtype}: [no content]")
            except Exception as e:
                # Fallback if annotation metadata is inaccessible
                removed_annots.append("Unknown annotation (could not extract content)")

            # Remove annotation object (highlights, comments, links)
            page.delete_annot(annot)
            annot = next_annot

    # Save cleaned PDF
    doc.save(output_path, garbage=4, deflate=True)
    doc.close()

    # Add entry to annotation log using the input filename as key
    annotation_log[os.path.basename(input_path)] = removed_annots  # <-- NEW: Log entries keyed by file

# Clean NOFO file
input_pdf = "../data/NOFO.pdf"
cleaned_pdf = "../data/NOFO_cleaned.pdf"
clean_pdf_annotations(input_pdf, cleaned_pdf)
print(f"Cleaned PDF saved to: {cleaned_pdf}")

# Get de-annotated NOFO doc content using PyPDFLoader for evaluation step
from langchain.document_loaders import PyPDFLoader
pdf_file = "../data/NOFO_cleaned.pdf"
pdf_loader = PyPDFLoader(pdf_file)
NOFO_pdf = pdf_loader.load()

# Prepare output folder for de-annotated research papers
os.makedirs("data/raw", exist_ok=True)

# Set variables for de-annotating the research paper PDF collection 
source_dir = "../content"
output_dir = "data/raw"

# Loop through content folder, de-annotate each PDF, and save to a 'clean' output directory
for file_name in os.listdir(source_dir):
    if file_name.lower().endswith(".pdf"):
        input_pdf = os.path.join(source_dir, file_name)
        cleaned_pdf = os.path.join(output_dir, file_name.replace(".pdf", "_cleaned.pdf"))
        print(f"Cleaning annotations for: {file_name}")
        clean_pdf_annotations(input_pdf, cleaned_pdf)
        print(f"Cleaned PDF saved to: {cleaned_pdf}")

# Write annotation log to disk after all PDFs are processed
log_path = "annotation_log.json"  # <-- NEW: File to store the annotation log
with open(log_path, "w", encoding="utf-8") as log_file:
    json.dump(annotation_log, log_file, indent=2, ensure_ascii=False)  # <-- NEW: Write log to file
print(f"Annotation removal log written to: {log_path}")  # <-- NEW: Confirm log creation

print("All research PDFs cleaned and saved in data/raw/")


Cleaned PDF saved to: ../data/NOFO_cleaned.pdf
Cleaning annotations for: cycon-final-draft.pdf
Cleaned PDF saved to: data/raw/cycon-final-draft_cleaned.pdf
Cleaning annotations for: Chat GPT Bias final w copyright.pdf
Cleaned PDF saved to: data/raw/Chat GPT Bias final w copyright_cleaned.pdf
Cleaning annotations for: Genetic_Algorithms_for_Prompt_Optimization.pdf
Cleaned PDF saved to: data/raw/Genetic_Algorithms_for_Prompt_Optimization_cleaned.pdf
Cleaning annotations for: DIVERSE_LLM_Dataset___IEEE_Big_Data.pdf
Cleaned PDF saved to: data/raw/DIVERSE_LLM_Dataset___IEEE_Big_Data_cleaned.pdf
Cleaning annotations for: Hashtag_Revival.pdf
Cleaned PDF saved to: data/raw/Hashtag_Revival_cleaned.pdf
Cleaning annotations for: FBI_Recruit_Hire_Final.pdf
Cleaned PDF saved to: data/raw/FBI_Recruit_Hire_Final_cleaned.pdf
Cleaning annotations for: Benson_MA491_NLP.pdf
Cleaned PDF saved to: data/raw/Benson_MA491_NLP_cleaned.pdf
Cleaning annotations for: Extreme Cohesion Darknet 20190815.pdf
Cleaned 

# Extract and Chunk Text

In [2]:
# ---------------------------------------------------------------
# Function to split cleaned text into 3000-token chunks with overlap for RAG
# ---------------------------------------------------------------
# This function breaks long text into overlapping token-based chunks for use in
# Retrieval-Augmented Generation (RAG) pipelines. Overlapping chunks help
# preserve context continuity across boundaries, improving answer quality.

import tiktoken  # OpenAI tokenizer library for counting and managing tokens

# ---------------------------------------------------------------
# Load tokenizer for the target model
# ---------------------------------------------------------------
# `tiktoken` provides tokenization rules tailored to specific OpenAI models.
# Here we select the encoding used by gpt-4o-mini to ensure our token counting
# aligns with how the model actually interprets input.
encoding = tiktoken.encoding_for_model("gpt-4o-mini")

# ---------------------------------------------------------------
# Define chunking function
# ---------------------------------------------------------------
# Inputs:
# - text: full string to be split into chunks
# - chunk_size: max number of tokens per chunk (default 3000)
# - overlap: number of tokens to repeat from the previous chunk (default 200)
# This overlap preserves some context from earlier chunks in each new chunk.

def chunk_text(text, chunk_size=3000, overlap=200):
    # Convert text into a list of token IDs using the tokenizer
    tokens = encoding.encode(text)

    # Initialize an empty list to store the final chunks
    chunks = []

    # Step through the token list in increments of (chunk_size - overlap)
    # This ensures that each new chunk shares `overlap` tokens with the previous one
    for i in range(0, len(tokens), chunk_size - overlap):
        # Slice the token list to get a window of `chunk_size` tokens
        chunk_tokens = tokens[i:i+chunk_size]

        # Decode the token slice back into text and add it to the list of chunks
        chunks.append(encoding.decode(chunk_tokens))

    # Return the full list of overlapping text chunks
    return chunks


In [3]:
# ---------------------------------------------------------------
# Function to clean extracted text by:
# - removing headers/footers
# - removing noise
# - fixing multi-column layout issues
# ---------------------------------------------------------------
# This function is useful for preprocessing text extracted from PDFs
# (e.g., via OCR or PDF parsers), which often contain artifacts such as
# page numbers, repeating headers/footers, hyphenated line breaks,
# and broken column layouts.

import re  # Regular expressions for pattern matching and substitution

def clean_extracted_text(text):
    """Remove noise (page numbers, headers, footers), merge hyphenated words,
    and flatten potential two-column layouts."""

    # ---------------------------------------------------------------
    # 1. Remove page numbering and common artifacts
    # ---------------------------------------------------------------
    # These patterns often appear in academic papers, reports, and government documents.
    # Removing them improves the quality of downstream embedding and summarization.

    text = re.sub(r'\bPage \d+\b', '', text, flags=re.IGNORECASE)  # Remove 'Page X'
    text = re.sub(r'\d+ of \d+', '', text, flags=re.IGNORECASE)    # Remove 'X of Y' style page counts

    # ---------------------------------------------------------------
    # 2. Identify and remove repeating headers/footers
    # ---------------------------------------------------------------
    # Strategy: count how many times each line occurs.
    # Merge two-column text by pairing lines
    merged_lines = []
    lines = text.split('\n')
    for i in range(0, len(lines), 2):
        if i+1 < len(lines):
            merged_lines.append(lines[i] + " " + lines[i+1])
        else:
            merged_lines.append(lines[i])
    return "\n".join(merged_lines)

In [41]:
# ---------------------------------------------------------------
# Function to extract, clean, and chunk research paper PDFs
# ---------------------------------------------------------------
# This function performs a full preprocessing pipeline for PDF documents,
# including text extraction (via PyPDF), cleaning (removing headers/footers, noise),
# and token-based chunking for use in downstream RAG pipelines.

from pypdf import PdfReader  # PyPDF is used for reading PDF documents and extracting text

# Additional imports for patching
from pdf2image import convert_from_path  # Convert PDF pages to images
import pytesseract  # OCR engine to extract text from images
import os
from datetime import datetime

def process_pdf_multistage(file_path):
    # Initialize an empty string to collect the full text from the PDF
    content = ""

    # Extract filename for metadata
    filename = os.path.basename(file_path)
    author = None
    creation_date = None
    num_pages = None

    try:
        # ---------------------------------------------------------------
        # 1. Attempt to load and parse the PDF
        # ---------------------------------------------------------------
        reader = PdfReader(file_path)  # Create a PdfReader object from the file path

        # Get basic document metadata (if available)
        meta = reader.metadata or {}
        author = meta.get('/Author', None)

        # Iterate through each page in the PDF
        for page in reader.pages:
            # Extract text from the page; if extraction fails or returns None, use an empty string
            page_text = page.extract_text() or ""

            # Append the page's text to the full document content
            content += page_text

    except Exception as e:
        # ---------------------------------------------------------------
        # 2. Handle extraction failures gracefully
        # ---------------------------------------------------------------
        # If any exception is raised during PDF reading or parsing,
        # log the error and allow the function to continue (returning empty chunks).
        print(f"PyPDF extraction failed: {e}")
        print("Falling back to OCR...")

        try:
            # Convert PDF pages to images using pdf2image
            images = convert_from_path(file_path)
            ocr_text_list = []

            for i, img in enumerate(images):
                # Run OCR on each image page using pytesseract
                page_text = pytesseract.image_to_string(img)
                ocr_text_list.append(page_text)

            # Combine all OCR'd page text into one document
            content = "\n".join(ocr_text_list)

        except Exception as ocr_error:
            print(f"OCR fallback also failed: {ocr_error}")
            return []

    # ---------------------------------------------------------------
    # 3. Clean the raw extracted text
    # ---------------------------------------------------------------
    # Use a dedicated cleaning function to:
    # - Remove headers, footers, and page numbers
    # - Merge hyphenated line breaks
    # - Flatten multi-column layouts
    # This improves the quality of embeddings and downstream retrieval.
    cleaned_text = clean_extracted_text(content)

    # ---------------------------------------------------------------
    # 4. Chunk the cleaned text into token-bounded segments
    # ---------------------------------------------------------------
    # Break the cleaned document into overlapping token chunks (e.g., 3000 tokens with 200-token overlap),
    # ensuring context continuity across chunks. This is critical for performance in RAG.
    chunks = chunk_text(cleaned_text)

    def safe_str(obj):
        """
        Convert a potentially non-serializable object (e.g., PyPDF's IndirectObject)
        into a JSON-compatible Python string or None.

        This is especially useful when working with metadata fields extracted from PDFs,
        where objects may be wrapped in non-primitive types (like PyPDF2.generic.IndirectObject),
        which the `json` module cannot serialize directly.

        Returns:
            - `str(obj)` if the object can be stringified without error
            - `None` if string conversion fails
        """
        try:
            # Attempt to cast the object to a string (e.g., IndirectObject → str)
            # This is usually sufficient for basic metadata like author, title, date, etc.
            return str(obj)
        
        except Exception:
            # If casting to string fails (e.g., object is not readable or triggers an exception),
            # return None instead, making the output JSON-safe.
            return None

    # ---------------------------------------------------------------
    # 4.1 Attach file-level metadata to each chunk
    # ---------------------------------------------------------------
    # This metadata can help with filtering, attribution, and retrieval analysis.
    chunks_with_metadata = [
        {
            "text": chunk,
            "metadata": {
                "source_file": safe_str(filename),
                "author": safe_str(author),
                "creation_date": safe_str(creation_date),
                "num_pages": safe_str(num_pages),
                "chunk_index": i
            }
        }
        for i, chunk in enumerate(chunks)
    ]

    # ---------------------------------------------------------------
    # 5. Return the processed chunks
    # ---------------------------------------------------------------
    # The final output is a list of text chunks, ready for embedding, storage, or retrieval.
    return chunks_with_metadata


In [4]:
# Extract, clean, chunk, and store raw chunks for all research paper PDFs

# ---------------------------------------------------------------
# 1. Import necessary libraries
# ---------------------------------------------------------------
import os              # Used for file path manipulation and directory handling
import json            # Used to save the final result as a JSON file
from glob import glob  # Used to match all PDF files in a directory

# ---------------------------------------------------------------
# 2. Set input/output paths
# ---------------------------------------------------------------

# Folder containing raw research paper PDFs (to be processed)
pdf_folder = "data/raw"

# Output file to save cleaned + chunked results
output_json_path = "data/cleaned_chunked_papers.json"

# ---------------------------------------------------------------
# 3. Initialize storage for processed results
# ---------------------------------------------------------------

# This list will store the result for each paper.
# Each element is a dictionary with:
#   - 'id': PDF filename
#   - 'chunks': list of cleaned and tokenized text chunks from that PDF
all_chunks = []

# ---------------------------------------------------------------
# 4. Loop through all PDF files in the target folder
# ---------------------------------------------------------------

# `glob` finds all .pdf files in the specified folder
for pdf_path in glob(os.path.join(pdf_folder, "*.pdf")):
    doc_name = os.path.basename(pdf_path)  # Extract just the filename (used as a unique ID)
    print(f"Processing: {pdf_path}")       # Log the file being processed
    
    try:
        # ---------------------------------------------------------------
        # Attempt to extract, clean, and chunk the PDF content
        # ---------------------------------------------------------------
        # `process_pdf_multistage()` is your custom pipeline that:
        #   1. Extracts text using PyPDF (and optionally OCR if needed)
        #   2. Cleans the text (removes noise, merges hyphenated lines, etc.)
        #   3. Chunks the cleaned text into token-bounded segments
        chunks = process_pdf_multistage(pdf_path)

        # ---------------------------------------------------------------
        # Append the processed result to the `all_chunks` list
        # ---------------------------------------------------------------
        # Each record contains the filename (as ID) and a list of chunks
        all_chunks.append({
            "id": doc_name,
            "chunks": chunks
        })

    except Exception as e:
        # ---------------------------------------------------------------
        # If anything goes wrong during processing, catch the error
        # ---------------------------------------------------------------
        print(f"Error processing {pdf_path}: {e}")  # Log the error for debugging

# ---------------------------------------------------------------
# 5. Save all processed results to a JSON file
# ---------------------------------------------------------------

# Write the list of all processed documents to a single JSON file
# - `indent=2` for human-readable formatting
# - `ensure_ascii=False` allows Unicode characters (like symbols or accents)
with open(output_json_path, "w", encoding="utf-8") as f:
    json.dump(all_chunks, f, indent=2, ensure_ascii=False)

# Final confirmation message
print(f"Saved cleaned + chunked text for {len(all_chunks)} PDFs to {output_json_path}")


Processing: data/raw/AAAI IAA CV_cleaned.pdf
Error processing data/raw/AAAI IAA CV_cleaned.pdf: name 'process_pdf_multistage' is not defined
Processing: data/raw/Sim of Decon_cleaned.pdf
Error processing data/raw/Sim of Decon_cleaned.pdf: name 'process_pdf_multistage' is not defined
Processing: data/raw/BotBuster___AAAI_cleaned.pdf
Error processing data/raw/BotBuster___AAAI_cleaned.pdf: name 'process_pdf_multistage' is not defined
Processing: data/raw/Political_Networks_Conference_cleaned.pdf
Error processing data/raw/Political_Networks_Conference_cleaned.pdf: name 'process_pdf_multistage' is not defined
Processing: data/raw/EmergencyResponseAI_cleaned.pdf
Error processing data/raw/EmergencyResponseAI_cleaned.pdf: name 'process_pdf_multistage' is not defined
Processing: data/raw/FSS-19_paper_137_cleaned.pdf
Error processing data/raw/FSS-19_paper_137_cleaned.pdf: name 'process_pdf_multistage' is not defined
Processing: data/raw/DIVERSE_LLM_Dataset___IEEE_Big_Data_cleaned.pdf
Error proce

## Pre-compute and Store Embeddings for RAG-enabled Tasks

In [1]:
# Check torch version
import torch
import transformers

print("Torch version:", torch.__version__)
print("Torch path:", torch.__file__)
print("Transformers version:", transformers.__version__)
print("Transformers path:", transformers.__file__)
print("uint64 exists:", hasattr(torch, "uint64"))

  from .autonotebook import tqdm as notebook_tqdm


ImportError: cannot import name 'dummy_essentia_and_librosa_and_pretty_midi_and_scipy_and_torch_objects' from 'transformers.utils' (/workspaces/genai_capstone/.venv/lib/python3.10/site-packages/transformers/utils/__init__.py)

In [5]:
# Import Chroma vectorstore and HuggingFace embedding wrapper from LangChain
from langchain_community.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
import os

# ---------------------------------------------------------------
# 1. Initialize embedding model
# ---------------------------------------------------------------
# This wraps a HuggingFace model (MiniLM) so it can be used with LangChain.
# MiniLM is a lightweight transformer model that produces sentence embeddings.
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# ---------------------------------------------------------------
# 2. Set the persistent storage directory for Chroma
# ---------------------------------------------------------------
# This is where Chroma will store the vector index on disk.
# The directory is placed *outside the repo* to avoid accidentally committing large files to Git.
persist_dir = "/workspaces/chroma_storage/chroma_embeddings"

# Create the directory if it doesn’t exist (idempotent)
os.makedirs(persist_dir, exist_ok=True)

# ---------------------------------------------------------------
# 3. Prepare LangChain Document objects
# ---------------------------------------------------------------
# LangChain expects documents in a specific format: each one must be a Document object
# containing `page_content` (the raw text) and optional `metadata`.
# Here, we pair each text chunk with its corresponding paper ID.
from langchain_core.documents import Document

docs = [
    Document(page_content=chunk, metadata={"paper_id": paper_id})
    for chunk, paper_id in zip(all_chunks, chunk_to_paper)
]
# - `all_chunks`: list of text snippets extracted from research papers
# - `chunk_to_paper`: list of paper IDs, one for each chunk, to track provenance

# ---------------------------------------------------------------
# 4. Create and persist Chroma vectorstore
# ---------------------------------------------------------------
# This creates a Chroma vector index and embeds all documents using the MiniLM model.
# The index will be saved to `persist_dir`, allowing it to be reloaded later without recomputation.
vectorstore = Chroma.from_documents(
    documents=docs,                     # List of Document objects with chunk text + metadata
    embedding=embedding_model,          # Embedding function used to vectorize each document
    collection_name="research_chunks",  # Name of the collection (can be queried later)
    persist_directory=persist_dir,      # Filesystem path where vectors and metadata are stored
    client_settings={"is_persistent": True}  # activates server/client mode
)

# ---------------------------------------------------------------
# 5. Confirm that embeddings were stored
# ---------------------------------------------------------------
# Show the files written by Chroma in the storage directory.
# This confirms that the persistent index exists and can be reloaded later.
import os
print(os.listdir("./chroma_embeddings"))

# Final confirmation message
print("Embeddings precomputed and stored in ChromaDB.")

RuntimeError: Failed to import transformers.models.bert.modeling_bert because of the following error (look up to see its traceback):
module 'torch' has no attribute 'uint64'

In [None]:
# ---------------------------------------------------------------
# Generic query to test retrieval from ChromaDB
# ---------------------------------------------------------------
# This block is used to perform a semantic search query over previously embedded and stored text chunks.
# It demonstrates how to reload a persisted Chroma vectorstore and retrieve the top-k most relevant documents
# based on a natural language query.

# ---------------------------------------------------------------
# Reload vectorstore (no need to re-embed)
# ---------------------------------------------------------------
# We reload the Chroma vectorstore from disk using the same collection name and persistence directory
# that were used during initial indexing. This allows the session to access precomputed embeddings and metadata
# without recomputing or reprocessing the original documents.

vectorstore = Chroma(
    collection_name="research_chunks",     # Must match the name used in .from_documents()
    embedding_function=embedding_model,    # Same embedding function as before (MiniLM)
    persist_directory=persist_dir          # Path where vector index and metadata were stored
)

# ---------------------------------------------------------------
# Set up the query topic (can be static or dynamic)
# ---------------------------------------------------------------
# This is the search string used to query the vectorstore. The embedding model will convert this query
# into a vector, and Chroma will use vector similarity to retrieve the most relevant chunks.
# In this example, the priority topic is a simple string, but in production it could be generated
# dynamically from a NOFO or user input.

priority_topic = "mental health"  # Example query topic, e.g., extracted from a NOFO

query = priority_topic  # Alias for clarity — supports swapping in a different query object

# ---------------------------------------------------------------
# Run similarity search
# ---------------------------------------------------------------
# Chroma performs a nearest-neighbor search in vector space using cosine similarity.
# The result is a list of the top-k most similar documents (k=5 in this case).
# Each result is a Document object containing both content and metadata.

results = vectorstore.similarity_search(query, k=5)  # Return top 5 most relevant chunks

# ---------------------------------------------------------------
# Display results
# ---------------------------------------------------------------
# For each result, we print:
# - The paper ID from metadata (to identify the source)
# - A 250-character snippet of the matched chunk (for preview)
# This is useful for debugging or validating whether the retrieval matches user expectations.

for r in results:
    print(f"{r.metadata['paper_id']}:\n{r.page_content[:250]}...\n")

> **Note:** The following section explains core functionality and workflow.

## **Step 1: Topic Extraction - [3 Marks]**

> **Read the NOFO doc and identify the topic for which the funding is to be given.**
---
<font color=Red>**Note:**</font> *2 marks are awarded for the prompt and 1 mark for the successful completion of this section, including debugging or modifying the code if necessary.*
   

**TASK:** Write an LLM prompt to extract the Topic for what the funding is been provided, from the NOFO document, Ask the LLM to respond back with the topic name only and nothing else.

In [None]:
# Topic extraction prompt
topic_extraction_prompt = f"""
You are a research grant specialist with expertise in analyzing NIH funding announcements and extracting key research priorities.

Your task: Analyze this NOFO document from the National Institute of Mental Health (NIMH) to identify the PRIMARY funding topic.

The document may describe multiple research areas, objectives, and priorities. Extract the single overarching topic that encompasses the main focus of this funding opportunity.

Return ONLY the primary topic in 3-8 words. No explanations, descriptions, or additional text.

Document:
{NOFO_pdf[0].page_content}
"""

In [None]:
# Finding the topic for which the Funding is been given
topic_extraction = llm.invoke(topic_extraction_prompt)
topic = topic_extraction.content

## Few-shot Prompt Setup for Assessing Relevance

In [None]:
json_path = "prompt_evaluation_log_cleaned.json"

# ------------------------------------------------------------
# FEW-SHOT FALLBACK EXAMPLES
# ------------------------------------------------------------
FALLBACK_EXAMPLES = [
    (
        "Digital CBT for Adolescents",
        """{
  "criteria_results": {
    "domain_relevance": "YES - focuses on mental health digital interventions",
    "methodological_alignment": "YES - randomized controlled trial design",
    "theoretical_connection": "NO - lacks explicit framework reference",
    "practical_application": "YES - informs deployment in youth settings"
  },
  "decision": "RELEVANT",
  "confidence": "85",
  "summary": "This study evaluates a mobile CBT app for adolescents, showing significant reduction in anxiety and depression symptoms compared to control. It highlights engagement strategies relevant to NOFO objectives."
}"""
    ),
    (
        "Oncology Drug Delivery Review",
        """{
  "criteria_results": {
    "domain_relevance": "NO - focuses on oncology drug mechanisms",
    "methodological_alignment": "NO",
    "theoretical_connection": "NO",
    "practical_application": "NO"
  },
  "decision": "PAPER NOT RELATED TO TOPIC",
  "confidence": "0",
  "summary": null
}"""
    )
]


In [None]:
# ------------------------------------------------------------
# FEW-SHOT RETRIEVAL + PROMPT BUILDER
# ------------------------------------------------------------
def get_few_shot_examples(
    json_path,
# Define configuration for few-shot example retrieval (number of examples)
    max_examples=FEW_SHOT_MAX_EXAMPLES,
# Minimum confidence threshold for including examples in few-shot prompting
    min_confidence=MIN_CONFIDENCE_FOR_FEWSHOT
):
    """
    Pulls balanced high-confidence examples from log or uses fallback if none found.
    """
# Import required libraries for core functionality
    import os, json, random

    examples = []

    # Attempt to pull from log
    if os.path.exists(json_path):
        with open(json_path, "r", encoding="utf-8") as f:
            try:
                data = json.load(f)
            except json.JSONDecodeError:
                data = []

        relevant_examples, irrelevant_examples = [], []

        for iteration in data:
            for doc in iteration.get("relevant_documents", []):
                hybrid_conf = max(doc.get("model_confidence", 0), doc.get("rule_confidence", 0))
                if hybrid_conf >= min_confidence:
                    relevant_examples.append((doc["title"], doc["reasoning"]))

            for doc in iteration.get("irrelevant_documents", []):
                irrelevant_examples.append((doc, "PAPER NOT RELATED TO TOPIC"))

        # Balance relevant and irrelevant (half and half)
        half = max_examples // 2
        random.shuffle(relevant_examples)
        random.shuffle(irrelevant_examples)
        selected_relevant = relevant_examples[:half]
        selected_irrelevant = irrelevant_examples[:half]
        examples = selected_relevant + selected_irrelevant

    # Fallback if no examples found
    if not examples:
        print("No high-confidence examples found. Using fallback seed examples.")
        examples = FALLBACK_EXAMPLES[:max_examples]

    return examples


def build_prompt_with_examples(topic, base_prompt, examples):
    """
    Prepend few-shot examples (from log or fallback) to the base prompt.
    """
    if not examples:
        return base_prompt

    examples_str = "\n\n".join(
        [f"Example ({title}):\n{reasoning}" for title, reasoning in examples]
    )

    return f"""
You are a research grant specialist evaluating research papers for relevance to NIH NOFO objectives: {topic}.

Below are examples of prior evaluations for context:
{examples_str}

Now evaluate the following paper using the same structure and logic:

{base_prompt}
"""

> **Note:** The following section explains core functionality and workflow.

## **Step 2: Research Paper Relevance Assessment - [3 Marks]**
> **Analyze all the Research Papers and filter out the research papers based on the topic of NOFO**
---
<font color=Red>**Note:**</font> *2 marks are awarded for the prompt and 1 mark for the successful completion of this section, including debugging or modifying the code if necessary.*

> **Note:** The following section explains core functionality and workflow.

**TASK:** Write an Prompt which can be used to analyze the relevance of the provided research paper in relation to the topic outlined in the NOFO (Notice of Funding Opportunity) document. Determine whether the research aligns with the goals, objectives, and funding criteria specified in the NOFO. Additionally, assess whether the research paper can be used to support or develop a viable project idea that fits within the scope of the funding opportunity.

<br>

**Note:** If the paper does **not** significantly relate to the topic—by domain, method, theory, or application ask the LLM to return: **"PAPER NOT RELATED TO TOPIC"**


<br>

Ask the LLM to respond in the below specified structure:

```
### Output Format:
"summary": "<summary of the paper under 300 words, or return: PAPER NOT RELATED TO TOPIC>"

```

In [None]:
relevance_prompt_a = f"""
You are a research grant specialist evaluating research papers for relevance to NIH NOFO objectives: {topic}.

Evaluate the paper step-by-step against these criteria:
1. Domain relevance (mental health, digital health, intervention effectiveness)
2. Methodological alignment (clinical trials, user engagement studies, technology development)
3. Theoretical connection (frameworks, evidence, insights for intervention design/implementation)
4. Practical application (supports development or testing of digital mental health interventions)

Instructions:
- For EACH criterion, respond YES or NO and justify briefly.
- A paper is RELEVANT if at least ONE criterion is YES.
- Assign a confidence score (0–100%) to the RELEVANT decision, based on how strongly the paper meets the criteria (higher = more confident relevance).
- If RELEVANT: provide a <300-word summary focused on digital mental health intervention insights.
- If NOT RELEVANT: return exactly "PAPER NOT RELATED TO TOPIC".

Output format (JSON):
{{
  "criteria_results": {{
    "domain_relevance": "YES/NO - justification",
    "methodological_alignment": "YES/NO - justification",
    "theoretical_connection": "YES/NO - justification",
    "practical_application": "YES/NO - justification"
  }},
  "decision": "RELEVANT" or "PAPER NOT RELATED TO TOPIC",
  "confidence": "<integer between 0 and 100>",
  "summary": "<summary text or null>"
}}

### Paper content:
"""

In [None]:
# ------------------------------------------------------------
# FEW-SHOT RETRIEVAL FUNCTION (needed for few_shot_examples)
# ------------------------------------------------------------
def get_few_shot_examples(
    json_path,
    max_examples=4,                 # total examples to include
    min_confidence=70               # minimum confidence threshold
):
    """
    Retrieve few-shot examples for prompt building:
    - Pulls from prior log entries if available
    - Falls back to hardcoded seed examples if log is empty

    Why this matters:
    - Few-shot examples improve LLM reasoning by showing "good answers"
    - Ensures consistency in relevance classification over multiple runs
    """

    import os, json, random

    examples = []

    # -------------------------------
    # 1. Attempt to load from log
    # -------------------------------
    if os.path.exists(json_path):
        with open(json_path, "r", encoding="utf-8") as f:
            try:
                data = json.load(f)
            except json.JSONDecodeError:
                data = []

        relevant_examples, irrelevant_examples = [], []

        # Loop through prior iterations
        for iteration in data:
            # Pull relevant docs with sufficient confidence
            for doc in iteration.get("relevant_documents", []):
                hybrid_conf = max(doc.get("model_confidence", 0), doc.get("rule_confidence", 0))
                if hybrid_conf >= min_confidence:
                    relevant_examples.append((doc["title"], doc["reasoning"]))

            # Irrelevant docs (fallback reasoning text)
            for doc in iteration.get("irrelevant_documents", []):
                irrelevant_examples.append((doc, "PAPER NOT RELATED TO TOPIC"))

        # Randomly select balanced examples
        half = max_examples // 2
        random.shuffle(relevant_examples)
        random.shuffle(irrelevant_examples)
        selected_relevant = relevant_examples[:half]
        selected_irrelevant = irrelevant_examples[:half]
        examples = selected_relevant + selected_irrelevant

    # -------------------------------
    # 2. Fallback if log is empty
    # -------------------------------
    if not examples:
        print("No high-confidence examples found. Using fallback seed examples.")
        examples = [
            (
                "Digital CBT for Adolescents",
                """{
  "criteria_results": {
    "domain_relevance": "YES - focuses on mental health digital interventions",
    "methodological_alignment": "YES - randomized controlled trial design",
    "theoretical_connection": "NO - lacks explicit framework reference",
    "practical_application": "YES - informs deployment in youth settings"
  },
  "decision": "RELEVANT",
  "confidence": "85",
  "summary": "This study evaluates a mobile CBT app for adolescents, showing significant reduction in anxiety and depression symptoms compared to control."
}"""
            ),
            (
                "Oncology Drug Delivery Review",
                """{
  "criteria_results": {
    "domain_relevance": "NO - focuses on oncology drug mechanisms",
    "methodological_alignment": "NO",
    "theoretical_connection": "NO",
    "practical_application": "NO"
  },
  "decision": "PAPER NOT RELATED TO TOPIC",
  "confidence": "0",
  "summary": null
}"""
            )
        ][:max_examples]

    return examples


In [None]:
# ------------------------------------------------------------
# FUNCTION: build_prompt_with_examples
# ------------------------------------------------------------
def build_prompt_with_examples(topic, base_prompt, examples):
    """
    Build a few-shot prompt for relevance classification.

    How it works:
    - Prepends example evaluations (few-shot) before the actual task prompt
    - Provides LLM with context: "Here is how similar papers were judged"
    - Ensures consistent reasoning across multiple runs

    Args:
        topic (str): The NOFO topic (priority research area)
        base_prompt (str): The evaluation instructions prompt
        examples (list of tuples): Few-shot examples [(title, reasoning), ...]

    Returns:
        str: Full prompt with examples + evaluation instructions
    """

    # Format few-shot examples (each example = title + reasoning text)
    examples_str = "\n\n".join(
        [f"Example ({title}):\n{reasoning}" for title, reasoning in examples]
    )

    # Assemble final prompt
    prompt = f"""
You are a research grant specialist evaluating research papers for relevance to NIH NOFO objectives: {topic}.

Below are examples of prior evaluations for context:
{examples_str}

Now evaluate the following paper using the same structure and logic:

{base_prompt}
"""
    return prompt


In [None]:
# --- Few-shot setup ---
# Retrieve few-shot examples (previous evaluations) and build prompt prefix
few_shot_examples = get_few_shot_examples(LOG_PATH)
prompt_with_examples = build_prompt_with_examples(topic, relevance_prompt_a, few_shot_examples)

# Import required libraries
import os
import json
import random
import tiktoken
from datetime import datetime
import re
import matplotlib.pyplot as plt
import time  # NEW: for batch delay control
import numpy as np  # NEW: for percentile calculation

# ------------------------------------------------------------
# CONFIGURATION SECTION
# ------------------------------------------------------------
TEST_MODE = True                  # If True, process only subset of data (not used in Chroma flow)
DISCREPANCY_THRESHOLD = 20        # Threshold for flagging model vs rule confidence mismatch

# Toggle modes
FAST_MODE = True                  # True = embedding-prioritized retrieval (top-K); False = evaluate all papers
TOP_K_PAPERS = 50                 # Target number of papers to evaluate after pre-filtering

# NEW CONFIG: Overfetch factor for Phase 1
# ------------------------------------------------------------
# Retrieve MORE chunks than TOP_K_PAPERS to ensure enough unique papers after aggregation.
# Example: 50 papers * factor 3 = 150 chunks
CHUNK_OVERFETCH_FACTOR = 3

# NEW CONFIG: Chunk cap and long-paper handling
CHUNK_CAP_PER_PAPER = 10          # Cap top-N chunks per paper (by similarity) to reduce bias & token load
LONG_PAPER_THRESHOLD = 30         # If a paper has >30 chunks, summarize chunks before aggregation
TOKEN_LIMIT_BEFORE_SUMMARY = 100000  # If estimated tokens exceed this, auto-summarize

# Batch processing settings
BATCH_SIZE = 10                   # Process papers in groups of 10
BATCH_DELAY = 3                   # Delay between batches (seconds) to avoid rate-limit errors

# Token cost estimation (OpenAI GPT-4o-mini pricing as of Aug 2025)
INPUT_COST_PER_1K = 0.00015       # $ per 1K input tokens
OUTPUT_COST_PER_1K = 0.0006       # $ per 1K output tokens

# Initialize cost tracking variables
total_input_tokens = 0            # Cumulative input tokens sent to API
total_output_tokens = 0           # Approximate output tokens (assume 10% of input)
total_cost_usd = 0.0              # Running cost estimate

prior_classification = {
    "relevant": [],
    "irrelevant": [],
    "unknown": []
}

# ------------------------------------------------------------
# LOGGING FUNCTION (unchanged)
# ------------------------------------------------------------
def log_prompt_iteration(
    json_path,
    prompt,
    relevant_docs_with_reasoning,
    irrelevant_docs,
):
    """
    Appends classification results to master JSON log for auditability and trend tracking.
    """
    iteration_id = len(json.load(open(json_path))) + 1 if os.path.exists(json_path) else 1
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    entry = {
        "iteration_id": iteration_id,
        "timestamp": timestamp,
        "prompt": prompt,
        "relevant_documents": relevant_docs_with_reasoning,
        "irrelevant_documents": irrelevant_docs
    }

    if os.path.exists(json_path):
        with open(json_path, "r", encoding="utf-8") as f:
            try:
                data = json.load(f)
            except json.JSONDecodeError:
                data = []
    else:
        data = []

    data.append(entry)
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

    print(f"Logged iteration {iteration_id} to {json_path}")

# ------------------------------------------------------------
# SELF-CHECK FUNCTION (unchanged)
# ------------------------------------------------------------
def verify_decision(llm, reasoning_output):
    """
    Performs secondary pass: verifies relevance decision from reasoning text.
    Only returns YES/NO (binary) to catch contradictory outputs.
    """
    verification_prompt = f"""
You are verifying the relevance decision based on the following evaluation:

{reasoning_output}

Only answer with 'YES' if the decision should be considered relevant, or 'NO' if not relevant.
    """
    verification_response = llm.invoke(verification_prompt)
    return "YES" in verification_response.content.upper()

# ------------------------------------------------------------
# RULE-DERIVED CONFIDENCE FUNCTION (unchanged)
# ------------------------------------------------------------
def calculate_rule_confidence(criteria_results):
    """
    Calculates deterministic confidence (0-95%) based on count of YES answers
    in criteria evaluations (domain, methods, theory, application).
    """
    yes_count = sum(1 for v in criteria_results.values() if v.upper().startswith("YES"))
    if yes_count == 0:
        return 0
    elif yes_count == 1:
        return 50
    elif yes_count == 2:
        return 70
    elif yes_count == 3:
        return 85
    else:
        return 95

# ------------------------------------------------------------
# CHROMADB INTEGRATION
# ------------------------------------------------------------
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

# Load same embedding model used for precomputing
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
persist_dir = "./chroma_embeddings"

# Connect to persisted Chroma vectorstore (contains chunks + metadata)
vectorstore = Chroma(
    collection_name="research_chunks",
    embedding_function=embedding_model,
    persist_directory=persist_dir
)

# ============================================================
# PHASE 1: PAPER-LEVEL PRE-FILTER (OVERFETCH + RANKING)
# ============================================================

print(f"[PHASE 1] Overfetching chunks for paper-level scoring (factor={CHUNK_OVERFETCH_FACTOR})")

# Step 1: Retrieve top-N chunks by similarity (overfetch)
overfetch_k = TOP_K_PAPERS * CHUNK_OVERFETCH_FACTOR
retrieved_chunks = vectorstore.similarity_search_with_score(topic, k=overfetch_k)  # returns (doc, score)

# Step 2: Aggregate scores per paper_id
paper_scores = {}         # {paper_id: [scores]}
for doc, score in retrieved_chunks:
    pid = doc.metadata.get("paper_id", "Unknown_Paper")
    if pid not in paper_scores:
        paper_scores[pid] = []
    paper_scores[pid].append(score)

# Step 3: Rank papers by **max score** (primary) and print **average** for debugging
ranked_papers = sorted(
    paper_scores.items(),
    key=lambda x: max(x[1]),  # still using max for ranking
    reverse=True
)

# Step 4: Select top-K unique papers
top_paper_ids = [pid for pid, _ in ranked_papers[:TOP_K_PAPERS]]

# Debug: Print ranking summary
print(f"Selected top {len(top_paper_ids)} papers for Phase 2 evaluation:")
for pid in top_paper_ids:
    print(f" - {pid}: max={max(paper_scores[pid]):.2f}, avg={sum(paper_scores[pid])/len(paper_scores[pid]):.2f}, chunks={len(paper_scores[pid])}")

# ------------------------------------------------------------
# NEW: Percentile Statistics Printout for Debugging
# ------------------------------------------------------------
scores_max = [max(scores) for scores in paper_scores.values()]
if scores_max:
    percentiles = [25, 50, 75, 90]
    print("\n[PHASE 1] Score Percentile Summary (max chunk scores per paper):")
    for p in percentiles:
        val = np.percentile(scores_max, p)
        print(f"  {p}th percentile: {val:.2f}")
    print(f"  Min score: {min(scores_max):.2f}")
    print(f"  Max score: {max(scores_max):.2f}")
else:
    print("[PHASE 1] No scores available to compute percentiles.")

# ============================================================
# PHASE 2: FULL-CHUNK RETRIEVAL WITH CAP + EARLY SUMMARY
# ============================================================

print("[PHASE 2] Retrieving all chunks (bulk) and filtering for top papers...")

# Retrieve all chunks at once (faster than per-paper search)
all_chunks = vectorstore.similarity_search_with_score(topic, k=9999)

# Organize chunks by paper_id and sort by score (so we can cap top-N)
paper_chunk_data = {}  # {paper_id: [(score, text), ...]}

for doc, score in all_chunks:
    pid = doc.metadata.get("paper_id", "Unknown_Paper")
    if pid in top_paper_ids:
        paper_chunk_data.setdefault(pid, []).append((score, doc.page_content))

# Build aggregated papers with **chunk cap** and **early summary for long papers**
paper_chunks = {}
paper_chunk_counts = {}

for pid, chunks in paper_chunk_data.items():
    # Sort chunks by descending similarity score (keep top most relevant)
    sorted_chunks = sorted(chunks, key=lambda x: x[0], reverse=True)

    # Cap to top-N chunks (prevents extremely long papers from dominating)
    capped_chunks = sorted_chunks[:CHUNK_CAP_PER_PAPER]

    # Count total chunks for debug
    paper_chunk_counts[pid] = len(sorted_chunks)

    # If paper has > LONG_PAPER_THRESHOLD chunks → summarize early
    if len(sorted_chunks) > LONG_PAPER_THRESHOLD:
        print(f"[EARLY SUMMARY] Paper '{pid}' exceeds {LONG_PAPER_THRESHOLD} chunks → summarizing chunks before aggregation.")
        # Summarize each chunk first, then combine summaries
        chunk_summaries = []
        for _, chunk_text in capped_chunks:
            summary = summarize_text(chunk_text)  # reuse summarization helper
            chunk_summaries.append(summary)
        combined_summary = "\n".join(chunk_summaries)
        paper_chunks[pid] = combined_summary
    else:
        # Normal case: concatenate capped chunks directly
        combined_text = "\n".join([text for _, text in capped_chunks])
        paper_chunks[pid] = combined_text

# Final aggregated papers for downstream evaluation
aggregated_papers = list(paper_chunks.items())
print(f"[PHASE 2] Aggregated {len(aggregated_papers)} papers for full relevance evaluation.")

# Debug print: show capped chunk count (original vs capped)
print("[PHASE 2] Chunk count per selected paper (post-aggregation with cap):")
for pid, total_chunks in paper_chunk_counts.items():
    print(f" - {pid}: {min(total_chunks, CHUNK_CAP_PER_PAPER)} chunks used (original {total_chunks})")

# ------------------------------------------------------------
# NEW: Automatic Warning for Overrepresented Papers
# ------------------------------------------------------------
# This block calculates the percentage of total chunks contributed by each paper
# (AFTER capping) and issues a warning if any paper contributes more than X% of total chunks.
# This helps detect potential bias where one very long paper dominates context.

WARNING_THRESHOLD_PERCENT = 25  # e.g., warn if >25% of chunks come from a single paper

# Total chunks after capping (sum of min(total, cap))
total_chunks_used = sum(min(count, CHUNK_CAP_PER_PAPER) for count in paper_chunk_counts.values())

for pid, count in paper_chunk_counts.items():
    used_chunks = min(count, CHUNK_CAP_PER_PAPER)
    percent = (used_chunks / total_chunks_used) * 100 if total_chunks_used > 0 else 0
    if percent > WARNING_THRESHOLD_PERCENT:
        print(f"*** WARNING: Paper '{pid}' contributes {percent:.1f}% of total chunks "
              f"({used_chunks}/{total_chunks_used}) → may indicate overrepresentation. ***")

# ------------------------------------------------------------
# TOKENIZER SETUP
# ------------------------------------------------------------
encoding = tiktoken.encoding_for_model("gpt-4o-mini")
MAX_TOKENS = 300000  # Safety ceiling for prompt+context

# ------------------------------------------------------------
# SUMMARIZATION HELPER
# ------------------------------------------------------------
def summarize_text(paper_text):
    """
    Summarizes full paper to ~300 words focusing on digital mental health interventions.
    Reduces token load while preserving conceptual content for classification.
    """
    summary_prompt = f"""
    Summarize the following research paper into ~300 words, focusing on
    digital mental health interventions, methods, and outcomes:

    {paper_text}
    """
    summary_response = llm.invoke(summary_prompt)
    return summary_response.content.strip()

# ------------------------------------------------------------
# MAIN LOOP: BATCH PROCESSING + COST TRACKING
# ------------------------------------------------------------
documents = []
irrelevant_docs_list = []
progress_cnt = 1
relevant_papers_count = 0
irrelevant_papers_count = 0
total_files = len(aggregated_papers)

for batch_start in range(0, total_files, BATCH_SIZE):
    batch = aggregated_papers[batch_start: batch_start + BATCH_SIZE]
    print(f"\nProcessing batch {batch_start//BATCH_SIZE + 1} "
          f"({len(batch)} papers) out of {total_files} total papers...")

    for paper_id, paper_text in batch:
        try:
            # --- Dynamic Token Budget Check ---
            # Estimate token count BEFORE building full prompt
            token_estimate = len(encoding.encode(paper_text))
            if token_estimate > TOKEN_LIMIT_BEFORE_SUMMARY:
                print(f"[TOKEN GUARD] Paper '{paper_id}' estimated {token_estimate} tokens → auto-summarizing.")
                paper_text = summarize_text(paper_text)  # Summarize entire paper

            # --- Summarize paper (fallback for smaller papers) ---
            summarized_text = summarize_text(paper_text)

            # --- Build relevance prompt ---
            available_tokens = MAX_TOKENS - len(encoding.encode(prompt_with_examples))
            truncated_text = encoding.decode(encoding.encode(summarized_text)[:available_tokens])
            full_prompt = prompt_with_examples + truncated_text

            # --- Token count + cost estimation ---
            token_count = len(encoding.encode(full_prompt))
            total_input_tokens += token_count
            total_output_tokens += int(token_count * 0.1)  # estimate 10% output
            total_cost_usd = (
                (total_input_tokens / 1000) * INPUT_COST_PER_1K +
                (total_output_tokens / 1000) * OUTPUT_COST_PER_1K
            )
            print(f"[Token Count] {paper_id}: {token_count} tokens "
                  f"(Estimated running cost: ${total_cost_usd:.4f})")

            # --- LLM relevance classification ---
            response = llm.invoke(full_prompt)
            print(f"Successfully processed paper {progress_cnt}/{total_files} ({paper_id})")
            progress_cnt += 1

            # --- Self-check verification ---
            is_relevant = verify_decision(llm, response.content)

            if not is_relevant or "PAPER NOT RELATED TO TOPIC" in response.content:
                irrelevant_papers_count += 1
                irrelevant_docs_list.append(paper_id)
                continue

            # --- Parse LLM JSON output ---
            try:
                parsed_json = json.loads(response.content)
            except json.JSONDecodeError:
                json_match = re.search(r"\{.*\}", response.content, re.DOTALL)
                parsed_json = json.loads(json_match.group(0)) if json_match else {}

            # --- Confidence scoring (model + rule) ---
            model_confidence = int(parsed_json.get("confidence", 0)) if parsed_json else None
            rule_confidence = calculate_rule_confidence(parsed_json["criteria_results"]) \
                if "criteria_results" in parsed_json else 0
            discrepancy = abs(model_confidence - rule_confidence) if model_confidence else None
            flagged = discrepancy > DISCREPANCY_THRESHOLD if discrepancy is not None else False

            # --- Store result ---
            documents.append({
                'title': paper_id,
                'file_path': "(from ChromaDB)",
                'llm_reasoning': response.content,
                'model_confidence': model_confidence,
                'rule_confidence': rule_confidence,
                'confidence_discrepancy': discrepancy,
                'flagged_for_review': flagged
            })
            relevant_papers_count += 1

        except Exception as e:
            print(f"!!! Error processing {paper_id}: {str(e)}")

    # --- Delay between batches ---
    print(f"Batch {batch_start//BATCH_SIZE + 1} complete. Sleeping {BATCH_DELAY} seconds...")
    time.sleep(BATCH_DELAY)

# ------------------------------------------------------------
# SUMMARY OUTPUT
# ------------------------------------------------------------
print("=" * 50)
print(f"Relevant Papers: {relevant_papers_count}/{total_files}")
print(f"Irrelevant Papers: {irrelevant_papers_count}/{total_files}")
print(f"Estimated Total Input Tokens: {total_input_tokens}")
print(f"Estimated Total Output Tokens: {total_output_tokens}")
print(f"Estimated Total Cost: ${total_cost_usd:.4f}")
print("=" * 50)

print("\nList of relevant papers:")
for doc in documents:
    print(f"\nTitle: {doc['title']}")
    print(f"Model Confidence: {doc['model_confidence']}")
    print(f"Rule Confidence: {doc['rule_confidence']}")
    print(f"Discrepancy: {doc['confidence_discrepancy']} (Flagged: {doc['flagged_for_review']})")
    print(f"Reasoning (truncated): {doc['llm_reasoning'][:500]}...")

# ------------------------------------------------------------
# LOGGING
# ------------------------------------------------------------
relevant_docs_with_reasoning = [
    {
        "title": doc['title'],
        "reasoning": doc['llm_reasoning'],
        "model_confidence": doc['model_confidence'],
        "rule_confidence": doc['rule_confidence'],
        "confidence_discrepancy": doc['confidence_discrepancy'],
        "flagged_for_review": doc['flagged_for_review']
    }
    for doc in documents
]

log_prompt_iteration(
    json_path="prompt_evaluation_log_cleaned.json",
    prompt=prompt_with_examples,
    relevant_docs_with_reasoning=relevant_docs_with_reasoning,
    irrelevant_docs=irrelevant_docs_list,
)

> **Note:** The following section explains core functionality and workflow.

## **Step 3: Proposal Ideation Based on Filtered Research - [4 marks]**
> **Use the filtered papers, to generate ideas for the Reseach Proposal.**
---
<font color=Red>**Note:**</font> *2 marks are awarded for the prompt, 1 mark for the Generating Idea and 1 mark for fetching file path of chosen idea along with successful completion of this section, including debugging or modifying the code if necessary.*

> **Note:** The following section explains core functionality and workflow.

**TASK:** Write an Prompt which can be used to generate 5 ideas for the Research Proposal, each idea should consist:

1. **Idea X:** [Concise Title of the Project Idea]  \n
2. **Description:** [Brief and targeted description summarizing the objectives, innovative elements, scientific rationale, and anticipated impact.]  \n
3. **Citation:** [Author(s), Year or Paper Title]  \n
4. **NOFO Alignment:** [List two or more specific NOFO requirements that this idea directly addresses]  \n
5. **File Path of the Research Paper:** [Exact file path, ending in .pdf]

- Use the Delimiter `---` for defining the structure of the sample outputs in the prompt





> **Note:** The following section explains core functionality and workflow.

#### Generating 5 Ideas

In [None]:
# Note to self: Be sure to add additional details from page linked in the NOFO pdf
# Also need to include constraints, e.g., "Digital health test beds that leverage well-established 
# digital health platforms to optimize evidence-based digital mental health interventions"

gen_idea_prompt = f"""


<WRITE YOUR PROMPT HERE>


"""

In [None]:
ideas = llm.invoke(gen_idea_prompt)

In [None]:
from IPython.display import Markdown, display
display(Markdown(ideas.content))

In [None]:
# For consideration if extracted text is not clean enough
# Add post-extraction GPT-enabled noise removal step
# to remove additional noise from chunks

# Too resource intensive for full data set. Add later if needed.

# import json
# from openai import OpenAI

# # Initialize OpenAI client
# client = OpenAI()

# def semantic_clean_text(raw_text):
#     prompt = f"""
# You are a document cleaner. Extract ONLY the main body text from the following academic or technical document:
# - Remove page numbers, headers/footers
# - Remove title page, author affiliations, figure/table captions
# - Remove references/bibliography sections
# - Keep abstracts, introductions, main sections, and conclusions

# Document:
# \"\"\"{raw_text}\"\"\"

# Return only the cleaned text.
# """
#     response = client.responses.create(
#         model="gpt-4o-mini",
#         input=prompt,
#         max_output_tokens=4000
#     )
#     return response.output_text

# # --- Ingest cleaned + chunked data and post-process with GPT ---
# input_path = "data/cleaned_chunked_papers.json"
# output_path = "data/cleaned_gpt.json"

# # Load chunked data
# with open(input_path, "r", encoding="utf-8") as f:
#     chunked_data = json.load(f)

# # Prepare list for GPT-processed results
# gpt_cleaned_data = []

# # Loop through each document
# for record in chunked_data:
#     doc_id = record["id"]
#     gpt_chunks = []

#     print(f"Post-processing (GPT cleanup): {doc_id}")

#     # Apply GPT cleaning to each chunk
#     for chunk in record["chunks"]:
#         cleaned_chunk = semantic_clean_text(chunk)
#         gpt_chunks.append(cleaned_chunk)

#     # Store result
#     gpt_cleaned_data.append({
#         "id": doc_id,
#         "chunks": gpt_chunks
#     })

# # Save GPT-cleaned data
# with open(output_path, "w", encoding="utf-8") as f:
#     json.dump(gpt_cleaned_data, f, indent=2, ensure_ascii=False)

# print(f"Saved GPT post-processed chunks to {output_path}")


> **Note:** The following section explains core functionality and workflow.

#### Choosing 1 Idea and fetching details

In [None]:
# Modify the idea_number for choosing the different idea
idea_number = 5   # change the number if you wish to choose and generate the research proposal for another idea
chosen_idea = ideas.content.split("---")[idea_number]

In [None]:
# Import required libraries for core functionality
import re

# Use a regular expression to find the file path of the research paper

pattern = r"File Path of the Research Paper:\*\*\s*(.+?)\n"
# If you are unable to extract the file path successfully using this pattern, use the `ChatGPT` or any other LLM to find the pattern that works for you, simply provide the LLM the sample response of your whole ideas and ask the LLM to generate the regex patterm for extracting the "File Path of the Research Paper"

match = re.search(pattern, chosen_idea)

if match:
  idea_generated_from_research_paper = match.group(1).strip()
  print("Filepath : ", idea_generated_from_research_paper)
else:
  print("File Path of the Research Paper not found in the chosen idea.")

> **Note:** The following section explains core functionality and workflow.

## **Step 4: Proposal Blueprint Preparation - [3 Marks]**

> **Select appropriate research ideas for the proposal and supply 'Sample Research Proposals' as templates to the LLM to support the generation of the final proposal.**
---   
<font color=Red>**Note:**</font> *2 marks are awarded for the prompt and 1 mark for the successful completion of this section, including debugging or modifying the code if necessary.*

**TASK:** Write an Prompt which can be used to generate the Research Proposal.

The prompt should be able to craft a research proposal based on the sample research proposal template, using one of the ideas generated above. The proposal should include references to the actual research papers from which the ideas are derived and should align well with the NOFO documents.

In [None]:
# Here we need to add the full papers instead of the summary
# Load PDF files and extract content using PyPDFLoader
chosen_idea_rp = PyPDFLoader(idea_generated_from_research_paper, mode="single").load()

# Loading the sample research proposal template
# Load PDF files and extract content using PyPDFLoader
research_proposal_template = PyPDFLoader(" <Path of Research Proposal Template> ", mode="single").load()

In [None]:
import json
import os
from pypdf import PdfReader
import camelot
import pytesseract
from pdf2image import convert_from_path
import tiktoken

# --- Tokenization setup ---
encoding = tiktoken.encoding_for_model("gpt-4o-mini")
MAX_TOKENS = 127500          # total model context window
EXTRACTION_BUDGET = 100000   # reserve ~20% for prompts/response

def count_tokens(text):
    """Count tokens using tiktoken encoding."""
    return len(encoding.encode(text))

# --- Load matching papers from JSON log ---
def load_matched_papers(json_path, pdf_folder="content"):
    """
    Extract list of relevant document file paths from the latest JSON iteration.
    """
    with open(json_path, "r") as f:
        data = json.load(f)
    
    # Take the last iteration's relevant_documents
    last_iteration = data[-1]
    relevant_docs = last_iteration.get("relevant_documents", [])
    
    # Build file paths for each relevant doc (assumes they exist in pdf_folder)
    file_paths = []
    for doc in relevant_docs:
        title = doc["title"]
        pdf_path = os.path.join(pdf_folder, title)
        if os.path.exists(pdf_path):
            file_paths.append(pdf_path)
        else:
            print(f"Warning: {pdf_path} not found. Skipping.")
    return file_paths

# --- Stage 1 & 2: Text + Table extraction ---
def extract_text_and_tables(file_path, token_budget):
    """Extract text and tables within token budget."""
    content = ""
    token_count = 0

    # Stage 1: PyPDF text extraction
    try:
        reader = PdfReader(file_path)
        for page in reader.pages:
            page_text = page.extract_text() or ""
            token_count += count_tokens(page_text)
            if token_count > token_budget:
                print(f"Token budget reached during text extraction: {file_path}")
                break
            content += page_text
    except Exception as e:
        print(f"PyPDF extraction failed: {e}")

    # Stage 2: Table extraction (Camelot)
    # try:
    #     tables = camelot.read_pdf(file_path, pages='all')
    #     for table in tables:
    #         table_text = "\n[Table Extracted]\n" + table.df.to_string()
    #         token_count += count_tokens(table_text)
    #         if token_count > token_budget:
    #             print(f"Token budget reached during table extraction: {file_path}")
    #             break
    #         content += table_text
    # except Exception:
    #     pass

    return content, token_count

# --- Stage 3: OCR extraction ---
# def extract_ocr(file_path, token_budget, current_tokens=0):
#     """Extract OCR text (figures/scanned pages) within remaining token budget."""
#     content = ""
#     token_count = current_tokens

#     try:
#         images = convert_from_path(file_path)
#         for image in images:
#             ocr_text = pytesseract.image_to_string(image)
#             token_count += count_tokens(ocr_text)
#             if token_count > token_budget:
#                 print(f"Token budget reached during OCR extraction: {file_path}")
#                 break
#             content += "\n[OCR Extracted]\n" + ocr_text
#     except Exception:
#         pass

    return content

# --- Process all matched papers ---
def process_matched_papers(json_path, pdf_folder="content"):
    """
    Load matched papers from JSON and process them using multi-stage extraction:
    Pass 1: Text + Tables
    Pass 2: OCR (Figures)
    Returns dict mapping filename -> combined extracted content.
    """
    matched_files = load_matched_papers(json_path, pdf_folder)
    text_table_data = {}
    token_usage = {}

    for file_path in matched_files:
        print(f"Extracting text/tables: {os.path.basename(file_path)}")
        content, tokens_used = extract_text_and_tables(file_path, EXTRACTION_BUDGET)
        text_table_data[os.path.basename(file_path)] = content
        token_usage[os.path.basename(file_path)] = tokens_used

    # Return text_table_data directly
    return text_table_data

    # Pass 2: Extract OCR for all files (if budget allows)
    # for file_path in matched_files:
    #     filename = os.path.basename(file_path)
    #     remaining_budget = EXTRACTION_BUDGET - token_usage.get(filename, 0)
    #     if remaining_budget > 0:
    #         print(f"Extracting OCR: {filename} (remaining budget: {remaining_budget})")
    #         ocr_content = extract_ocr(file_path, EXTRACTION_BUDGET, token_usage[filename])
    #         results[filename] = text_table_data[filename] + ocr_content
    #     else:
    #         print(f"Skipping OCR for {filename} (no remaining token budget)")
    #         results[filename] = text_table_data[filename]

# Example usage:
# matched_content = process_matched_papers("/mnt/data/prompt_evaluation_log_cleaned.json", pdf_folder="../content")
# print(matched_content.keys())


In [None]:
matched_content = process_matched_papers("prompt_evaluation_log_cleaned.json", pdf_folder="data/raw")

In [None]:
print(matched_content)

In [None]:
research_proposal_template_prompt = f"""


<WRITE YOUR PROMPT HERE>


"""

In [None]:
research_plan = llm.invoke(research_proposal_template_prompt)

In [None]:
display(Markdown(research_plan.content))

In [None]:
# @title **Optional Part - Creating a PDF of the Research Proposal**
# The code in this cell block is used for printing out the output in the PDF format
from markdown_pdf import MarkdownPdf, Section

pdf = MarkdownPdf()
pdf.add_section(Section(research_plan.content))
pdf.save("Reseach Proposal First Draft.pdf")

> **Note:** The following section explains core functionality and workflow.

## **Step 5: Proposal Evaluation Against NOFO Criteria - [3 Marks]**
> **Use the LLM to evaluate the generated proposal (LLM-as-Judge) and assess its alignment with the NOFO criteria.**
   

---
<font color=Red>**Note:**</font> *2 marks are awarded for the prompt and 1 mark for the successful completion of this section, including debugging or modifying the code if necessary.*

**TASK:** Write an Prompt which can be used to evaluate the Research Proposal based on:
1. **Innovation**
2. **Significance**
3. **Approach**
4. **Investigator Expertise**

- Ask the LLM to rate on each of the criteria from **1 (Poor)** to **5 (Excellent)**
- Ask the LLM to provide the resonse in the json format
```JSON
name: Innovation
    justification: "<Justification>"
    score: <1-5>
    strengths: "<Strength 1>"
    weaknesses: "<Weakness 1>"
    recommendations: "<Recommendation 1>"
```



In [None]:
evaluation_prompt = f'''


<WRITE YOUR PROMPT HERE>


'''

In [None]:
# Call the LLM with the prepared prompt and truncated paper content
eval_response = llm.invoke(evaluation_prompt)

In [None]:
# Import required libraries for core functionality
import json
json_resp = json.loads(eval_response.content[7:-3])

In [None]:
for key, value in json_resp.items():
  print(f"---\n{key}:")
  if isinstance(value, list):
    for item in value:
      for k, v in item.items():
        print(f"  {k}: {v}")
      print("="*50)
  elif isinstance(value, dict):
    for k, v in value.items():
      print(f"  {k}: {v}")
  else:
    print(f"  {value}")

> **Note:** The following section explains core functionality and workflow.

## **Step 6: Human Review and Refinement of Proposal**
> **Perform Human Evaluation of the generated Proposal. Edit or Modify the proposal as necessary.**

In [None]:
display(Markdown(research_plan.content))

# **Step 7: Summary and Recommendation - [2 Marks]**


Based on the projects, learners are expected to share their observations, key learnings, and insights related to this business use case, including the challenges they encountered.

Additionally, they should recommend or explain any changes that could improve the project, along with suggesting additional steps that could be taken for further enhancement.



In [None]:

# --- Enhanced PDF Processing (Commenting original PyPDF-only approach) ---
# Original starter code (commented for traceability):
# Load PDF files and extract content using PyPDFLoader
# docs = PyPDFLoader(file_path, mode="single").load()

# New Implementation: Multi-stage parsing (PyPDF → Camelot/Tabula → OCR fallback)
# Purpose: Capture text, tables, and figures from diverse PDF formats (Mermaid C node, Rubric Step 2).

from PyPDF2 import PdfReader
# Import required libraries for core functionality
import camelot
import pytesseract
from pdf2image import convert_from_path

def process_pdf_multistage(file_path):
    """
    Multi-stage pipeline for extracting text, tables, and figures from PDFs.
    Stages:
    1. PyPDF (text)
    2. Camelot/Tabula (tables)
    3. OCR (scanned pages/figures)
    """
    content = ""

    # Stage 1: PyPDF text extraction
    try:
        reader = PdfReader(file_path)
        for page in reader.pages:
            content += page.extract_text() or ""
    except Exception as e:
        print(f"PyPDF extraction failed: {e}")

    # Stage 2: Table extraction (Camelot)
    try:
        tables = camelot.read_pdf(file_path, pages='all')
        for table in tables:
            content += "\n[Table Extracted]\n" + table.df.to_string()
    except Exception:
        pass

    # Stage 3: OCR fallback for scanned pages or figures
    try:
        images = convert_from_path(file_path)
        for image in images:
            text = pytesseract.image_to_string(image)
            content += "\n[OCR Extracted]\n" + text
    except Exception:
        pass

    return content


In [None]:

# --- Hybrid Retrieval (BM25 + Embeddings) ---
# Original code used either BM25 OR embeddings; this combines both (Mermaid D node, Rubric Step 2).

from rank_bm25 import BM25Okapi
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

def hybrid_retrieval_setup(docs_text):
    """
    Creates BM25 and embedding indexes for hybrid search.
    """
    # BM25 Index
    tokenized_corpus = [doc.split(" ") for doc in docs_text]
    bm25 = BM25Okapi(tokenized_corpus)

    # Embedding Index
    embed_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    vectorstore = Chroma.from_texts(docs_text, embed_model)

    return bm25, vectorstore


In [None]:

# --- Agentic Components (Research Analyst, Proposal Writer, Compliance Checker) ---
# Implements multi-agent workflow (Mermaid E subgraph, Rubric Step 3-4).

from langchain.agents import initialize_agent, Tool

def analyze_papers(query):
    return "Synthesis of relevant papers"

def check_compliance(proposal):
    return "Compliance report"

tools = [
    Tool(name="Research Analyst", func=analyze_papers, description="Synthesizes relevant papers."),
    Tool(name="Compliance Checker", func=check_compliance, description="Ensures NOFO alignment.")
]

# Initialize agent with zero-shot reasoning and tools
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)


In [None]:

# --- Agentic Components (Research Analyst, Proposal Writer, Compliance Checker) ---
# Implements multi-agent workflow (Mermaid E subgraph, Rubric Step 3-4).

from langchain.agents import initialize_agent, Tool

def analyze_papers(query):
    return "Synthesis of relevant papers"

def check_compliance(proposal):
    return "Compliance report"

tools = [
    Tool(name="Research Analyst", func=analyze_papers, description="Synthesizes relevant papers."),
    Tool(name="Compliance Checker", func=check_compliance, description="Ensures NOFO alignment.")
]

# Initialize agent with zero-shot reasoning and tools
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)


In [None]:

# --- Multi-Criteria Evaluation with Guardrails ---
# Original evaluation only scored NIH criteria; now adds guardrail flags (Mermaid G node, Rubric Step 5).

evaluation_prompt = f"""
Evaluate the proposal on:
1. Innovation
2. Significance
3. Approach
4. Investigator Expertise

Return JSON:
{{
  "criteria": [
    {{
      "name": "Innovation",
      "score": 1-5,
      "strengths": "...",
      "weaknesses": "...",
      "recommendations": "..."
    }},
    ...
  ],
  "overall_score": 1-5,
  "guardrail_flags": ["hallucination risk", "compliance gap"]
}}
"""


In [None]:

# --- Caching Intermediate Steps ---
# Saves embeddings, filtered papers, and draft proposals for reuse (Mermaid J node, Rubric Step 7).

# Import required libraries for core functionality
import pickle

def save_checkpoint(data, name):
    with open(f"checkpoint_{name}.pkl", "wb") as f:
        pickle.dump(data, f)

def load_checkpoint(name):
    try:
        with open(f"checkpoint_{name}.pkl", "rb") as f:
            return pickle.load(f)
    except FileNotFoundError:
        return None


> **Note:** The following section explains core functionality and workflow.


# Quick Reference: Few-Shot + Agentic Enhancements

This section provides details about the few-shot pool, semantic versioning, and agentic conflict resolver integrated into this workflow.

---

## Key Features

**Semantic Versioning**
- Automatically increments version numbers (`v2-fewshot`, `v3-agentic`) based on features used.
- Few-shot only → `-fewshot`
- Few-shot + agentic resolver → `-agentic`

**Few-Shot Pool**
- Derived from cleaned log (`prompt_evaluation_log_cleaned.json`).
- Filters examples with ≥80% hybrid confidence.
- Balances relevant/irrelevant examples 50/50 and ensures diversity.

**Agentic Conflict Resolver**
- Activates when model vs. rule confidence differs by >20%.
- Produces reconciled decision and rationale logged under `agentic_resolution`.

**Enhanced Logging Fields**
- `decision_source`: hybrid (model + rule)
- `hybrid_confidence`: average of model and rule confidence
- `agentic_resolution`: reconciliation result (if applicable)
- `prompt_version`: auto-generated semantic version


In [None]:

# ------------------------------------------------------------
# VERSION TRACKING + FEW-SHOT REBUILDER + AGENTIC RESOLVER
# ------------------------------------------------------------

# Function: Determine the next semantic version string for the prompt
def get_next_prompt_version(log_path, agentic_enabled=False):
    """
    Determine next semantic version based on last logged version.
    Increments number, adds suffix based on features used.
    """
# Import required libraries for core functionality
    import os, json, re
    version_num = 1
    if os.path.exists(log_path):
        with open(log_path, "r", encoding="utf-8") as f:
            try:
                data = json.load(f)
            except json.JSONDecodeError:
                data = []
        # Extract last version number
        for entry in reversed(data):
            if "prompt_version" in entry:
                match = re.match(r"v(\d+)", entry["prompt_version"])
                if match:
                    version_num = int(match.group(1)) + 1
                break

    suffix = "-agentic" if agentic_enabled else "-fewshot"
    return f"v{version_num}{suffix}"


# Function: Build balanced high-confidence few-shot example pool from the log
def rebuild_few_shot_pool(cleaned_log_path, min_conf=80, max_examples=4):
    """
    Build balanced high-confidence few-shot pool from cleaned log.
    Balances relevant and irrelevant, ensures diversity.
    """
# Import required libraries for core functionality
    import json, random
    with open(cleaned_log_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    relevant, irrelevant = [], []
    for iteration in data:
        for doc in iteration.get("relevant_documents", []):
            hybrid_conf = max(doc.get("model_confidence", 0), doc.get("rule_confidence", 0))
            if hybrid_conf >= min_conf:
                relevant.append((doc["title"], doc["reasoning"]))
        for doc in iteration.get("irrelevant_documents", []):
            irrelevant.append((doc, "PAPER NOT RELATED TO TOPIC"))

    # Shuffle and balance
    half = max_examples // 2
    random.shuffle(relevant)
    random.shuffle(irrelevant)
    return relevant[:half] + irrelevant[:half]


# Function: Resolve discrepancies between model and rule confidences using agentic logic
def agentic_conflict_resolver(doc_title, reasoning_json, model_conf, rule_conf):
    """
    Agentic layer to reconcile conflicts:
    - Triggered when discrepancy exceeds threshold
    - Returns reconciled decision and rationale
    """
    rationale = []
    if abs(model_conf - rule_conf) > 20:
        if rule_conf > model_conf:
            final_decision = "RELEVANT" if rule_conf >= 50 else "PAPER NOT RELATED TO TOPIC"
            rationale.append("Rule confidence higher; prioritizing deterministic criteria.")
        else:
            final_decision = "RELEVANT" if model_conf >= 50 else "PAPER NOT RELATED TO TOPIC"
            rationale.append("Model confidence higher; prioritizing LLM interpretation.")
    else:
        final_decision = "RELEVANT" if (model_conf + rule_conf) / 2 >= 50 else "PAPER NOT RELATED TO TOPIC"
        rationale.append("Confidences close; hybrid average used for decision.")

    return {
        "final_decision": final_decision,
        "rationale": " ".join(rationale)
    }


In [None]:

# ------------------------------------------------------------
# ENHANCED LOGGING WITH SEMANTIC VERSIONING AND AGENTIC RESOLUTION
# ------------------------------------------------------------

# Ensure this cell is run AFTER document processing and building relevant_docs_with_reasoning

# Define constants for few-shot
# Define configuration for few-shot example retrieval (number of examples)
FEW_SHOT_MAX_EXAMPLES = 4
# Minimum confidence threshold for including examples in few-shot prompting
MIN_CONFIDENCE_FOR_FEWSHOT = 70
# Path to the cleaned JSON log file where prompt evaluation iterations are stored
LOG_PATH = "prompt_evaluation_log_cleaned.json"

# Determine prompt version
# Path to the cleaned JSON log file where prompt evaluation iterations are stored
current_version = get_next_prompt_version(LOG_PATH, agentic_enabled=any(doc.get('flagged_for_review', False) for doc in relevant_docs_with_reasoning))

# Add decision source and hybrid confidence
for doc in relevant_docs_with_reasoning:
    doc["decision_source"] = "hybrid"
    doc["hybrid_confidence"] = (doc["model_confidence"] + doc["rule_confidence"]) / 2

# Add agentic resolution for flagged docs
for doc in relevant_docs_with_reasoning:
    if doc.get("flagged_for_review"):
        resolution = agentic_conflict_resolver(
            doc_title=doc["title"],
            reasoning_json=doc["reasoning"],
            model_conf=doc["model_confidence"],
            rule_conf=doc["rule_confidence"]
        )
        doc["agentic_resolution"] = resolution

# Append prompt_version to log
# Path to the cleaned JSON log file where prompt evaluation iterations are stored
with open(LOG_PATH, "r+", encoding="utf-8") as f:
    data = json.load(f)
    if data:
        data[-1]["prompt_version"] = current_version
    f.seek(0)
    json.dump(data, f, indent=2, ensure_ascii=False)
    f.truncate()

print(f"Logged with prompt version: {current_version}")



# Additional enhancements proposed by Claude

Your flowchart shows a well-structured approach to the RFP response generation system. Here are several improvements I'd recommend to enhance the robustness and effectiveness of your solution:

1. Enhanced RFP Requirements Extraction
After step B, add a sub-process for:

Requirement Categorization: Classify requirements into mandatory vs. optional, technical vs. administrative
Scoring Rubric Extraction: Specifically parse how proposals will be evaluated
Budget Constraints Analysis: Extract funding limits and cost-effectiveness criteria
Timeline Extraction: Identify key dates and milestone requirements

2. Improved Paper Processing Pipeline
Between steps C and D, consider adding:

Citation Network Analysis: Map relationships between papers to identify influential work
Method/Innovation Extraction: Specifically extract methodologies and novel approaches
Results/Outcomes Extraction: Capture quantitative results and impact metrics
Quality Assessment: Add a paper quality scoring mechanism (impact factor, recency, relevance)

3. Enhanced Retrieval and Ranking
Expand step D with:

Multi-Query Generation: Generate multiple search queries from different RFP aspects
Cross-Reference Validation: Verify that selected papers actually support proposed innovations
Diversity Scoring: Ensure selected papers cover different aspects of the RFP
Gap Analysis: Identify what the RFP asks for that isn't well-covered in existing research

4. Strengthened Agentic Architecture
Add these specialized agents to your existing three:

Innovation Synthesizer Agent: Combines findings from multiple papers into novel approaches
Budget Estimator Agent: Ensures proposals are financially realistic
Risk Assessment Agent: Identifies potential implementation challenges
Competitive Analysis Agent: Positions your proposal against likely competitors

5. Improved Evaluation and Refinement
Enhance the evaluation loop (G-I) with:

Specific Weakness Detection: Not just overall score, but identify specific weak sections
Competitive Benchmarking: Compare against successful past proposals if available
Consistency Checking: Ensure all sections align and support each other
Technical Feasibility Validation: Verify proposed solutions are implementable

6. Additional Process Improvements
Consider these architectural enhancements:
flowchart LR
    subgraph "Knowledge Management"
        KB1[Domain Ontology]
        KB2[Success Patterns DB]
        KB3[Common Pitfalls DB]
    end
    
    subgraph "Feedback Loops"
        FL1[Real-time Agent Collaboration]
        FL2[Iterative Improvement Tracking]
        FL3[Version Control System]
    end
7. Quality Assurance Additions

Plagiarism Detection: Ensure generated content is original
Fact Verification: Cross-check claims against source papers
Readability Analysis: Ensure proposal meets target audience expectations
Compliance Validation: Automated checks against all RFP requirements

8. Output Enhancement
For the final deliverables (step N), consider generating:

Executive Summary: One-page overview for quick review
Technical Appendix: Detailed methodology descriptions
Budget Justification: Line-by-line cost explanations
Risk Mitigation Plan: Addressing identified challenges
Evaluation Metrics: How success will be measured

9. Monitoring and Logging
Add throughout the pipeline:

Decision Logging: Track why papers were selected/rejected
Agent Reasoning Traces: Understand how proposals were generated
Performance Metrics: Time taken, resources used, quality scores
Error Handling: Graceful degradation if components fail

10. Advanced Features
Consider these stretch goals:

Multi-RFP Learning: Learn from multiple RFPs to improve over time
Collaborative Filtering: If multiple users, learn from collective behavior
Adaptive Prompting: Adjust prompts based on intermediate results
Uncertainty Quantification: Flag areas where the system is less confident
