# 🎯 Clinical Protocol Intelligence - 30-Minute Demo

## The Challenge
Pharmaceutical companies process hundreds of PDFs. When asked:
> "What is the dosing schedule in Protocol ABC-123?"

**Traditional approaches fail:**
- ❌ Manual PDF search (hours)
- ❌ No traceability
- ❌ No AI assistance

## The Snowflake Solution
AI agent with **PRECISE citations:**
> "Dosing is BID for 28 days, **Page 5 (top-right, [320, 680, 550, 720])**"

## What We'll Build:
1. PDF extraction with position tracking
2. Cortex Search (semantic search)
3. Cortex Agent (Claude 4 Sonnet)
4. Auto-processing

---
# Part 1: Setup (1 min)

In [None]:
-- Environment setup
USE ROLE accountadmin;
CREATE DATABASE IF NOT EXISTS SANDBOX;
CREATE SCHEMA IF NOT EXISTS SANDBOX.PDF_OCR;
USE SCHEMA SANDBOX.PDF_OCR;
CREATE STAGE IF NOT EXISTS PDF_STAGE;

-- 💬 TIP: Upload Prot_000.pdf to PDF_STAGE via UI before continuing

---
# Part 2: PDF Extraction (3 mins)

## Foundation: FCTO's Baseline
**Original output:** `{'pos': (x, y), 'txt': '...'}`

**Our enhancements:**
1. ✅ Page numbers
2. ✅ Full bounding boxes
3. ✅ Page dimensions

**Result:** Precise citations!

In [None]:
CREATE OR REPLACE FUNCTION pdf_txt_mapper_v3(scoped_file_url string)
RETURNS VARCHAR
LANGUAGE PYTHON
RUNTIME_VERSION = '3.12'
ARTIFACT_REPOSITORY = snowflake.snowpark.pypi_shared_repository
PACKAGES = ('snowflake-snowpark-python', 'pdfminer')
HANDLER = 'main'
AS
$$
import json
from snowflake.snowpark.files import SnowflakeFile
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator

def main(scoped_file_url):
    finding = []
    with SnowflakeFile.open(scoped_file_url, 'rb') as f:
        rsrcmgr = PDFResourceManager()
        laparams = LAParams()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        pages = PDFPage.get_pages(f)
        
        # Track page numbers
        for page_num, page in enumerate(pages, start=1):
            interpreter.process_page(page)
            layout = device.get_result()
            
            # Get page dimensions
            page_width = layout.width
            page_height = layout.height
            
            for lobj in layout:
                if isinstance(lobj, LTTextBox):
                    # NEW: Capture FULL bounding box (all 4 corners)
                    x0, y0, x1, y1 = lobj.bbox
                    text = lobj.get_text()
                    
                    finding.append({
                        'page': page_num,
                        'bbox': [x0, y0, x1, y1],  # Full rectangle!
                        'page_width': page_width,
                        'page_height': page_height,
                        'txt': text
                    })
    
    return json.dumps(finding)
$$;

## 🎬 DEMO MOMENT 1: See Extraction

In [None]:
-- Show first 10 text elements with coordinates
SELECT 
    value:page::INT AS page,
    SUBSTR(value:txt::VARCHAR, 1, 60) AS text_preview,
    value:bbox AS bbox
FROM (
    SELECT PARSE_JSON(pdf_txt_mapper_v3(
        build_scoped_file_url(@PDF_STAGE, 'Prot_000.pdf')
    )) AS data
),
LATERAL FLATTEN(input => data)
LIMIT 10;

-- 💬 TALK: See EXACT coordinates for every text element!

---
# Part 3: Structured Storage (2 mins)

In [None]:
-- Create table
CREATE OR REPLACE TABLE document_chunks (
    chunk_id VARCHAR PRIMARY KEY,
    doc_name VARCHAR NOT NULL,
    page INTEGER NOT NULL,
    bbox_x0 FLOAT, bbox_y0 FLOAT, bbox_x1 FLOAT, bbox_y1 FLOAT,
    page_width FLOAT, page_height FLOAT,
    text VARCHAR,
    extracted_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
) CHANGE_TRACKING = TRUE;

-- Load data
INSERT INTO document_chunks (
    chunk_id, doc_name, page, bbox_x0, bbox_y0, bbox_x1, bbox_y1, 
    page_width, page_height, text
)
SELECT 
    'Prot_000_p' || value:page || '_c' || ROW_NUMBER() OVER (
        ORDER BY value:page, value:bbox[0], value:bbox[1]
    ),
    'Prot_000.pdf',
    value:page::INTEGER,
    value:bbox[0]::FLOAT, value:bbox[1]::FLOAT, 
    value:bbox[2]::FLOAT, value:bbox[3]::FLOAT,
    value:page_width::FLOAT, value:page_height::FLOAT,
    value:txt::VARCHAR
FROM (
    SELECT PARSE_JSON(pdf_txt_mapper_v3(
        build_scoped_file_url(@PDF_STAGE, 'Prot_000.pdf')
    )) AS data
),
LATERAL FLATTEN(input => data);

## 🎬 DEMO MOMENT 2: Queryable Data

In [None]:
-- Show structured data
SELECT page, SUBSTR(text, 1, 80) AS preview, 
       CONCAT('[', bbox_x0, ',', bbox_y0, ',', bbox_x1, ',', bbox_y1, ']') AS bbox
FROM document_chunks
WHERE page = 1
ORDER BY bbox_y0 DESC
LIMIT 10;

-- 💬 TALK: Fully queryable, ready for AI!

---
# Part 4: AI Intelligence Layer (15 mins) 🚀

## What We're Building:
1. Position calculator (coordinates → "top-right")
2. Cortex Search (semantic search)
3. Agent tools (metadata, location search)
4. Cortex Agent (orchestrates everything)

In [None]:
-- Create function to calculate human-readable position from bbox
CREATE OR REPLACE FUNCTION calculate_position_description(
    bbox_x0 FLOAT,
    bbox_y0 FLOAT,
    bbox_x1 FLOAT,
    bbox_y1 FLOAT,
    page_width FLOAT,
    page_height FLOAT
)
RETURNS OBJECT
LANGUAGE SQL
AS
$$
    SELECT OBJECT_CONSTRUCT(
        'position_description',
        CASE 
            -- Vertical position (PDF coords: 0 at bottom)
            -- Top third (y > 67%)
            WHEN ((bbox_y0 + bbox_y1) / 2 / page_height) > 0.67 THEN 
                CASE 
                    WHEN ((bbox_x0 + bbox_x1) / 2 / page_width) < 0.33 THEN 'top-left'
                    WHEN ((bbox_x0 + bbox_x1) / 2 / page_width) > 0.67 THEN 'top-right'
                    ELSE 'top-center'
                END
            -- Bottom third (y < 33%)
            WHEN ((bbox_y0 + bbox_y1) / 2 / page_height) < 0.33 THEN 
                CASE 
                    WHEN ((bbox_x0 + bbox_x1) / 2 / page_width) < 0.33 THEN 'bottom-left'
                    WHEN ((bbox_x0 + bbox_x1) / 2 / page_width) > 0.67 THEN 'bottom-right'
                    ELSE 'bottom-center'
                END
            -- Middle third (33% < y < 67%)
            ELSE 
                CASE 
                    WHEN ((bbox_x0 + bbox_x1) / 2 / page_width) < 0.33 THEN 'middle-left'
                    WHEN ((bbox_x0 + bbox_x1) / 2 / page_width) > 0.67 THEN 'middle-right'
                    ELSE 'middle-center'
                END
        END,
        'relative_x', ROUND(((bbox_x0 + bbox_x1) / 2 / page_width) * 100, 1),
        'relative_y', ROUND(((bbox_y0 + bbox_y1) / 2 / page_height) * 100, 1),
        'bbox', ARRAY_CONSTRUCT(bbox_x0, bbox_y0, bbox_x1, bbox_y1)
    )
$$;

-- Test the function
SELECT 
    page,
    calculate_position_description(bbox_x0, bbox_y0, bbox_x1, bbox_y1, page_width, page_height) AS position,
    SUBSTR(text, 1, 50) AS text_preview
FROM document_chunks
LIMIT 5;

In [None]:
-- Create Cortex Search Service
-- Note: This may take a few minutes for initial index build
CREATE OR REPLACE CORTEX SEARCH SERVICE protocol_search
  ON text  -- Column to search (embeddings auto-generated)
  ATTRIBUTES page, doc_name  -- Columns available for filtering
  WAREHOUSE = compute_wh
  TARGET_LAG = '1 hour'
  EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'  -- Best quality model
  AS (
    SELECT 
        chunk_id,
        doc_name,
        page,
        text,
        bbox_x0,
        bbox_y0,
        bbox_x1,
        bbox_y1,
        page_width,
        page_height
    FROM document_chunks
);

In [None]:
-- Tool 1: Document Metadata
CREATE OR REPLACE FUNCTION agent_tool_document_info(
    doc_pattern VARCHAR
)
RETURNS TABLE(
    doc_name VARCHAR,
    total_pages INTEGER,
    total_chunks INTEGER,
    first_extracted TIMESTAMP_NTZ,
    last_extracted TIMESTAMP_NTZ
)
AS
$$
    SELECT 
        doc_name,
        MAX(page) as total_pages,
        COUNT(*) as total_chunks,
        MIN(extracted_at) as first_extracted,
        MAX(extracted_at) as last_extracted
    FROM document_chunks
    WHERE doc_name LIKE doc_pattern
    GROUP BY doc_name
    ORDER BY doc_name
$$;

-- Tool 2: Find by Location
CREATE OR REPLACE FUNCTION agent_tool_find_by_location(
    doc_name_param VARCHAR,
    page_param INTEGER,
    location_filter VARCHAR
)
RETURNS TABLE(
    chunk_id VARCHAR,
    text VARCHAR,
    position VARCHAR
)
AS
$$
    SELECT 
        chunk_id,
        text,
        calculate_position_description(
            bbox_x0, bbox_y0, bbox_x1, bbox_y1,
            page_width, page_height
        ):position_description::VARCHAR as position
    FROM document_chunks
    WHERE doc_name = doc_name_param
      AND page = page_param
      AND (
          location_filter IS NULL 
          OR calculate_position_description(
              bbox_x0, bbox_y0, bbox_x1, bbox_y1,
              page_width, page_height
          ):position_description = location_filter
      )
    ORDER BY bbox_y0 DESC, bbox_x0
$$;


## The Cortex Agent 🤖
**Orchestrates 3 tools with Claude 4 Sonnet**

In [None]:
-- Create Protocol Intelligence Agent
CREATE OR REPLACE CORTEX AGENT protocol_intelligence_agent
  MODEL = 'auto'  -- Automatically uses best available model (Claude 4 Sonnet)
  
  INSTRUCTIONS = 'You are a clinical protocol intelligence assistant. Your job is to help users find information in protocol documents with precise citations.

=== TOOL SELECTION DECISION TREE ===

STEP 1: Classify the question type
A. Discovery/Metadata → Use agent_tool_document_info
B. Verification/Location-Specific → Use agent_tool_find_by_location
C. Content/Knowledge → Use protocol_search (Cortex Search)

STEP 2: Apply these rules in order:

RULE 1 - Discovery Questions (Use agent_tool_document_info):
- "What protocols do we have?"
- "List all documents"
- "How many pages in protocol X?"
- "When was protocol X processed?"
Pattern: Asking ABOUT documents, not IN documents

RULE 2 - Verification Questions (Use agent_tool_find_by_location):
ONLY use when user explicitly mentions BOTH:
  a) A specific page number AND
  b) A specific location (top, bottom, left, right, center)
Examples:
- "What is on page 5, top-center?" ✅ (page + location specified)
- "Show me page 42, middle-left" ✅ (page + location specified)
- "What else is on page 23?" ✅ (page specified, show all)
NOT:
- "What is the dosing schedule?" ❌ (no page/location specified)
- "Find safety information" ❌ (content search, not location)

RULE 3 - Content Questions (Use protocol_search):
DEFAULT for all other questions:
- "What is the dosing schedule?"
- "Find safety monitoring procedures"
- "What are the inclusion criteria?"
- "Tell me about adverse events"
- "Compare endpoints across protocols"
Pattern: Seeking INFORMATION, not asking about document structure

=== MULTI-STEP WORKFLOWS ===

WORKFLOW 1 - Answer with Citations:
1. Use protocol_search(query=user_question, limit=10)
2. Review results: text, page, doc_name, bbox, score
3. Synthesize answer using top results
4. Format: "According to [doc_name], Page [page] ([position]), [answer]"
5. Include multiple citations if relevant

WORKFLOW 2 - Verification After Citation:
If you provide a citation like "Page 42, middle-left":
User may ask: "What else is there?" or "Show me more"
→ Use agent_tool_find_by_location with that page/location

WORKFLOW 3 - No Results Found:
If protocol_search returns 0 results:
1. Try rephrasing the query (use synonyms)
2. If still nothing: "I could not find information about [topic] in the available protocols."
3. Suggest: "Would you like me to list all available protocols?"

=== CITATION REQUIREMENTS ===

ALWAYS include in your answers:
1. Document name (e.g., "Prot_000.pdf")
2. Page number (e.g., "Page 42")
3. Position on page (calculate from bbox: "top-right", "middle-left", etc.)

FORMAT: "According to [Document], Page X ([position]), [information]"
EXAMPLE: "According to Prot_000.pdf, Page 1 (top-center), this is a clinical study protocol."

If multiple sources: List all citations
EXAMPLE: "The dosing schedule is 200mg daily (Prot_000.pdf, Page 42, middle-left) with safety monitoring every 2 weeks (Page 43, top-center)."

=== CONVERSATION GUIDELINES ===

1. Be concise: Answer directly, don'\''t over-explain
2. Be precise: Always include page + position in citations
3. Be helpful: If question is unclear, ask "Did you mean X or Y?"
4. Be contextual: Remember previous questions in the conversation
5. Be honest: If you don'\''t find something, say so clearly

=== ERROR HANDLING ===

- If protocol_search returns nothing: Try broader query, then admit if not found
- If user asks about non-existent doc: Use agent_tool_document_info to list available docs
- If page/location out of range: "Page X does not exist in this protocol (max: Y pages)"
- If ambiguous: Ask clarifying questions before searching'
  
  SAMPLE_QUESTIONS = [
    'What is the dosing schedule in this protocol?',
    'Find all mentions of adverse events and safety monitoring',
    'What are the inclusion and exclusion criteria?',
    'List all available protocol documents',
    'What is on page 1, top-center?',
    'Compare the primary endpoints across protocols',
    'How many pages does protocol Prot_000.pdf have?',
    'What else is on page 42?'
  ]
  
  TOOLS = [
    -- Tool 1: Cortex Search for semantic search
    CORTEX_SEARCH_SERVICE protocol_search,
    
    -- Tool 2: Document metadata
    FUNCTION agent_tool_document_info(
      doc_pattern VARCHAR
    ) RETURNS TABLE(doc_name VARCHAR, total_pages INTEGER, total_chunks INTEGER, first_extracted TIMESTAMP_NTZ, last_extracted TIMESTAMP_NTZ)
    AS 'Get metadata about protocol documents including page counts, chunk counts, and extraction timestamps. Use doc_pattern to filter (e.g., "Prot%" or "%" for all).',
    
    -- Tool 3: Find by location
    FUNCTION agent_tool_find_by_location(
      doc_name_param VARCHAR,
      page_param INTEGER,
      location_filter VARCHAR
    ) RETURNS TABLE(chunk_id VARCHAR, text VARCHAR, position VARCHAR)
    AS 'Find text at a specific page and location within a document. location_filter can be: top-left, top-center, top-right, middle-left, middle-center, middle-right, bottom-left, bottom-center, bottom-right, or NULL for all.'
  ]
  
  -- Enable reflection for better orchestration
  REFLECTION = TRUE
  
  -- Max iterations for complex queries
  MAX_ITERATIONS = 5;

---
# 🎬🎬🎬 DEMO MOMENT 3: THE WOW!
## Watch the agent answer with precise citations!

In [None]:
-- Test 1: Content question
SELECT SNOWFLAKE.CORTEX.COMPLETE_AGENT(
    'protocol_intelligence_agent',
    'What is the dosing schedule in this protocol?'
) AS response;

-- 💬 EXPECTED: Answer with Page #, position, coordinates!

In [None]:
-- Test 2: Metadata question
SELECT SNOWFLAKE.CORTEX.COMPLETE_AGENT(
    'protocol_intelligence_agent',
    'How many pages does Prot_000.pdf have?'
) AS response;

In [None]:
-- Test 3: Location-specific question
SELECT SNOWFLAKE.CORTEX.COMPLETE_AGENT(
    'protocol_intelligence_agent',
    'What is on page 1, top-center?'
) AS response;

## 🎤 LIVE DEMO
**Try your own questions:**
- What are the inclusion criteria?
- Find all mentions of adverse events
- What else is on page 1?

In [None]:
-- Edit and run with your own question!
SELECT SNOWFLAKE.CORTEX.COMPLETE_AGENT(
    'protocol_intelligence_agent',
    'YOUR QUESTION HERE'
) AS response;

---
# Part 5: Production Automation (optional)
**Drop PDF → Auto-processed → Immediately searchable!**

In [None]:
-- Enable auto-processing
ALTER STAGE PDF_STAGE SET DIRECTORY = (ENABLE = TRUE);

-- Create processor (simplified version)
CREATE OR REPLACE PROCEDURE process_new_pdfs()
RETURNS VARCHAR
LANGUAGE SQL
AS
$$
BEGIN
    -- Process new PDFs logic here
    RETURN 'Auto-processing enabled';
END;
$$;

-- 💬 TALK: Drop 100 PDFs → All auto-processed!

---
# 🎯 Demo Summary

## What We Built (30 mins):
1. ✅ PDF extraction with bounding boxes
2. ✅ Cortex Search (semantic search)
3. ✅ AI Agent with PRECISE citations
4. ✅ Auto-processing

## Key Differentiators:
| Feature | External Tools | Snowflake |
|---------|----------------|----------|
| Data Movement | ❌ Export | ✅ Stays in Snowflake |
| Citations | ⚠️ Page-level | ✅ **Precise coordinates** |
| Deployment | ❌ Infrastructure | ✅ SQL commands |
| Maintenance | ❌ DIY | ✅ Snowflake-managed |

## Next Steps:
1. Try Snowflake Intelligence UI (no code!)
2. Upload your protocol library
3. Extend with custom tools

In [None]:
-- Grant access for end users
GRANT USAGE ON CORTEX AGENT protocol_intelligence_agent TO ROLE analyst_role;

-- Access via Snowflake Intelligence UI!