# Phase 0: PDF OCR with Position Tracking - Baseline

## Overview
This notebook implements the **baseline solution** provided by the Snowflake FCTO for extracting text from PDFs while capturing position information.

### What This Does:
- Extracts text from PDF documents stored in Snowflake stages
- Captures the **x,y coordinates** of each text box on the page
- Returns structured data: `{pos: (x,y), txt: text}`

### Customer Requirement This Addresses:
‚úÖ **Document Intelligence - positioning capability** - knows where text appears on the page

### What's Missing (Future Phases):
- ‚ùå Page numbers
- ‚ùå Section detection
- ‚ùå Better chunking
- ‚ùå LLM integration
- ‚ùå Citation system

---


## Step 1: Environment Setup

Set up the Snowflake environment with appropriate roles and context.


In [None]:
-- Use administrative role to grant permissions
USE ROLE accountadmin;


In [None]:
-- Grant access to PyPI packages (needed for pdfminer library)
GRANT DATABASE ROLE SNOWFLAKE.PYPI_REPOSITORY_USER TO ROLE accountadmin;


## Step 2: Database and Schema Setup

Create the PDF_OCR schema in the SANDBOX database for this project.


In [None]:
-- Create the PDF_OCR schema if it doesn't exist
CREATE SCHEMA IF NOT EXISTS SANDBOX.PDF_OCR
COMMENT = 'Schema for PDF OCR with position tracking solution';


In [None]:
-- Set database and schema context
USE DATABASE SANDBOX;
USE SCHEMA PDF_OCR;


## Step 3: Create Stage for PDF Storage

Stages in Snowflake are locations where data files are stored. We'll create an internal stage to hold our PDF documents.


In [None]:
-- Create internal stage for PDF files
CREATE STAGE IF NOT EXISTS PDF_STAGE
COMMENT = 'Stage for storing clinical protocol PDFs and other documents';


In [None]:
-- Verify stage was created
SHOW STAGES LIKE 'PDF_STAGE';


## Step 4: Create PDF Text Mapper UDF

This User-Defined Function (UDF) is the core of our solution. Let's break down what it does:

### Technology Stack:
- **Language:** Python 3.12
- **Library:** `pdfminer` - A robust PDF parsing library
- **Snowflake Integration:** Uses `SnowflakeFile` to read directly from stages

### How It Works:
1. Opens the PDF file from the Snowflake stage
2. Iterates through each page
3. Extracts text boxes (`LTTextBox` objects) from the page layout
4. Captures the **bounding box coordinates** (bbox) - specifically:
   - `bbox[0]` = x-coordinate (left)
   - `bbox[3]` = y-coordinate (top)
5. Returns an array of objects: `{pos: (x,y), txt: text}`

### Input:
- `scoped_file_url`: A Snowflake-generated URL pointing to a file in a stage

### Output:
- VARCHAR (JSON string) containing array of text boxes with positions


In [None]:
CREATE OR REPLACE FUNCTION pdf_txt_mapper(scoped_file_url string)
RETURNS VARCHAR
LANGUAGE PYTHON
RUNTIME_VERSION = '3.12'
ARTIFACT_REPOSITORY = snowflake.snowpark.pypi_shared_repository
PACKAGES = ('snowflake-snowpark-python', 'pdfminer')
HANDLER = 'main'
AS
$$
from snowflake.snowpark.files import SnowflakeFile
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator

def main(scoped_file_url):
    finding = []
    with SnowflakeFile.open(scoped_file_url, 'rb') as f:
        # Initialize PDF processing components
        rsrcmgr = PDFResourceManager()
        laparams = LAParams()  # Layout analysis parameters
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        pages = PDFPage.get_pages(f)
        
        # Process each page
        for page in pages:
            interpreter.process_page(page)
            layout = device.get_result()
            
            # Extract text boxes from the page
            for lobj in layout:
                if isinstance(lobj, LTTextBox):
                    # bbox = (x0, y0, x1, y1) where (x0,y0) is bottom-left, (x1,y1) is top-right
                    x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
                    finding += [{'pos': (x, y), 'txt': text}]
    
    return str(finding)
$$;


In [None]:
-- Verify function was created
SHOW FUNCTIONS LIKE 'pdf_txt_mapper';


## Step 5: Upload PDF to Stage

### Instructions:

**Option 1: Using Snowflake Web UI**
1. Navigate to Data ‚Üí Databases ‚Üí SANDBOX ‚Üí PDF_OCR ‚Üí Stages
2. Click on the `PDF_STAGE` stage
3. Click "+ Files" button in the top right
4. Upload your PDF file (e.g., `Prot_000.pdf`)

**Option 2: Using SnowSQL CLI**
```bash
snowsql -a <account> -u <username>
USE SCHEMA SANDBOX.PDF_OCR;
PUT file:///path/to/your/file.pdf @PDF_STAGE AUTO_COMPRESS=FALSE;
```

**Option 3: Using Python Snowpark**
```python
session.file.put("Prot_000.pdf", "@PDF_STAGE", auto_compress=False)
```

Let's verify the file after upload:


In [None]:
-- List files in the PDF stage
LIST @PDF_STAGE;


## Step 6: Test the PDF Text Mapper

Now let's test our function with the uploaded PDF.

### What to Expect:
- The function will return a VARCHAR (string representation of a Python list)
- Each element will be: `{'pos': (x, y), 'txt': 'extracted text'}`
- The output will be **very long** for multi-page documents

### Note on `build_scoped_file_url()`:
This Snowflake function generates a temporary, scoped URL that allows the UDF to securely access the staged file.


In [None]:
-- Test with the clinical protocol PDF
-- This will return the full extracted text with positions
SELECT pdf_txt_mapper(build_scoped_file_url(@PDF_STAGE, 'Prot_000.pdf')) AS extracted_data;


## Step 7: Analyze the Output

Let's get some basic statistics about what was extracted.


In [None]:
-- Get the length of the output
SELECT 
    LENGTH(pdf_txt_mapper(build_scoped_file_url(@PDF_STAGE, 'Prot_000.pdf'))) AS output_length_chars,
    LENGTH(pdf_txt_mapper(build_scoped_file_url(@PDF_STAGE, 'Prot_000.pdf'))) / 1024 AS output_length_kb;


## Phase 0 Summary

### ‚úÖ What We've Accomplished:
1. Set up Snowflake environment with proper roles and permissions
2. Created a stage for storing PDF documents
3. Deployed the FCTO's baseline PDF text mapper UDF
4. Extracted text from a clinical protocol PDF with position information

### üìä Current Output Format:
```python
[{'pos': (54.0, 720.3), 'txt': 'CLINICAL PROTOCOL\n'}, 
 {'pos': (72.0, 680.1), 'txt': 'Study Title: ...\n'},
 ...]
```

### üéØ What This Gives Us:
- ‚úÖ Text extraction from PDFs
- ‚úÖ X,Y coordinates for each text box
- ‚úÖ Snowflake-native processing (no external services)

### ‚ö†Ô∏è Current Limitations:
- ‚ùå No page number information
- ‚ùå No section/hierarchy detection
- ‚ùå Text boxes may be too granular or broken
- ‚ùå Output is a string, not structured data we can query
- ‚ùå No way to answer "Where did this info come from?"

---

## Next Steps: Phase 1
In the next phase, we'll enhance this solution to:
1. **Add page numbers** to each text box
2. Store results in a **queryable table** instead of a string
3. Add a **unique chunk ID** for each text box

This will enable queries like:
```sql
SELECT * FROM document_chunks 
WHERE page = 5 
AND txt ILIKE '%medication%';
```


## Troubleshooting

### Common Issues:

**1. Permission Error on PyPI:**
```
Error: Access denied for database role SNOWFLAKE.PYPI_REPOSITORY_USER
```
**Solution:** Make sure you ran the GRANT command as ACCOUNTADMIN

**2. File Not Found:**
```
Error: File 'Prot_000.pdf' does not exist
```
**Solution:** Verify the file was uploaded with `LIST @PDF_STAGE;`

**3. Function Takes Too Long:**
- Large PDFs (100+ pages) can take 30-60 seconds
- This is normal for the initial processing
- Consider processing in batches for very large documents

**4. Memory Issues:**
- For very large PDFs (500+ pages), you may need to increase warehouse size
- Or split the PDF into smaller chunks before processing


---

# Phase 1: Add Page Numbers & Structured Storage

## What We're Adding

In Phase 1, we'll enhance the baseline solution with:
1. **Page number tracking** - Know which page each text box came from
2. **Table storage** - Store results in a queryable table (not VARCHAR)
3. **Chunk IDs** - Unique identifiers for each text box
4. **Timestamps** - Track when documents were processed

### Benefits:
- ‚úÖ Query specific pages: `WHERE page = 5`
- ‚úÖ Search across documents: `WHERE text ILIKE '%medication%'`
- ‚úÖ Audit trail: When was this document processed?
- ‚úÖ Compare multiple PDFs in the same table


## Step 1: Create Document Chunks Table

This table will store the extracted text with metadata:
- `chunk_id`: Unique identifier (e.g., 'Prot_000_p5_c42')
- `doc_name`: Source PDF filename
- `page`: Page number (1-indexed)
- `x, y`: Position coordinates
- `text`: Extracted text content
- `extracted_at`: Timestamp of extraction


In [None]:
CREATE OR REPLACE TABLE document_chunks (
    chunk_id VARCHAR PRIMARY KEY,
    doc_name VARCHAR NOT NULL,
    page INTEGER NOT NULL,
    x FLOAT,
    y FLOAT,
    text VARCHAR,
    extracted_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);


In [None]:
-- Verify table was created
DESC TABLE document_chunks;


## Step 2: Enhanced UDF with Page Numbers

Now we'll create an **enhanced version** of the UDF that tracks page numbers.

### Key Changes:
1. `enumerate(pages, start=1)` - Track page numbers starting from 1
2. `'page': page_num` - Include page number in output
3. Returns JSON with page information

### Output Format:
```python
[{'page': 1, 'pos': (54.0, 720.3), 'txt': 'CLINICAL PROTOCOL'},
 {'page': 1, 'pos': (72.0, 680.1), 'txt': 'Study Title: ...'},
 {'page': 2, 'pos': (54.0, 720.3), 'txt': 'Section 1: ...'}]
```


In [None]:
CREATE OR REPLACE FUNCTION pdf_txt_mapper_v2(scoped_file_url string)
RETURNS VARCHAR
LANGUAGE PYTHON
RUNTIME_VERSION = '3.12'
ARTIFACT_REPOSITORY = snowflake.snowpark.pypi_shared_repository
PACKAGES = ('snowflake-snowpark-python', 'pdfminer')
HANDLER = 'main'
AS
$$
import json
from snowflake.snowpark.files import SnowflakeFile
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator

def main(scoped_file_url):
    finding = []
    with SnowflakeFile.open(scoped_file_url, 'rb') as f:
        rsrcmgr = PDFResourceManager()
        laparams = LAParams()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        pages = PDFPage.get_pages(f)
        
        # Track page numbers with enumerate
        for page_num, page in enumerate(pages, start=1):
            interpreter.process_page(page)
            layout = device.get_result()
            
            for lobj in layout:
                if isinstance(lobj, LTTextBox):
                    x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()
                    # Use list [x, y] instead of tuple (x, y) for valid JSON
                    finding.append({
                        'page': page_num,
                        'pos': [x, y],
                        'txt': text
                    })
    
    # Return valid JSON using json.dumps()
    return json.dumps(finding)
$$;


In [None]:
-- Verify the enhanced function was created
SHOW FUNCTIONS LIKE 'pdf_txt_mapper_v2';


## Step 3: Test Enhanced UDF

Let's test the new UDF to verify it now includes page numbers.


In [None]:
-- Test the enhanced UDF - should now include page numbers
SELECT pdf_txt_mapper_v2(build_scoped_file_url(@PDF_STAGE, 'Prot_000.pdf')) AS extracted_data_with_pages;


## Step 4: Parse and Load Data into Table

Now we'll parse the JSON output and load it into our `document_chunks` table.

We'll use Snowflake's JSON parsing functions:
- `PARSE_JSON()` - Parse the VARCHAR into JSON
- `FLATTEN()` - Convert JSON array into rows
- `GET()` - Extract specific fields from JSON objects


In [None]:
-- Parse JSON and insert into table
INSERT INTO document_chunks (chunk_id, doc_name, page, x, y, text)
SELECT 
    'Prot_000_p' || value:page || '_c' || ROW_NUMBER() OVER (ORDER BY value:page, value:pos[0], value:pos[1]) AS chunk_id,
    'Prot_000.pdf' AS doc_name,
    value:page::INTEGER AS page,
    value:pos[0]::FLOAT AS x,
    value:pos[1]::FLOAT AS y,
    value:txt::VARCHAR AS text
FROM (
    SELECT PARSE_JSON(pdf_txt_mapper_v2(build_scoped_file_url(@PDF_STAGE, 'Prot_000.pdf'))) AS parsed_data
),
LATERAL FLATTEN(input => parsed_data) AS f;


## Step 5: Query the Results!

Now we can query the extracted data using SQL. This is the **power of Phase 1** - structured, queryable data!


In [None]:
-- How many text chunks were extracted?
SELECT COUNT(*) AS total_chunks FROM document_chunks;


In [None]:
-- How many chunks per page?
SELECT 
    page,
    COUNT(*) AS chunks_on_page
FROM document_chunks
GROUP BY page
ORDER BY page
LIMIT 20;


In [None]:
-- Search for mentions of 'medication' or 'drug'
SELECT 
    chunk_id,
    page,
    SUBSTR(text, 1, 100) AS text_preview
FROM document_chunks
WHERE text ILIKE '%medication%'
   OR text ILIKE '%drug%'
ORDER BY page
LIMIT 10;


In [None]:
-- Get all text from a specific page (e.g., page 5)
SELECT 
    chunk_id,
    x,
    y,
    text
FROM document_chunks
WHERE page = 5
ORDER BY y DESC, x;


## Phase 1 Summary

### ‚úÖ What We've Accomplished:
1. Created `document_chunks` table for structured storage
2. Enhanced UDF (`pdf_txt_mapper_v2`) with page number tracking
3. Parsed JSON output and loaded into queryable table
4. Demonstrated SQL queries on extracted text

### üìä New Capabilities:
```sql
-- Query by page
SELECT * FROM document_chunks WHERE page = 5;

-- Search for keywords
SELECT * FROM document_chunks WHERE text ILIKE '%medication%';

-- Count chunks per page
SELECT page, COUNT(*) FROM document_chunks GROUP BY page;
```

### üéØ What This Gives Us:
- ‚úÖ **Page numbers** - Know which page every text box came from
- ‚úÖ **Queryable data** - Use SQL instead of parsing strings
- ‚úÖ **Chunk IDs** - Unique identifiers for traceability
- ‚úÖ **Timestamps** - Track when documents were processed
- ‚úÖ **Citation foundation** - Can now answer "This is on page 5"

### ‚ö†Ô∏è Still Missing (Future Phases):
- ‚ùå Full bounding boxes (only have x,y corner) ‚Üí Phase 2
- ‚ùå Font information (size, bold/italic) ‚Üí Phase 3
- ‚ùå Section detection (headers, hierarchy) ‚Üí Phase 4
- ‚ùå Smart chunking (semantic boundaries) ‚Üí Phase 5
- ‚ùå LLM integration with citations ‚Üí Phase 6

---

## Next Steps: Phase 2
In Phase 2, we'll capture **full bounding boxes** (x0, y0, x1, y1) instead of just (x, y). This will enable:
- Highlighting text in PDF viewers
- Detecting multi-column layouts
- Calculating text height/width
- More accurate positioning for citations


---

# Phase 2: Full Bounding Boxes

## What We're Adding

In Phase 2, we'll enhance the solution to capture **complete rectangles** instead of just corner points:
1. **Full bounding boxes** - (x0, y0, x1, y1) instead of just (x, y)
2. **Page dimensions** - Width and height of each page
3. **Text dimensions** - Calculate width and height of text boxes
4. **Visual highlighting** - Enable PDF viewer highlighting

### Benefits:
- ‚úÖ Draw rectangles around extracted text in PDF viewers
- ‚úÖ Calculate relative positions (% from top/left)
- ‚úÖ Detect multi-column layouts
- ‚úÖ Measure text width and height
- ‚úÖ Enable visual highlighting in Streamlit apps


## Step 1: Update Table Schema

We'll alter the existing table to add full bounding box columns.


In [None]:
-- Add bounding box columns to existing table
ALTER TABLE document_chunks ADD COLUMN IF NOT EXISTS bbox_x0 FLOAT;
ALTER TABLE document_chunks ADD COLUMN IF NOT EXISTS bbox_y0 FLOAT;
ALTER TABLE document_chunks ADD COLUMN IF NOT EXISTS bbox_x1 FLOAT;
ALTER TABLE document_chunks ADD COLUMN IF NOT EXISTS bbox_y1 FLOAT;
ALTER TABLE document_chunks ADD COLUMN IF NOT EXISTS page_width FLOAT;
ALTER TABLE document_chunks ADD COLUMN IF NOT EXISTS page_height FLOAT;


In [None]:
-- Verify new columns were added
DESC TABLE document_chunks;


## Step 2: Enhanced UDF with Full Bounding Boxes

Now we'll create a new version of the UDF that captures the **complete bounding box**.

### Key Changes:
1. `x0, y0, x1, y1 = lobj.bbox` - Capture all 4 corners
2. `page.width, page.height` - Capture page dimensions
3. Returns complete rectangle coordinates

### Bounding Box Explained:
```
(x0, y1)  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
          ‚îÇ   Text Box   ‚îÇ
          ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  (x1, y0)
```
- `x0, y0` = Bottom-left corner
- `x1, y1` = Top-right corner
- PDF coordinates start at bottom-left (0,0)


In [None]:
CREATE OR REPLACE FUNCTION pdf_txt_mapper_v3(scoped_file_url string)
RETURNS VARCHAR
LANGUAGE PYTHON
RUNTIME_VERSION = '3.12'
ARTIFACT_REPOSITORY = snowflake.snowpark.pypi_shared_repository
PACKAGES = ('snowflake-snowpark-python', 'pdfminer')
HANDLER = 'main'
AS
$$
import json
from snowflake.snowpark.files import SnowflakeFile
from pdfminer.layout import LAParams, LTTextBox
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator

def main(scoped_file_url):
    finding = []
    with SnowflakeFile.open(scoped_file_url, 'rb') as f:
        rsrcmgr = PDFResourceManager()
        laparams = LAParams()
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        pages = PDFPage.get_pages(f)
        
        # Track page numbers
        for page_num, page in enumerate(pages, start=1):
            interpreter.process_page(page)
            layout = device.get_result()
            
            # Get page dimensions
            page_width = layout.width
            page_height = layout.height
            
            for lobj in layout:
                if isinstance(lobj, LTTextBox):
                    # NEW: Capture FULL bounding box (all 4 corners)
                    x0, y0, x1, y1 = lobj.bbox
                    text = lobj.get_text()
                    
                    finding.append({
                        'page': page_num,
                        'bbox': [x0, y0, x1, y1],  # Full rectangle!
                        'page_width': page_width,
                        'page_height': page_height,
                        'txt': text
                    })
    
    return json.dumps(finding)
$$;


In [None]:
-- Verify the enhanced function was created
SHOW FUNCTIONS LIKE 'pdf_txt_mapper_v3';


## Step 3: Test Enhanced UDF

Let's test the new UDF to verify it captures full bounding boxes.


In [None]:
-- Test the enhanced UDF - should now include full bounding boxes
SELECT pdf_txt_mapper_v3(build_scoped_file_url(@PDF_STAGE, 'Prot_000.pdf')) AS extracted_data_with_bbox;


## Step 4: Clear Old Data and Load with Full Bbox

We'll truncate the table and reload with the enhanced data including full bounding boxes.


In [None]:
-- Clear existing data (optional - comment out if you want to keep Phase 1 data)
TRUNCATE TABLE document_chunks;


In [None]:
-- Parse JSON and insert with full bounding box data
INSERT INTO document_chunks (
    chunk_id, doc_name, page, 
    x, y,  -- Keep old columns for backward compatibility
    bbox_x0, bbox_y0, bbox_x1, bbox_y1,  -- New: Full bbox
    page_width, page_height,              -- New: Page dimensions
    text
)
SELECT 
    'Prot_000_p' || value:page || '_c' || ROW_NUMBER() OVER (ORDER BY value:page, value:bbox[0], value:bbox[1]) AS chunk_id,
    'Prot_000.pdf' AS doc_name,
    value:page::INTEGER AS page,
    value:bbox[0]::FLOAT AS x,          -- Top-left x (for compatibility)
    value:bbox[3]::FLOAT AS y,          -- Top-left y (for compatibility)
    value:bbox[0]::FLOAT AS bbox_x0,    -- Bottom-left x
    value:bbox[1]::FLOAT AS bbox_y0,    -- Bottom-left y
    value:bbox[2]::FLOAT AS bbox_x1,    -- Top-right x
    value:bbox[3]::FLOAT AS bbox_y1,    -- Top-right y
    value:page_width::FLOAT AS page_width,
    value:page_height::FLOAT AS page_height,
    value:txt::VARCHAR AS text
FROM (
    SELECT PARSE_JSON(pdf_txt_mapper_v3(build_scoped_file_url(@PDF_STAGE, 'Prot_000.pdf'))) AS parsed_data
),
LATERAL FLATTEN(input => parsed_data) AS f;


## Step 5: Query with Bounding Box Data

Now we can use the full bounding box information for advanced queries.


In [None]:
-- Calculate text box dimensions
SELECT 
    chunk_id,
    page,
    (bbox_x1 - bbox_x0) AS width,
    (bbox_y1 - bbox_y0) AS height,
    SUBSTR(text, 1, 50) AS text_preview
FROM document_chunks
ORDER BY height DESC
LIMIT 10;


In [None]:
-- Calculate relative positions (useful for detecting headers)
SELECT 
    chunk_id,
    page,
    ROUND((bbox_x0 / page_width) * 100, 1) AS left_percent,
    ROUND((bbox_y0 / page_height) * 100, 1) AS bottom_percent,
    SUBSTR(text, 1, 50) AS text_preview
FROM document_chunks
WHERE (bbox_y0 / page_height) > 0.8  -- Top 20% of page (likely headers)
ORDER BY page
LIMIT 10;


In [None]:
-- Detect multi-column layouts
SELECT 
    page,
    CASE 
        WHEN bbox_x0 < page_width/2 THEN 'LEFT_COLUMN'
        ELSE 'RIGHT_COLUMN'
    END AS column,
    COUNT(*) as text_boxes
FROM document_chunks
GROUP BY page, column
ORDER BY page;


In [None]:
-- Get citations with full bbox for visual highlighting
SELECT 
    chunk_id,
    page,
    bbox_x0,
    bbox_y0,
    bbox_x1,
    bbox_y1,
    SUBSTR(text, 1, 100) AS text_preview
FROM document_chunks
WHERE text ILIKE '%medication%'
ORDER BY page
LIMIT 5;


## Phase 2 Summary

### ‚úÖ What We've Accomplished:
1. Added full bounding box columns to `document_chunks` table
2. Created enhanced UDF (`pdf_txt_mapper_v3`) that captures complete rectangles
3. Loaded data with full bbox coordinates (x0, y0, x1, y1)
4. Added page dimensions (width, height)
5. Demonstrated advanced queries using bbox data

### üìä New Capabilities:
```sql
-- Calculate text dimensions
SELECT (bbox_x1 - bbox_x0) AS width, (bbox_y1 - bbox_y0) AS height;

-- Find headers (top of page)
SELECT * WHERE (bbox_y0 / page_height) > 0.8;

-- Detect columns
SELECT CASE WHEN bbox_x0 < page_width/2 THEN 'LEFT' ELSE 'RIGHT' END;
```

### üéØ What This Enables:
- ‚úÖ **Visual highlighting** in PDF viewers (Streamlit app!)
- ‚úÖ **Text dimensions** for header detection
- ‚úÖ **Relative positioning** for layout analysis
- ‚úÖ **Column detection** for multi-column documents
- ‚úÖ **Precise citations** with exact rectangles

### üí° Use with Streamlit App:
The `streamlit_pdf_viewer.py` app can now:
1. Query chunks with full bbox data
2. Draw highlight rectangles on PDF pages
3. Show exact location visually
4. Enable "click to highlight" functionality

### ‚ö†Ô∏è Still Missing (Future Phases):
- ‚ùå Font information (size, bold/italic) ‚Üí Phase 3
- ‚ùå Section detection (headers, hierarchy) ‚Üí Phase 4
- ‚ùå Smart chunking (semantic boundaries) ‚Üí Phase 5
- ‚ùå LLM integration with citations ‚Üí Phase 6

---

## Next Steps: Phase 3
In Phase 3, we'll extract **font information** (name, size, bold/italic) to automatically detect headers and section boundaries.
