# üìù Digitize Notes Pipeline - Complete Processing Notebook

This notebook demonstrates the full end-to-end pipeline for digitizing handwritten and printed notes.

## Pipeline Steps:
1. **Image Loading & Preprocessing** - Load, detect document corners, perspective correct
2. **Text Detection & Layout Analysis** - Find text regions, lines, paragraphs
3. **OCR/HTR Recognition** - Extract text with bounding boxes and confidence
4. **Text Chunking** - Split into semantic chunks for embedding
5. **Embedding Generation** - Create vector embeddings for search
6. **Storage & Indexing** - Store in vector database

## Prerequisites:
```bash
pip install opencv-python numpy pillow pytesseract transformers sentence-transformers
pip install pdf2image qdrant-client easyocr
```

In [None]:
# Standard imports
import sys
import os

# Add project root to path
project_root = os.path.abspath(os.path.join(os.getcwd(), '../..'))
sys.path.insert(0, project_root)
sys.path.insert(0, os.path.join(project_root, 'backend/ai-service'))

import cv2
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
from PIL import Image
import json
from dataclasses import dataclass, asdict
from typing import List, Dict, Any, Tuple, Optional
import uuid
import time

# Set display options
%matplotlib inline
plt.rcParams['figure.figsize'] = [15, 10]

print(f"‚úÖ Project root: {project_root}")

---
## 1Ô∏è‚É£ Image Loading & Preprocessing

Load a sample image and apply our enhancement pipeline:
- Document corner detection
- Perspective correction
- Deskewing
- Noise reduction

In [None]:
# Import our image enhancer
from app.services.image_enhancer import ImageEnhancer, A4_SIZE

# Initialize enhancer
enhancer = ImageEnhancer(
    output_size=A4_SIZE,
    target_brightness=200.0,
    denoise_strength=8,
    sharpen_strength=1.2
)

print("‚úÖ ImageEnhancer initialized")
print(f"   Output size: {A4_SIZE[0]}x{A4_SIZE[1]} (300 DPI A4)")

In [None]:
# Load a sample image (update path to your test image)
# You can use any image of handwritten notes
sample_image_path = "../sample_notes.jpg"  # UPDATE THIS PATH

# Create a synthetic test image if no sample exists
if not os.path.exists(sample_image_path):
    print("‚ö†Ô∏è No sample image found. Creating synthetic test image...")
    
    # Create a white page with some text-like lines
    test_img = np.ones((800, 600, 3), dtype=np.uint8) * 255
    
    # Add some "handwritten" lines (dark strokes)
    for y in range(100, 700, 50):
        x_start = 50 + np.random.randint(-10, 10)
        x_end = 550 + np.random.randint(-30, 30)
        thickness = np.random.randint(1, 3)
        cv2.line(test_img, (x_start, y), (x_end, y), (30, 30, 30), thickness)
        
        # Add some variation (simulate handwriting)
        for _ in range(3):
            cx = np.random.randint(x_start, x_end)
            cy = y + np.random.randint(-5, 5)
            cv2.circle(test_img, (cx, cy), 2, (30, 30, 30), -1)
    
    # Add some perspective distortion
    h, w = test_img.shape[:2]
    pts1 = np.float32([[0, 0], [w, 0], [0, h], [w, h]])
    pts2 = np.float32([[10, 20], [w-30, 10], [20, h-10], [w-10, h-20]])
    M = cv2.getPerspectiveTransform(pts1, pts2)
    test_img = cv2.warpPerspective(test_img, M, (w, h))
    
    image = test_img
    print("‚úÖ Created synthetic test image")
else:
    image = cv2.imread(sample_image_path)
    print(f"‚úÖ Loaded image from: {sample_image_path}")

print(f"   Shape: {image.shape}")

# Display original
plt.figure(figsize=(10, 8))
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
plt.title("Original Image")
plt.axis('off')
plt.show()

In [None]:
# Apply full enhancement pipeline
result = enhancer.enhance(
    image,
    crop_document=True,
    apply_binarization=False  # Set True for pure B/W
)

print("üìä Enhancement Results:")
print(f"   Corners detected: {result.corners_detected}")
print(f"   Rotation angle: {result.rotation_angle:.2f}¬∞")
print(f"\n   Metrics BEFORE:")
print(f"     Brightness: {result.metrics_before.brightness:.1f}")
print(f"     Contrast: {result.metrics_before.contrast:.1f}")
print(f"     Sharpness: {result.metrics_before.sharpness:.1f}")
print(f"\n   Metrics AFTER:")
print(f"     Brightness: {result.metrics_after.brightness:.1f}")
print(f"     Contrast: {result.metrics_after.contrast:.1f}")
print(f"     Sharpness: {result.metrics_after.sharpness:.1f}")

# Display comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 8))
axes[0].imshow(cv2.cvtColor(result.original, cv2.COLOR_BGR2RGB))
axes[0].set_title("Original")
axes[0].axis('off')

axes[1].imshow(cv2.cvtColor(result.enhanced, cv2.COLOR_BGR2RGB))
axes[1].set_title("Enhanced")
axes[1].axis('off')

axes[2].imshow(cv2.cvtColor(result.thumbnail, cv2.COLOR_BGR2RGB))
axes[2].set_title("Thumbnail")
axes[2].axis('off')

plt.tight_layout()
plt.show()

---
## 2Ô∏è‚É£ Text Detection & Layout Analysis

Detect text regions and extract line-level bounding boxes.

In [None]:
@dataclass
class TextLine:
    """Detected text line with bounding box"""
    bbox: Tuple[int, int, int, int]  # x, y, w, h
    confidence: float = 0.0
    text: str = ""
    line_index: int = 0

@dataclass
class TextBlock:
    """A block of text (paragraph or section)"""
    lines: List[TextLine]
    bbox: Tuple[int, int, int, int]
    block_type: str = "paragraph"  # paragraph, header, footer, table

def detect_text_lines(image: np.ndarray, min_area: int = 100) -> List[TextLine]:
    """
    Detect text lines using morphological operations
    
    Returns list of TextLine with bounding boxes
    """
    # Convert to grayscale
    if len(image.shape) == 3:
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    else:
        gray = image.copy()
    
    # Threshold to get binary image
    _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
    
    # Dilate horizontally to connect characters in a line
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40, 2))
    dilated = cv2.dilate(binary, kernel, iterations=2)
    
    # Find contours
    contours, _ = cv2.findContours(dilated, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
    lines = []
    for i, cnt in enumerate(contours):
        x, y, w, h = cv2.boundingRect(cnt)
        area = w * h
        
        # Filter by area
        if area < min_area:
            continue
        
        # Filter by aspect ratio (lines are wider than tall)
        if h > w * 0.5:  # Skip very tall/square regions
            continue
            
        lines.append(TextLine(
            bbox=(x, y, w, h),
            line_index=i
        ))
    
    # Sort by y position (top to bottom)
    lines.sort(key=lambda l: l.bbox[1])
    
    # Update line indices
    for i, line in enumerate(lines):
        line.line_index = i
    
    return lines

# Detect lines in enhanced image
enhanced_img = result.enhanced
detected_lines = detect_text_lines(enhanced_img)

print(f"‚úÖ Detected {len(detected_lines)} text lines")
for i, line in enumerate(detected_lines[:5]):
    x, y, w, h = line.bbox
    print(f"   Line {i}: bbox=({x}, {y}, {w}x{h})")

In [None]:
# Visualize detected lines
def visualize_lines(image: np.ndarray, lines: List[TextLine], title: str = "Detected Lines"):
    """Draw bounding boxes around detected text lines"""
    vis = image.copy()
    
    for i, line in enumerate(lines):
        x, y, w, h = line.bbox
        
        # Color based on confidence (green=high, red=low)
        if line.confidence > 0.8:
            color = (0, 255, 0)  # Green
        elif line.confidence > 0.5:
            color = (0, 255, 255)  # Yellow
        else:
            color = (0, 165, 255)  # Orange
        
        cv2.rectangle(vis, (x, y), (x+w, y+h), color, 2)
        cv2.putText(vis, str(i), (x, y-5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 1)
    
    plt.figure(figsize=(12, 10))
    plt.imshow(cv2.cvtColor(vis, cv2.COLOR_BGR2RGB))
    plt.title(f"{title} ({len(lines)} lines)")
    plt.axis('off')
    plt.show()

visualize_lines(enhanced_img, detected_lines)

---
## 3Ô∏è‚É£ OCR/HTR Recognition

Extract text from detected lines using our OCR service (TrOCR + Tesseract fallback).

In [None]:
# Import OCR service
from app.services.ocr_service import OCRService, OCRResult

# Initialize OCR (will use HF API if key available, otherwise Tesseract)
ocr = OCRService(
    use_api=True,  # Try HF API first
    fallback_to_local=True,
    line_height_threshold=50
)

print("‚úÖ OCR Service initialized")
print(f"   HF API available: {ocr.hf_api_available}")
print(f"   Tesseract available: {ocr.tesseract_available}")

In [None]:
# Extract text from whole image
start_time = time.time()
ocr_result = ocr.extract_text(enhanced_img, preprocess=True)
elapsed = time.time() - start_time

print(f"üìÑ OCR Results:")
print(f"   Processing time: {elapsed:.2f}s")
print(f"   Model used: {ocr_result.model_used}")
print(f"   Confidence: {ocr_result.confidence:.2%}")
print(f"   Lines detected: {len(ocr_result.lines)}")
print(f"\nüìù Extracted Text:")
print("-" * 50)
print(ocr_result.text[:500] if len(ocr_result.text) > 500 else ocr_result.text)
print("-" * 50)

In [None]:
# Extract text per line for more detailed results
@dataclass
class LineOCRResult:
    """OCR result for a single line"""
    line_index: int
    bbox: Tuple[int, int, int, int]
    text: str
    confidence: float
    model_used: str

def ocr_lines(image: np.ndarray, lines: List[TextLine], ocr_service: OCRService) -> List[LineOCRResult]:
    """Run OCR on each detected line"""
    results = []
    
    for line in lines:
        x, y, w, h = line.bbox
        
        # Crop line region with some padding
        pad = 5
        y1 = max(0, y - pad)
        y2 = min(image.shape[0], y + h + pad)
        x1 = max(0, x - pad)
        x2 = min(image.shape[1], x + w + pad)
        
        line_img = image[y1:y2, x1:x2]
        
        if line_img.size == 0:
            continue
        
        # OCR the line
        try:
            result = ocr_service.extract_text(line_img, preprocess=False)
            results.append(LineOCRResult(
                line_index=line.line_index,
                bbox=line.bbox,
                text=result.text.strip(),
                confidence=result.confidence,
                model_used=result.model_used
            ))
        except Exception as e:
            print(f"   ‚ö†Ô∏è Error on line {line.line_index}: {e}")
    
    return results

# Run per-line OCR (limit to first 5 lines for demo)
print("üîç Running per-line OCR...")
line_results = ocr_lines(enhanced_img, detected_lines[:5], ocr)

print(f"\nüìã Per-Line Results:")
for lr in line_results:
    conf_emoji = "üü¢" if lr.confidence > 0.8 else ("üü°" if lr.confidence > 0.5 else "üî¥")
    print(f"   Line {lr.line_index}: {conf_emoji} [{lr.confidence:.0%}] '{lr.text[:50]}...' ({lr.model_used})")

---
## 4Ô∏è‚É£ Text Chunking

Split extracted text into overlapping chunks suitable for embedding.

In [None]:
# Import notes embedding service (for chunking)
from app.services.notes_embedding import NotesEmbeddingService, TextChunk

# Initialize (skip Qdrant connection for now)
embedding_service = NotesEmbeddingService(
    chunk_size=400,
    chunk_overlap=100
)

print("‚úÖ Embedding service initialized")
print(f"   Chunk size: 400 chars")
print(f"   Overlap: 100 chars")

In [None]:
# Chunk the extracted text
doc_id = f"doc_{uuid.uuid4().hex[:8]}"
page_id = f"{doc_id}_p01"
job_id = f"job_{uuid.uuid4().hex[:8]}"

chunks = embedding_service.chunk_text(
    text=ocr_result.text,
    page_id=page_id,
    job_id=job_id,
    metadata={
        "document_id": doc_id,
        "ocr_confidence": ocr_result.confidence,
        "model_used": ocr_result.model_used
    }
)

print(f"‚úÖ Created {len(chunks)} chunks from {len(ocr_result.text)} characters")
print(f"\nüì¶ Chunk Details:")
for i, chunk in enumerate(chunks[:3]):
    print(f"\n   Chunk {i}:")
    print(f"     ID: {chunk.chunk_index}")
    print(f"     Chars: {chunk.start_char}-{chunk.end_char}")
    print(f"     Text: '{chunk.text[:100]}...'")

---
## 5Ô∏è‚É£ Embedding Generation

Generate vector embeddings for semantic search.

In [None]:
# Generate embeddings for chunks
texts = [chunk.text for chunk in chunks]

start_time = time.time()
embeddings = embedding_service.generate_embeddings(texts)
elapsed = time.time() - start_time

print(f"‚úÖ Generated {len(embeddings)} embeddings")
print(f"   Time: {elapsed:.2f}s")
print(f"   Embedding dimension: {len(embeddings[0]) if embeddings else 0}")

# Show embedding stats
if embeddings:
    emb_array = np.array(embeddings)
    print(f"\nüìä Embedding Statistics:")
    print(f"   Mean: {emb_array.mean():.4f}")
    print(f"   Std: {emb_array.std():.4f}")
    print(f"   Min: {emb_array.min():.4f}")
    print(f"   Max: {emb_array.max():.4f}")

In [None]:
# Visualize embedding similarity matrix
if len(embeddings) > 1:
    from sklearn.metrics.pairwise import cosine_similarity
    
    sim_matrix = cosine_similarity(embeddings)
    
    plt.figure(figsize=(8, 6))
    plt.imshow(sim_matrix, cmap='viridis', vmin=0, vmax=1)
    plt.colorbar(label='Cosine Similarity')
    plt.title('Chunk Embedding Similarity Matrix')
    plt.xlabel('Chunk Index')
    plt.ylabel('Chunk Index')
    plt.show()

---
## 6Ô∏è‚É£ Complete Pipeline Summary

Assemble the full document metadata structure.

In [None]:
# Build complete document structure
document = {
    "document_id": doc_id,
    "job_id": job_id,
    "status": "processed",
    "processing_time_sec": 0,
    "pages": [
        {
            "page_id": page_id,
            "page_index": 0,
            "width": enhanced_img.shape[1],
            "height": enhanced_img.shape[0],
            "ocr_confidence_avg": ocr_result.confidence,
            "lines": [
                {
                    "line_index": lr.line_index,
                    "bbox": lr.bbox,
                    "text": lr.text,
                    "confidence": lr.confidence
                }
                for lr in line_results
            ],
            "chunks": [
                {
                    "chunk_id": f"{page_id}_c{chunk.chunk_index:04d}",
                    "chunk_index": chunk.chunk_index,
                    "text": chunk.text,
                    "char_range": [chunk.start_char, chunk.end_char],
                    "embedding_dim": len(embeddings[i]) if i < len(embeddings) else 0
                }
                for i, chunk in enumerate(chunks)
            ]
        }
    ]
}

print("üìÑ Complete Document Structure:")
print(json.dumps(document, indent=2, default=str)[:2000])

---
## üîç Bonus: Semantic Search Demo

If you have Qdrant running, you can test semantic search.

In [None]:
# Semantic search (requires Qdrant running)
def demo_semantic_search(query: str, embeddings: List, chunks: List[TextChunk], top_k: int = 3):
    """Simple in-memory semantic search demo"""
    from sklearn.metrics.pairwise import cosine_similarity
    
    # Generate query embedding
    query_emb = embedding_service.generate_embeddings([query])
    
    if not query_emb or not embeddings:
        print("No embeddings available")
        return
    
    # Calculate similarities
    sims = cosine_similarity(query_emb, embeddings)[0]
    
    # Get top-k
    top_indices = np.argsort(sims)[::-1][:top_k]
    
    print(f"üîç Query: '{query}'")
    print(f"\nüìã Top {top_k} Results:")
    for rank, idx in enumerate(top_indices):
        print(f"\n   {rank+1}. Score: {sims[idx]:.4f}")
        print(f"      Text: '{chunks[idx].text[:100]}...'")

# Demo search
if chunks and embeddings:
    demo_semantic_search("what is the main topic", embeddings, chunks)

---
## üìä Quality Metrics & Thresholds

Evaluate OCR quality and flag pages needing review.

In [None]:
@dataclass
class QualityReport:
    """Quality assessment for a processed page"""
    page_id: str
    avg_confidence: float
    min_confidence: float
    max_confidence: float
    lines_below_threshold: int
    needs_review: bool
    recommended_action: str

def assess_quality(
    line_results: List[LineOCRResult],
    page_id: str,
    confidence_threshold: float = 0.7
) -> QualityReport:
    """Assess OCR quality and determine if human review needed"""
    
    if not line_results:
        return QualityReport(
            page_id=page_id,
            avg_confidence=0.0,
            min_confidence=0.0,
            max_confidence=0.0,
            lines_below_threshold=0,
            needs_review=True,
            recommended_action="No text detected - manual review required"
        )
    
    confidences = [lr.confidence for lr in line_results]
    avg_conf = np.mean(confidences)
    min_conf = np.min(confidences)
    max_conf = np.max(confidences)
    lines_below = sum(1 for c in confidences if c < confidence_threshold)
    
    # Determine if review needed
    needs_review = avg_conf < confidence_threshold or lines_below > len(confidences) * 0.3
    
    # Recommend action
    if avg_conf >= 0.9:
        action = "High quality - ready for indexing"
    elif avg_conf >= 0.7:
        action = "Good quality - minor review recommended"
    elif avg_conf >= 0.5:
        action = "Medium quality - review low-confidence lines"
    else:
        action = "Low quality - consider reprocessing or manual transcription"
    
    return QualityReport(
        page_id=page_id,
        avg_confidence=avg_conf,
        min_confidence=min_conf,
        max_confidence=max_conf,
        lines_below_threshold=lines_below,
        needs_review=needs_review,
        recommended_action=action
    )

# Generate quality report
quality = assess_quality(line_results, page_id)

print("üìä Quality Report:")
print(f"   Page: {quality.page_id}")
print(f"   Avg Confidence: {quality.avg_confidence:.1%}")
print(f"   Range: {quality.min_confidence:.1%} - {quality.max_confidence:.1%}")
print(f"   Lines below threshold: {quality.lines_below_threshold}")
print(f"   Needs review: {'‚ö†Ô∏è YES' if quality.needs_review else '‚úÖ NO'}")
print(f"   Recommendation: {quality.recommended_action}")

---
## üíæ Export Results

Save processed results for further analysis.

In [None]:
# Save processed results
output_dir = Path("./output")
output_dir.mkdir(exist_ok=True)

# Save enhanced image
cv2.imwrite(str(output_dir / f"{page_id}_enhanced.png"), result.enhanced)
cv2.imwrite(str(output_dir / f"{page_id}_thumbnail.png"), result.thumbnail)

# Save document JSON
with open(output_dir / f"{doc_id}_metadata.json", 'w') as f:
    json.dump(document, f, indent=2, default=str)

# Save quality report
with open(output_dir / f"{page_id}_quality.json", 'w') as f:
    json.dump(asdict(quality), f, indent=2)

print(f"‚úÖ Results saved to: {output_dir.absolute()}")
print(f"   - {page_id}_enhanced.png")
print(f"   - {page_id}_thumbnail.png")
print(f"   - {doc_id}_metadata.json")
print(f"   - {page_id}_quality.json")

---
## üéØ Summary

This notebook demonstrated the complete Digitize Notes pipeline:

| Step | Component | Output |
|------|-----------|--------|
| 1 | Image Enhancement | Perspective-corrected, denoised image |
| 2 | Line Detection | Bounding boxes for text regions |
| 3 | OCR/HTR | Extracted text with confidence scores |
| 4 | Chunking | Semantic chunks for embedding |
| 5 | Embedding | Vector representations for search |
| 6 | Quality Assessment | Review flags and recommendations |

### Next Steps:
- Integrate with Qdrant for vector storage
- Add PDF splitting for multi-page documents
- Implement human correction endpoints
- Build UI for review workflow