# Fixed Handwriting Recognition Pipeline
## SOCAR Hackathon 2025 - Error-Free Version

This notebook fixes common errors:
- ‚úÖ Robust error handling
- ‚úÖ Better type checking
- ‚úÖ Detailed error messages
- ‚úÖ Image validation
- ‚úÖ Graceful failure handling

## 1. Setup with Error Handling

In [1]:
# Install packages
!pip install -q kagglehub transformers torch torchvision pillow
!pip install -q matplotlib seaborn plotly pandas numpy scikit-learn tqdm
!pip install -q jiwer opencv-python scikit-image

print("‚úÖ Installation complete!")

‚úÖ Installation complete!


In [2]:
# Imports
import kagglehub
import os
import json
import traceback
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from collections import defaultdict, Counter
from tqdm.auto import tqdm
from dataclasses import dataclass, asdict
from typing import Dict, List, Optional, Tuple, Any
import warnings
warnings.filterwarnings('ignore')

# Image processing
from PIL import Image
import cv2
from skimage.filters import threshold_sauvola

# Deep Learning
import torch
from transformers import (
    VisionEncoderDecoderModel,
    TrOCRProcessor
)

# Metrics
from jiwer import cer, wer

# Visualization
import plotly.graph_objects as go
from plotly.subplots import make_subplots

print(f"‚úÖ PyTorch: {torch.__version__}")
print(f"‚úÖ Device: {'GPU (' + torch.cuda.get_device_name(0) + ')' if torch.cuda.is_available() else 'CPU'}")

‚úÖ PyTorch: 2.9.0+cu126
‚úÖ Device: CPU


## 2. Fixed Image Preprocessing with Validation

In [3]:
class SafeImageProcessor:
    """
    Robust image preprocessing with comprehensive error handling.
    """
    
    @staticmethod
    def validate_image(image):
        """
        Validate image before processing.
        """
        if image is None:
            raise ValueError("Image is None")
        
        if isinstance(image, np.ndarray):
            if image.size == 0:
                raise ValueError("Image array is empty")
            if image.dtype == np.object_:
                raise ValueError("Image has object dtype, cannot process")
        
        return True
    
    @staticmethod
    def load_image_safely(image_path):
        """
        Safely load image with error handling.
        """
        try:
            # Try with OpenCV first
            image = cv2.imread(str(image_path))
            
            if image is None:
                # Fallback to PIL
                pil_image = Image.open(image_path)
                image = np.array(pil_image)
            
            # Validate
            if image is None or image.size == 0:
                raise ValueError(f"Could not load image: {image_path}")
            
            # Convert to uint8 if needed
            if image.dtype != np.uint8:
                if image.dtype == np.float32 or image.dtype == np.float64:
                    image = (image * 255).astype(np.uint8)
                else:
                    image = image.astype(np.uint8)
            
            return image
            
        except Exception as e:
            raise ValueError(f"Error loading image {image_path}: {str(e)}")
    
    @staticmethod
    def preprocess_image(image_path):
        """
        Preprocess image with comprehensive error handling.
        """
        try:
            # Load image safely
            image = SafeImageProcessor.load_image_safely(image_path)
            
            # Store original
            original = image.copy()
            
            # Convert to grayscale
            if len(image.shape) == 3:
                gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
            else:
                gray = image.copy()
            
            # Ensure proper dtype
            if gray.dtype != np.uint8:
                gray = gray.astype(np.uint8)
            
            # Binarize with error handling
            try:
                thresh = threshold_sauvola(gray, window_size=min(25, min(gray.shape)//2))
                binary = (gray > thresh).astype(np.uint8) * 255
            except Exception as e:
                print(f"  ‚ö†Ô∏è Sauvola failed, using Otsu: {e}")
                _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
            
            return {
                'original': original,
                'gray': gray,
                'binary': binary,
                'shape': image.shape,
                'success': True
            }
            
        except Exception as e:
            print(f"  ‚ùå Preprocessing error: {str(e)}")
            return {
                'original': None,
                'gray': None,
                'binary': None,
                'shape': None,
                'success': False,
                'error': str(e)
            }

print("‚úÖ Safe image processor implemented")

‚úÖ Safe image processor implemented


## 3. Fixed OCR Pipeline with Robust Error Handling

In [4]:
@dataclass
class ProcessingResult:
    """
    Result with success/failure tracking.
    """
    image_path: str
    success: bool
    raw_text: Optional[str] = None
    confidence: Optional[float] = None
    extracted_fields: Optional[Dict] = None
    error_message: Optional[str] = None
    error_type: Optional[str] = None
    
    def to_dict(self):
        return asdict(self)

class RobustOCRPipeline:
    """
    OCR pipeline with comprehensive error handling and recovery.
    """
    
    def __init__(self, model_name="microsoft/trocr-base-handwritten"):
        print("üöÄ Initializing Robust OCR Pipeline...")
        
        self.processor_helper = SafeImageProcessor()
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        
        try:
            print(f"üì¶ Loading TrOCR: {model_name}")
            self.processor = TrOCRProcessor.from_pretrained(model_name)
            self.model = VisionEncoderDecoderModel.from_pretrained(model_name)
            self.model.to(self.device)
            self.model.eval()
            print(f"‚úÖ Model loaded on {self.device}")
        except Exception as e:
            print(f"‚ùå Failed to load model: {e}")
            raise
        
        # Statistics
        self.stats = {
            'processed': 0,
            'successful': 0,
            'failed': 0,
            'errors': defaultdict(int)
        }
    
    def recognize_text_safe(self, image):
        """
        Safely recognize text with error handling.
        """
        try:
            # Validate input
            if image is None:
                raise ValueError("Input image is None")
            
            # Convert to PIL Image
            if isinstance(image, np.ndarray):
                # Ensure proper dtype
                if image.dtype == np.object_:
                    raise ValueError("Image has object dtype")
                
                if image.dtype != np.uint8:
                    if image.dtype in [np.float32, np.float64]:
                        image = (image * 255).astype(np.uint8)
                    else:
                        image = image.astype(np.uint8)
                
                # Convert to PIL
                if len(image.shape) == 2:
                    pil_image = Image.fromarray(image, mode='L')
                elif len(image.shape) == 3:
                    if image.shape[2] == 3:
                        pil_image = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
                    else:
                        pil_image = Image.fromarray(image[:,:,0], mode='L')
                else:
                    raise ValueError(f"Unexpected image shape: {image.shape}")
            else:
                pil_image = image
            
            # Ensure RGB mode
            if pil_image.mode != 'RGB':
                pil_image = pil_image.convert('RGB')
            
            # Validate size
            if pil_image.size[0] == 0 or pil_image.size[1] == 0:
                raise ValueError("Image has zero dimensions")
            
            # Process with model
            pixel_values = self.processor(
                pil_image, 
                return_tensors="pt"
            ).pixel_values.to(self.device)
            
            # Generate
            with torch.no_grad():
                outputs = self.model.generate(
                    pixel_values,
                    output_scores=True,
                    return_dict_in_generate=True,
                    num_beams=5,
                    max_length=128
                )
            
            # Decode
            text = self.processor.batch_decode(
                outputs.sequences, 
                skip_special_tokens=True
            )[0]
            
            # Calculate confidence
            if hasattr(outputs, 'sequences_scores'):
                confidence = float(torch.exp(outputs.sequences_scores[0]).cpu())
            else:
                confidence = 0.85
            
            return text, confidence, None
            
        except Exception as e:
            error_msg = f"Recognition error: {str(e)}"
            print(f"  ‚ö†Ô∏è {error_msg}")
            return "", 0.0, error_msg
    
    def extract_fields_safe(self, text):
        """
        Safely extract fields from text.
        """
        try:
            fields = {}
            
            if not text:
                return fields
            
            lines = text.split('\n')
            
            for line in lines:
                if ':' in line:
                    try:
                        parts = line.split(':', 1)
                        if len(parts) == 2:
                            key = parts[0].strip().lower()
                            value = parts[1].strip()
                            if key and value:
                                fields[key] = value
                    except:
                        continue
            
            return fields
            
        except Exception as e:
            print(f"  ‚ö†Ô∏è Field extraction error: {e}")
            return {}
    
    def process_document(self, image_path):
        """
        Process document with comprehensive error handling.
        """
        self.stats['processed'] += 1
        
        try:
            # Preprocess
            preprocessed = self.processor_helper.preprocess_image(image_path)
            
            if not preprocessed['success']:
                self.stats['failed'] += 1
                self.stats['errors']['preprocessing'] += 1
                return ProcessingResult(
                    image_path=str(image_path),
                    success=False,
                    error_message=preprocessed.get('error', 'Preprocessing failed'),
                    error_type='preprocessing'
                )
            
            # Recognize text
            text, confidence, error = self.recognize_text_safe(preprocessed['binary'])
            
            if error:
                self.stats['failed'] += 1
                self.stats['errors']['recognition'] += 1
                return ProcessingResult(
                    image_path=str(image_path),
                    success=False,
                    error_message=error,
                    error_type='recognition'
                )
            
            # Extract fields
            fields = self.extract_fields_safe(text)
            
            # Success
            self.stats['successful'] += 1
            return ProcessingResult(
                image_path=str(image_path),
                success=True,
                raw_text=text,
                confidence=confidence,
                extracted_fields=fields
            )
            
        except Exception as e:
            self.stats['failed'] += 1
            self.stats['errors']['unknown'] += 1
            
            error_msg = f"{type(e).__name__}: {str(e)}"
            print(f"  ‚ùå Unexpected error: {error_msg}")
            
            return ProcessingResult(
                image_path=str(image_path),
                success=False,
                error_message=error_msg,
                error_type='unknown'
            )
    
    def print_stats(self):
        """
        Print processing statistics.
        """
        print("\n" + "="*60)
        print("üìä PROCESSING STATISTICS")
        print("="*60)
        print(f"Total processed: {self.stats['processed']}")
        print(f"‚úÖ Successful: {self.stats['successful']} ({self.stats['successful']/self.stats['processed']*100:.1f}%)")
        print(f"‚ùå Failed: {self.stats['failed']} ({self.stats['failed']/self.stats['processed']*100:.1f}%)")
        
        if self.stats['errors']:
            print(f"\nüîç Error breakdown:")
            for error_type, count in self.stats['errors'].items():
                print(f"  ‚Ä¢ {error_type}: {count}")
        print("="*60)

print("‚úÖ Robust OCR pipeline implemented")

‚úÖ Robust OCR pipeline implemented


## 4. Download Dataset

In [5]:
# Download dataset
print("üì• Downloading dataset...")
path = kagglehub.dataset_download("chaimaourgani/handwritten2text-training-dataset")
print(f"‚úÖ Dataset: {path}")

# Find images
dataset_path = Path(path)
image_files = list(dataset_path.rglob('*.png')) + list(dataset_path.rglob('*.jpg'))
print(f"üìä Found {len(image_files)} images")

üì• Downloading dataset...
Using Colab cache for faster access to the 'handwritten2text-training-dataset' dataset.
‚úÖ Dataset: /kaggle/input/handwritten2text-training-dataset
üìä Found 12111 images


## 5. Initialize Pipeline

In [6]:
# Initialize robust pipeline
pipeline = RobustOCRPipeline()
print("\n‚úÖ Pipeline ready!")

üöÄ Initializing Robust OCR Pipeline...
üì¶ Loading TrOCR: microsoft/trocr-base-handwritten


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Some weights of VisionEncoderDecoderModel were not initialized from the model checkpoint at microsoft/trocr-base-handwritten and are newly initialized: ['encoder.pooler.dense.bias', 'encoder.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


‚úÖ Model loaded on cpu

‚úÖ Pipeline ready!


## 6. Process Documents with Error Recovery

In [7]:
# Process documents
sample_size = 20  # Process 20 documents
results = []

print(f"üîÑ Processing {sample_size} documents...\n")

for img_path in tqdm(image_files[:sample_size], desc="Processing"):
    result = pipeline.process_document(img_path)
    results.append(result)

# Print statistics
pipeline.print_stats()

# Separate successful and failed
successful_results = [r for r in results if r.success]
failed_results = [r for r in results if not r.success]

print(f"\n‚úÖ Successfully processed: {len(successful_results)}")
print(f"‚ùå Failed: {len(failed_results)}")

üîÑ Processing 20 documents...



Processing:   0%|          | 0/20 [00:00<?, ?it/s]


üìä PROCESSING STATISTICS
Total processed: 20
‚úÖ Successful: 20 (100.0%)
‚ùå Failed: 0 (0.0%)

‚úÖ Successfully processed: 20
‚ùå Failed: 0


## 7. Analyze Results

In [8]:
# Display successful results
if successful_results:
    print("\n" + "="*80)
    print("‚úÖ SUCCESSFUL RESULTS")
    print("="*80)
    
    for i, result in enumerate(successful_results[:5], 1):
        print(f"\n{i}. {Path(result.image_path).name}")
        print(f"   Text: {result.raw_text[:100]}..." if len(result.raw_text) > 100 else f"   Text: {result.raw_text}")
        print(f"   Confidence: {result.confidence:.2%}")
        print(f"   Fields: {len(result.extracted_fields)}")
        if result.extracted_fields:
            for key, value in list(result.extracted_fields.items())[:3]:
                print(f"     ‚Ä¢ {key}: {value}")

# Display failed results with error details
if failed_results:
    print("\n" + "="*80)
    print("‚ùå FAILED RESULTS - Error Analysis")
    print("="*80)
    
    for i, result in enumerate(failed_results[:10], 1):
        print(f"\n{i}. {Path(result.image_path).name}")
        print(f"   Error Type: {result.error_type}")
        print(f"   Error Message: {result.error_message}")


‚úÖ SUCCESSFUL RESULTS

1. train2011-589_000005.jpg
   Text: cordialement
   Confidence: 72.17%
   Fields: 0

2. train2011-771_000002.jpg
   Text: var le montant de mes impits eta puleri mensuellement .
   Confidence: 67.54%
   Fields: 0

3. train2011-783_000005.jpg
   Text: chossi de me brunnen vero vous . Twinensis can be presente was
   Confidence: 52.30%
   Fields: 0

4. train2011-73_000003.jpg
   Text: en ample sea commande over man couple client FNJBO14
   Confidence: 61.20%
   Fields: 0

5. train2011-136_000001.jpg
   Text: Te sonhaiterai commander their pains de chaussettes , taille 39142 ,
   Confidence: 70.22%
   Fields: 0


## 8. Visualize Results

In [9]:
if successful_results:
    # Confidence distribution
    confidences = [r.confidence for r in successful_results]
    
    fig = go.Figure()
    fig.add_trace(go.Histogram(
        x=confidences,
        nbinsx=20,
        marker_color='lightblue'
    ))
    
    fig.update_layout(
        title='Confidence Score Distribution (Successful Results)',
        xaxis_title='Confidence',
        yaxis_title='Count',
        height=400
    )
    fig.show()
    
    print(f"\nüìä CONFIDENCE STATISTICS:")
    print(f"   Mean: {np.mean(confidences):.2%}")
    print(f"   Std:  {np.std(confidences):.2%}")
    print(f"   Min:  {np.min(confidences):.2%}")
    print(f"   Max:  {np.max(confidences):.2%}")

# Error type distribution
if failed_results:
    error_types = Counter([r.error_type for r in failed_results])
    
    fig = go.Figure()
    fig.add_trace(go.Bar(
        x=list(error_types.keys()),
        y=list(error_types.values()),
        marker_color='lightcoral'
    ))
    
    fig.update_layout(
        title='Error Type Distribution',
        xaxis_title='Error Type',
        yaxis_title='Count',
        height=400
    )
    fig.show()


üìä CONFIDENCE STATISTICS:
   Mean: 66.69%
   Std:  11.26%
   Min:  41.47%
   Max:  87.19%


## 9. Export Results

In [10]:
# Export all results
export_data = {
    'summary': {
        'total': len(results),
        'successful': len(successful_results),
        'failed': len(failed_results),
        'success_rate': len(successful_results) / len(results) * 100 if results else 0
    },
    'successful_results': [r.to_dict() for r in successful_results],
    'failed_results': [r.to_dict() for r in failed_results]
}

with open('robust_ocr_results.json', 'w', encoding='utf-8') as f:
    json.dump(export_data, f, indent=2, ensure_ascii=False)

print("‚úÖ Exported to: robust_ocr_results.json")

# Export CSV summary
summary_data = []
for result in results:
    summary_data.append({
        'image': Path(result.image_path).name,
        'success': result.success,
        'text_preview': result.raw_text[:50] if result.raw_text else '',
        'confidence': result.confidence if result.confidence else 0,
        'fields_count': len(result.extracted_fields) if result.extracted_fields else 0,
        'error_type': result.error_type if not result.success else '',
        'error_message': result.error_message[:50] if result.error_message else ''
    })

df = pd.DataFrame(summary_data)
df.to_csv('processing_summary.csv', index=False)
print("‚úÖ Exported summary to: processing_summary.csv")

print("\nüìÑ Summary Preview:")
print(df.head(10))

‚úÖ Exported to: robust_ocr_results.json
‚úÖ Exported summary to: processing_summary.csv

üìÑ Summary Preview:
                       image  success  \
0   train2011-589_000005.jpg     True   
1   train2011-771_000002.jpg     True   
2   train2011-783_000005.jpg     True   
3    train2011-73_000003.jpg     True   
4   train2011-136_000001.jpg     True   
5   train2011-696_000004.jpg     True   
6   train2011-227_000001.jpg     True   
7   train2011-468_000007.jpg     True   
8   train2011-985_000001.jpg     True   
9  train2011-1274_000002.jpg     True   

                                        text_preview  confidence  \
0                                       cordialement    0.721707   
1  var le montant de mes impits eta puleri mensue...    0.675427   
2  chossi de me brunnen vero vous . Twinensis can...    0.522995   
3  en ample sea commande over man couple client F...    0.611952   
4  Te sonhaiterai commander their pains de chauss...    0.702207   
5  at que mei mame je traval

## 10. Final Summary

In [11]:
print("\n" + "="*80)
print("üìã FINAL PROCESSING REPORT")
print("="*80)

success_rate = len(successful_results) / len(results) * 100 if results else 0

report = f"""
üìä OVERVIEW:
   ‚Ä¢ Total Documents: {len(results)}
   ‚Ä¢ ‚úÖ Successful: {len(successful_results)} ({success_rate:.1f}%)
   ‚Ä¢ ‚ùå Failed: {len(failed_results)} ({100-success_rate:.1f}%)
"""

if successful_results:
    confidences = [r.confidence for r in successful_results]
    total_fields = sum(len(r.extracted_fields) for r in successful_results if r.extracted_fields)
    
    report += f"""
‚úÖ SUCCESSFUL PROCESSING:
   ‚Ä¢ Average Confidence: {np.mean(confidences):.2%}
   ‚Ä¢ Min Confidence: {np.min(confidences):.2%}
   ‚Ä¢ Max Confidence: {np.max(confidences):.2%}
   ‚Ä¢ Total Fields Extracted: {total_fields}
   ‚Ä¢ Avg Fields per Doc: {total_fields/len(successful_results):.1f}
"""

if failed_results:
    error_types = Counter([r.error_type for r in failed_results])
    
    report += f"""
‚ùå FAILURE ANALYSIS:
"""
    for error_type, count in error_types.most_common():
        report += f"   ‚Ä¢ {error_type}: {count} ({count/len(failed_results)*100:.1f}%)\n"

report += f"""
üíæ EXPORTS:
   ‚Ä¢ Detailed JSON: robust_ocr_results.json
   ‚Ä¢ Summary CSV: processing_summary.csv

‚úÖ Processing Complete!
"""

print(report)
print("="*80)


üìã FINAL PROCESSING REPORT

üìä OVERVIEW:
   ‚Ä¢ Total Documents: 20
   ‚Ä¢ ‚úÖ Successful: 20 (100.0%)
   ‚Ä¢ ‚ùå Failed: 0 (0.0%)

‚úÖ SUCCESSFUL PROCESSING:
   ‚Ä¢ Average Confidence: 66.69%
   ‚Ä¢ Min Confidence: 41.47%
   ‚Ä¢ Max Confidence: 87.19%
   ‚Ä¢ Total Fields Extracted: 0
   ‚Ä¢ Avg Fields per Doc: 0.0

üíæ EXPORTS:
   ‚Ä¢ Detailed JSON: robust_ocr_results.json
   ‚Ä¢ Summary CSV: processing_summary.csv

‚úÖ Processing Complete!



## Key Fixes Implemented:

### 1. Type Checking & Validation
- ‚úÖ Validates image dtype before processing
- ‚úÖ Converts `np.object_` arrays safely
- ‚úÖ Handles float/int type conversions
- ‚úÖ Validates image dimensions

### 2. Error Handling
- ‚úÖ Try-except blocks at every level
- ‚úÖ Detailed error messages
- ‚úÖ Error categorization (preprocessing, recognition, unknown)
- ‚úÖ Graceful failure handling

### 3. Fallback Mechanisms
- ‚úÖ OpenCV ‚Üí PIL fallback for image loading
- ‚úÖ Sauvola ‚Üí Otsu fallback for thresholding
- ‚úÖ Default confidence scores when calculation fails

### 4. Statistics & Monitoring
- ‚úÖ Track success/failure rates
- ‚úÖ Error type breakdown
- ‚úÖ Processing statistics
- ‚úÖ Detailed failure analysis

### 5. Output
- ‚úÖ Separate successful and failed results
- ‚úÖ Detailed error messages for debugging
- ‚úÖ Comprehensive reporting
- ‚úÖ CSV export with error details

---

**This version will handle problematic images gracefully and provide detailed error information for debugging!**

**SOCAR Hackathon 2025** | **AI Engineering Track**