# OCR Processing Script for Swedish RSA PDFs - Improved Version

## Overview

This improved OCR script processes Swedish municipal RSA (Risk and Vulnerability Analysis) PDF documents that failed standard text extraction. It uses Tesseract OCR with Swedish language support and outputs data in a format directly compatible with the readtext R package.


**Class and Method Documentation:**
```python
@dataclass
class DocumentMetadata:
    """Metadata extracted from RSA document filename.

    Attributes:
        filename: Original PDF filename
        municipality: Swedish municipality name
        year: Publication year (4-digit string or 'unknown')
        is_masked: Whether document is marked as 'Maskad' (redacted)
    """

    @classmethod
    def from_filename(cls, filename: str) -> DocumentMetadata:
        """Parse metadata from RSA document filename.

        Expected format: RSA [municipality] [year] [Maskad].pdf

        Args:
            filename: PDF filename to parse

        Returns:
            DocumentMetadata instance with extracted fields

        Examples:
            >>> DocumentMetadata.from_filename("RSA Ale 2015 Maskad.pdf")
            DocumentMetadata(filename='RSA Ale 2015 Maskad.pdf',
                           municipality='Ale', year='2015', is_masked=True)
        """
```

## Installation

### System Dependencies

**Ubuntu/Debian:**
```bash
sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-swe poppler-utils
```

**macOS:**
```bash
brew install tesseract tesseract-lang poppler
```

### Python Dependencies

```bash
pip install pytesseract pdf2image pillow pandas tqdm pyarrow
```

**Note:** Swedish language support (`tesseract-ocr-swe`) is **required**. The script will fail if it's not installed.

## Usage

### Command Line

```bash
# Process all PDFs
python ocr_swedish_pdfs_improved.py -i ./pdfs -o ./output

# Process specific files
python ocr_swedish_pdfs_improved.py -i ./pdfs -o ./output -f failed_files.txt

# Check installation
python ocr_swedish_pdfs_improved.py --check-only

# Higher DPI for better quality (slower)
python ocr_swedish_pdfs_improved.py -i ./pdfs -o ./output --dpi 400
```

### Jupyter Lab Usage (Recommended)

```python
from pathlib import Path
from ocr_swedish_pdfs_improved import run_pipeline, ProcessingConfig

# Configure processing
config = ProcessingConfig(
    language="swe+eng",  # Swedish + English for mixed content
    dpi=300,
    min_text_length=500,
    workers=1,  # Use 1 for Jupyter (multiprocessing can cause issues)
)

# Run the pipeline
results, summary = run_pipeline(
    input_dir=Path("/path/to/pdfs"),
    output_dir=Path("./ocr_output"),
    config=config,
)

# Check results
print(f"Processed: {summary.total_files}")
print(f"Successful: {summary.successful}")
print(f"Failed: {summary.failed_short + summary.failed_other}")

# Access individual results
for result in results:
    if result.status.is_success():
        print(f"{result.file}: {result.word_count} words")
```

### Process Specific Files from List

```python
from pathlib import Path
from ocr_swedish_pdfs_improved import run_pipeline, ProcessingConfig

results, summary = run_pipeline(
    input_dir=Path("/path/to/pdfs"),
    output_dir=Path("./ocr_output"),
    file_list=Path("failed_files.txt"),  # One filename per line
    config=ProcessingConfig(workers=1),
)
```

## Output Files

| File | Description |
|------|-------------|
| `ocr_readtext_format.parquet` | **Primary output** - matches readtext format (file, text columns) |
| `ocr_full_results.parquet` | All metadata and processing details |
| `ocr_results_summary.csv` | Summary without text content |
| `still_failed_files.txt` | Files that failed OCR |
| `*.txt` | Individual text files (optional) |

## Integration with Stanza Pipeline

```python
import pandas as pd

# Load readtext success (convert from RDS if needed, or use parquet)
# If you have readtext_success.rds, convert it first:
# import pyreadr
# readtext_df = pyreadr.read_r('readtext_success.rds')[None]
# readtext_df.to_parquet('readtext_success.parquet')

readtext_df = pd.read_parquet('readtext_success.parquet')
ocr_df = pd.read_parquet('ocr_output/ocr_readtext_format.parquet')

# Combine
combined_df = pd.concat([readtext_df, ocr_df], ignore_index=True)

print(f"Total documents: {len(combined_df)}")

# Continue with existing Stanza pipeline...
```

## Architecture Diagram

```
┌─────────────────────────────────────────────────────────────┐
│                        CLI / API                            │
│  main() ←→ run_pipeline() ←→ create_argument_parser()       │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Configuration Layer                       │
│  ProcessingConfig │ Constants │ DependencyChecker           │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Discovery Layer                           │
│  discover_pdf_files() │ DocumentMetadata.from_filename()    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Processing Layer                          │
│  BatchProcessor │ process_single_document │ OCREngine       │
│  ImagePreprocessor (Strategy Pattern)                       │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                      Output Layer                            │
│  OutputWriter │ ProcessingSummary │ OCRResult               │
└─────────────────────────────────────────────────────────────┘
```

## Performance Considerations

- **Parallel Processing**: Use `--workers N` for multi-core systems
- **DPI Trade-off**: Higher DPI = better quality but slower processing
- **Memory**: Each worker loads entire PDFs into memory; adjust workers accordingly
- **Disk I/O**: Parquet output is faster than CSV for large datasets

## Extending the Script

### Custom Preprocessing

```python
class ContrastEnhancer(ImagePreprocessor):
    def __init__(self, factor: float = 1.5):
        from PIL import ImageEnhance
        self.factor = factor
    
    def process(self, image):
        from PIL import ImageEnhance
        enhancer = ImageEnhance.Contrast(image)
        return enhancer.enhance(self.factor)

# Use custom preprocessor
preprocessor = CompositePreprocessor([
    GrayscalePreprocessor(),
    ContrastEnhancer(factor=2.0),
    ThresholdPreprocessor(threshold=140),
])
engine = OCREngine(config, preprocessor=preprocessor)
```

### Custom Output Format

```python
class CustomOutputWriter(OutputWriter):
    def write_all(self, results):
        paths = super().write_all(results)
        # Add custom format
        paths['custom'] = self._write_custom_format(results)
        return paths
```
