# German Document OCR with Qwen2-VL

This cookbook demonstrates how to use Qwen2-VL for Optical Character Recognition (OCR) on **German documents**. German business documents have unique characteristics that require special attention:

- **Umlauts** (ä, ö, ü) and **Eszett** (ß)
- **German date formats** (DD.MM.YYYY)
- **Currency formatting** (1.234,56 €)
- **Specific document types**: Rechnungen (invoices), Formulare (forms), Ausweise (IDs)

## Use Cases Covered

1. **Invoice OCR** (Rechnungserkennung)
2. **Form Processing** (Formularverarbeitung)
3. **ID Document Extraction** (Ausweiserkennung)
4. **Structured Data Extraction** with JSON output

---

**Author:** [Keyvan Hardani](https://keyvan.ai) - AI Engineer specializing in German Document Intelligence

**Related Project:** [German-OCR](https://github.com/Keyvanhardani/German-OCR) - Specialized OCR for German documents

## Setup

First, install the required dependencies:

In [None]:
!pip install transformers torch qwen-vl-utils accelerate -q

In [None]:
import torch
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import json
import re
from PIL import Image
import requests
from io import BytesIO

## Load the Model

We use `Qwen2-VL-7B-Instruct` which offers excellent multilingual OCR capabilities including German.

In [None]:
# Load model and processor
model_name = "Qwen/Qwen2-VL-7B-Instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(model_name)

print(f"Model loaded: {model_name}")

## Helper Functions

In [None]:
def load_image(image_path_or_url: str) -> Image.Image:
    """Load image from local path or URL."""
    if image_path_or_url.startswith(('http://', 'https://')):
        response = requests.get(image_path_or_url)
        return Image.open(BytesIO(response.content)).convert('RGB')
    return Image.open(image_path_or_url).convert('RGB')


def extract_json(text: str) -> dict:
    """Extract JSON from model response (handles markdown code blocks)."""
    # Try to find JSON in markdown code block
    json_match = re.search(r'```(?:json)?\s*([\s\S]*?)```', text)
    if json_match:
        text = json_match.group(1)
    
    # Clean and parse
    text = text.strip()
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        return {"raw_text": text, "parse_error": True}


def run_ocr(image_source: str, prompt: str, max_tokens: int = 2048) -> str:
    """Run OCR inference on an image with a custom prompt."""
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_source},
                {"type": "text", "text": prompt}
            ]
        }
    ]
    
    # Prepare inputs
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt"
    ).to(model.device)
    
    # Generate
    with torch.no_grad():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=False
        )
    
    # Decode
    generated_ids_trimmed = [
        out_ids[len(in_ids):] 
        for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    
    return processor.batch_decode(
        generated_ids_trimmed, 
        skip_special_tokens=True, 
        clean_up_tokenization_spaces=False
    )[0]

---

## 1. German Invoice OCR (Rechnungserkennung)

German invoices ("Rechnungen") contain specific fields that are legally required:

- **Rechnungsnummer** (Invoice number)
- **Rechnungsdatum** (Invoice date)
- **Steuernummer / USt-IdNr.** (Tax ID / VAT number)
- **Nettobetrag, MwSt., Bruttobetrag** (Net, VAT, Gross amounts)
- **IBAN / BIC** (Bank details)

In [None]:
# German Invoice Extraction Prompt
GERMAN_INVOICE_PROMPT = """
Analysiere diese deutsche Rechnung und extrahiere alle relevanten Informationen.

Gib die Daten als JSON mit folgender Struktur zurück:

```json
{
    "rechnungsnummer": "string",
    "rechnungsdatum": "DD.MM.YYYY",
    "lieferdatum": "DD.MM.YYYY oder null",
    "absender": {
        "firma": "string",
        "adresse": "string",
        "steuernummer": "string oder null",
        "ust_idnr": "string oder null"
    },
    "empfaenger": {
        "name": "string",
        "adresse": "string"
    },
    "positionen": [
        {
            "beschreibung": "string",
            "menge": number,
            "einzelpreis": number,
            "gesamtpreis": number
        }
    ],
    "betraege": {
        "netto": number,
        "mwst_satz": number,
        "mwst_betrag": number,
        "brutto": number
    },
    "zahlungsinformationen": {
        "iban": "string oder null",
        "bic": "string oder null",
        "zahlungsziel": "string oder null"
    }
}
```

Wichtig:
- Behalte das deutsche Datumsformat (DD.MM.YYYY)
- Konvertiere Beträge zu Zahlen (1.234,56 € → 1234.56)
- Setze fehlende Felder auf null
"""

# Example usage (replace with your invoice image)
# invoice_image = "path/to/german_invoice.jpg"
# result = run_ocr(invoice_image, GERMAN_INVOICE_PROMPT)
# invoice_data = extract_json(result)
# print(json.dumps(invoice_data, indent=2, ensure_ascii=False))

### Example: Process a Sample Invoice

In [None]:
# Demo with a sample German invoice
# You can replace this URL with your own invoice image

sample_invoice_url = "YOUR_INVOICE_IMAGE_URL_HERE"

# Uncomment to run:
# result = run_ocr(sample_invoice_url, GERMAN_INVOICE_PROMPT)
# invoice_data = extract_json(result)
# 
# print("=" * 50)
# print("EXTRAHIERTE RECHNUNGSDATEN")
# print("=" * 50)
# print(json.dumps(invoice_data, indent=2, ensure_ascii=False))

---

## 2. German Form Processing (Formularverarbeitung)

German forms often include:
- Checkboxes (Ankreuzfelder)
- Handwritten entries
- Structured fields with labels

In [None]:
# German Form Extraction Prompt
GERMAN_FORM_PROMPT = """
Analysiere dieses deutsche Formular und extrahiere alle ausgefüllten Felder.

Gib die Daten als JSON zurück mit:
- Feldname als Schlüssel
- Eingetragener Wert als Wert
- Bei Ankreuzfeldern: true/false
- Bei leeren Feldern: null

Beispiel:
```json
{
    "vorname": "Max",
    "nachname": "Mustermann",
    "geburtsdatum": "15.03.1985",
    "geschlecht_maennlich": true,
    "geschlecht_weiblich": false,
    "telefon": null
}
```

Erkenne auch handschriftliche Einträge.
"""

# Example usage:
# form_image = "path/to/german_form.jpg"
# result = run_ocr(form_image, GERMAN_FORM_PROMPT)
# form_data = extract_json(result)
# print(json.dumps(form_data, indent=2, ensure_ascii=False))

---

## 3. ID Document Extraction (Ausweiserkennung)

Extract information from German ID cards (Personalausweis) and passports.

**Note:** Always handle personal data according to GDPR (DSGVO) regulations!

In [None]:
# German ID Extraction Prompt
GERMAN_ID_PROMPT = """
Extrahiere die Informationen aus diesem deutschen Ausweisdokument.

Gib die Daten als JSON zurück:

```json
{
    "dokumenttyp": "Personalausweis/Reisepass/Führerschein",
    "nachname": "string",
    "vorname": "string",
    "geburtsdatum": "DD.MM.YYYY",
    "geburtsort": "string",
    "nationalitaet": "string",
    "ausweisnummer": "string",
    "gueltig_bis": "DD.MM.YYYY",
    "ausstellende_behoerde": "string oder null"
}
```

Hinweis: Achte auf korrekte Umlaute (ä, ö, ü, ß).
"""

# Example usage:
# id_image = "path/to/german_id.jpg"
# result = run_ocr(id_image, GERMAN_ID_PROMPT)
# id_data = extract_json(result)
# print(json.dumps(id_data, indent=2, ensure_ascii=False))

---

## 4. Full-Page German OCR

For general German text extraction without structured output:

In [None]:
# Simple German OCR Prompt
GERMAN_OCR_SIMPLE = """
Extrahiere den gesamten Text aus diesem Bild.

Regeln:
- Behalte die ursprüngliche Formatierung bei (Absätze, Listen)
- Achte auf korrekte deutsche Zeichen (ä, ö, ü, ß)
- Erkenne Tabellen und formatiere sie lesbar
- Gib nur den extrahierten Text zurück, keine Erklärungen
"""

# Example usage:
# document_image = "path/to/german_document.jpg"
# extracted_text = run_ocr(document_image, GERMAN_OCR_SIMPLE)
# print(extracted_text)

---

## 5. Batch Processing Multiple Documents

In [None]:
def process_german_documents(image_paths: list, document_type: str = "invoice") -> list:
    """
    Process multiple German documents in batch.
    
    Args:
        image_paths: List of image file paths or URLs
        document_type: One of 'invoice', 'form', 'id', 'general'
    
    Returns:
        List of extracted data dictionaries
    """
    prompts = {
        "invoice": GERMAN_INVOICE_PROMPT,
        "form": GERMAN_FORM_PROMPT,
        "id": GERMAN_ID_PROMPT,
        "general": GERMAN_OCR_SIMPLE
    }
    
    prompt = prompts.get(document_type, GERMAN_OCR_SIMPLE)
    results = []
    
    for i, image_path in enumerate(image_paths):
        print(f"Processing document {i+1}/{len(image_paths)}: {image_path}")
        try:
            result = run_ocr(image_path, prompt)
            if document_type != "general":
                result = extract_json(result)
            results.append({"file": image_path, "data": result, "success": True})
        except Exception as e:
            results.append({"file": image_path, "error": str(e), "success": False})
    
    return results

# Example usage:
# documents = ["invoice1.jpg", "invoice2.jpg", "invoice3.jpg"]
# results = process_german_documents(documents, document_type="invoice")
# for r in results:
#     print(json.dumps(r, indent=2, ensure_ascii=False))

---

## Best Practices for German Document OCR

### 1. Image Quality
- **Resolution**: Minimum 150 DPI, ideally 300 DPI
- **Lighting**: Even lighting, avoid shadows
- **Orientation**: Correct rotation before processing

### 2. German-Specific Considerations
- Always validate Umlauts (ä, ö, ü) and Eszett (ß)
- German dates use DD.MM.YYYY format
- Currency: Decimal comma (1.234,56 €)
- Tax IDs follow specific patterns (DE + 9 digits for VAT)

### 3. GDPR/DSGVO Compliance
- Process personal data locally when possible
- Implement data minimization
- Log access to sensitive documents
- Delete processed images after extraction

### 4. Error Handling
- Validate extracted dates and amounts
- Implement confidence scores
- Manual review for low-confidence extractions

---

## Resources

- **German-OCR Project**: [github.com/Keyvanhardani/German-OCR](https://github.com/Keyvanhardani/German-OCR) - Specialized OCR fine-tuned for German documents
- **Qwen2-VL Documentation**: [Hugging Face Model Card](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
- **DSGVO Guidelines**: [GDPR compliance for document processing](https://gdpr.eu/)

---

*This cookbook was created by [Keyvan Hardani](https://keyvan.ai) to help the German-speaking community leverage Qwen2-VL for document processing.*