# OCR Evaluation

OCR Output (orignal) mit Ground-Truth (corrected) gegenüberstellen

Methoden
- CER (Character Error Rate)
- WER (Word Error Rate)
- Segment Accuracy (CER basiert)

pip install jiwer editdistance 
Python 3.9+


### 1) MD Dateien parsen (in Segmente zerlegen, wie Titel, Formeln, Bilder ignoriert)

In [1]:
import re
from pathlib import Path


In [2]:
SEGMENT_RE = re.compile(
    r"<\|ref\|>(.*?)<\|/ref\|>\s*"
    r"<\|det\|>\[\[(.*?)\]\]<\|/det\|>\s*"
    r"(.*?)(?=\n<\|ref\||\Z)",
    re.S
)


In [5]:
def parse_md(path):
    text = Path(path).read_text(encoding="utf-8")
    segments = []

    for seg_type, bbox, content in SEGMENT_RE.findall(text):
        if seg_type == "image":
            continue

        segments.append({
            "type": seg_type,
            "text": content.strip()
        })

    return segments


### 2) Normalisieren (Gross-Kleinschreibung, Leerezeichen, ...)

In [6]:
import unicodedata


In [7]:
def normalize(s):
    s = s.lower()
    s = unicodedata.normalize("NFKD", s)
    s = "".join(c for c in s if not unicodedata.combining(c))
    s = re.sub(r"\s+", " ", s)
    return s.strip()


### 3) CER & WER

In [8]:
import editdistance
from jiwer import wer


In [9]:
def cer(gt, ocr):
    return editdistance.eval(gt, ocr) / max(1, len(gt))


### 4) Segmente vergleichen

In [10]:
def evaluate_segments(gt_segments, ocr_segments, tau=0.1):
    results = []

    for gt, ocr in zip(gt_segments, ocr_segments):

        # Gleichungen separat behandeln
        if gt["type"] == "equation":
            results.append({
                "type": "equation",
                "exact_match":
                    normalize(gt["text"]) == normalize(ocr["text"])
            })
            continue

        gt_text = normalize(gt["text"])
        ocr_text = normalize(ocr["text"])

        c = cer(gt_text, ocr_text)
        w = wer(gt_text, ocr_text)

        results.append({
            "type": "text",
            "cer": c,
            "wer": w,
            "segment_correct": c <= tau
        })

    return results


### 5) über alle md Dateien

In [11]:
def run_evaluation(original_dir, corrected_dir):
    all_results = []

    for gt_file in Path(corrected_dir).glob("*.md"):
        ocr_file = Path(original_dir) / gt_file.name
        if not ocr_file.exists():
            continue

        gt_segments = parse_md(gt_file)
        ocr_segments = parse_md(ocr_file)

        all_results.extend(
            evaluate_segments(gt_segments, ocr_segments)
        )

    return all_results


### 6) Zusammenfassung Ergebnisse

In [12]:
import statistics

In [13]:
def summarize(results):
    text = [r for r in results if r["type"] == "text"]
    eq = [r for r in results if r["type"] == "equation"]

    return {
        "CER_mean": statistics.mean(r["cer"] for r in text),
        "CER_median": statistics.median(r["cer"] for r in text),
        "WER_mean": statistics.mean(r["wer"] for r in text),
        "Segment_Accuracy":
            sum(r["segment_correct"] for r in text) / len(text),
        "Equation_Accuracy":
            sum(r["exact_match"] for r in eq) / len(eq) if eq else None
    }


In [16]:
results = run_evaluation("original", "corrected")
summary = summarize(results)

for k, v in summary.items():
    print(f"{k}: {v:.8f}")


CER_mean: 0.01358545
CER_median: 0.00000000
WER_mean: 0.03922622
Segment_Accuracy: 0.95953757
Equation_Accuracy: 0.96634615


- CER_mean ≈ 1.36 % → exzellente OCR
- Median CER = 0.0 → über 50 % der Segmente perfekt
- WER ≈ 3.9 % → sehr gute Lesbarkeit
- Segment Accuracy ≈ 96 % → fast alle Textblöcke brauchbar
- Equation Accuracy ≈ 97 % → stark für mathematischen Inhalt