# Measure contributions of parts of OCR pipeline

## Possible OCR pipelines (actions are green)

![ocr_flow](assets/ocr_flow.drawio.jpg)

#### The are 4 groups of actions in the full pipeline (green boxes). The purpose of this notebook is to test if all of these actions really helps with the OCR results, and if they do, by how much.

- Which of the 4 image processing pipelines improve OCR performance?
- Two OCR engines: `Tesseract` & `EasyOCR`. `Tesseract` is the current leader in open source OCR engines, does adding `EasyOCR` improve the results?
- The `combine text` function is only needed if we stick with the ensemble approach. I.e. only if we use more than one image processing pipeline or more than one OCR engine.
- The `clean text` function corrects misspellings and common OCR errors with punctuation, spacing, etc. We want to measure its efficacy.

## Comparison strategy

#### We're doing ablations on the OCR pipeline.
- How well do `Tesseract` and `EasyOCR` perform on their own without image pre-processing. I'll also try the engine directly grafted to the `clean text` function.
- How well do each of the image pre-processing steps help the OCR process? and which ones work well with which OCR engine. I'm going to try various permutations of these.
- Can I whittle this down to one or zero image pre-processing pipelines and one OCR engine? If so, then this would allow me to drop the `combine text` step.
- How much does the `clean text` step help?

#### Scoring
- I'll use an expert derived gold standard to compare against the ablation sequences.
- We are using Levenshtein distance as the scores. Levenshtein distance counts character mismatches between sequences in a best case pai alignment.

## Setup

In [11]:
from collections import defaultdict
from itertools import groupby
from pathlib import Path
from types import SimpleNamespace

from tqdm import tqdm

from digi_leap.pylib import consts
from digi_leap.pylib.db import db
from digi_leap.pylib.ocr import ocr_compare as comp

## Constants

In [2]:
ARGS = SimpleNamespace(
    database=Path(consts.DATA_DIR / "sernec" / "sernec.sqlite"),
    gold_set="test_gold_set",
    score_set="notebook_scores",
    notes="",
    csv_path=consts.DATA_DIR / "sernec" / "gold_std_2022-06-28_sample.csv",
)

## Save a gold standard to the database

In [3]:
# comp.insert_gold_std(ARGS.csv_path, ARGS.database, ARGS.gold_set)

## Get a gold standard from the database

In [4]:
GOLD_STD = comp.select_gold_std(ARGS.database, ARGS.gold_set)
# gold_std[:2]

## Score the image processing OCR engine combinations

In [5]:
def calculate_scores(args, gold_std):
    combos = comp.pipeline_engine_combos()

    scores = []

    for gold in tqdm(gold_std):
        images = comp.process_images(gold)
        all_frags = comp.get_ocr_fragments(images, gold)

        scores += comp.score_simple_ocr(args, gold, images)
        scores += comp.score_ocr_ensembles(args, gold, all_frags, combos)

    return scores


# SCORES = calculate_scores(ARGS, GOLD_STD)

## Insert scores

In [9]:
# comp.insert_scores(ARGS, SCORES)

## Load scores

In [7]:
def select_scores(args):
    with db.connect(args.database) as cxn:
        results = db.execute(
            cxn, "select * from ocr_scores where score_set = ?", (args.score_set,)
        )
        scores = [r for r in results]
    return scores


SCORES = select_scores(ARGS)

## Peek at the scores

In [10]:
def peek_scores(scores):
    grouped_scores = groupby(scores, key=lambda s: s["label_id"])

    for labels_id, scores in grouped_scores:
        print("=" * 80)
        print(labels_id)
        scores = sorted(scores, key=lambda s: (s["score"], len(s["actions"])))
        for score in scores:
            print(f"{score['score']:4d}  {score['actions']}")


# peek_scores(SCORES)

In [18]:
def best_pipelines_tally(scores):
    tally = defaultdict(int)

    grouped_scores = groupby(scores, key=lambda s: s["label_id"])
    for _, scores in grouped_scores:
        scores = sorted(scores, key=lambda s: (s["score"], len(s["actions"])))
        prev_score = -1
        place = 0
        for score in scores:
            if score["score"] != prev_score:
                place += 1
                prev_score = score["score"]
            tally[score["actions"]] += place

    tally = [(v, len(k), k) for k, v in tally.items()]
    tally = sorted(tally)

    for score, _, actions in tally:
        print(f"{score:4d}  {actions}")


best_pipelines_tally(SCORES)

  14  ('', 'tesseract') ('deskew', 'tesseract') ('binarize', 'tesseract') ('denoise', 'tesseract') post_process
  15  ('', 'tesseract') ('binarize', 'tesseract') ('denoise', 'tesseract') post_process
  15  ('', 'easyocr') ('', 'tesseract') ('deskew', 'easyocr') ('deskew', 'tesseract') post_process
  15  ('deskew', 'easyocr') ('deskew', 'tesseract') ('binarize', 'easyocr') ('binarize', 'tesseract') post_process
  17  ('deskew', 'tesseract') ('binarize', 'tesseract') ('denoise', 'tesseract') post_process
  22  ('binarize', 'easyocr') ('binarize', 'tesseract') ('denoise', 'easyocr') ('denoise', 'tesseract') post_process
  23  ('', 'tesseract') ('deskew', 'tesseract') ('binarize', 'tesseract') post_process
  24  ('binarize', 'tesseract') ('denoise', 'tesseract')
  24  ('deskew', 'easyocr') ('deskew', 'tesseract') post_process
  24  ('binarize', 'easyocr') ('binarize', 'tesseract') post_process
  25  ('binarize', 'tesseract') ('denoise', 'tesseract') post_process
  25  ('deskew', 'easyocr')