# Measure contributions of parts of OCR pipeline

## Possible OCR pipelines (actions are green)

![ocr_flow](assets/ocr_flow.drawio.jpg)

#### The are 4 groups of actions in the full pipeline (green boxes). The purpose of this notebook is to test if all of these actions really helps with the OCR results, and if they do, by how much.

- **Find the single best ensemble.**
- Which of the 4 image processing pipelines improve OCR performance?
- Two OCR engines: `Tesseract` & `EasyOCR`. `Tesseract` is the current leader in open source OCR engines, does adding `EasyOCR` improve the results?
- The `combine text` function is only needed if we stick with the ensemble approach. I.e. only if we use more than one image processing pipeline or more than one OCR engine.
- The `clean text` function corrects misspellings and common OCR errors with punctuation, spacing, etc. We want to measure its efficacy.

## Comparison strategy

#### We're doing ablations on the OCR pipeline.
- How well do `Tesseract` and `EasyOCR` perform on their own without image pre-processing. I'll also try the engine directly grafted to the `clean text` function.
- How well do each of the image pre-processing steps help the OCR process? and which ones work well with which OCR engine. I'm going to try various permutations of these.
- Can I whittle this down to one or zero image pre-processing pipelines and one OCR engine? If so, then this would allow me to drop the `combine text` step.
- How much does the `clean text` step help?
- Note that `EasyOCR` uses a fair bit of GPU resources and if we can remove it it will speed up the OCR pipeline significantly. I.e. `EasyOCR` is difficult to parallelize.

#### Scoring
- I'll use an expert derived gold standard to compare against the ablation sequences.
- I am using Levenshtein distance as the scores. Levenshtein distance counts character mismatches between sequences in a best case pairwise alignment.

## Setup

In [1]:
import json
from IPython.display import display, HTML
from collections import defaultdict, namedtuple
from colorama import Back, Fore, Style
from itertools import groupby
from pathlib import Path
from types import SimpleNamespace

from digi_leap.pylib import consts
from digi_leap.pylib.ocr import ocr_compare as compare

In [2]:
display(HTML("<style>pre { white-space: pre !important; }</style>"))

In [3]:
GOLD_STD_PATH = consts.DATA_DIR / "sernec" / "gold_std_2022-06-28"

ARGS = SimpleNamespace(
    database=Path(consts.DATA_DIR / "sernec" / "sernec.sqlite"),
    gold_set="gold_set_2022-06-28",
    score_set="scores_2022-06-28",
    char_set="default",
    notes="",
    csv_path=GOLD_STD_PATH / "gold_std_2022-06-28.csv",
)

## Gold standard

In [4]:
# Save a new gold standard to a database

# compare.insert_gold_std(ARGS.csv_path, ARGS.database, ARGS.gold_set)

In [5]:
# Read a gold standard from the database

GOLD_STD = compare.select_gold_std(ARGS.database, ARGS.gold_set)
GOLD_DICT = {g["gold_id"]: g for g in GOLD_STD}

## OCR scores

In [6]:
scorer = compare.Scorer(ARGS)

In [7]:
# Calculate new scores

# SCORES = scorer.calculate(GOLD_STD)
# scorer.insert_scores(SCORES)

In [8]:
SCORES = scorer.select_scores()

## Examine scores

In [9]:
def peek_scores(scores):
    grouped_scores = groupby(scores, key=lambda s: s["label_id"])

    for labels_id, scores in grouped_scores:
        print("=" * 80)
        print(labels_id)
        scores = sorted(
            scores, key=lambda s: (s["score"], len(json.loads(s["actions"])))
        )
        for score in scores:
            print(f"{score['score']:4d}  {score['actions']}")


# peek_scores(SCORES)

In [10]:
def msa_top_scores(scores, gold_std, line_align):
    grouped_scores = groupby(scores, key=lambda s: s["label_id"])

    for label_id, scores in grouped_scores:
        scores = list(scores)
        gold = gold_std[scores[0]["gold_id"]]
        min_score = min(s["score"] for s in scores)
        top = [gold["gold_text"]]
        top += [s["score_text"] for s in scores if s["score"] == min_score]
        top = [" ".join(ln.split()) for ln in top]

        print(f"{label_id}  {min_score}")

        aligned = line_align.align(top)

        rows = len(aligned)
        cols = len(aligned[0])
        colored = [list(a) for a in aligned]

        for col in range(cols):
            col_chars = [aligned[row][col] for row in range(rows)]
            if len(set(col_chars)) > 1:
                for row in range(rows):
                    colored[row][col] = (
                        Back.LIGHTGREEN_EX
                        + Fore.WHITE
                        + colored[row][col]
                        + Style.RESET_ALL
                    )

        # for ln in colored:
        #     print("".join(ln))
        # print()


msa_top_scores(SCORES, GOLD_DICT, scorer.line_align)

71341  9
71540  8
71640  12
71844  18
71943  1
72143  0
72241  3
72540  23
72744  10
73042  2
73240  3
73540  18
73644  1
73743  1
73840  6
74243  3
74445  68
74541  8
74643  3
74841  163
74941  10
75340  5
75440  0
75640  0
75845  2
76142  2
76241  2
76340  1
76442  0
76641  1
76742  0
76840  7
77243  5
77342  164
77440  0
77641  0
77845  7
78241  8
78343  14
78445  1
78542  8
78642  0
78743  46
79141  3
79342  5
79441  1
79541  0
79843  0
80341  4
80542  14
80740  4
81041  0
81440  1
81744  76
82041  1
82141  0
82442  0
82641  2
82842  14
83140  0
83340  2
83544  1
83645  0
83840  7
83940  1
84141  6
84241  9
84540  4
85142  0
85240  2
85740  5
85842  4
85942  0
86040  3
86840  1
87241  14
87540  0
87740  2
87843  0
87940  0
88042  0
88240  3
88541  1
88640  4
88743  0
88943  7
89042  1
89140  1
89340  5
90242  4
90341  2
90440  120
90641  5
91141  2
91240  12
91641  4
92040  4
92642  15
93040  0
93142  5
93240  14
93641  2
93748  0
93842  1
93944  55
94040  3
94140  5
94241  10
9454

In [11]:
PipelineScore = namedtuple("PipelineScore", "score pipeline")


def scores_by_pipeline(scores, gold_std):
    tally = defaultdict(int)

    for score in scores:
        tally[score["actions"]] += score["score"]

    tally = [(v, len(a), a) for k, v in tally.items() if (a := json.loads(k))]
    tally = sorted(tally)
    return [PipelineScore(t[0], t[2]) for t in tally]


summed = scores_by_pipeline(SCORES, GOLD_DICT)

for sum_ in summed:
    print(sum_)

PipelineScore(score=3589, pipeline=[['deskew', 'easyocr'], ['deskew', 'tesseract'], ['binarize', 'easyocr'], ['binarize', 'tesseract'], ['post_process']])
PipelineScore(score=3743, pipeline=[['deskew', 'easyocr'], ['deskew', 'tesseract'], ['binarize', 'easyocr'], ['binarize', 'tesseract']])
PipelineScore(score=4109, pipeline=[['deskew', 'tesseract'], ['binarize', 'tesseract'], ['denoise', 'tesseract']])
PipelineScore(score=4167, pipeline=[['deskew', 'tesseract'], ['binarize', 'tesseract'], ['denoise', 'tesseract'], ['post_process']])
PipelineScore(score=4299, pipeline=[['deskew', 'easyocr'], ['deskew', 'tesseract'], ['binarize', 'easyocr'], ['binarize', 'tesseract'], ['denoise', 'easyocr'], ['denoise', 'tesseract'], ['post_process']])
PipelineScore(score=4638, pipeline=[['deskew', 'easyocr'], ['deskew', 'tesseract'], ['binarize', 'easyocr'], ['binarize', 'tesseract'], ['denoise', 'easyocr'], ['denoise', 'tesseract']])
PipelineScore(score=4692, pipeline=[['', 'tesseract'], ['binarize', 