# Compare OCR methods

## The OCR pipeline

![ocr_flow](../assets/ocr_flow.jpg)

#### The are 4 steps to the full pipeline. The purpose of this notebook is to test if all of these steps really helps with the OCR results, and if they do, by how much.
1. Given a label image hit it with 3 processing pipelines that commonly improve OCR results.
    1. Deskew the image: It performs a gentle blur of the image, scales the image to an OCR friendly size, orients the image to be upright, and then it fine tunes the image orientation.
    1. Binarize the image: It does everything in the deskew pipeline and then it converts it to a black and white image using the Souvola algorithm.
    1. Denose the image: Does everything in binarize and then it tries to remove speckles and small holes by merging small islands of pixels with the surrounding values.
1. We take each modified image from step 1 and pass them thru 2 OCR engines yielding 6 sets of OCR results. Both engines have their advantages and disadvantages.
    1. Tesseract: This is the OCR engine most people are familiar with.
    1. EaasyOCR: A more recent OCR engine.
1. Now we combine the 6 sets of OCR results into 1 by doing:
    1. A multiple sequence alignment on the lines of text using a character "visual" similarity matrix for scoring replacements and gaps.
    1. We then derive a consensus sequence from that multiple sequence alignment.
1. Finally, we cleanup the text with various techniques.
    1. We use a spell checker to fix lingering misspelling.
    1. Replace common OCR character issues like mixing up or droping punctuation, etc.
    1. OCR engines have issues with spaces and we try to correct those by seeing if adding or removing them improves the text output.

## Comparison strategy

#### We're doing ablations on the OCR pipeline.
- How well do Tesseract and EasyOCR perform on their own?
- How well do each of the image processing steps (deskew, binarize, denoise) help the OCR process?
- Does the multiple sequence alignment (MSA) improve OCR results and do all image processing pipelines help with the MSA?
- How much does the text cleaning step help?

#### To score sequences we are using expert derived text as a gold standard and scoring the ablation results against the gold standard using Levenshtein distance.

## Setup

In [1]:
import sys

sys.path.append("..")

In [2]:
from itertools import chain
from pathlib import Path
from types import SimpleNamespace

import pandas as pd
from tqdm import tqdm

from digi_leap.pylib import consts
from digi_leap.pylib.db import db
from digi_leap.pylib.label_builder import label_builder as builder
from digi_leap.pylib.ocr import ocr_runner as runner
from digi_leap.pylib.ocr import ocr_compare as comp

## Constants

In [3]:
ARGS = SimpleNamespace(
    database=Path(consts.DATA_DIR / "sernec" / "sernec.sqlite"),
    gold_set="test_gold_set",
    score_set="notebook_scores",
    notes="",
)

CSV_PATH = consts.DATA_DIR / "sernec" / "gold_std_2022-06-28_sample.csv"

## Save a gold standard to the database

In [4]:
# comp.new_gold_std(CSV_PATH, ARGS.database, ARGS.gold_set)

## Get a gold standard

In [5]:
GOLD_STD = comp.get_gold_std(ARGS.database, ARGS.gold_set)
# gold_std[:2]

## Score the image processing OCR engine combinations

In [6]:
def get_scores(args, gold_std):
    combos = comp.pipeline_engine_combos()

    scores = []

    for gold in tqdm(gold_std):
        images = comp.process_images(gold)
        all_frags = comp.get_ocr_fragments(images, gold)

        scores += comp.simple_ocr(args, gold, images)

        for combo in combos:
            frags = [all_frags[t] for t in combo]
            frags = list(chain(*frags))

            text = builder.build_label_text(
                frags, comp.spell_well(), comp.line_align()
            )
            actions = " ".join(str(t) for t in combo)
            scores.append(comp.new_score_rec(args, gold, text, actions))

            text = builder.post_process_text(text, comp.spell_well())
            actions = " ".join(str(t) for t in combo) + " post_process"
            scores.append(comp.new_score_rec(args, gold, text, actions))

    return scores

In [7]:
SCORES = get_scores(ARGS, GOLD_STD)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:52<00:00, 28.24s/it]


In [8]:
prev = ""
for score in SCORES:
    if prev != score["label_id"]:
        prev = score["label_id"]
        print()
        print(score["label_id"])
    print(f"{score['levenshtein']:4d}  {score['actions']}")


71341
  28  ('', 'easyocr')
  21  ('', 'easyocr') post_process
  26  ('', 'tesseract')
  26  ('', 'tesseract') post_process
  36  ('deskew', 'easyocr')
  27  ('deskew', 'easyocr') post_process
  10  ('deskew', 'tesseract')
  10  ('deskew', 'tesseract') post_process
  36  ('binarize', 'easyocr')
  30  ('binarize', 'easyocr') post_process
  13  ('binarize', 'tesseract')
  14  ('binarize', 'tesseract') post_process
  46  ('denoise', 'easyocr')
  42  ('denoise', 'easyocr') post_process
  19  ('denoise', 'tesseract')
  19  ('denoise', 'tesseract') post_process
  33  ('', 'easyocr') ('', 'tesseract')
  30  ('', 'easyocr') ('', 'tesseract') post_process
  15  ('deskew', 'easyocr') ('deskew', 'tesseract')
  12  ('deskew', 'easyocr') ('deskew', 'tesseract') post_process
  24  ('binarize', 'easyocr') ('binarize', 'tesseract')
  18  ('binarize', 'easyocr') ('binarize', 'tesseract') post_process
  30  ('denoise', 'easyocr') ('denoise', 'tesseract')
  23  ('denoise', 'easyocr') ('denoise', 'tesser