# OCR Lab

## Main goals of this lab:
- Call tesseract to apply OCR on an image
- Apply some preprocessing to the source image to improve OCR performances
- Evaluate the performances using some metrics to compare models

Bonus:
- Use tesseract to find word localisation

## Prerequisites
- install dependencies from the parent folder
- have tesseract installed on your computer
- put the carolinems.traindata in you tesseract folder: ``tesseract --list-langs``

In [None]:
import cv2 # CV2 is a library specialized in image processing
import numpy as np
import pytesseract # pytesseract is our interface to communicate with tesseract
from matplotlib import pyplot as plt
from pathlib import Path

# Open the file as is and try OCR

First we will try just to use the current image and see tesseract's result.

In [None]:
image_dir = Path('..') / 'demo_data'
image_path = image_dir / 'demo_image.jpeg'

In [None]:
def open_image(image):
    img = cv2.imread(image)
    return img

In [None]:
img = open_image(image_path)
plt.imshow(img)

In [None]:
# https://github.com/madmaze/pytesseract
text=pytesseract.image_to_string(img, lang="carolinems") # the lang parameters allows to select the tesseract model to use 
print("The text is :\n",text)

Are we happy with the result ?

# What preprocessing can be done ?

- Image preprocessing
  * grayscaling
  * thresholding
  * dilating
  * eroding
  * opening
  * canny edge detection
  * noise removal
  * template matching.
- Orientation (deskwing)
- Segmentation

This list is not finite and can be easily extended

## Thresholding

### Reminder on image
An image is a 2D matrix of pixels. A pixel is composed in general of 3 components Red,Green,Blue (RGB) and any color can be obtained by the combinaison of those components. Each component can take any value from 0 to 255. https://rgbcolorpicker.com


Thresholding is one of the first preprocessing to apply, it consists at :
- first: transforming the image into grayscale
- second: transform the image into black and white (and no gray) by applying a threshold on the grey intensity.

In [None]:
# get grayscale image
def get_grayscale(image):
    return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

def thresholding(image):
    return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

In [None]:
image = open_image(image_path)
gray = ... # grey scaled image
thresh = ... # thresholded image

In [None]:
# Plot the two images to compare
fig = plt.figure(figsize=(13,13))
ax = []

ax.append( fig.add_subplot(1, 2, 1) )
ax[-1].set_title('Row text')
b,g,r = cv2.split(image)
rgb_img = cv2.merge([r,g,b])
plt.imshow(rgb_img, cmap='gray')

ax.append( fig.add_subplot(1, 2, 2) )
ax[-1].set_title('After thresholding')
plt.imshow(thresh, cmap='gray')


In [None]:
# Apply tesseract again on the black and white image
text=...
print("The text is :\n",text)

## Other preprocessings

In [None]:
# ...

# OCR Evaluation

To evaluate the performances of the OCR, we need a ``ground truth`` with the expected text.
Then the ``jiwer`` (https://jitsi.github.io/jiwer/usage/) library can compute performance metrics by comparing the two texts.

In [None]:
with open(image_dir / "ground_truth.txt") as file:
    ground_truth = file.readlines()

ground_truth = " ".join([ i.strip() for i in ground_truth])

In [None]:
import jiwer

def score(ground_truth, predicted):
    return {
        "wer": ...,
        "cer": ...
    }

In [None]:
# Get OCR output using Pytesseract
lang = "carolinems"

custom_config = r'--oem 3 --psm 1'
print('-----------------------------------------')
print('ORIGINAL IMAGE')
print('-----------------------------------------')
print(score(ground_truth, pytesseract.image_to_string(image, config=custom_config, lang=lang)))

print('-----------------------------------------')
print('IMAGE WITH THRESHOLDING')
print('-----------------------------------------')
print(score(ground_truth, pytesseract.image_to_string(thresh, config=custom_config, lang=lang)))

# Tesseract parameters

Tesseract can do thing for us but requires a tuning to be able to be adapted to our case.


In [2]:
!tesseract --help-extra

Usage:
  tesseract --help | --help-extra | --help-psm | --help-oem | --version
  tesseract --list-langs [--tessdata-dir PATH]
  tesseract --print-fonts-table [options...] [configfile...]
  tesseract --print-parameters [options...] [configfile...]
  tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]

OCR options:
  --tessdata-dir PATH   Specify the location of tessdata path.
  --user-words PATH     Specify the location of user words file.
  --user-patterns PATH  Specify the location of user patterns file.
  --dpi VALUE           Specify DPI for input image.
  --loglevel LEVEL      Specify logging level. LEVEL can be
                        ALL, TRACE, DEBUG, INFO, WARN, ERROR, FATAL or OFF.
  -l LANG[+LANG]        Specify language(s) used for OCR.
  -c VAR=VALUE          Set value for config variables.
                        Multiple -c arguments are allowed.
  --psm NUM             Specify page segmentation mode.
  --oem NUM             Specify OCR Engi

In [None]:
lang = "carolinems"

for psm in ...:
    custom_config = ...
    print(f"PSM: {psm}", score(ground_truth, pytesseract.image_to_string(thresh, config=custom_config, lang=lang)))

## PSM for Page segmentation modes

It is the way Tesseract segment the image into lines. It has a huge impact on tesseract ability to detect text

# Word localisation

Tesseract can do more things than just outputting the raw text. Here we'll use it to find the coordinates of words: 

In [None]:
from copy import deepcopy
from pytesseract import Output

lang = "carolinems"
custom_config = r'--oem 3 --psm 1'

image = deepcopy(thresh)
h, w = image.shape

d = pytesseract.image_to_data(image, config=custom_config, output_type=Output.DICT, lang=lang)

n_boxes = len(d['text'])
for i in range(n_boxes):
    # condition to only pick boxes with a confidence > 40%
    if int(d['conf'][i]) > 40:
        (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
        image = cv2.rectangle(image, (x, y), (x + w, y + h), (0, 255, 0), 2)

plt.figure(figsize=(16,12))
plt.imshow(image)
plt.show()