Skip to content

SapirKro/HebHTR

 
 

Repository files navigation

Hebrew Handwritten Text Recognizer (OCR)

Hebrew Handwritten Text Recognizer, based on machine learning. Implemented with TensorFlow and OpenCV.
Model is based on Harald Scheidl's SimpleHTR model [1], and CTC-WordBeam algorithm [2].

Getting Started

Prerequisites

Currently HebHTR is only supported on Linux. I've tested it on Ubuntu 18.04.

In order to run HebHTR you need to compile Harald Scheidl's CTC-WordBeam. In order to do that you need to clone the CTC-WordBeam, go to cpp/proj/ directory and run the script ./buildTF.sh.

Quick Start

from HebHTR import *

# Create new HebHTR object.
img = HebHTR('example.png')

# Infer words from image.
text = img.imgToWord(iterations=5, decoder_type='word_beam')

Result:

About the Model

As mentioned, this model was written by Harald Scheidl. This model was trained to decode text from images with a single word. I've trained the model on a Hebrew words dataset.

The model receives input image of the shape 128×32, binary colored. It has 5 CNN layers, 2 RNN layers, and eventually words are being decoded with a CTC-WordBeam algoritm.

2

Explanation in much more details can be found in Harald's article [1].

All words prediced by this model should fit it's input data, i.e binary colored images of size 128*32. Therefore, HebHTR normalizes each image to binary color. Then, HebHTR resizes it (without distortion) until it either has a width of 128 or a height of 32. Finally, image is copied into a (white) target image of size 128×32.

The following figure demonstrates this process:

About the Dataset

I've created a dataset of around 100,000 Hebrew words. Around 50,000 of them are real words, taken from students scanned exams. Segementation of those words was done using one of my previous works which can be found here.
This data was cleaned and labeled manually by me. The other 50,000 words were made artificially also by me. The word list for creating the artificial words is taken from MILA's Hebrew stopwords lexicon [3]. Overall, the whole dataset contains 25 different handwrites. The dataset also contains digits and punctuation characters.

All words in the dataset were encoded into black and white (binary).
For example:

About the Corpus

The corpus which is being used in the Word Beam contains around 500,000 unique Hebrew words. The corpus was created by me using the MILA's Arutz 7 corpus [4], TheMarker corpus [5] and HaKnesset corpus [6].

Avaliable Functions

imgToWords

imgToWords(iterations=5, decoder_type='word_beam')

Converts a text-based image to text.

Parameters:

  • iterations (int): Number of dilation iterations that will be done on the image. Image is dilated to find the contours of it's words. Default value is set to 5.

  • decoder_type (string): Which decoder to use when infering a word. There are two decoding options:

    • 'word_beam' - CTC word beam algorithm.
    • 'best_path' - Determinded by taking the model's most likely character at each position.

    The word beam decoding has significant better results.

Returns

  • Text decoded by the model from the image (string).

Example of usage in this function:

from HebHTR import *

# Create new HebHTR object.
img = HebHTR('example.png')

# Infer words from image.
text = img.imgToWord(iterations=5, decoder_type='word_beam')

Result:


Requirements

  • TensorFlow 1.12.0
  • Numpy 16.4
  • OpenCV

References

[1] Harald Scheid's SimpleHTR model
[2] Harald Scheid's CTC-WordBeam algorithm
[3] The MILA Hebrew Lexicon
[4] MILA's Arutz 7 corpus
[5] MILA's TheMarker corpus
[6] MILA's HaKnesset corpus

About

Hebrew Handwritten OCR

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%