Jochre is an OCR package based on supervised machine learning techniques. It has been applied to several languages, including Yiddish, Occitan and Alsacien.
There are several phases :
- Annotation - Annotation of a training/evaluation corpus using JochreWeb
- Training - Construction of the OCR model
- Evaluation - Evaluating the accuracy of the OCR model
- Analysis - Use of an existing model to analyse new scanned pages
Annotation requires the JochreWeb application.
Training and evaluation require a Jochre database constructed using JochreWeb.
Analysis requires a model constructed during training, but no longer requires the database used to construct the model.
During analysis (and evaluation), Jochre involves the following steps:
- Segmentation : break up the images into paragraphs, rows, groups (representing words) and shapes (representing letters). This uses ad-hoc statistical algorithms.
- Guessing : apply the model to guess the n most probable words for each group (this list is known as the "beam").
- Post-processing : use of a lexicon to rerank the words in the beam, and select the most likely analysis.
See Installation for Jochre installation instructions.