Various documents related to Tesseract OCR

The Fourth Annual Test of OCR Accuracy

Publication Year: 1995

An Overview of the Tesseract OCR Engine

Publication Year: 2007

The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy[1], is described in a comprehensive overview. Emphasis is placed on aspects that are novel or at least unusual in an OCR engine, including in particular the line finding, features/classification methods, and the adaptive classifier.

Hybrid Page Layout Analysis via Tab-Stop Detection

Published at 2009 10 th International Conference on Document Analysis and Recognition.

(source)

A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions. The complete C++ source code implementation is available as part of the Tesseract open source OCR engine athttps://github.com/tesseract-ocr/tesseract.

Combined Orientation and Script Detection using the Tesseract OCR Engine

(source)

Publication Year: 2009

This paper proposes a simple but effective algorithm to estimate the script and dominant page orientation of the text contained in an image. A candidate set of shape classes for each script is generated using synthetically rendered text and used to train a fast shape classifier. At run time, the classifier is applied independently to connected components in the image for each possible orientation of the component, and the accumulated confidence scores are used to determine the best estimate of page orientation and script. Results demonstrate the effectiveness of the approach on a dataset of 1846 documents containing a diverse set of images in 14 scripts and any of four possible page orientations.

Adapting the Tesseract Open Source OCR Engine for Multilingual OCR

(source)

Publication Year: 2009

We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.

©ACM, 2009. This is the authors’ version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the International Workshop on Multilingual OCR 2009, Barcelona, Spain July 25, 2009.

Table detection in heterogeneous documents

(source)

Publication Year: 2010

Detecting tables in document images is important since not only do tables contain important information, but also most of the layout analysis methods fail in the presence of tables in the document image. Existing approaches for table detection mainly focus on detecting tables in single columns of text and do not work reliably on documents with varying layouts. This paper presents a practical algorithm for table detection that works with a high accuracy on documents with varying layouts (company reports, newspaper articles, magazine pages, ...). An open source implementation of the algorithm is provided as part of the Tesseract OCR engine. Evaluation of the algorithm on document images from publicly available UNLV dataset shows competitive performance in comparison to the table detection module of a commercial OCR system.

Comparison to other implementation:

Learning to Detect Tables in Scanned Document Images using Line Information at 2017 14th IAPR International Conference on Document Analysis and Recognition
Table Detection using Deep Learning

Limits on the Application of Frequency-based Language Models to OCR

(source)

Publication Year: 2011

Although large language models are used in speech recognition and machine translation applications, OCR systems are “far behind” in their use of language models. The reason for this is not the laggardness of the OCR community, but the fact that, at high accuracies, a frequency-based language model can do more damage than good, unless carefully applied. This paper presents an analysis of this discrepancy with the help of the Google Books n-gram Corpus, and concludes that noisy-channel models that closely model the underlying classifier and segmentation errors are required.

Improving Book OCR by Adaptive Language and Image Models

(source)

Publication Year: 2012

In order to cope with the vast diversity of book content and typefaces, it is important for OCR systems to leverage the strong consistency within a book but adapt to variations across books. In this work, we describe a system that combines two parallel correction paths using document-specific image and language models. Each model adapts to shapes and vocabularies within a book to identify inconsistencies as correction hypotheses, but relies on the other for effective cross-validation. Using the open source Tesseract engine as baseline, results on a large dataset of scanned books demonstrate that word error rates can be reduced by 25% using this approach.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
das_tutorial2016		das_tutorial2016
AT-1995.pdf		AT-1995.pdf
Combined_Orientation_and_Script_Detection_using_the_Tesseract_OCR_Engine.pdf		Combined_Orientation_and_Script_Detection_using_the_Tesseract_OCR_Engine.pdf
Improving_Book_OCR_by_Adaptive_Language_and_Image_Models.pdf		Improving_Book_OCR_by_Adaptive_Language_and_Image_Models.pdf
Limits_on_the_Application_of_Frequency-based_Language_Models_to_OCR.pdf		Limits_on_the_Application_of_Frequency-based_Language_Models_to_OCR.pdf
MOCRadaptingtesseract2.pdf		MOCRadaptingtesseract2.pdf
PageLayoutAnalysisICDAR2.pdf		PageLayoutAnalysisICDAR2.pdf
README.md		README.md
Table_detection_in_heterogeneous_documents.pdf		Table_detection_in_heterogeneous_documents.pdf
_config.yml		_config.yml
tesseracticdar2007.pdf		tesseracticdar2007.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Various documents related to Tesseract OCR

The Fourth Annual Test of OCR Accuracy

An Overview of the Tesseract OCR Engine

Hybrid Page Layout Analysis via Tab-Stop Detection

Combined Orientation and Script Detection using the Tesseract OCR Engine

Adapting the Tesseract Open Source OCR Engine for Multilingual OCR

Table detection in heterogeneous documents

Limits on the Application of Frequency-based Language Models to OCR

Improving Book OCR by Adaptive Language and Image Models

Tesseract Blends Old and New OCR Technology: Tutorial2 at DAS2016 – Document Analysis Systems

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

tesseract-ocr/docs

Folders and files

Latest commit

History

Repository files navigation

Various documents related to Tesseract OCR

Tesseract Blends Old and New OCR Technology: Tutorial2 at DAS2016 – Document Analysis Systems

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!