HOCR to EOCR Converter

This proof of concept converts .hocr to .eocr. Note that this converter only supports the words (ocrx_word), lines (ocr_line), paragraphs (ocr_par) and pages ( ocr_page) from the .hocr content.

The EOCR File Format

A .eocr is a format that Zuva DocAI recognizes as "pre-OCR'd" content - in other words, Zuva DocAI will not run the document through the OCR engine if a .eocr is provided.

There are two main components of a .eocr file:

The header;
The body (i.e. an instance of the Document message from recognition_results_pb2.proto)

In order, the .eocr contains the above components:

The header;
The sha1 digest of the body
The body, g-zip compressed

Using recognition_results_pb2.proto

This .proto contains the Protobuf messages. These messages can be used to create instance objects in your programming language of choice (for example, an instance of a EOCR Character) To do this, you will need to use the protoc command compile the .proto for your programming language.

The following can be used to create the Python classes which represent the .proto file, since this proof of concept was written in Python:

protoc -I=. --python_out=. recognition_results.proto

Additional information about Protobuf can be found in the following link: https://developers.google.com/protocol-buffers/docs/pythontutorial#compiling-your-protocol-buffers

For convenience, a copy of recognition_results_pb2.py is provided in this repository.

Usage

The following is an example of how to use the HOCRToEOCRConverter class:

from HOCRToEOCRConverter import HOCRToEOCRConverter

converter = HOCRToEOCRConverter()
converter.hocr_folder = ''  # The folder that contains the list of .hocr for each page OCR'd out of the source file
converter.set_document_md5(b'')  # The source file's md5 .digest()
converter.start()
converter.export('')  # The file path (including file name) of the resultant .eocr

This script can be used in conjunction with the Zuva DocAI Python Wrapper sample code, where you can take resultant .eocr content and submit it to Zuva via file.create.

Troubleshooting

On MacOS, if you encounter the error

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

You may need to install lxml:

pip3 install lxml

Licensing

The code is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
out/CANADAGOOS-F1Securiti-2152017		out/CANADAGOOS-F1Securiti-2152017
.gitignore		.gitignore
CANADAGOOS-F1Securiti-2152017.PDF		CANADAGOOS-F1Securiti-2152017.PDF
CANADAGOOS-F1Securiti-2152017.eocr		CANADAGOOS-F1Securiti-2152017.eocr
CANADAGOOS-F1Securiti-2152017.hocr		CANADAGOOS-F1Securiti-2152017.hocr
HOCRToEOCRConverter.py		HOCRToEOCRConverter.py
LICENSE		LICENSE
README.md		README.md
doOcr.sh		doOcr.sh
eocr_helper.py		eocr_helper.py
hocr_helper.py		hocr_helper.py
poc.py		poc.py
recognition_results.proto		recognition_results.proto
recognition_results_pb2.py		recognition_results_pb2.py

License

zuvaai/hocr-to-eocr-converter

Folders and files

Latest commit

History

Repository files navigation

HOCR to EOCR Converter

The EOCR File Format

Using recognition_results_pb2.proto

Usage

Troubleshooting

Licensing

About

Resources

License

Stars

Watchers

Forks

Languages