Skip to content
Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Python HTML Shell Dockerfile
Branch: master
Clone or download

Latest commit

tmbdev Added optional preprocessing with BeautifulSoup.
This fixes problems for extract-images when the hOCR file does not
specify encodings in a way that LXML understands. Enable with -U,
no change in behavior if not enabled.
Latest commit b3e3807 Aug 13, 2019

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
test Added optional preprocessing with BeautifulSoup. Aug 13, 2019
.gitignore Basic setuptools support Mar 1, 2016
.travis.yml
Dockerfile Update Dockerfile Oct 4, 2018
LICENSE LICENSE Apr 10, 2016
README.md Document padding option for hocr-extract-images Jun 24, 2019
hocr-check
hocr-combine Fix regression from 4d09c55 in hocr-combine Jun 25, 2019
hocr-cut Improve hocr-cut Mar 2, 2019
hocr-eval Fix too long lines Sep 12, 2018
hocr-eval-geom Fixed spacing for splitted strings Sep 12, 2018
hocr-eval-lines Fixed spacing for splitted strings Sep 12, 2018
hocr-extract-g1000 Fix E711 (comparison to None) Sep 12, 2018
hocr-extract-images Added optional preprocessing with BeautifulSoup. Aug 13, 2019
hocr-lines Fixed spacing for splitted strings Sep 12, 2018
hocr-merge-dc Fix E266 too many leading '#' for block comment Sep 8, 2018
hocr-pdf Update hocr-pdf Mar 3, 2019
hocr-split Fix E266 too many leading '#' for block comment Sep 8, 2018
hocr-wordfreq Fix code style for regular expressions Sep 12, 2018
requirements.txt Update Travis CI and configuration for Python 3.7 Sep 9, 2018
setup.cfg Setup for release on PyPI #44 Sep 1, 2016
setup.py Bump version to 1.3.0 Mar 2, 2019

README.md

hocr-tools

Build Status Codacy Badge PyPI pyversions license

About

hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation.

There is a Public Specification for the hOCR Format.

About the code

Each command line program is self contained; if you have Python 2.7 with the required packages installed, it should just work. (Unfortunately, that means some code duplication; we may revisit this issue in later revisions.)

Installation

System-wide with pip

You can install hocr-tools along with its dependencies from PyPI:

sudo pip install hocr-tools

System-wide from source

On a Debian/Ubuntu system, install the dependencies from packages:

sudo apt-get install python-lxml python-reportlab python-pil \
  python-beautifulsoup python-numpy python-scipy python-matplotlib

Or, to fetch dependencies from the cheese shop:

sudo pip install -r requirements.txt  # basic

Then install the dist:

sudo python setup.py install

virtualenv

Once

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Subsequently

source venv/bin/activate
./hocr-...

Available Programs

Included command line programs:

hocr-check

hocr-check file.html

Perform consistency checks on the hOCR file.

hocr-combine

hocr-combine file1.html [file2.html ...]

Combine the OCR pages contained in each HTML file into a single document. The document metadata is taken from the first file.

hocr-cut

hocr-cut [-h] [-d] [file.html]

Cut a page (horizontally) into two pages in the middle such that the most of the bounding boxes are separated nicely, e.g. cutting double pages or double columns

hocr-eval-lines

hocr-eval-lines [-v] true-lines.txt hocr-actual.html

Evaluate hOCR output against ASCII ground truth. This evaluation method requires that the line breaks in true-lines.txt and the ocr_line elements in hocr-actual.html agree (most ASCII output from OCR systems satisfies this requirement).

hocr-eval-geom

hocr-eval-geom [-e element-name] [-o overlap-threshold] hocr-truth hocr-actual

Compare the segmentations at the level of the element name (default: ocr_line). Computes undersegmentation, oversegmentation, and missegmentation.

hocr-eval

hocr-eval hocr-true.html hocr-actual.html

Evaluate the actual OCR with respect to the ground truth. This outputs the number of OCR errors due to incorrect segmentation and the number of OCR errors due to character recognition errors.

It works by aligning segmentation components geometrically, and for each segmentation component that can be aligned, computing the string edit distance of the text the segmentation component contains.

hocr-extract-g1000

Extract lines from Google 1000 book sample

hocr-extract-images

hocr-extract-images [-b BASENAME] [-p PATTERN] [-e ELEMENT] [-P PADDING] [file]

Extract the images and texts within all the ocr_line elements within the hOCR file. The BASENAME is the image directory, the default pattern is line-%03d.png, the default element is ocr_line and there is no extra padding by default.

hocr-lines

hocr-lines [FILE]

Extract the text within all the ocr_line elements within the hOCR file given by FILE. If called without any file, hocr-lines reads hOCR data from stdin.

hocr-merge-dc

hocr-merge-dc dc.xml hocr.html > hocr-new.html

Merges the Dublin Core metadata into the hOCR file by encoding the data in its header.

hocr-pdf

hocr-pdf <imgdir> > out.pdf
hocr-pdf --savefile out.pdf <imgdir>

Create a searchable PDF from a pile of hOCR and JPEG. It is important that the corresponding JPEG and hOCR files have the same name with their respective file ending. All of these files should lie in one directory, which one has to specify as an argument when calling the command, e.g. use hocr-pdf . > out.pdf to run the command in the current directory and save the output as out.pdf alternatively hocr-pdf . --savefile out.pdf which avoids routing the output through the terminal.

hocr-split

hocr-split file.html pattern

Split a multipage hOCR file into hOCR files containing one page each. The pattern should something like "base-%03d.html"

hocr-wordfreq

hocr-wordfreq [-h] [-i] [-n MAX] [-s] [-y] [file.html]

Outputs a list of the most frequent words in an hOCR file with their number of occurrences. If called without any file, hocr-wordfreq reads hOCR data (for example from hocr-combine) from stdin.

By default, the first 10 words are shown, but any number can be requested with -n. Use -i to ignore upper and lower case, -s to split on spaces only which will then lead to words also containing punctations, and -y tries to dehyphenate the text (separation of words at line break with a hyphen) before analysis.

Unit tests

The unit tests are written using the tsht framework.

Running the full test suite:

./test/tsht

Running a single test

./test/tsht <path-to/unit-test.tsht>

e.g.

./test/tsht test/hocr-pdf/test-hocr-pdf.tsht

Writing a test

Please see the documentation in the tsht repository and take a look at the existing unit tests.

  1. Create a new directory under ./test
  2. Copy any test assets (images, hOCR files...) to this directory
  3. Create a file <name-of-your-test>.tsht starting from this template:
#!/usr/bin/env tsht

# adjust to the number of your tests
plan 1

# write your tests here
exec_ok "hocr-foo" "-x" "foo"

# remove any temporary files
# rm some-generated-file
You can’t perform that action at this time.