RFC: Add option to include images in hOCR output #3710

MerlijnWajer · 2022-01-05T13:15:17Z

Written in collaboration with Aram (see commit author).

This pull request adds support to output ocr_photo elements in the hOCR renderer. ocr_photo is "Something that requires JPEG or PNG to be represented well" per the specification. ocr_image would be for SVG content. Since that is hard to distinguish, let's just go to ocr_photo unconditionally.

There are a few open questions/concerns that I can think of:

Do we want to render ocrx_word elements inside ocr_photo? I think per the hOCR specification this is allowed, but Tesseract seems to just "find" a word of the same bounding box with nothing but spaces as characters, in our testing.
If we do not want to have words inside ocr_photo elements, should we skip generating the ocr_line when writing hOCR even if the hocr images option is turned off, since we know the element is detected as an image block type with (apparently) no actual content?
If we want to skip adding ocrx_word elements, is the goto acceptable, or would you rather see the code rewritten without the goto?
Do we want to default to having this turned on?

stweil

Great that you addressed this highly desired feature!

I have some smaller comments. Please note that we plan to release 5.0.1 this week, and I would postpone reviewing and testing of the new feature until that release was published.

src/api/hocrrenderer.cpp

Signed-off-by: Merlijn Wajer <merlijn@wizzup.org>

MerlijnWajer · 2022-01-05T14:06:12Z

Thinking about this some more, I think I recall having some images where Tesseract thought the entire page was a photo, but also still found (legitimate) text on it. I'll find such an example -- if my recollection is correct then we definitely do want to add text elements inside the ocr_photo elements.

MerlijnWajer · 2022-01-05T14:36:25Z

Here is one such an image: https://archive.org/~merlijn/tesseract-images/hocr-images/sim_canadian-medical-association-journal_1963-03-16_88_11_0003.jpg

There are also two files in that directory ( https://archive.org/~merlijn/tesseract-images/hocr-images/ ), "skipword.hocr" is using the code from this PR as of writing, "noskipword.hocr" is when we keep skipword always to false.

It looks like tesseract's iterator is clever enough to not place the majority of the text under the floating image, so that would argue for indeed skipping text contents in ocr_photo elements, but would love to get some more feedback or test cases for it.

kba · 2022-01-05T14:54:45Z

If we do not want to have words inside ocr_photo elements, should we skip generating the ocr_line when writing hOCR even if the hocr images option is turned off, since we know the element is detected as an image block type with (apparently) no actual content?

While the hocr spec is vague in this regard, I would still strongly argue against ocr_line in ocr_photo. There might be legitimate applications (like text in an advert in a newspaper) but generating a "placeholder `ocr_line`` for every image is not one of them.

wollmers · 2022-01-05T15:03:12Z

@MerlijnWajer

Do we want to render ocrx_word elements inside ocr_photo? I think per the hOCR specification this is allowed, but Tesseract seems to just "find" a word of the same bounding box with nothing but spaces as characters, in our testing.

If I understand the code of the PR, it just writes a HTML element of type ocr_photo with a bounding-box. Then I vote for overlapping ocrx_word with ocr_photo if we have only rectangular bounding-boxes and not polygons like Page-XML. On skewed or warped pages bounding-boxes can overlap. Or the illustration is not rectangular and the text flows around the illustration.

Do we want to default to having this turned on?

I would like to have as much non-text areas classified and tagged as possible. Tables, formulae, decorative elements (separators), drawings, copper engravings (hole page), wood cuts. Even if it's not very reliable, the heuristic guess of Tesseract ("there is something, but it's not typical text") is helpful.

MerlijnWajer · 2022-01-06T12:41:07Z

@MerlijnWajer

Do we want to render ocrx_word elements inside ocr_photo? I think per the hOCR specification this is allowed, but Tesseract seems to just "find" a word of the same bounding box with nothing but spaces as characters, in our testing.

If I understand the code of the PR, it just writes a HTML element of type ocr_photo with a bounding-box. Then I vote for overlapping ocrx_word with ocr_photo if we have only rectangular bounding-boxes and not polygons like Page-XML. On skewed or warped pages bounding-boxes can overlap. Or the illustration is not rectangular and the text flows around the illustration.

Ok, note that currently we only seem to output a few spaces for these areas, and nothing else. I think it depends mostly on how the Tesseract page iterator works. It might be that the current behaviour (before this PR) of outputting a ocr_line block (with this PR ocr_photo) with just some spaces is in fact a bug.

Do we want to default to having this turned on?

I would like to have as much non-text areas classified and tagged as possible. Tables, formulae, decorative elements (separators), drawings, copper engravings (hole page), wood cuts. Even if it's not very reliable, the heuristic guess of Tesseract ("there is something, but it's not typical text") is helpful.

Right, there is are other block types like PT_NOISE that we could match to ocr_noise, and others like PT_HORZ_LINE, PT_VERT_LINE. I also agree we ought to output as much as we can, because heuristics in Tesseract can improve over time.

wollmers · 2022-01-07T08:13:02Z

@MerlijnWajer

Ok, note that currently we only seem to output a few spaces for these areas, and nothing else. I think it depends mostly on how the Tesseract page iterator works. It might be that the current behaviour (before this PR) of outputting a ocr_line block (with this PR ocr_photo) with just some spaces is in fact a bug.

AFAIR it depends on the combination of --psm and --oem. A photo (also called halftone) is easy to detect in most cases, because it has sharp edges against the background, and it has characteristic features (density histogram, size, proportion). This can be just written into hOCR as ocr_photo with a low rate of false negative.

There are other cases, which Tesseract seems not detecting, like large, extremely letterspaced letters, or single glyphs like page numbers. Then Tesseract either outputs nothing (not a text area, case 1), or tries OCR without recognition result (case 2) and outputs a word element with only a space in it.

stweil · 2022-01-07T21:04:43Z

I get no text for test/testing/8087_054.3G.tif with the current patch. So more tuning is needed.

stweil · 2022-01-07T23:11:45Z

Tesseract thought the entire page was a photo

And it is even right, it is some kind of photo (or a scan), so the hOCR output may look like this:

--- /tmp/noimg.hocr	2022-01-08 00:07:04.559688193 +0100
+++ /tmp/img.hocr	2022-01-08 00:06:20.446583605 +0100
@@ -1116,13 +1116,7 @@
      </span>
     </p>
    </div>
-   <div class='ocr_carea' id='block_1_96' title="bbox 0 0 3450 5034">
-    <p class='ocr_par' id='par_1_100' lang='eng' title="bbox 0 0 3450 5034">
-     <span class='ocr_line' id='line_1_158' title="bbox 0 0 3450 5034; baseline 0 0; x_size 2517; x_descenders -1258.5; x_ascenders 1258.5">
-      <span class='ocrx_word' id='word_1_401' title='bbox 0 0 3450 5034; x_wconf 95'> </span>
-     </span>
-    </p>
-   </div>
+   <div class='ocr_photo' id='block_1_96' title="bbox 0 0 3450 5034"></div>
   </div>
  </body>
 </html>

stweil · 2022-01-15T21:55:13Z

In https://github.com/stweil/tesseract/tree/output-formats I have an experimental implementation which works better for me and which also handles ALTO and text output.

stweil · 2022-01-16T13:18:27Z

@MerlijnWajer, in pull request #3723 I handle image and line regions unconditionally. I tested it on a set of 41998 images. It produced the same text results, but correctly wrote image and line regions instead of text regions with words consisting of blanks only. Maybe you want to test that on a larger number of images.

MerlijnWajer · 2022-01-16T13:22:38Z

@MerlijnWajer, in pull request #3723 I handle image and line regions unconditionally. I tested it on a set of 41998 images. It produced the same text results, but correctly wrote image and line regions instead of text regions with words consisting of blanks only. Maybe you want to test that on a larger number of images.

Looks good, thanks for picking this up and improving it. Not writing text regions with only blank words I think is another clear improvement. I will try to do a few test runs with the code coming week (and can provide a sign off if useful), but don't block on me if you feel it's ready to be merged.

stweil · 2022-02-02T13:58:34Z

Meanwhile more than two weeks passed by. Are there new results from testing this pull request and also for #3723?

MerlijnWajer · 2022-02-02T14:04:06Z

I'll build PR #3723 today, try it out and report back today, thanks for the nudge. The way I understood it, #3723 replaces this PR, correct?

stweil · 2022-02-02T14:29:32Z

Yes, that's right.

amitdo · 2022-02-16T14:00:08Z

#3723 was merged.

@MerlijnWajer, thanks for your effort.

amitdo · 2022-02-16T14:33:50Z

@wollmers,

Regarding polygon, the API has BlockPolygon() method. Currently, none of the output formats use this method.

stweil · 2022-02-16T15:49:16Z

OCR-D uses BlockPolygon (see https://github.com/OCR-D/ocrd_tesserocr/blob/master/ocrd_tesserocr/recognize.py#L506).

I think Tesseract's hOCR and ALTO output could be enhanced to use polygons, too.

wollmers · 2022-02-17T09:14:32Z

https://github.com/OCR-D/ocrd_tesserocr/blob/master/ocrd_tesserocr/recognize.py#L506

polygon for geometric hulls of areas and polyline for e.g. baseline are more precise and flexible. We can always convert or interpolate to bbox or straight line but not into the other direction. Best would be Bezier curves which are supported by SVG and ImageMagick understands the SVG-syntax for Bezier curves. Potrace also returns SVG. But that's maybe overdone in 99% of the cases and too complex math for an average developer.

stweil requested changes Jan 5, 2022

View reviewed changes

src/api/hocrrenderer.cpp Outdated Show resolved Hide resolved

src/api/hocrrenderer.cpp Outdated Show resolved Hide resolved

src/api/hocrrenderer.cpp Outdated Show resolved Hide resolved

src/api/hocrrenderer.cpp Outdated Show resolved Hide resolved

Added option to include images in hOCR output

dafe4ff

Signed-off-by: Merlijn Wajer <merlijn@wizzup.org>

MerlijnWajer force-pushed the hocr-images branch from 4b662b9 to dafe4ff Compare January 5, 2022 13:59

stweil added this to To do: New features for release 5 in Tesseract next Jan 6, 2022

stweil mentioned this pull request Jan 7, 2022

Plans for tesseract 5.x.y #3673

Open

stweil moved this from To do: New features for release 5 to In progress in Tesseract next Jan 8, 2022

amitdo closed this Feb 16, 2022

stweil removed this from In progress in Tesseract next Feb 16, 2022

MerlijnWajer mentioned this pull request Feb 20, 2022

Usefulness of MRC for decent quality compression of scanned book pages with illustrations internetarchive/archive-pdf-tools#33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Add option to include images in hOCR output #3710

RFC: Add option to include images in hOCR output #3710

MerlijnWajer commented Jan 5, 2022 •

edited

stweil left a comment •

edited

MerlijnWajer commented Jan 5, 2022

MerlijnWajer commented Jan 5, 2022 •

edited

kba commented Jan 5, 2022

wollmers commented Jan 5, 2022

MerlijnWajer commented Jan 6, 2022

wollmers commented Jan 7, 2022

stweil commented Jan 7, 2022

stweil commented Jan 7, 2022

stweil commented Jan 15, 2022

stweil commented Jan 16, 2022 •

edited

MerlijnWajer commented Jan 16, 2022

stweil commented Feb 2, 2022

MerlijnWajer commented Feb 2, 2022

stweil commented Feb 2, 2022

amitdo commented Feb 16, 2022

amitdo commented Feb 16, 2022

stweil commented Feb 16, 2022

wollmers commented Feb 17, 2022

RFC: Add option to include images in hOCR output #3710

RFC: Add option to include images in hOCR output #3710

Conversation

MerlijnWajer commented Jan 5, 2022 • edited

stweil left a comment • edited

Choose a reason for hiding this comment

MerlijnWajer commented Jan 5, 2022

MerlijnWajer commented Jan 5, 2022 • edited

kba commented Jan 5, 2022

wollmers commented Jan 5, 2022

MerlijnWajer commented Jan 6, 2022

wollmers commented Jan 7, 2022

stweil commented Jan 7, 2022

stweil commented Jan 7, 2022

stweil commented Jan 15, 2022

stweil commented Jan 16, 2022 • edited

MerlijnWajer commented Jan 16, 2022

stweil commented Feb 2, 2022

MerlijnWajer commented Feb 2, 2022

stweil commented Feb 2, 2022

amitdo commented Feb 16, 2022

amitdo commented Feb 16, 2022

stweil commented Feb 16, 2022

wollmers commented Feb 17, 2022

MerlijnWajer commented Jan 5, 2022 •

edited

stweil left a comment •

edited

MerlijnWajer commented Jan 5, 2022 •

edited

stweil commented Jan 16, 2022 •

edited