Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add option to include images in hOCR output #3710

Closed
wants to merge 1 commit into from

Conversation

MerlijnWajer
Copy link
Contributor

@MerlijnWajer MerlijnWajer commented Jan 5, 2022

Written in collaboration with Aram (see commit author).

This pull request adds support to output ocr_photo elements in the hOCR renderer. ocr_photo is "Something that requires JPEG or PNG to be represented well" per the specification. ocr_image would be for SVG content. Since that is hard to distinguish, let's just go to ocr_photo unconditionally.

There are a few open questions/concerns that I can think of:

  1. Do we want to render ocrx_word elements inside ocr_photo? I think per the hOCR specification this is allowed, but Tesseract seems to just "find" a word of the same bounding box with nothing but spaces as characters, in our testing.
  2. If we do not want to have words inside ocr_photo elements, should we skip generating the ocr_line when writing hOCR even if the hocr images option is turned off, since we know the element is detected as an image block type with (apparently) no actual content?
  3. If we want to skip adding ocrx_word elements, is the goto acceptable, or would you rather see the code rewritten without the goto?
  4. Do we want to default to having this turned on?

Copy link
Contributor

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that you addressed this highly desired feature!

I have some smaller comments. Please note that we plan to release 5.0.1 this week, and I would postpone reviewing and testing of the new feature until that release was published.

src/api/hocrrenderer.cpp Outdated Show resolved Hide resolved
src/api/hocrrenderer.cpp Outdated Show resolved Hide resolved
src/api/hocrrenderer.cpp Outdated Show resolved Hide resolved
src/api/hocrrenderer.cpp Outdated Show resolved Hide resolved
Signed-off-by: Merlijn Wajer <merlijn@wizzup.org>
@MerlijnWajer
Copy link
Contributor Author

Thinking about this some more, I think I recall having some images where Tesseract thought the entire page was a photo, but also still found (legitimate) text on it. I'll find such an example -- if my recollection is correct then we definitely do want to add text elements inside the ocr_photo elements.

@MerlijnWajer
Copy link
Contributor Author

MerlijnWajer commented Jan 5, 2022

Here is one such an image: https://archive.org/~merlijn/tesseract-images/hocr-images/sim_canadian-medical-association-journal_1963-03-16_88_11_0003.jpg

There are also two files in that directory ( https://archive.org/~merlijn/tesseract-images/hocr-images/ ), "skipword.hocr" is using the code from this PR as of writing, "noskipword.hocr" is when we keep skipword always to false.

It looks like tesseract's iterator is clever enough to not place the majority of the text under the floating image, so that would argue for indeed skipping text contents in ocr_photo elements, but would love to get some more feedback or test cases for it.

@kba
Copy link

kba commented Jan 5, 2022

  • If we do not want to have words inside ocr_photo elements, should we skip generating the ocr_line when writing hOCR even if the hocr images option is turned off, since we know the element is detected as an image block type with (apparently) no actual content?

While the hocr spec is vague in this regard, I would still strongly argue against ocr_line in ocr_photo. There might be legitimate applications (like text in an advert in a newspaper) but generating a "placeholder `ocr_line`` for every image is not one of them.

@wollmers
Copy link

wollmers commented Jan 5, 2022

@MerlijnWajer

  1. Do we want to render ocrx_word elements inside ocr_photo? I think per the hOCR specification this is allowed, but Tesseract seems to just "find" a word of the same bounding box with nothing but spaces as characters, in our testing.

If I understand the code of the PR, it just writes a HTML element of type ocr_photo with a bounding-box. Then I vote for overlapping ocrx_word with ocr_photo if we have only rectangular bounding-boxes and not polygons like Page-XML. On skewed or warped pages bounding-boxes can overlap. Or the illustration is not rectangular and the text flows around the illustration.

  1. Do we want to default to having this turned on?

I would like to have as much non-text areas classified and tagged as possible. Tables, formulae, decorative elements (separators), drawings, copper engravings (hole page), wood cuts. Even if it's not very reliable, the heuristic guess of Tesseract ("there is something, but it's not typical text") is helpful.

@MerlijnWajer
Copy link
Contributor Author

@MerlijnWajer

  1. Do we want to render ocrx_word elements inside ocr_photo? I think per the hOCR specification this is allowed, but Tesseract seems to just "find" a word of the same bounding box with nothing but spaces as characters, in our testing.

If I understand the code of the PR, it just writes a HTML element of type ocr_photo with a bounding-box. Then I vote for overlapping ocrx_word with ocr_photo if we have only rectangular bounding-boxes and not polygons like Page-XML. On skewed or warped pages bounding-boxes can overlap. Or the illustration is not rectangular and the text flows around the illustration.

Ok, note that currently we only seem to output a few spaces for these areas, and nothing else. I think it depends mostly on how the Tesseract page iterator works. It might be that the current behaviour (before this PR) of outputting a ocr_line block (with this PR ocr_photo) with just some spaces is in fact a bug.

  1. Do we want to default to having this turned on?

I would like to have as much non-text areas classified and tagged as possible. Tables, formulae, decorative elements (separators), drawings, copper engravings (hole page), wood cuts. Even if it's not very reliable, the heuristic guess of Tesseract ("there is something, but it's not typical text") is helpful.

Right, there is are other block types like PT_NOISE that we could match to ocr_noise, and others like PT_HORZ_LINE, PT_VERT_LINE. I also agree we ought to output as much as we can, because heuristics in Tesseract can improve over time.

@stweil stweil added this to To do: New features for release 5 in Tesseract next Jan 6, 2022
@wollmers
Copy link

wollmers commented Jan 7, 2022

@MerlijnWajer

Ok, note that currently we only seem to output a few spaces for these areas, and nothing else. I think it depends mostly on how the Tesseract page iterator works. It might be that the current behaviour (before this PR) of outputting a ocr_line block (with this PR ocr_photo) with just some spaces is in fact a bug.

AFAIR it depends on the combination of --psm and --oem. A photo (also called halftone) is easy to detect in most cases, because it has sharp edges against the background, and it has characteristic features (density histogram, size, proportion). This can be just written into hOCR as ocr_photo with a low rate of false negative.

There are other cases, which Tesseract seems not detecting, like large, extremely letterspaced letters, or single glyphs like page numbers. Then Tesseract either outputs nothing (not a text area, case 1), or tries OCR without recognition result (case 2) and outputs a word element with only a space in it.

@stweil
Copy link
Contributor

stweil commented Jan 7, 2022

I get no text for test/testing/8087_054.3G.tif with the current patch. So more tuning is needed.

@stweil
Copy link
Contributor

stweil commented Jan 7, 2022

Tesseract thought the entire page was a photo

And it is even right, it is some kind of photo (or a scan), so the hOCR output may look like this:

--- /tmp/noimg.hocr	2022-01-08 00:07:04.559688193 +0100
+++ /tmp/img.hocr	2022-01-08 00:06:20.446583605 +0100
@@ -1116,13 +1116,7 @@
      </span>
     </p>
    </div>
-   <div class='ocr_carea' id='block_1_96' title="bbox 0 0 3450 5034">
-    <p class='ocr_par' id='par_1_100' lang='eng' title="bbox 0 0 3450 5034">
-     <span class='ocr_line' id='line_1_158' title="bbox 0 0 3450 5034; baseline 0 0; x_size 2517; x_descenders -1258.5; x_ascenders 1258.5">
-      <span class='ocrx_word' id='word_1_401' title='bbox 0 0 3450 5034; x_wconf 95'> </span>
-     </span>
-    </p>
-   </div>
+   <div class='ocr_photo' id='block_1_96' title="bbox 0 0 3450 5034"></div>
   </div>
  </body>
 </html>

@stweil stweil moved this from To do: New features for release 5 to In progress in Tesseract next Jan 8, 2022
@stweil
Copy link
Contributor

stweil commented Jan 15, 2022

In https://github.com/stweil/tesseract/tree/output-formats I have an experimental implementation which works better for me and which also handles ALTO and text output.

@stweil
Copy link
Contributor

stweil commented Jan 16, 2022

@MerlijnWajer, in pull request #3723 I handle image and line regions unconditionally. I tested it on a set of 41998 images. It produced the same text results, but correctly wrote image and line regions instead of text regions with words consisting of blanks only. Maybe you want to test that on a larger number of images.

@MerlijnWajer
Copy link
Contributor Author

@MerlijnWajer, in pull request #3723 I handle image and line regions unconditionally. I tested it on a set of 41998 images. It produced the same text results, but correctly wrote image and line regions instead of text regions with words consisting of blanks only. Maybe you want to test that on a larger number of images.

Looks good, thanks for picking this up and improving it. Not writing text regions with only blank words I think is another clear improvement. I will try to do a few test runs with the code coming week (and can provide a sign off if useful), but don't block on me if you feel it's ready to be merged.

@stweil
Copy link
Contributor

stweil commented Feb 2, 2022

Meanwhile more than two weeks passed by. Are there new results from testing this pull request and also for #3723?

@MerlijnWajer
Copy link
Contributor Author

I'll build PR #3723 today, try it out and report back today, thanks for the nudge. The way I understood it, #3723 replaces this PR, correct?

@stweil
Copy link
Contributor

stweil commented Feb 2, 2022

Yes, that's right.

@amitdo
Copy link
Collaborator

amitdo commented Feb 16, 2022

#3723 was merged.

@MerlijnWajer, thanks for your effort.

@amitdo amitdo closed this Feb 16, 2022
@amitdo
Copy link
Collaborator

amitdo commented Feb 16, 2022

@wollmers,

Regarding polygon, the API has BlockPolygon() method. Currently, none of the output formats use this method.

@stweil
Copy link
Contributor

stweil commented Feb 16, 2022

OCR-D uses BlockPolygon (see https://github.com/OCR-D/ocrd_tesserocr/blob/master/ocrd_tesserocr/recognize.py#L506).

I think Tesseract's hOCR and ALTO output could be enhanced to use polygons, too.

@stweil stweil removed this from In progress in Tesseract next Feb 16, 2022
@wollmers
Copy link

https://github.com/OCR-D/ocrd_tesserocr/blob/master/ocrd_tesserocr/recognize.py#L506

polygon for geometric hulls of areas and polyline for e.g. baseline are more precise and flexible. We can always convert or interpolate to bbox or straight line but not into the other direction. Best would be Bezier curves which are supported by SVG and ImageMagick understands the SVG-syntax for Bezier curves. Potrace also returns SVG. But that's maybe overdone in 99% of the cases and too complex math for an average developer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants