How to unhide ocr layer created by Adobe? #3013

nissansz · 2024-01-11T08:28:43Z

nissansz
Jan 11, 2024

Attached a file with Adobe OCR layer.
I want to get below info.
Textline box, font info, font name, alignment etc. so that I can imitate when adding my own result.

JorjMcKie · 2024-01-11T08:38:59Z

JorjMcKie
Jan 11, 2024
Maintainer

First of all, thanks for your understanding!

You can do something this:

import fitz
doc=fitz.open("2.pdf")
page=doc[0]
blocks=page.get_text("dict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]
spans=[]
for b in blocks:
    for l in b["lines"]:
        spans.extend([s for s in l["spans"]])

        
spans[0]
{'size': 10.199999809265137, 'flags': 4, 'font': 'Times-Roman', 'color': 0, 'ascender': 1.0529999732971191, 'descender': -0.2809999883174896, 'text': 'ICS 35. 040 ', 'origin': (54.16999816894531, 53.26995849609375), 'bbox': (54.16999816894531, 42.52935791015625, 109.11921691894531, 56.13615798950195)}
spans[-1]
{'size': 28.48898696899414, 'flags': 4, 'font': 'HiddenHorzOCR', 'color': 0, 'ascender': 1.0, 'descender': -0.20000000298023224, 'text': '中华人民共和国水利部发布', 'origin': (140.1699981689453, 777.5999755859375), 'bbox': (140.1699981689453, 746.8800048828125, 457.2100830078125, 783.7439575195312)}

Text alignment is no information stored in any PDF. All you can have is a variant of the above that delivers text by character ("rawdict" instead of "dict"). Then use the delivered single character bboxes to rewrite the text character-by-character.

0 replies

nissansz · 2024-01-11T08:52:18Z

nissansz
Jan 11, 2024
Author

For TextWriter morph,
the only way to use morph is to calculate text_length/rect_width for autofit?

How to correctly calculate text_length in pixels?
My calculation result seems wrong, the written textline does not match.
font is windows deng.ttf

fontsize = 15

        textwidth = fitz.get_text_length(text, fontname = 'china-s',  fontsize=fontsize, encoding='utf-8')
        # pivot = fitz.Point(rect.x0, rect.y0 + rect.height / 2)  # middle of left side
        # pos = pivot
        mat = fitz.Matrix(textwidth/rect.width, 1.0)
        # mat = fitz.Matrix(1.0, 1.0)
        # mat = fitz.Matrix(rect.width/textwidth, 1.0)

        # mat = fitz.Matrix(1/0.7, 1)  # 2 = double the width
        tw.write_text(page, opacity=None, color=color, morph=(pos, mat), overlay=True, oc=0, render_mode=0)

3_overlay_tw.pdf

result is as below

1 reply

JorjMcKie Jan 11, 2024
Maintainer

Multiple comments:

TextWriter never uses Base-14 fonts, it always embeds font files (see documentation!).
for Courier, Helvetica, Times-Roman it uses font files with equal metrix as the Base-14 equivalents, but not for CJK. So you cannot correctly measure text length as you do.
In your use case, you seem to be interested in one-line outputs only. Therefore, it makes not sense to use fill_textbox method. Instead, make a separate TextWriter for each line and use TextWriter.append(). There is a TextWriter property last_point (see documentation), which gives you the end position after each append. Then compute the difference to the start and you have the effective text length - whatever font may have been used.

nissansz · 2024-01-11T09:17:16Z

nissansz
Jan 11, 2024
Author

Thank you.
I found a solution.
It seesms working well.

https://stackoverflow.com/questions/75189505/pymupdf-get-optimal-font-size-given-a-rectangle
fs = rect.width / font.text_length(text, fontsize=1)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to unhide ocr layer created by Adobe? #3013

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to unhide ocr layer created by Adobe? #3013

nissansz Jan 11, 2024

Replies: 3 comments · 1 reply

JorjMcKie Jan 11, 2024 Maintainer

nissansz Jan 11, 2024 Author

JorjMcKie Jan 11, 2024 Maintainer

nissansz Jan 11, 2024 Author

nissansz
Jan 11, 2024

Replies: 3 comments 1 reply

JorjMcKie
Jan 11, 2024
Maintainer

nissansz
Jan 11, 2024
Author

JorjMcKie Jan 11, 2024
Maintainer

nissansz
Jan 11, 2024
Author