Replies: 3 comments 1 reply
-
First of all, thanks for your understanding! You can do something this: import fitz
doc=fitz.open("2.pdf")
page=doc[0]
blocks=page.get_text("dict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]
spans=[]
for b in blocks:
for l in b["lines"]:
spans.extend([s for s in l["spans"]])
spans[0]
{'size': 10.199999809265137, 'flags': 4, 'font': 'Times-Roman', 'color': 0, 'ascender': 1.0529999732971191, 'descender': -0.2809999883174896, 'text': 'ICS 35. 040 ', 'origin': (54.16999816894531, 53.26995849609375), 'bbox': (54.16999816894531, 42.52935791015625, 109.11921691894531, 56.13615798950195)}
spans[-1]
{'size': 28.48898696899414, 'flags': 4, 'font': 'HiddenHorzOCR', 'color': 0, 'ascender': 1.0, 'descender': -0.20000000298023224, 'text': '中华人民共和国水利部发布', 'origin': (140.1699981689453, 777.5999755859375), 'bbox': (140.1699981689453, 746.8800048828125, 457.2100830078125, 783.7439575195312)} Text alignment is no information stored in any PDF. All you can have is a variant of the above that delivers text by character ("rawdict" instead of "dict"). Then use the delivered single character bboxes to rewrite the text character-by-character. |
Beta Was this translation helpful? Give feedback.
-
For TextWriter morph, How to correctly calculate text_length in pixels? fontsize = 15
|
Beta Was this translation helpful? Give feedback.
-
Thank you. https://stackoverflow.com/questions/75189505/pymupdf-get-optimal-font-size-given-a-rectangle |
Beta Was this translation helpful? Give feedback.
-
2.pdf
Attached a file with Adobe OCR layer.
I want to get below info.
Textline box, font info, font name, alignment etc. so that I can imitate when adding my own result.
Beta Was this translation helpful? Give feedback.
All reactions