Skip to content

/uXXXXX instead of a single character in extracted text for some pdfs #2273

Open
@equaeghe

Description

@equaeghe

I am trying to get a somewhat reliable estimate of the number of visual (non-whitespace, non-metadata) characters in pdf files. For this, I use the extract_text function.

I stumbled across a situation where visually the same text gives rise to different character counts. Namely, I have an original LaTeX-produced pdf and a derived version of it which was processed by some Adobe software. After investigating, it turns out that in the derived version, some characters from the original are replaced by /uXXXXX strings. This occurs mainly for math symbols. For example in the original, there is 𝛼 and in the derived, there is the string /u1D6FC (where indeed u+1D6FC corresponds to the italic math alpha in unicode).

I assume the above difference is due to some underlying difference in encoding of the unicode character. I would like to use pypdf to get a somewhat reliable estimate of the number of visual characters and think in this case, the correct thing for pypdf to do would be to interpret /u1D6FC at the appropriate point in its text extraction processing pipeline as 𝛼 and similarly for all other such unicode characters.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.1.57-gentoo-a-x86_64-AMD_Ryzen_7_PRO_4750U_with_Radeon_Graphics-with-glibc2.37

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.15.5, crypt_provider=('cryptography', '41.0.4'), PIL=10.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

import pypdf
import difflib

original = pypdf.PdfReader("original.pdf").pages[0].extract_text()
derived = pypdf.PdfReader("derived.pdf").pages[0].extract_text()

print(
    "\n".join(
        list(
            difflib.unified_diff(
                original.split(), derived.split(),
                fromfile="original", tofile="derived", n=0
            )
        )
    ).replace("\n\n", "\n")
)

Output:

--- original
+++ derived
@@ -52 +52 @@
-𝐴1𝐴2Y
+/u1D4341/u1D4342Y
@@ -92,3 +92,3 @@
-for𝐴2,
-with𝛼=1,
-𝛽=0.5,𝑞=25
+for/u1D4342,
+with/u1D6FC=1,
+/u1D6FD=0.5,/u1D45E=25

Test pdfs:

Metadata

Metadata

Assignees

No one assigned

    Labels

    workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions