Description
I am trying to get a somewhat reliable estimate of the number of visual (non-whitespace, non-metadata) characters in pdf files. For this, I use the extract_text
function.
I stumbled across a situation where visually the same text gives rise to different character counts. Namely, I have an original LaTeX-produced pdf and a derived version of it which was processed by some Adobe software. After investigating, it turns out that in the derived version, some characters from the original are replaced by /uXXXXX strings. This occurs mainly for math symbols. For example in the original, there is 𝛼
and in the derived, there is the string /u1D6FC
(where indeed u+1D6FC corresponds to the italic math alpha in unicode).
I assume the above difference is due to some underlying difference in encoding of the unicode character. I would like to use pypdf to get a somewhat reliable estimate of the number of visual characters and think in this case, the correct thing for pypdf to do would be to interpret /u1D6FC
at the appropriate point in its text extraction processing pipeline as 𝛼
and similarly for all other such unicode characters.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-6.1.57-gentoo-a-x86_64-AMD_Ryzen_7_PRO_4750U_with_Radeon_Graphics-with-glibc2.37
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.15.5, crypt_provider=('cryptography', '41.0.4'), PIL=10.0.1
Code + PDF
This is a minimal, complete example that shows the issue:
import pypdf
import difflib
original = pypdf.PdfReader("original.pdf").pages[0].extract_text()
derived = pypdf.PdfReader("derived.pdf").pages[0].extract_text()
print(
"\n".join(
list(
difflib.unified_diff(
original.split(), derived.split(),
fromfile="original", tofile="derived", n=0
)
)
).replace("\n\n", "\n")
)
Output:
--- original
+++ derived
@@ -52 +52 @@
-𝐴1𝐴2Y
+/u1D4341/u1D4342Y
@@ -92,3 +92,3 @@
-for𝐴2,
-with𝛼=1,
-𝛽=0.5,𝑞=25
+for/u1D4342,
+with/u1D6FC=1,
+/u1D6FD=0.5,/u1D45E=25
Test pdfs: