/uXXXXX instead of a single character in extracted text for some pdfs

I am trying to get a somewhat reliable estimate of the number of visual (non-whitespace, non-metadata) characters in pdf files. For this, I use the `extract_text` function.

I stumbled across a situation where visually the same text gives rise to different character counts. Namely, I have an original LaTeX-produced pdf and a derived version of it which was processed by some Adobe software. After investigating, it turns out that in the derived version, some characters from the original are replaced by /uXXXXX strings. This occurs mainly for math symbols. For example in the original, there is `𝛼` and in the derived, there is the string `/u1D6FC` (where indeed [u+1D6FC corresponds to the italic math alpha in unicode](http://www.unicode-symbol.com/u/1D6FC.html)).

I assume the above difference is due to some underlying difference in encoding of the unicode character. I would like to use pypdf to get a somewhat reliable estimate of the number of visual characters and think in this case, the correct thing for pypdf to do would be to interpret `/u1D6FC` at the appropriate point in its text extraction processing pipeline as `𝛼` and similarly for all other such unicode characters.

## Environment

Which environment were you using when you encountered the problem?

```bash
$ python -m platform
Linux-6.1.57-gentoo-a-x86_64-AMD_Ryzen_7_PRO_4750U_with_Radeon_Graphics-with-glibc2.37

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.15.5, crypt_provider=('cryptography', '41.0.4'), PIL=10.0.1
```

## Code + PDF

This is a minimal, complete example that shows the issue:

```python
import pypdf
import difflib

original = pypdf.PdfReader("original.pdf").pages[0].extract_text()
derived = pypdf.PdfReader("derived.pdf").pages[0].extract_text()

print(
    "\n".join(
        list(
            difflib.unified_diff(
                original.split(), derived.split(),
                fromfile="original", tofile="derived", n=0
            )
        )
    ).replace("\n\n", "\n")
)
```

Output:

```diff
--- original
+++ derived
@@ -52 +52 @@
-𝐴1𝐴2Y
+/u1D4341/u1D4342Y
@@ -92,3 +92,3 @@
-for𝐴2,
-with𝛼=1,
-𝛽=0.5,𝑞=25
+for/u1D4342,
+with/u1D6FC=1,
+/u1D6FD=0.5,/u1D45E=25
```

Test pdfs:

- [original.pdf](https://github.com/py-pdf/pypdf/files/13187358/original.pdf)
- [derived.pdf](https://github.com/py-pdf/pypdf/files/13187361/derived.pdf)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

/uXXXXX instead of a single character in extracted text for some pdfs #2273

Environment

Code + PDF

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

/uXXXXX instead of a single character in extracted text for some pdfs #2273

Description

Environment

Code + PDF

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions