Uninterpretable Type3 Fonts and Excessive Layout Mode Text Output

The Type3 font specification in the PDF 1.7 standard allows producers to execute arbitrary glyph drawing commands on a per character code basis. This feature can be used to render non-text PDF content (e.g. charts and graphs) using the standard PDF text operators (Td, Tj, etc). It also allows producers to render text content *visually* without providing any mechanism for translating said content back to a true encoded character. For example, the drawing commands associated with character code 65 ("A") could be used to draw a "Z" or a unicorn or a fire breathing dragon or an "A". In such situations, extracting text in layout mode can result in massively inflated outputs, putting users at risk of OOM exceptions.

## Environment

```cmd
(pdfextnew) C:\Users\samha\pdf-extractor>python -m platform
Windows-10-10.0.22631-SP0
Python 3.10.6 | packaged by conda-forge | (main, Oct 24 2022, 16:02:16) [MSC v.1916 64 bit (AMD64)] on win32

(pdfextnew) C:\Users\samha\pdf-extractor>python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.2.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=9.5.0
```

## Code + PDF

```python-repl
>>> from pypdf import PdfReader
>>> r = PdfReader('c:/users/samha/downloads/UninterpretableType3Font.pdf')
>>> layout_output = r.pages[0].extract_text(extraction_mode="layout")
>>> print(len(layout_output))
9947317
```

[UninterpretableType3Font.pdf](https://github.com/user-attachments/files/18551904/UninterpretableType3Font.pdf)

## Traceback

This issue does not directly result in any exception. However, the contents of `layout_output` in the sample above will contain no information of value (only long strings of spaces interspersed with an occasional named reference to a CharProcs entry), and due to its consumption of nearly 10MB of memory, user pipelines that process many pages simultaneously are put at risk of OOM exceptions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uninterpretable Type3 Fonts and Excessive Layout Mode Text Output #3081

Environment

Code + PDF

Traceback

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uninterpretable Type3 Fonts and Excessive Layout Mode Text Output #3081

Description

Environment

Code + PDF

Traceback

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions