Skip to content

Uninterpretable Type3 Fonts and Excessive Layout Mode Text Output #3081

@shartzog

Description

@shartzog

The Type3 font specification in the PDF 1.7 standard allows producers to execute arbitrary glyph drawing commands on a per character code basis. This feature can be used to render non-text PDF content (e.g. charts and graphs) using the standard PDF text operators (Td, Tj, etc). It also allows producers to render text content visually without providing any mechanism for translating said content back to a true encoded character. For example, the drawing commands associated with character code 65 ("A") could be used to draw a "Z" or a unicorn or a fire breathing dragon or an "A". In such situations, extracting text in layout mode can result in massively inflated outputs, putting users at risk of OOM exceptions.

Environment

(pdfextnew) C:\Users\samha\pdf-extractor>python -m platform
Windows-10-10.0.22631-SP0
Python 3.10.6 | packaged by conda-forge | (main, Oct 24 2022, 16:02:16) [MSC v.1916 64 bit (AMD64)] on win32

(pdfextnew) C:\Users\samha\pdf-extractor>python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.2.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=9.5.0

Code + PDF

>>> from pypdf import PdfReader
>>> r = PdfReader('c:/users/samha/downloads/UninterpretableType3Font.pdf')
>>> layout_output = r.pages[0].extract_text(extraction_mode="layout")
>>> print(len(layout_output))
9947317

UninterpretableType3Font.pdf

Traceback

This issue does not directly result in any exception. However, the contents of layout_output in the sample above will contain no information of value (only long strings of spaces interspersed with an occasional named reference to a CharProcs entry), and due to its consumption of nearly 10MB of memory, user pipelines that process many pages simultaneously are put at risk of OOM exceptions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions