-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
The Type3 font specification in the PDF 1.7 standard allows producers to execute arbitrary glyph drawing commands on a per character code basis. This feature can be used to render non-text PDF content (e.g. charts and graphs) using the standard PDF text operators (Td, Tj, etc). It also allows producers to render text content visually without providing any mechanism for translating said content back to a true encoded character. For example, the drawing commands associated with character code 65 ("A") could be used to draw a "Z" or a unicorn or a fire breathing dragon or an "A". In such situations, extracting text in layout mode can result in massively inflated outputs, putting users at risk of OOM exceptions.
Environment
(pdfextnew) C:\Users\samha\pdf-extractor>python -m platform
Windows-10-10.0.22631-SP0
Python 3.10.6 | packaged by conda-forge | (main, Oct 24 2022, 16:02:16) [MSC v.1916 64 bit (AMD64)] on win32
(pdfextnew) C:\Users\samha\pdf-extractor>python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.2.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=9.5.0Code + PDF
>>> from pypdf import PdfReader
>>> r = PdfReader('c:/users/samha/downloads/UninterpretableType3Font.pdf')
>>> layout_output = r.pages[0].extract_text(extraction_mode="layout")
>>> print(len(layout_output))
9947317
Traceback
This issue does not directly result in any exception. However, the contents of layout_output in the sample above will contain no information of value (only long strings of spaces interspersed with an occasional named reference to a CharProcs entry), and due to its consumption of nearly 10MB of memory, user pipelines that process many pages simultaneously are put at risk of OOM exceptions.