-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Closed
Labels
is-featureA feature requestA feature requestworkflow-imagesFrom a users perspective, image handling is the affected feature/workflowFrom a users perspective, image handling is the affected feature/workflow
Description
Explanation
I found an example for the /JBIG2Decode filter :-)
Code Example
from pypdf import PdfReader, __version__
print(f"pypdf=={__version__}")
reader = PdfReader("New.Jersey.Coinbase.staking.securities.charges.2023-0606_Coinbase-Penalty-and-C-D.pdf")
page = reader.pages[0]
for img in page.images:
print(img.name)
gives
pypdf==3.12.2
Traceback (most recent call last):
File "/home/moose/Downloads/pyissue/main.py", line 8, in <module>
for img in page.images:
File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2604, in __iter__
yield self[i]
~~~~^^^
File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2600, in __getitem__
return self.get_function(lst[index])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 522, in _get_image
imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/moose/Github/py-pdf/pypdf/pypdf/filters.py", line 844, in _xobj_to_image
data = x_object_obj.get_data() # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/moose/Github/py-pdf/pypdf/pypdf/generic/_data_structures.py", line 919, in get_data
decoded._data = decode_stream_data(self)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/moose/Github/py-pdf/pypdf/pypdf/filters.py", line 634, in decode_stream_data
raise NotImplementedError(f"unsupported filter {filter_
diedgarsa, wyfdev, GunnarHolwerda, DSLituiev, rodion-m and 1 more
Metadata
Metadata
Assignees
Labels
is-featureA feature requestA feature requestworkflow-imagesFrom a users perspective, image handling is the affected feature/workflowFrom a users perspective, image handling is the affected feature/workflow
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
MartinThoma commentedon Jul 20, 2023
PDF found in #1983
pubpub-zz commentedon Jul 20, 2023
from #951
Here is pdfminer implementation:
https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/jbig2.py
ItDoesntWorkScan.pdf
MartinThoma commentedon Mar 6, 2024
#2502 (comment) - here we have another example
stefan6419846 commentedon Apr 14, 2024
This might need a general design decision if I am not mistaken: Pillow does not seem to support JBIG2, while our implementation currently assumes that all images can be loaded as
PIL.Image.Image
(pdfminer.six
does not use Pillow for saving images).AFAIK there only is
jbig2dec
which would have to be used in a subprocess to get a "good" image format from the JBIG2 image embedded inside the PDF file (after adding the missing bytes, specifically "the JBIG2 file header, end-of-page segments, and end-of-file segment" which are not part of the XObject according to section 7.4.7 of the PDF 2.0 spec), although this might cause issues with masks etc. (jbig2dec
itself is subject to APGL-3.0-or-later and with its strong copyleft effect (including SaaS) rather unlikely to become part of Pillow.) The alternative would be to parse the essential aspects like the pixel data from the JBIG2 image ourselves.mdecaro commentedon Jun 1, 2024
Many platforms support a standalone 'jbig2dec' functionality (e.g. on Mac can brew install jbig2dec) - can you farm that functionality out to that routine? I was going to try, but can't seem to get the raw unfiltered image bytes from the page object. Prob my ignorance... (will keep digging)
pubpub-zz commentedon Jun 1, 2024
the XObject is a ContentStream. You should be able to access the data with
.get_data()
stefan6419846 commentedon Jun 2, 2024
Given the data, you should still have a look at the PDF specification on the filter (for PDF 2.0/ISO 32000-2:2020, this is section 7.4.7), especially as
jbig2dec
probably expects the header and footer to be present, which are omitted within PDF files.mdecaro commentedon Jun 2, 2024
26 remaining items