Skip to content

NotImplementedError: unsupported filter /JBIG2Decode #1989

@MartinThoma

Description

@MartinThoma
Member

Explanation

I found an example for the /JBIG2Decode filter :-)

Code Example

PDF: https://github.com/py-pdf/pypdf/files/12090692/New.Jersey.Coinbase.staking.securities.charges.2023-0606_Coinbase-Penalty-and-C-D.pdf

from pypdf import PdfReader, __version__

print(f"pypdf=={__version__}")

reader = PdfReader("New.Jersey.Coinbase.staking.securities.charges.2023-0606_Coinbase-Penalty-and-C-D.pdf")

page = reader.pages[0]
for img in page.images:
    print(img.name)

gives

pypdf==3.12.2
Traceback (most recent call last):
  File "/home/moose/Downloads/pyissue/main.py", line 8, in <module>
    for img in page.images:
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2604, in __iter__
    yield self[i]
          ~~~~^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 2600, in __getitem__
    return self.get_function(lst[index])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/_page.py", line 522, in _get_image
    imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/filters.py", line 844, in _xobj_to_image
    data = x_object_obj.get_data()  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/generic/_data_structures.py", line 919, in get_data
    decoded._data = decode_stream_data(self)
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/moose/Github/py-pdf/pypdf/pypdf/filters.py", line 634, in decode_stream_data
    raise NotImplementedError(f"unsupported filter {filter_

Activity

added
workflow-imagesFrom a users perspective, image handling is the affected feature/workflow
on Jul 20, 2023
self-assigned this
on Jul 20, 2023
MartinThoma

MartinThoma commented on Jul 20, 2023

@MartinThoma
MemberAuthor

PDF found in #1983

pubpub-zz

pubpub-zz commented on Jul 20, 2023

@pubpub-zz
Collaborator
MartinThoma

MartinThoma commented on Mar 6, 2024

@MartinThoma
MemberAuthor

#2502 (comment) - here we have another example

stefan6419846

stefan6419846 commented on Apr 14, 2024

@stefan6419846
Collaborator

This might need a general design decision if I am not mistaken: Pillow does not seem to support JBIG2, while our implementation currently assumes that all images can be loaded as PIL.Image.Image (pdfminer.six does not use Pillow for saving images).

AFAIK there only is jbig2dec which would have to be used in a subprocess to get a "good" image format from the JBIG2 image embedded inside the PDF file (after adding the missing bytes, specifically "the JBIG2 file header, end-of-page segments, and end-of-file segment" which are not part of the XObject according to section 7.4.7 of the PDF 2.0 spec), although this might cause issues with masks etc. (jbig2dec itself is subject to APGL-3.0-or-later and with its strong copyleft effect (including SaaS) rather unlikely to become part of Pillow.) The alternative would be to parse the essential aspects like the pixel data from the JBIG2 image ourselves.

mdecaro

mdecaro commented on Jun 1, 2024

@mdecaro

Many platforms support a standalone 'jbig2dec' functionality (e.g. on Mac can brew install jbig2dec) - can you farm that functionality out to that routine? I was going to try, but can't seem to get the raw unfiltered image bytes from the page object. Prob my ignorance... (will keep digging)

pubpub-zz

pubpub-zz commented on Jun 1, 2024

@pubpub-zz
Collaborator

Many platforms support a standalone 'jbig2dec' functionality (e.g. on Mac can brew install jbig2dec) - can you farm that functionality out to that routine? I was going to try, but can't seem to get the raw unfiltered image bytes from the page object. Prob my ignorance... (will keep digging)

the XObject is a ContentStream. You should be able to access the data with .get_data()

stefan6419846

stefan6419846 commented on Jun 2, 2024

@stefan6419846
Collaborator

Given the data, you should still have a look at the PDF specification on the filter (for PDF 2.0/ISO 32000-2:2020, this is section 7.4.7), especially as jbig2dec probably expects the header and footer to be present, which are omitted within PDF files.

mdecaro

mdecaro commented on Jun 2, 2024

@mdecaro

26 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-featureA feature requestworkflow-imagesFrom a users perspective, image handling is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @MartinThoma@pubpub-zz@mdecaro@mgberg@stefan6419846

      Issue actions

        NotImplementedError: unsupported filter /JBIG2Decode · Issue #1989 · py-pdf/pypdf