Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to remove an image from PDF? #338

Closed
chest3x opened this issue Aug 2, 2019 · 7 comments
Closed

How to remove an image from PDF? #338

chest3x opened this issue Aug 2, 2019 · 7 comments
Assignees
Labels

Comments

@chest3x
Copy link

@chest3x chest3x commented Aug 2, 2019

Hi,

firstly, thanks for a great project.

I am looking for a way to remove specific images (not all of them) from a PDF.
Possibly also replacing them by text, but that seems doable, as the location of the image is exposed already.

Is there a way to do that using PyMuPDF?

Thanks.

@chest3x chest3x added the question label Aug 2, 2019
@JorjMcKie

This comment has been minimized.

Copy link
Member

@JorjMcKie JorjMcKie commented Aug 2, 2019

That is possible, but not an easy one and highly depends on a few things:

  • which software inserted the image
  • on how many pages is the image used (could be more than one!)
  • how much coding effort are you willing to invest

An image leaves its mark at more than one place:

  1. in the /Resources object used by the page object(s)
  2. in the /Contents object(s) of every page using it.

In comparison, just suppressing the image to appear (and not physically removing the image object from the PDF) is fairly doable:

>>> import fitz
>>> doc=fitz.open("PyMuPDF.pdf")
>>> doc.getPageImageList(0)  # images on page 0
[[270, 0, 261, 115, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode']]
>>> # 270 is the image object xref
>>> # the page references it via name 'Im1'
>>> doc[0]._getContents()
[274]
>>> # xref 274 is the only /Contents object of the page (could be 
>>> c = doc._getXrefStream(274) # read the stream source
>>> c.find(b"/Im1 Do") # try find the image display command
217
>>> cnew = c.replace(b"/Im1 Do", b"") # remove it
>>> doc._updateStream(274, cnew) # replace page's /Content object
>>> 

Now the image should no longer be shown on that page.
Possible complications:

  • there could be more than one /Contents object not just a one-element list [274].
  • The command /Im1 Do could contain an arbitrary number of spaces or \n, and the two tokens could even be on two separate /Contents (not to be expected however).
  • if same image appears on several pages, the xref 270 remains the same, but the name Im1 could different.
@JorjMcKie JorjMcKie self-assigned this Aug 2, 2019
@chest3x

This comment has been minimized.

Copy link
Author

@chest3x chest3x commented Aug 7, 2019

Thank You for your answer.

Is it also possible to extract location (BBox) of the image I am 'removing'?

Currently I am playing around with two approaches:
page.getText("dict")['blocks'] - here I am capable of extracting the BBox of the image, but I am not able to get image reference here
doc.getPageImageList(0) - here I am capable of getting the image reference, but not the BBox.

@JorjMcKie

This comment has been minimized.

Copy link
Member

@JorjMcKie JorjMcKie commented Aug 7, 2019

Your comment is right on spot:
Unfortunately, this is not possible right now :-(.
This is the reason:
The page.gettext() methods works for all supported document types - not just for PDFs, which are the only to contain cross reference numbers.

The only thing you can do, is trying some heuristics:
The image information in page.getText("dict")['blocks'] contains the bbox as location information, but also the original image information, width, height, type (extension), bpi,...
doc.getPageImageList also provides some image information (not exactly the same, however).
If you compare this information, you may get a sure cros identification in many cases.
If you do doc.extractImage(xref) with the xref you find in doc.getPageImageList, the returned dictionary returns even more complete image information to match with that of page.getText("dict")['blocks'].

Image display in a PDF in principle is coded like this:
In the page's /Contents source we will find

a b c d e f cm % 'cm' = concatenate matrix, a,b,c,d,e,f are matrix elements (floats)
...            % any number of other PDF commands
/Im1 Do        % display an image named 'Im1'
...

In the page's object definition we will find

>>> doc = fitz.open("PyMuPDF.pdf")
>>> page=doc[0]
>>> page.xref
264
>>> print(doc._getXrefString(264))
<<
  /Type /Page
  /Contents 269 0 R
  /Resources 268 0 R
  /MediaBox [ 0 0 612 792 ]
  /Parent 273 0 R
>>
>>> print(doc._getXrefString(268)) # print resources object
<<
  /Font <<
    /F38 271 0 R
    /F39 272 0 R
  >>
  /XObject <<
    /Im1 265 0 R
  >>
  /ProcSet [ /PDF /Text /ImageC ]
>>
>>> doc.getPageImageList(0) # compare with this output:
[[265, 0, 261, 115, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode']]
>>> # now look at /Contents source of the page:
>>> cont = doc._getXrefStream(269).decode() # decode bytes to string
>>> print(cont[:500])
...
q
1 0 0 1 72 710.536 cm
[]0 d 0 J 0.996 w 0 0 m 468 0 l S
Q
0 g 0 G
0 g 0 G
q
195.75 0 0 86.25 344.25 616.814 cm % concatenate matrix
/Im1 Do                            % display image
Q
BT
...

To do the suggested heuristic compare, try this ...

>>> # applying brute force ...
>>> img = doc.extractImage(265)
>>> img.keys()
dict_keys(['ext', 'smask', 'width', 'height', 'colorspace', 'xres', 'yres', 'cs-name', 'image'])
>>> img["cs-name"]
'DeviceRGB'
>>> img["ext"]
'jpeg'
>>> blks = page.getText("dict")["blocks"]
>>> blks[0].keys()
dict_keys(['type', 'bbox', 'width', 'height', 'ext', 'image'])
>>> blks[0]["image"] == img["image"]
True
>>> # voilà, found a match!
@chest3x

This comment has been minimized.

Copy link
Author

@chest3x chest3x commented Aug 9, 2019

Thanks a lot, this helped me to solve the problem.

@chest3x chest3x closed this Aug 9, 2019
@JorjMcKie

This comment has been minimized.

Copy link
Member

@JorjMcKie JorjMcKie commented Aug 13, 2019

In the meantime, I have developed a function, which extracts the bbox of images on PDF pages without using page.getText("dict").
It does this by parsing the PDF commands defining a page's layout (/Contents and similar PDF objects).

The bbox calculator is pure Python, but using PyMuPDF. It currently supports images, which are not inserted with a rotation other than integer multiples of 90 degrees.
I am continuing to extend the solution for those as well.

The following ZIP contains this function get_image_bbox.py and a test script, which can be used as a front-end for testing purposes.

Maybe you are interested in trying it out.

get-bbox.zip

So, this function helps solve your problem in the following way:
Given a PDF page n and the list of images on it like this:

>>> import fitz
>>> from get_image_bbox import get_image_bbox
>>> doc=fitz.open("PyMuPDF.pdf")
>>> page=doc[0]
>>> imglist=doc.getPageImageList(page.number)
>>> imglist
[[266, 0, 261, 115, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode']]
>>> bbox = get_image_bbox(page, imglist[0])
>>> print(bbox)
Rect(344.25, 88.93597412109375, 540.0, 175.18597412109375)
>>> # just as a cross check:
>>> blocks = [b["bbox"] for b in page.getText("dict")["blocks"] if b["type"] == 1]
>>> blocks[0]
(344.25, 88.93597412109375, 540.0, 175.18597412109375)
>>> # as can be seen: (practically) the same rectangle
@JorjMcKie

This comment has been minimized.

Copy link
Member

@JorjMcKie JorjMcKie commented Aug 13, 2019

Advantages over my previous suggestion are:

  • very much faster
  • does not require allocation of (usually) large memory areas using getText and extractImage -- those are not used.
  • supports a wider range of images because it doesn't rely on binary equality of image streams delivered by getText and extractImage.
@JorjMcKie

This comment has been minimized.

Copy link
Member

@JorjMcKie JorjMcKie commented Aug 13, 2019

Well, in the above case, it was even exactly the same rectangle.
That needs not always be so due to rounding issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.