Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question / Comment: How to crop white margins around the page #617

Closed
sant527 opened this issue Aug 25, 2020 · 7 comments
Closed

Question / Comment: How to crop white margins around the page #617

sant527 opened this issue Aug 25, 2020 · 7 comments
Assignees
Labels
question resolved fixed / implemented / answered

Comments

@sant527
Copy link

sant527 commented Aug 25, 2020

I generally have a need to crop the white margins on few sides and some times all sides

I am using pdf-crop-margins to crop the margin in the bottom only using the following commad

pdf-crop-margins -v -p4 100 0 100 100 test.pdf
here -p4 (means percentage not to crop
100 means leave it dont crop
0 means crop till the edge of text
-p4 left bot right top

Can something similar can be done using pymupdf

@JorjMcKie
Copy link
Collaborator

Sure. You can set the CropBox property. This attribute's diemsnions initially equal that of the (unrotated) page.
So toset the visible part of the page to some rectangle r, do page.setCropBox(r).

@sant527
Copy link
Author

sant527 commented Aug 26, 2020

What is r here. i saw in source code it says rect

Can you give some example

@JorjMcKie
Copy link
Collaborator

An object of class fitz.Rect which represents a rectangle defined by its top-left and bottom-right points (i.e. the diagonal). Can therefore be defined as fitz.Rect(x0, y0, x1, y1), where the top-left is fitz.Point(x0, y0) and the bottom-right is fitz.Point(x1, y1).
For a page, there exists the rectangle page.rect. Example of an A4 page page.rect = fitz.Rect(0, 0, 595, 842).
Omitting e.g. a 50 pixel border around such a rectangle can be achieved by

  1. fitz.Rect(50, 50, 595 - 50, 842 - 50), or
  2. page.rect + (50, 50, -50, -50)

72 pixels equal one inch, so you can calculate in this unit and respectivel centimeters.

You can algebraically add / subtract rectangles: r1 + r2 which adds the resp. coordinates. Here r2 can also be a 4-tuple, if the left operand r1 is a fitz.Rect (example 2 above).

So the shortest form to omit that border in this example is executing page.setCropBox(page.rect + (50, 50, -50, -50).

@sant527
Copy link
Author

sant527 commented Aug 26, 2020

thank you very much. For an elaborate answer

@JorjMcKie JorjMcKie added the resolved fixed / implemented / answered label Aug 26, 2020
@StevenClontz
Copy link

StevenClontz commented Sep 29, 2023

We're looking at dropping pdf-crop-margins as we already need PyMuPDF for other functionality. I think I understand that page.setCropBox(r) crops the page to the rectangle r. Is there any way to automatically compute r to be the smallest rectangle containing all the content on a page (e.g. so we automatically detect and crop out margins)?

@JorjMcKie
Copy link
Collaborator

We're looking at dropping pdf-crop-margins as we already need PyMuPDF for other functionality. I think I understand that page.setCropBox(r) crops the page to the rectangle r. Is there any way to automatically compute r to be the smallest rectangle containing all the content on a page (e.g. so we automatically detect and crop out margins)?

Yes, page.set_cropbox() (with page being a Page object) sets the visible part of a page.

It does not physically delete the part becoming invisible. Other values for that rectangle may recover these things.

To compute the smallest rectangle for anything the page has to show use page.get_bboxlog() in the following code snippet:

rect = fitz.EMPTY_RECT()  # start with the standard empty rectangle
for item in page.get_bboxlog():
    rect |= item[1]  # join this bbox into the result
# rect now wraps all page content

The advantage is, that no text or image or whatever needs to be extracted to do this.

An item of page.get_bboxlog() looks like this (type, (x0, y0, x1, y1)). "type" can be "fill-text", "fill-image" and more, showing the object type. The second tuple is the boundary box.

@StevenClontz
Copy link

Thanks @JorjMcKie we'll check this out. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question resolved fixed / implemented / answered
Projects
None yet
Development

No branches or pull requests

3 participants