-
Notifications
You must be signed in to change notification settings - Fork 159
Version 0.2.0 #315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Version 0.2.0 #315
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR bumps the package version from 0.1.9 to 0.2.0, indicating a minor version release with significant enhancements. The main improvement is a refactored reading order computation algorithm for better multi-column document layout handling.
- Enhanced stripe clustering algorithm with vector-aware gap detection
- Updated function signatures to accept page context and vector data
- Simplified import structure by removing unnecessary try/except block
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pymupdf4llm/setup.py | Version bump to 0.2.0 |
| pymupdf4llm/pymupdf4llm/versions_file.py | Version bump to 0.2.0 |
| pdf4llm/setup.py | Version bump to 0.2.0 |
| pymupdf4llm/pymupdf4llm/init.py | Removed try/except for pymupdf.layout import |
| pymupdf4llm/pymupdf4llm/helpers/utils.py | Major refactoring of reading order computation with new vector-aware stripe clustering, updated function signatures, and improved documentation |
| pymupdf4llm/pymupdf4llm/helpers/document_layout.py | Updated function call to match new signature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| vectors = [ | ||
| pymupdf.Rect(b["bbox"]) | ||
| for b in blocks | ||
| if b["bbox"][3] - b["bbox"][1] >= min_bbox_height and b["bbox"] in joined_boxes |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition b[\"bbox\"] in joined_boxes is checking if a tuple/list is 'in' a Rect object, which will always return False. This should use joined_boxes.contains(b[\"bbox\"]) or pymupdf.Rect(b[\"bbox\"]) in joined_boxes to properly check if the bbox is contained within the joined_boxes rectangle.
| if b["bbox"][3] - b["bbox"][1] >= min_bbox_height and b["bbox"] in joined_boxes | |
| if b["bbox"][3] - b["bbox"][1] >= min_bbox_height and pymupdf.Rect(b["bbox"]) in joined_boxes |
| if box[0] - prev_right > HORIZONTAL_GAP: | ||
| columns.append(sorted(current_column, key=lambda b: b[1])) | ||
| current_column = [box] | ||
| else: | ||
| current_column.append(box) | ||
|
|
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sorting key changed from b[3] (bottom y-coordinate) to b[1] (top y-coordinate). While this may be intentional for top-to-bottom ordering within columns, this is inconsistent with the comment on line 249 'Sort top to bottom' which uses b[3] (bottom coordinate). Consider adding a comment explaining why columns are sorted by top coordinate while stripes use bottom coordinate.
| if box[0] - prev_right > HORIZONTAL_GAP: | |
| columns.append(sorted(current_column, key=lambda b: b[1])) | |
| current_column = [box] | |
| else: | |
| current_column.append(box) | |
| if box[0] - prev_right > HORIZONTAL_GAP: | |
| # Note: We sort boxes within each column by their top y-coordinate (b[1]) for top-to-bottom reading order. | |
| # This differs from stripes, which are sorted by bottom y-coordinate (b[3]). | |
| # The use of b[1] here is intentional to ensure columns are read from top to bottom. | |
| columns.append(sorted(current_column, key=lambda b: b[1])) | |
| current_column = [box] | |
| else: | |
| current_column.append(box) | |
| # As above, sort the last column by top y-coordinate (b[1]). |
| Args: | ||
| boxes (list): List of bounding boxes, each defined as (x0, y0, x1, y1). | ||
| boxes (list): List of bounding boxes. |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for the boxes parameter was simplified but lost important information. The original 'each defined as (x0, y0, x1, y1)' described the expected structure. This should be restored or enhanced to 'List of bounding boxes, each defined as (x0, y0, x1, y1, class)' to match the actual usage in the code.
| boxes (list): List of bounding boxes. | |
| boxes (list): List of bounding boxes, each defined as (x0, y0, x1, y1, class). |
|
|
||
|
|
||
| def cluster_stripes(boxes, vertical_gap: float = 12): | ||
| def cluster_stripes(boxes, joined_boxes, vectors, vertical_gap=12): |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function signature added new parameters joined_boxes and vectors, but the docstring's Args section does not document these parameters. Add documentation for joined_boxes (the bounding rectangle of all boxes) and vectors (list of vector rectangles to consider during stripe division).
|
|
||
|
|
||
| def compute_reading_order(boxes, vertical_gap: float = 12): | ||
| def compute_reading_order(boxes, joined_boxes, vectors, vertical_gap=12): |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function signature added new parameters joined_boxes and vectors, but the docstring's Args section does not document these parameters. Add documentation for these new parameters to explain their purpose.
|
|
||
|
|
||
| def find_reading_order(boxes, vertical_gap: float = 36) -> list: | ||
| def find_reading_order(page_rect, blocks, boxes, vertical_gap: float = 12) -> list: |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function signature changed significantly with new parameters page_rect and blocks, and the default vertical_gap changed from 36 to 12, but the docstring's Args section does not document these new parameters or explain the change in default value. Update the documentation to include page_rect and blocks parameters.
No description provided.