Skip to content

Conversation

@JorjMcKie
Copy link
Collaborator

No description provided.

@JorjMcKie JorjMcKie requested a review from Copilot November 10, 2025 19:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR bumps the package version from 0.1.9 to 0.2.0, indicating a minor version release with significant enhancements. The main improvement is a refactored reading order computation algorithm for better multi-column document layout handling.

  • Enhanced stripe clustering algorithm with vector-aware gap detection
  • Updated function signatures to accept page context and vector data
  • Simplified import structure by removing unnecessary try/except block

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pymupdf4llm/setup.py Version bump to 0.2.0
pymupdf4llm/pymupdf4llm/versions_file.py Version bump to 0.2.0
pdf4llm/setup.py Version bump to 0.2.0
pymupdf4llm/pymupdf4llm/init.py Removed try/except for pymupdf.layout import
pymupdf4llm/pymupdf4llm/helpers/utils.py Major refactoring of reading order computation with new vector-aware stripe clustering, updated function signatures, and improved documentation
pymupdf4llm/pymupdf4llm/helpers/document_layout.py Updated function call to match new signature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vectors = [
pymupdf.Rect(b["bbox"])
for b in blocks
if b["bbox"][3] - b["bbox"][1] >= min_bbox_height and b["bbox"] in joined_boxes
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition b[\"bbox\"] in joined_boxes is checking if a tuple/list is 'in' a Rect object, which will always return False. This should use joined_boxes.contains(b[\"bbox\"]) or pymupdf.Rect(b[\"bbox\"]) in joined_boxes to properly check if the bbox is contained within the joined_boxes rectangle.

Suggested change
if b["bbox"][3] - b["bbox"][1] >= min_bbox_height and b["bbox"] in joined_boxes
if b["bbox"][3] - b["bbox"][1] >= min_bbox_height and pymupdf.Rect(b["bbox"]) in joined_boxes

Copilot uses AI. Check for mistakes.
Comment on lines +312 to 317
if box[0] - prev_right > HORIZONTAL_GAP:
columns.append(sorted(current_column, key=lambda b: b[1]))
current_column = [box]
else:
current_column.append(box)

Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sorting key changed from b[3] (bottom y-coordinate) to b[1] (top y-coordinate). While this may be intentional for top-to-bottom ordering within columns, this is inconsistent with the comment on line 249 'Sort top to bottom' which uses b[3] (bottom coordinate). Consider adding a comment explaining why columns are sorted by top coordinate while stripes use bottom coordinate.

Suggested change
if box[0] - prev_right > HORIZONTAL_GAP:
columns.append(sorted(current_column, key=lambda b: b[1]))
current_column = [box]
else:
current_column.append(box)
if box[0] - prev_right > HORIZONTAL_GAP:
# Note: We sort boxes within each column by their top y-coordinate (b[1]) for top-to-bottom reading order.
# This differs from stripes, which are sorted by bottom y-coordinate (b[3]).
# The use of b[1] here is intentional to ensure columns are read from top to bottom.
columns.append(sorted(current_column, key=lambda b: b[1]))
current_column = [box]
else:
current_column.append(box)
# As above, sort the last column by top y-coordinate (b[1]).

Copilot uses AI. Check for mistakes.
Args:
boxes (list): List of bounding boxes, each defined as (x0, y0, x1, y1).
boxes (list): List of bounding boxes.
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for the boxes parameter was simplified but lost important information. The original 'each defined as (x0, y0, x1, y1)' described the expected structure. This should be restored or enhanced to 'List of bounding boxes, each defined as (x0, y0, x1, y1, class)' to match the actual usage in the code.

Suggested change
boxes (list): List of bounding boxes.
boxes (list): List of bounding boxes, each defined as (x0, y0, x1, y1, class).

Copilot uses AI. Check for mistakes.


def cluster_stripes(boxes, vertical_gap: float = 12):
def cluster_stripes(boxes, joined_boxes, vectors, vertical_gap=12):
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature added new parameters joined_boxes and vectors, but the docstring's Args section does not document these parameters. Add documentation for joined_boxes (the bounding rectangle of all boxes) and vectors (list of vector rectangles to consider during stripe division).

Copilot uses AI. Check for mistakes.


def compute_reading_order(boxes, vertical_gap: float = 12):
def compute_reading_order(boxes, joined_boxes, vectors, vertical_gap=12):
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature added new parameters joined_boxes and vectors, but the docstring's Args section does not document these parameters. Add documentation for these new parameters to explain their purpose.

Copilot uses AI. Check for mistakes.


def find_reading_order(boxes, vertical_gap: float = 36) -> list:
def find_reading_order(page_rect, blocks, boxes, vertical_gap: float = 12) -> list:
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature changed significantly with new parameters page_rect and blocks, and the default vertical_gap changed from 36 to 12, but the docstring's Args section does not document these new parameters or explain the change in default value. Update the documentation to include page_rect and blocks parameters.

Copilot uses AI. Check for mistakes.
@JorjMcKie JorjMcKie merged commit c9229b7 into main Nov 10, 2025
6 checks passed
@JorjMcKie JorjMcKie deleted the v0.2.0 branch November 10, 2025 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants