get_toc(simple=False) return 'to' point coordinate is not based on top-left origin #3413

charosen · 2024-04-25T06:53:31Z

charosen
Apr 25, 2024

Description of the bug

i have a pdf, with outlines(titles) and content below:

1.1 Hello World

1.1.1. first step to hello world

content

and i want to extract all the outline(titles) and their coordinates in page.

when i use get_toc(simple=False), fitz return a toc list:

[[1,
  '1.1 Hello world',
  1,
  {'kind': 4,
   'xref': 41631,
   'page': 0,
   'to': Point(0.0, 761.8583),
   'zoom': 0.0,
   'nameddest': '_OPENTOPIC_TOC_PROCESSING_d13321e25969',
   'collapse': True,
   'color': (0.0, 0.0, 0.0)}],
 [2,
  '1.1.1 first step to hello world',
  1,
  {'kind': 4,
   'xref': 41632,
   'page': 0,
   'to': Point(0.0, 731.8583),
   'zoom': 0.0,
   'nameddest': '_OPENTOPIC_TOC_PROCESSING_d13321e25972',
   'collapse': True,
   'color': (0.0, 0.0, 0.0)}],
...
]

the returned 'to' points is not based on top-left origin, but bottom-left origin, because 1.1 Hello world is above 1.1.1 first step to hello world', but Point(0.0, 761.8583) is greater than Point(0.0, 731.8583),

it seems like pdf coordinates, not (py)mupdf coordinates.

how to covert those toc 'to' points to top-bottom coordinates.

How to reproduce the bug

import fitz

document = fitz.open('mypdf.pdf')

toc = document.get_toc(simple=False)

toc results:

[[1,
  '1.1 Hello world',
  1,
  {'kind': 4,
   'xref': 41631,
   'page': 0,
   **'to': Point(0.0, 761.8583),**
   'zoom': 0.0,
   'nameddest': '_OPENTOPIC_TOC_PROCESSING_d13321e25969',
   'collapse': True,
   'color': (0.0, 0.0, 0.0)}],
 [2,
  '1.1.1 first step to hello world',
  1,
  {'kind': 4,
   'xref': 41632,
   'page': 0,
   **'to': Point(0.0, 731.8583),**
   'zoom': 0.0,
   'nameddest': '_OPENTOPIC_TOC_PROCESSING_d13321e25972',
   'collapse': True,
   'color': (0.0, 0.0, 0.0)}],
...
]

PyMuPDF version

1.24.1

Operating system

Linux

Python version

3.9

JorjMcKie · 2024-04-25T06:58:46Z

JorjMcKie
Apr 25, 2024
Maintainer

You did not provide the reproducing file.

0 replies

charosen · 2024-04-25T07:04:38Z

charosen
Apr 25, 2024
Author

You did not provide the reproducing file.

sorry, i could not upload mypdf file for some reason.

However, it is pretty clear that 'to' point in toc is based on bottom-left origin, not top-left origin.

i simply want to convert 'to' points to top-left coordinates.

0 replies

JorjMcKie · 2024-04-25T07:14:25Z

JorjMcKie
Apr 25, 2024
Maintainer

It is not all clear:
What are we even looking at? Where do the "**" come from?
The TOC entries seem to point to named destinations - are there errors in the PDF? Or in our code?
Did the PDF creator want to point to the bottom left point 🤷‍♂️?
Have you tried to look at the PDF's names dictionary?

Again: without the file in question we are already wasting time.

0 replies

JorjMcKie · 2024-04-25T07:20:17Z

JorjMcKie
Apr 25, 2024
Maintainer

Maybe you simply had a question and just wanted to know how to do coordinate transformation?
In that case you shouldn't have submitted an error report but a post in Discussions.

0 replies

charosen · 2024-04-25T07:26:45Z

charosen
Apr 25, 2024
Author

It is not all clear: What are we even looking at? Where do the "**" come from? The TOC entries seem to point to named destinations - are there errors in the PDF? Or in our code? Did the PDF creator want to point to the bottom left point 🤷‍♂️? Have you tried to look at the PDF's names dictionary?

Again: without the file in question we are already wasting time.

Sorry for the "**" signs, i just want to get bolded fonts, and i already delete them.

my question is:

get_toc(simple=False) returns a Point(0.0, 761.8583) for 1.1 Hello World, and a Point(0.0, 731.8583) for 1.1.1. first step to hello world.

1.1 Hello World is above 1.1.1. first step to hello world, however, Point(0.0, 761.8583) is greater than Point(0.0, 731.8583), which is not based on pymupdf top-left coordinates.

0 replies

JorjMcKie · 2024-04-25T07:58:33Z

JorjMcKie
Apr 25, 2024
Maintainer

Ok - to make some progress, I transferring this thread to Discussions, and we can continue there.

2 replies

charosen Apr 25, 2024
Author

ok, sorry for my mis-submitting to bugs.

charosen Apr 25, 2024
Author

I want to know:

how to convert get_toc "to" Points back into top-left coordinates, in other words, convert pdf coordinates back to (py)mupdf coordinates?
any attribute in get_toc results could indicate whether the "to" point is top-left coordinates or bottom-left coordinates? when i should do convertion ?

JorjMcKie · 2024-04-25T08:09:44Z

JorjMcKie
Apr 25, 2024
Maintainer

The two TOC entries obviously point to named destinations. You can extract (but not set / update) a PDF's defined symbolic names as a Python dictionary via names = doc.resolve_names(). As documented, the "to" values there are in PDF coordinates - (0, 0) is bottom-left. This is also the "to" value in the TOC link.
You need the page object to transform this point value to MuPDF coordinates like this: fitz.Point(0.0, 731.8583) * page.transformation_matrix.

6 replies

JorjMcKie Apr 25, 2024
Maintainer

This does a matrix multiplication point * matrix, resulting of course in a point.

charosen Apr 25, 2024
Author

for my second question:

based on documentation:

when the dest is named destinations, in other words, kind == 4, i should do the conversion ?
when the dest is Points to a place in this document, kind == 1, i should not do it ?

JorjMcKie Apr 25, 2024
Maintainer

correct

charosen Apr 25, 2024
Author

thanks for patiently guiding!!!
i try to fix my codes now!!

thanks again !!!

JorjMcKie Apr 25, 2024
Maintainer

you are welcome

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_toc(simple=False) return 'to' point coordinate is not based on top-left origin #3413

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

get_toc(simple=False) return 'to' point coordinate is not based on top-left origin #3413

charosen Apr 25, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Replies: 7 comments · 8 replies

JorjMcKie Apr 25, 2024 Maintainer

charosen Apr 25, 2024 Author

JorjMcKie Apr 25, 2024 Maintainer

JorjMcKie Apr 25, 2024 Maintainer

charosen Apr 25, 2024 Author

JorjMcKie Apr 25, 2024 Maintainer

charosen Apr 25, 2024 Author

charosen Apr 25, 2024 Author

JorjMcKie Apr 25, 2024 Maintainer

JorjMcKie Apr 25, 2024 Maintainer

charosen Apr 25, 2024 Author

JorjMcKie Apr 25, 2024 Maintainer

charosen Apr 25, 2024 Author

JorjMcKie Apr 25, 2024 Maintainer

charosen
Apr 25, 2024

Replies: 7 comments 8 replies

JorjMcKie
Apr 25, 2024
Maintainer

charosen
Apr 25, 2024
Author

JorjMcKie
Apr 25, 2024
Maintainer

JorjMcKie
Apr 25, 2024
Maintainer

charosen
Apr 25, 2024
Author

JorjMcKie
Apr 25, 2024
Maintainer

charosen Apr 25, 2024
Author

charosen Apr 25, 2024
Author

JorjMcKie
Apr 25, 2024
Maintainer

JorjMcKie Apr 25, 2024
Maintainer

charosen Apr 25, 2024
Author

JorjMcKie Apr 25, 2024
Maintainer

charosen Apr 25, 2024
Author

JorjMcKie Apr 25, 2024
Maintainer