Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: cycle in structure tree #2749

Closed
benisraelnir opened this issue Oct 19, 2023 · 7 comments
Closed

RuntimeError: cycle in structure tree #2749

benisraelnir opened this issue Oct 19, 2023 · 7 comments

Comments

@benisraelnir
Copy link

Hi. In the newest version 1.23.5 am getting this error when reading specific pdfs (link to one of them are in the code below).

[c:\Workspace\ikido-data-science\venv\Lib\site-packages\fitz\fitz.py](file:///C:/Workspace/ikido-data-science/venv/Lib/site-packages/fitz/fitz.py) in ?(self, clip, flags, matrix)
   6002     def _get_textpage(self, clip=None, flags=0, matrix=None):
-> 6003         val = _fitz.Page__get_textpage(self, clip, flags, matrix)
   6004         val.thisown = True
   6005 
   6006         return val

RuntimeError: cycle in structure tree

To Reproduce

This is my code:

import io
import requests
import fitz
pdf_link = 'https://app.ikido.tech/api/datasheet/b738958aeedfcc7efee127e5fea0a6b483e4022ac562c16473ab89af7ef0cd9444f6c1c884a398362c56863ce9ea6cbedc9a005f44facaa990c567a9d08ddd95/AVXC-S-A0014478402-1.pdf'
request = requests.get(pdf_link, timeout=20)
filestream = io.BytesIO(request.content)
text = []
with fitz.open(stream=filestream, filetype="pdf") as doc: 
    doc[0].get_text()

configuration

  • both on windows and linux
@JorjMcKie
Copy link
Collaborator

Sorry for the tardy response!
Confirming: The PDF indeed contains a loop in the definition of its structure tree. So the diagnosis (a recent fix in MuPDF) is correct and the exception is justifiable.

It might be subject to interpretation, whether downgrading this problem to a warning would make sense though.
We are discussing this.

As a circumvention, put text extraction in a try/except clause.
In fact, ignoring the structure tree altogether will also help and text extraction might succeed. Therefore, you could do this:

text = []
for page in doc:
    try:
        text.append(page.get_text())
    except RuntimeError:  # make a temporary PDF with the problem page
        temp = fitz.open()
        temp.insert_pdf(doc, from_page=page.number, to_page=page.number)
        text.append(temp[0].get_text())
        temp.close()

This will work in your case.

@JorjMcKie
Copy link
Collaborator

Update from the MuPDF developers:
In a future MuPDF version, a cyclic Structure Tree will be disabled or ignored for processing the PDF's contents. In effect leading to the same result as my circumvention above.

@JorjMcKie JorjMcKie added the postpone postpone to a future version label Nov 3, 2023
@julian-smith-artifex-com
Copy link
Collaborator

With latest PyMuPDF and MuPDF the test case runs ok, with a warning "MuPDF error: cycle in structure tree".

#2548 tests the same issue.

@julian-smith-artifex-com
Copy link
Collaborator

tests/test_2548.py:test_2548() has been extended to check for the new behaviour in PyMuPDF-1.23.7, so marking this as fixed in next release.

@julian-smith-artifex-com julian-smith-artifex-com added Fixed in next release and removed postpone postpone to a future version labels Nov 17, 2023
@dvzrv
Copy link

dvzrv commented Nov 30, 2023

Hi! When building pymupdf 1.23.6 against mupdf 1.23.7 I get a failing test:

=================================== FAILURES ===================================
__________________________________ test_2548 ___________________________________

    def test_2548():
        """Text extraction should fail because of PDF structure cycle.

        Old MuPDF version did not detect the loop.
        """
        print(f'test_2548(): {fitz.mupdf_version_tuple=}')
        if fitz.mupdf_version_tuple < (1, 23, 4):
            print(f'test_2548(): Not testing #2548 because infinite hang before mupdf-1.23.4.')
            return
        fitz.TOOLS.mupdf_warnings(reset=True)
        doc = fitz.open(f'{root}/tests/resources/test_2548.pdf')
        e = False
        for page in doc:
            try:
                _ = page.get_text()
            except Exception as ee:
                print(f'test_2548: {ee=}')
                if hasattr(fitz, 'mupdf'):
                    # Rebased.
                    expected = "RuntimeError('code=2: cycle in structure tree')"
                else:
                    # Classic.
                    expected = "RuntimeError('cycle in structure tree')"
                assert repr(ee) == expected, f'Expected {expected=} but got {repr(ee)=}.'
                e = True
        wt = fitz.TOOLS.mupdf_warnings()
        print(f'test_2548(): {wt=}')
        if fitz.mupdf_version_tuple < (1, 24, 0):
>           assert e
E           assert False

tests/test_2548.py:35: AssertionError
----------------------------- Captured stdout call -----------------------------
test_2548(): fitz.mupdf_version_tuple=(1, 23, 7)
test_2548(): wt='structure tree broken, assume tree is missing: cycle in structure tree'
=========================== short test summary info ============================
FAILED tests/test_2548.py::test_2548 - assert False
================= 1 failed, 157 passed, 1 deselected in 4.90s ==================

Can you point me to where this test is fixed as for rebuild purposes I will have to disable this now.

@julian-smith-artifex-com
Copy link
Collaborator

Releases of PyMuPDF are only tested with a specific MuPDF, and are not tested or updated to work with later MuPDF releases.

MuPDF often changes behaviour between its releases, so some test failures with later MuPDF releases are to be expected. In particular, PyMuPDF-1.23.6 was only tested with MuPDF-1.23.5.

If you want to use MuPDF-1.23.7, you'll have to wait for our next release, PyMuPDF-1.23.7, which i'm hoping to make today or tomorrow.

[Or you could try the latest PyMuPDF from git, which usually (but not always) works with the latest MuPDF from git (master and current release branche)].

@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.23.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants