Skip to content

Missing background color in HTML output #459

@corryms1993

Description

@corryms1993

Hi there,

I've noticed that when running text extraction there are elements appearing at the bottom of the output HTML rather than in their correctly ordered position within the input PDF.

Example PDF: (https://www.cqc.org.uk/sites/default/files/new_reports/AAAG0004.pdf)

image

For example, for the above image, the text: "This section is primarily information for the provider" appears at the top of the PDF. However in the output, the value appears in the following location:

image

I know that elements within a PDF aren't ordered but I know PyMuPDF handles columns well so thought this seemed strange. Does anyone know if there is a parameter that can be modified to alter the ordering or have any other tips for correcting this?

Hopefully someone is able to determine the cause of the problem.

Configuration

Windows 10 x64
Python 3.6
PyMuPDF version 1.16.11 (have tried older versions too)

Many thanks,

Corry

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions