-
Notifications
You must be signed in to change notification settings - Fork 634
Closed
Labels
not a bugnot a bug / user error / unable to reproducenot a bug / user error / unable to reproduce
Description
Please provide all mandatory information!
Describe the bug (mandatory)
I am trying to read text from a pdf file. Here is the output of doc.getPageText(0)
Most of those unicode characters are Replacement Characters.
The details of pdfinfo
are as follows:
Title: DownloadFile.aspx
Author: xyz
Creator: PScript5.dll Version 5.2.2
Producer: GPL Ghostscript 9.06
CreationDate: Thu Mar 10 03:38:19 2022 +0545
ModDate: Thu Mar 10 03:38:19 2022 +0545
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 612 x 792 pts (letter)
Page rot: 0
File size: 57369 bytes
Optimized: no
PDF version: 1.4
Output from pdffont
is below:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none] Type 3 Custom yes no no 26 0
I am using PyMuPdf version: 1.17.7
To Reproduce (mandatory)
I am sorry but I don't think I'll be able to share the pdf file.
Expected behavior (optional)
Your configuration (mandatory)
- Ubuntu 20.04
- Python: 3.8.0
- PyMuPDF: 1.17.7 installed using pip
For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)
would be sufficient (for the first two bullets).
3.8.0 (default, Feb 24 2022, 18:39:11)
[GCC 9.3.0]
linux
PyMuPDF 1.17.7: Python bindings for the MuPDF 1.17.0 library.
Version date: 2020-09-14 06:33:06.
Built for Python 3.8 on linux (64-bit).
Metadata
Metadata
Assignees
Labels
not a bugnot a bug / user error / unable to reproducenot a bug / user error / unable to reproduce