Skip to content

Reading PDF Text from digital document returns only Unicode Replacement Character. \utffd #1646

@igaurab

Description

@igaurab

Please provide all mandatory information!

Describe the bug (mandatory)

I am trying to read text from a pdf file. Here is the output of doc.getPageText(0)

image

Most of those unicode characters are Replacement Characters.

The details of pdfinfo are as follows:

Title:          DownloadFile.aspx
Author:         xyz
Creator:        PScript5.dll Version 5.2.2
Producer:       GPL Ghostscript 9.06
CreationDate:   Thu Mar 10 03:38:19 2022 +0545
ModDate:        Thu Mar 10 03:38:19 2022 +0545
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          1
Encrypted:      no
Page size:      612 x 792 pts (letter)
Page rot:       0
File size:      57369 bytes
Optimized:      no
PDF version:    1.4

Output from pdffont is below:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none]                               Type 3            Custom           yes no  no      26  0

I am using PyMuPdf version: 1.17.7

To Reproduce (mandatory)

I am sorry but I don't think I'll be able to share the pdf file.

Expected behavior (optional)

Your configuration (mandatory)

  • Ubuntu 20.04
  • Python: 3.8.0
  • PyMuPDF: 1.17.7 installed using pip

For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).

3.8.0 (default, Feb 24 2022, 18:39:11) 
[GCC 9.3.0] 
 linux 
 
PyMuPDF 1.17.7: Python bindings for the MuPDF 1.17.0 library.
Version date: 2020-09-14 06:33:06.
Built for Python 3.8 on linux (64-bit).

Metadata

Metadata

Assignees

Labels

not a bugnot a bug / user error / unable to reproduce

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions