Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Ignore UTF-8 decode errors #1865

Merged
merged 3 commits into from
Jun 3, 2023
Merged

Conversation

talibhmukadam
Copy link
Contributor

Problem
Some pdfs contain Latin characters, and when trying to read them using pypdf, it throws the following exception.|

    text = page.extract_text()
  File "/Users/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pypdf/_page.py", line 1851, in extract_text
    return self._extract_text(
  File "/Users/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pypdf/_page.py", line 1356, in _extract_text
    content = ContentStream(content, pdf, "bytes")
  File "/Users/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 877, in __init__
    self.__parse_content_stream(stream_bytes)
  File "/Users/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 943, in __parse_content_stream
    operands.append(read_object(stream, None, self.forced_encoding))
  File "/Users/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pypdf/generic/_data_structures.py", line 1053, in read_object
    return read_string_from_stream(stream, forced_encoding)
  File "/Users/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pypdf/generic/_utils.py", line 107, in read_string_from_stream
    msg = rf"Unexpected escaped string: {tok.decode('utf8')}"
    ```

@talibhmukadam talibhmukadam changed the title ignore decode errors BUG : Ignore UTF-8 decode errors Jun 1, 2023
@codecov
Copy link

codecov bot commented Jun 1, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (81a58da) 93.42% compared to head (ed2319e) 93.42%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1865   +/-   ##
=======================================
  Coverage   93.42%   93.42%           
=======================================
  Files          34       34           
  Lines        6634     6634           
  Branches     1303     1303           
=======================================
  Hits         6198     6198           
  Misses        284      284           
  Partials      152      152           
Impacted Files Coverage Δ
pypdf/generic/_utils.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link

@tasfiqul-ghani tasfiqul-ghani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Facing this issue : 'utf-8' codec can't decode byte 0x83 in position 0: invalid start byte
This PR will fix the issue.

@tasfiqul-ghani
Copy link

@MartinThoma Please merge the PR.It will fix a major issue. For some PDFs we are getting this error :
'utf-8' codec can't decode byte 0x83 in position 0: invalid start byte

@pubpub-zz
Copy link
Collaborator

@tasfiqul-ghani
can you share one pdf showing the issue ?

pypdf/generic/_utils.py Outdated Show resolved Hide resolved
Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
@MartinThoma MartinThoma changed the title BUG : Ignore UTF-8 decode errors BUG: Ignore UTF-8 decode errors Jun 3, 2023
@MartinThoma MartinThoma added soon PRs that are almost ready to be merged, issues that get solved pretty soon is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Jun 3, 2023
@MartinThoma MartinThoma merged commit 686adec into py-pdf:main Jun 3, 2023
12 checks passed
@MartinThoma
Copy link
Member

Thank you for your contribution @talibhmukadam 🙏

If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

@pubpub-zz
Copy link
Collaborator

@talibhmukadam / @tasfiqul-ghani
I' m still understanding in a sample file. I would like to check if nothing is hidden behind🤔

@talibhmukadam
Copy link
Contributor Author

@MartinThoma , @pubpub-zz thank you guys for reviewing and merging the fix so fast. If I could ask, what would be a possible ETA to release the new version of pypdf with this fix? I am sorry to rush you but it is blocking us from releasing a new feature 😢

@MartinThoma , Yes, please feel free to add me to the contributor's list. 😄

@pubpub-zz , I would love to share the pdf file, but the few pdf files that we got the errors on, are resume files that contain PII information which my organization wouldn't allow me to share. I hope you understand.
If I could reproduce this error on any other pdf file. I will definitely share that with you.

@MartinThoma
Copy link
Member

I'm creating a release at the moment. It will be on PyPI in less than 2 hours.

However, if you want to make sure that the fix stays in pypdf, we need to get a sample file. Otherwise it could happen in future that another change breaks it again (but I also understand the PII restrictions 😢 )

@MartinThoma
Copy link
Member

Yes, please feel free to add me to the contributor's list. smile

Which name should I use and should I link to some profile (e.g. your Github profile?)

MartinThoma added a commit that referenced this pull request Jun 4, 2023
Deprecations (DEP)
-  Deprecate PdfMerger (#1866)

Bug Fixes (BUG)
-  Ignore UTF-8 decode errors (#1865)

Robustness (ROB)
-  Handle missing /Type entry in Page tree (#1859)

[Full Changelog](3.9.0...3.9.1)
@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jun 4, 2023

@talibhmukadam / @tasfiqul-ghani
Would you agree to share the file privately? If so, please email it to @MartinThoma (info@martin-thoma.de)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF soon PRs that are almost ready to be merged, issues that get solved pretty soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants