-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improved ExtractText(3) #969
Conversation
TODO : add some encodings missing
There are two minor Flake8 issues:
Do you prefer to fix them yourself or should I do it? (also as a general question) |
@MartinThoma, Can you have a look please |
@pubpub-zz I might be sleepy-dumb, but I don't see what you mean. I think you only have to minor stylistic / mypy adjustments you need to make: #971 |
I'll have a more detailed look tomorrow at all the goodness you're bringing to PyPDF2 this time :-) |
Oh, if you worry about the code coverage: That's not so bad. It's especially not a blocker from getting your improvements merged. I will run various tests (especially https://github.com/py-pdf/benchmarks ) to check things are improved. I can live if coverage drops a bit (and I will have a more detailed look at the places which are not covered) |
@MartinThoma |
@MartinThoma
|
Oh damn. That sounds as if it's related to #646 I'll have a closer look tomorrow |
I still have some work to fix text extraction with the "paper rotated" |
Codecov Report
@@ Coverage Diff @@
## main #969 +/- ##
==========================================
+ Coverage 84.25% 84.42% +0.16%
==========================================
Files 18 18
Lines 4115 4179 +64
Branches 868 887 +19
==========================================
+ Hits 3467 3528 +61
- Misses 465 468 +3
Partials 183 183
Continue to review full report at Codecov.
|
@pubpub-zz I've added the test back, without any adjustment. It works: #971 |
@pubpub-zz I cannot answer that. It would be ok for me to have the behavior as in the specs, which also means that I just re-ran the benchmark and the results look very similar. |
@MartinThoma, |
@pubpub-zz I love how committed you are to improve PyPDF2, but please don't feel pressured because I said that I wanted to make a release today. It's unpaid so it should be fun. If it doesn't work today or even for some weeks, it would be fine 🤗 |
Devlivered just before diner and I've mowed the lawn 😁 |
@MartinThoma , can you have a look I do not understand : there is an error on test_utils but it did not changed it. |
That is a merge with main which got wrong. You need to adjust the "ids" parameter to match the number of tests. I think there is currently an "11" but it should be "8" (in the range function) |
I've set the ids because the auto-generated I'd takes just all of the parameters which was extremely long |
Good job 😁👍 I was just making burgers for my girlfriend and we will now have an relaxed evening 😊 |
@pubpub-zz I've updated the PR so that the tests run. It was weird that they didn't succeed ... apparently, the tests ran on code as if it was already having the automatic merge. The automatic merge didn't adjust the ids range: 0ba91aa I try to go through the PR today evening / night :-) |
@pubpub-zz Looks good to me! I would squash-commit with the following text:
Does that represent the changes well to users? |
Besides the two typos I've just commented, there is one robustness-change I would do: The
I would just wrap it in a try-except import logging
logger = logging.getLogger(__name__)
...
while a <= b:
sq = fmt2 % c
key = unhexlify(fmt % a).decode(
"charmap" if map_dict[-1] == 1 else "utf-16-be"
)
unhexlified = unhexlify(sq)
try:
decoded = unhexlified.decode("utf-16-be")
except UnicodeDecodeError as exc:
logger.warning("UnicodeDecodeError when parsing cmap")
a += 1
c += 1
continue
map_dict[key] = decoded
int_entry.append(a)
a += 1
c += 1 |
Co-authored-by: Martin Thoma <info@martin-thoma.de>
under analysis |
|
…end of data use surrogatepass in _cmap and _page
@MartinThoma |
The 2.2.0 release improves text extraction again via (#969): * Improvements around /Encoding / /ToUnicode * Extraction of CMaps improved * Fallback for font def missing * Support for /Identity-H and /Identity-V: utf-16-be * Support for /GB-EUC-H / /GB-EUC-V / GBp/c-EUC-H / /GBpc-EUC-V (beta release for evaluation) * Arabic (for evaluation) * Whitespace extraction improvements Those changes should mainly improve the text extraction for non-ASCII alphabets, e.g. Russian / Chinese / Japanese / Korean / Arabic. Full Changelog: 2.1.1...2.2.0
New corrections for
extract_text()
fixes extraction in cmap
#953
#431
#242
#591 /#954 should be good but doubts on arabic