Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoding detection on PHP 8.1 #182

Merged
merged 2 commits into from
Jun 8, 2022
Merged

Conversation

come-nc
Copy link
Contributor

@come-nc come-nc commented Apr 25, 2022

Use mb_check_encoding to detect encoding as mb_detect_encoding is misbehaving under PHP 8.1.
Also use mb_convert_encoding instead of utf8_encode as it’s getting deprecated in PHP 8.2.
Fixes #181

Use mb_check_encoding to detect encoding as mb_detect_encoding is misbehaving under PHP 8.1.
Also use mb_convert_encoding instead of utf8_encode as it’s getting deprecated in PHP 8.2.
Fixes sabre-io#181
@staabm
Copy link
Member

staabm commented Apr 25, 2022

do we need additional test-coverage for this change?

@codecov
Copy link

codecov bot commented Apr 25, 2022

Codecov Report

Merging #182 (42e1411) into master (315f592) will decrease coverage by 0.15%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##             master     #182      +/-   ##
============================================
- Coverage     89.75%   89.60%   -0.16%     
  Complexity      262      262              
============================================
  Files            15       15              
  Lines           898      885      -13     
============================================
- Hits            806      793      -13     
  Misses           92       92              
Impacted Files Coverage Δ
lib/functions.php 95.65% <100.00%> (-0.07%) ⬇️
lib/Client.php 84.50% <0.00%> (-0.67%) ⬇️
lib/Response.php 96.77% <0.00%> (-0.20%) ⬇️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@come-nc
Copy link
Contributor Author

come-nc commented Apr 25, 2022

do we need additional test-coverage for this change?

If the current coverage did not catch the bug, yes.

Here are some example strings that could be used in the test: https://3v4l.org/RrjlE
You can see that both 'Dušan' and 'Živko' (stolen from https://en.wikipedia.org/wiki/Slavic_names) are messed up by the function.

@staabm
Copy link
Member

staabm commented Apr 25, 2022

please investigate whether we need such a test and add it, if required.

Signed-off-by: Côme Chilliet <come.chilliet@nextcloud.com>
@come-nc
Copy link
Contributor Author

come-nc commented Apr 26, 2022

please investigate whether we need such a test and add it, if required.

Test added.


switch ($encoding) {
case 'ISO-8859-1':
$path = utf8_encode($path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this method is deprecated as of php 8.2, this is a step into a good direction

👍

@staabm staabm requested a review from phil-davis April 26, 2022 09:34
@PVince81
Copy link

PVince81 commented Jun 7, 2022

@DeepDiver1975 @phil-davis any objections to merging this ? thanks 😄

@come-nc
Copy link
Contributor Author

come-nc commented Jun 9, 2022

@DeepDiver1975
Copy link
Member

Should I open PR on these repos as well?

yes please

@phil-davis
Copy link
Contributor

phil-davis commented Jun 24, 2022

Just a note for "information". There are some impossibilities for this kind of automated "educated guess" detection of encoding.
For example, hex C2A3 is the UTF-8 for the UK pound symbol.
But in ISO-8859-1 that is 2 code-points - C2 is  and A3 happens to be the UK pound symbol £

So if any software is presented with C2A3 as an "encoded string" and no other meta-data about how to interpret it, then there is no way to know if it is meant to represent £ or just £.

There will be plenty of other examples. Have a play at https://dencode.com/en/string/hex and try putting in hex for https://en.wikipedia.org/wiki/UTF-8 code points, and find ones that match sets of ISO-8859-1 code points that represent sequences of characters that could also be a valid, reasonable combination that might occur in a file name, for example.

CEA9 is UFT-8 Greek Omega Ω and ISO-8859-1 Ω

Any automated algorithm has to have a heuristic that "guesses" that Greek Omega is more common than the combination Ω

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF-8 encoding detection fails on PHP 8.1
5 participants