Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve auto detection of encodings/character sets in absence of ECI info #336

Open
axxel opened this issue May 20, 2022 · 0 comments
Open

Comments

@axxel
Copy link
Collaborator

axxel commented May 20, 2022

@gitlost comment in #334 about guessing language made me rethink my dumb approach mentioned there and first invest a little in a web search, which resulted in this: https://github.com/google/compact_enc_det. It has been suggested here and seems to be used inside google chrome. It has a compatible license, so I gave it a 'quick' try and hacked it into my build and it does indeed detect the correct encoding of the mentioned upstream sample. It breaks 2 unit tests though, because it guesses something different than the old approach (GB2312 instead of ShiftJIS). The same problem would likely happen with any other self-made 'improvement' or use of any other 3rd party lib.

One annoying problem is that the provided cmake scripts are uncooperative in the FetchContent context and need patches for it to build, so I would need to fork it.

So this looks promising but would likely introduce regressions for end users (while also likely fix issues...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant