Improve auto detection of encodings/character sets in absence of ECI info #336

axxel · 2022-05-20T09:43:25Z

@gitlost comment in #334 about guessing language made me rethink my dumb approach mentioned there and first invest a little in a web search, which resulted in this: https://github.com/google/compact_enc_det. It has been suggested here and seems to be used inside google chrome. It has a compatible license, so I gave it a 'quick' try and hacked it into my build and it does indeed detect the correct encoding of the mentioned upstream sample. It breaks 2 unit tests though, because it guesses something different than the old approach (GB2312 instead of ShiftJIS). The same problem would likely happen with any other self-made 'improvement' or use of any other 3rd party lib.

One annoying problem is that the provided cmake scripts are uncooperative in the FetchContent context and need patches for it to build, so I would need to fork it.

So this looks promising but would likely introduce regressions for end users (while also likely fix issues...).

The text was updated successfully, but these errors were encountered:

This was referenced May 20, 2022

How to improve binary data support? (Community feedback requested) #334

Closed

update CMakeLists.txt to accomodate the needs of the community google/compact_enc_det#22

Open

axxel mentioned this issue Jan 1, 2023

How do I set the CharacterSet in iOS wrapper ? #448

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve auto detection of encodings/character sets in absence of ECI info #336

Improve auto detection of encodings/character sets in absence of ECI info #336

axxel commented May 20, 2022

Improve auto detection of encodings/character sets in absence of ECI info #336

Improve auto detection of encodings/character sets in absence of ECI info #336

Comments

axxel commented May 20, 2022