-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possibility to support Chinese codecs? #34
Comments
I don't think it would be possible to support GB* without a huge loss in accuracy. The problem is that most sequences of bytes can be decoded as GB18030, for example, regardless of whether they're actually intended to be GB18030. (Are you sure the text in that example is meant to be Chinese at all?) The thing that makes ftfy possible is that most sequences of bytes aren't valid UTF-8, so when you can decode something as UTF-8, it's a strong signal that it's the right thing to do. At one point I looked into trying to support the Japanese encoding Shift-JIS. Even though it has fewer valid sequences than the GB* encodings, I was getting too many false positives on likely sequences of bytes. |
Sure, I indeed did not look at what kind of byte sequences GB* codecs produce; if you say it's not feasible due to the false-positive rates, then it's not an option. Yes, the text in question was meant to be Chinese; the problem was explicitly constrained to text that was either English or Chinese only. |
This string is an interesting puzzle. I'm coming to the conclusion that it's not actually GB* - it seems to be Chinese in triple-UTF-8 with some bytes missing. |
See, to me any Chinese character looks like any other Chinese character and I made the incorrect assumption that by using GB* on the sloppy-cp-1252 result I'd get something approaching valid.. The bytes missing are probably due to un-printable bytes not having been copied into the question; the OP didn't use Something like this then? >>> print u'袋è¢âdcx€¹Ã¤Â¸Å½Ã¦Å“‹å‹们çâ€ÂµÃ¥ÂÂå•â€'.encode('sloppy-windows-1252').decode('utf8', 'ignore').encode('sloppy-windows-1252').decode('utf8', 'ignore')
袋dcx与朋们 |
���0$Í(e�K�>”ÇHÉÜbÆ{Â;fB�o;¬UóÙC‹�pÎ�?Ÿ÷�œ3·9B�Oƒ |
Based on this Stack Overflow question I looked into support for Chinese character encodings.
The GB* series of codecs are, like UTF-8, a variable width encoding. The example in the question reads:
which can be decoded using GB* encodings to varying degrees of success:
Unfortunately I do not know which one of these is closest to the original, but that doesn't matter all that much. What'd be needed is an analysis of how GB* encodings pushed through the CP1252 / Latin-1 sieve can be distinguished from UTF-8 Mojibakes and handled
fix_one_step_and_explain()
.Is supporting these codecs feasible?
The text was updated successfully, but these errors were encountered: