Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplified Chinese #3

Open
dan2468 opened this issue Jan 22, 2018 · 1 comment
Open

Simplified Chinese #3

dan2468 opened this issue Jan 22, 2018 · 1 comment

Comments

@dan2468
Copy link

dan2468 commented Jan 22, 2018

Should it be able to fix “ÓůÓĂČíĽţ(YYRJ)” ?
(It should be a person’s name in a Chinese script.)

@rspeer
Copy link
Contributor

rspeer commented Jan 23, 2018

The text is "御用软件(YYRJ)", right? (That's the result of encoding the text as Windows-1250 and decoding as GBK.)

This is a similar case to #4, but because GBK is a multi-byte character set, it is at least conceivable that the ftfy library could deal with it.

The problem is the decoding as Windows-1250, the Eastern European encoding that's giving you letters like ů. It often creates a mess of ambiguity (as it does in #4) by being too similar to ISO-8859-2. I don't think ftfy will ever be able to disentangle Windows-1250 from arbitrary other encodings for that reason. Do you have any control over your data source that's decoding text from numerous different languages as if it were Windows-1250?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants