-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support CP949 (Windows-949) #10
Comments
@puzzlet nowhere in that Google blog post do I see anything about this. They specifically use EUC-KR as a label for their graph, not CP949. |
@sigmavirus24 Web programmers involved in Korean encodings might call EUC-KR/CP949 interchangeably. Only EUC-KR is recognized by some standards, while CP949 is the de-facto encoding in which most of the pages are written, even when they say they're encoded in EUC-KR. |
https://en.wikipedia.org/wiki/Code_page_949 it isn't a standard recognized by IANA. I don't see a reason to change the naming of something. Creating a set of docs and noting that a page encoded with CP949 will be detected as EUC-KR is fine with me. |
I just said that CP949 has extra characters defined. There are tons of live webpages containing those, which in result refuse to be detected as EUC-KR. The test case at ceecb4a is one example.
To successfully detect CP949, we need a new state machine to adopt the newly introduced byte sequences. |
In the interest of being entirely transparent, I'm not just going to move EUC-KR. I'm going to keep EUC-KR and add an entirely separate set of tools for CP949 since it is fairly prominent. I'm swamped right now, so if you want to submit a pull request that'd be great, otherwise I'll be hacking away at this slowly and methodically. |
Side-note: If anyone drop by here in the interest of encoding detection support to cp949, please note that it has been implemented to chardet/chardet on Dec 2013. |
CP949 is a superset of EUC-KR (Korean) with extra characters defined. Almost all webpages declared themselves as EUC-KR should be safely assumed to be in CP949, as they potentially are, since it has been the default locale of Korean version of MS Windows.
Here is the usage stats on the web, according to Google: http://googleblog.blogspot.kr/2010/01/unicode-nearing-50-of-web.html
We can support this by:
EUCKR
-related classes and constants toCP949
mbcssm.py
The frequency table should be the same, since the supplemented characters are the most infrequent.
The text was updated successfully, but these errors were encountered: