Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CP949 (Windows-949) #10

Closed
puzzlet opened this issue Jan 24, 2013 · 6 comments
Closed

Support CP949 (Windows-949) #10

puzzlet opened this issue Jan 24, 2013 · 6 comments

Comments

@puzzlet
Copy link
Collaborator

puzzlet commented Jan 24, 2013

CP949 is a superset of EUC-KR (Korean) with extra characters defined. Almost all webpages declared themselves as EUC-KR should be safely assumed to be in CP949, as they potentially are, since it has been the default locale of Korean version of MS Windows.

Here is the usage stats on the web, according to Google: http://googleblog.blogspot.kr/2010/01/unicode-nearing-50-of-web.html

We can support this by:

  • renaming EUCKR-related classes and constants to CP949
  • and patching the byte-sequence state machine in mbcssm.py

The frequency table should be the same, since the supplemented characters are the most infrequent.

@ghost ghost assigned puzzlet Jan 24, 2013
@puzzlet puzzlet mentioned this issue Jan 24, 2013
puzzlet added a commit that referenced this issue Jan 24, 2013
puzzlet added a commit that referenced this issue Jan 24, 2013
@sigmavirus24
Copy link
Member

@puzzlet nowhere in that Google blog post do I see anything about this. They specifically use EUC-KR as a label for their graph, not CP949.

sigmavirus24 added a commit that referenced this issue Jan 24, 2013
sigmavirus24 added a commit that referenced this issue Jan 24, 2013
sigmavirus24 added a commit that referenced this issue Jan 24, 2013
@puzzlet
Copy link
Collaborator Author

puzzlet commented Jan 24, 2013

@sigmavirus24 Web programmers involved in Korean encodings might call EUC-KR/CP949 interchangeably. Only EUC-KR is recognized by some standards, while CP949 is the de-facto encoding in which most of the pages are written, even when they say they're encoded in EUC-KR.

@sigmavirus24
Copy link
Member

https://en.wikipedia.org/wiki/Code_page_949 it isn't a standard recognized by IANA. I don't see a reason to change the naming of something. Creating a set of docs and noting that a page encoded with CP949 will be detected as EUC-KR is fine with me.

@puzzlet
Copy link
Collaborator Author

puzzlet commented Jan 24, 2013

I just said that CP949 has extra characters defined. There are tons of live webpages containing those, which in result refuse to be detected as EUC-KR. The test case at ceecb4a is one example.

$ charade tests/CP949/ricanet.com.xml
tests/CP949/ricanet.com.xml: ISO-8859-2 with confidence 0.2285323490602884

To successfully detect CP949, we need a new state machine to adopt the newly introduced byte sequences.

@puzzlet puzzlet reopened this Jan 24, 2013
@sigmavirus24
Copy link
Member

In the interest of being entirely transparent, I'm not just going to move EUC-KR. I'm going to keep EUC-KR and add an entirely separate set of tools for CP949 since it is fairly prominent. I'm swamped right now, so if you want to submit a pull request that'd be great, otherwise I'll be hacking away at this slowly and methodically.

@dalguji
Copy link

dalguji commented Jan 11, 2015

Side-note: If anyone drop by here in the interest of encoding detection support to cp949, please note that it has been implemented to chardet/chardet on Dec 2013.

@sv24-archive sv24-archive locked and limited conversation to collaborators Jan 11, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants