Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GB18030 #11

Closed
puzzlet opened this issue Jan 24, 2013 · 10 comments
Closed

Support GB18030 #11

puzzlet opened this issue Jan 24, 2013 · 10 comments
Assignees

Comments

@puzzlet
Copy link
Collaborator

puzzlet commented Jan 24, 2013

GB18030 is a superset of Chinese encoding GB2312, which charade already supports.

Like #10, we can support this by:

  • renaming GB2312-related classes and constants to GB18030
  • and patching the byte-sequence state machine in mbcssm.py
@ghost ghost assigned puzzlet Jan 24, 2013
@sigmavirus24
Copy link
Member

This, however, I will gladly rename and make note that GB2312 is superseded by GB18030 since it's actually an official standard.

@sigmavirus24
Copy link
Member

We will probably also need to add to this to make sure GB18030 is covered as well since it is an expansion of GB2312 right? I don't mind replacing GB2312 in this case but I want to make sure we cover GB18030 properly.

@puzzlet
Copy link
Collaborator Author

puzzlet commented Jan 31, 2013

The module currently seems to have flaws with the encoding, as it incorrectly identifies strings with non-GB2312 characters as GB2312:

# from http://lifesinger.github.com/lab/2009/loadtime/test_hubble.html
>>> x = u'(人工の測量,1~2センチメートルの誤差があるかもしれない 尺码标注仅供参考,不能作为退货理由)'
>>> x.encode('GB2312')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gb2312' codec can't encode character u'\u6e2c' in position 4: illegal multibyte sequence
>>> import charade; charade.detect(x.encode('GB18030'))
{'confidence': 0.99, 'encoding': 'GB2312'}

But it's not a correct GB18030 detector either, as it rules out some unusual but valid byte sequences:

# from http://zh.wikipedia.org/zh-cn/%C4%90
>>> x = u'Đ, đ(d-stroke)是标准越南语跟标准克罗地亚语和波斯尼亚语所使用的字母,但两者所表示的音位完全不同。它是越南语字母表的第 7 个字 母、克罗地亚语字母表的第 8 个字母。'
>>> charade.detect(x.encode('GB18030'))
{'confidence': 0.22369399198761675, 'encoding': 'ISO-8859-2'}
>>> charade.detect(x[4:].encode('GB18030'))
{'confidence': 0.99, 'encoding': 'GB2312'}

@ghost ghost assigned puzzlet Jan 31, 2013
@sigmavirus24
Copy link
Member

Thanks for the extra information @puzzlet. I'm hoping to have some time for charade this weekend or next.

@sigmavirus24
Copy link
Member

@puzzlet there's no chance you have an accurate frequency table for this, do you?

@puzzlet
Copy link
Collaborator Author

puzzlet commented Feb 26, 2013

@sigmavirus24 no; but I'm pretty sure it's not very different from the GB2312 table, which would be a good starting point. (and I wonder how accurate original chardet's tables used to be?)

@sigmavirus24
Copy link
Member

(I wonder the same thing. But just because I don't implicitly trust many things.) How do you propose adding support for the valid characters you mentioned were causing issues with the GB2312 encoding?

(Also, sorry for taking so long to get around to this. I'm still really busy, but I've worked through other projects and want to make this better.)

@puzzlet
Copy link
Collaborator Author

puzzlet commented Feb 26, 2013

For the behaviour of the detector, we can add GB18030 as a new, separate encoding as we did in #13 .

And the issue described above is the bug of the byte sequence state machine (mbcssm.py) -- we need to examine the table for GB2312, and add a new one for GB18030.

@sigmavirus24
Copy link
Member

Sounds like a good start.

@sigmavirus24
Copy link
Member

The fact that as late as September of last year, Mozilla doesn't have a table for GB18030, is depressing. Here's everything they have: https://mxr.mozilla.org/mozilla-central/source/extensions/universalchardet/src/base/

I might write something to quickly strip the C++ and replace everything with python so it will be a simple way of converting their *.tab files to our familiar *.py files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants