Support GB18030 #11

puzzlet · 2013-01-24T08:59:20Z

GB18030 is a superset of Chinese encoding GB2312, which charade already supports.

Like #10, we can support this by:

renaming GB2312-related classes and constants to GB18030
and patching the byte-sequence state machine in mbcssm.py

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2013-01-24T14:32:17Z

This, however, I will gladly rename and make note that GB2312 is superseded by GB18030 since it's actually an official standard.

sigmavirus24 · 2013-01-25T05:17:10Z

We will probably also need to add to this to make sure GB18030 is covered as well since it is an expansion of GB2312 right? I don't mind replacing GB2312 in this case but I want to make sure we cover GB18030 properly.

puzzlet · 2013-01-31T15:32:03Z

The module currently seems to have flaws with the encoding, as it incorrectly identifies strings with non-GB2312 characters as GB2312:

# from http://lifesinger.github.com/lab/2009/loadtime/test_hubble.html
>>> x = u'(人工の測量,1～2センチメートルの誤差があるかもしれない 尺码标注仅供参考，不能作为退货理由）'
>>> x.encode('GB2312')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gb2312' codec can't encode character u'\u6e2c' in position 4: illegal multibyte sequence
>>> import charade; charade.detect(x.encode('GB18030'))
{'confidence': 0.99, 'encoding': 'GB2312'}

But it's not a correct GB18030 detector either, as it rules out some unusual but valid byte sequences:

# from http://zh.wikipedia.org/zh-cn/%C4%90
>>> x = u'Đ, đ（d-stroke）是标准越南语跟标准克罗地亚语和波斯尼亚语所使用的字母，但两者所表示的音位完全不同。它是越南语字母表的第 7 个字 母、克罗地亚语字母表的第 8 个字母。'
>>> charade.detect(x.encode('GB18030'))
{'confidence': 0.22369399198761675, 'encoding': 'ISO-8859-2'}
>>> charade.detect(x[4:].encode('GB18030'))
{'confidence': 0.99, 'encoding': 'GB2312'}

sigmavirus24 · 2013-01-31T15:42:41Z

Thanks for the extra information @puzzlet. I'm hoping to have some time for charade this weekend or next.

sigmavirus24 · 2013-02-26T03:53:48Z

@puzzlet there's no chance you have an accurate frequency table for this, do you?

puzzlet · 2013-02-26T04:09:49Z

@sigmavirus24 no; but I'm pretty sure it's not very different from the GB2312 table, which would be a good starting point. (and I wonder how accurate original chardet's tables used to be?)

sigmavirus24 · 2013-02-26T04:13:07Z

(I wonder the same thing. But just because I don't implicitly trust many things.) How do you propose adding support for the valid characters you mentioned were causing issues with the GB2312 encoding?

(Also, sorry for taking so long to get around to this. I'm still really busy, but I've worked through other projects and want to make this better.)

puzzlet · 2013-02-26T04:22:55Z

For the behaviour of the detector, we can add GB18030 as a new, separate encoding as we did in #13 .

And the issue described above is the bug of the byte sequence state machine (mbcssm.py) -- we need to examine the table for GB2312, and add a new one for GB18030.

sigmavirus24 · 2013-02-26T04:41:34Z

Sounds like a good start.

sigmavirus24 · 2013-03-21T16:12:52Z

The fact that as late as September of last year, Mozilla doesn't have a table for GB18030, is depressing. Here's everything they have: https://mxr.mozilla.org/mozilla-central/source/extensions/universalchardet/src/base/

I might write something to quickly strip the C++ and replace everything with python so it will be a simple way of converting their *.tab files to our familiar *.py files.

ghost assigned puzzlet Jan 24, 2013

sigmavirus24 mentioned this issue Jan 24, 2013

Create some documentation #12

Closed

ghost assigned puzzlet Jan 31, 2013

sigmavirus24 closed this as completed Dec 29, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GB18030 #11

Support GB18030 #11

puzzlet commented Jan 24, 2013

sigmavirus24 commented Jan 24, 2013

sigmavirus24 commented Jan 25, 2013

puzzlet commented Jan 31, 2013

sigmavirus24 commented Jan 31, 2013

sigmavirus24 commented Feb 26, 2013

puzzlet commented Feb 26, 2013

sigmavirus24 commented Feb 26, 2013

puzzlet commented Feb 26, 2013

sigmavirus24 commented Feb 26, 2013

sigmavirus24 commented Mar 21, 2013

Support GB18030 #11

Support GB18030 #11

Comments

puzzlet commented Jan 24, 2013

sigmavirus24 commented Jan 24, 2013

sigmavirus24 commented Jan 25, 2013

puzzlet commented Jan 31, 2013

sigmavirus24 commented Jan 31, 2013

sigmavirus24 commented Feb 26, 2013

puzzlet commented Feb 26, 2013

sigmavirus24 commented Feb 26, 2013

puzzlet commented Feb 26, 2013

sigmavirus24 commented Feb 26, 2013

sigmavirus24 commented Mar 21, 2013