Possibility to support Chinese codecs? #34

mjpieters · 2015-03-16T08:30:33Z

Based on this Stack Overflow question I looked into support for Chinese character encodings.

The GB* series of codecs are, like UTF-8, a variable width encoding. The example in the question reads:

Ã¨Â¢â€¹Ã¨Â¢âdcx€¹Ã¤Â¸Å½Ã¦Å“â€¹Ã¥Ââ€¹Ã¤Â»Â¬Ã§â€ÂµÃ¥ÂÂÃ¥â€¢â€

which can be decoded using GB* encodings to varying degrees of success:

>>> print text.encode('windows-1252').decode('gb2312', 'replace')
猫垄�姑�⑩dcx�盲赂沤忙��姑ヂ�姑ぢ宦�р�得ヂ�氓�⑩�
>>> print text.encode('windows-1252').decode('gbk', 'replace')
猫垄鈥姑�⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦�р�得ヂ�氓鈥⑩�
>>> print text.encode('windows-1252').decode('gb18030', 'replace')
猫垄鈥姑⑩dcx�盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦р�得ヂ氓鈥⑩�
>>> print text.encode('windows-1252').decode('big5', 'replace')
癡瞽�嘔阬Ｔdcx�瓣繡鬚疆��嘔氐�嘔刈鄞珍把�腕氐倦疇�Ｔ�

Unfortunately I do not know which one of these is closest to the original, but that doesn't matter all that much. What'd be needed is an analysis of how GB* encodings pushed through the CP1252 / Latin-1 sieve can be distinguished from UTF-8 Mojibakes and handled fix_one_step_and_explain().

Is supporting these codecs feasible?

The text was updated successfully, but these errors were encountered:

rspeer · 2015-03-16T19:19:23Z

I don't think it would be possible to support GB* without a huge loss in accuracy. The problem is that most sequences of bytes can be decoded as GB18030, for example, regardless of whether they're actually intended to be GB18030. (Are you sure the text in that example is meant to be Chinese at all?)

The thing that makes ftfy possible is that most sequences of bytes aren't valid UTF-8, so when you can decode something as UTF-8, it's a strong signal that it's the right thing to do.

At one point I looked into trying to support the Japanese encoding Shift-JIS. Even though it has fewer valid sequences than the GB* encodings, I was getting too many false positives on likely sequences of bytes.

mjpieters · 2015-03-16T19:23:38Z

Sure, I indeed did not look at what kind of byte sequences GB* codecs produce; if you say it's not feasible due to the false-positive rates, then it's not an option.

Yes, the text in question was meant to be Chinese; the problem was explicitly constrained to text that was either English or Chinese only.

rspeer · 2015-03-16T19:32:54Z

This string is an interesting puzzle. I'm coming to the conclusion that it's not actually GB* - it seems to be Chinese in triple-UTF-8 with some bytes missing.

mjpieters · 2015-03-16T19:35:24Z

See, to me any Chinese character looks like any other Chinese character and I made the incorrect assumption that by using GB* on the sloppy-cp-1252 result I'd get something approaching valid..

The bytes missing are probably due to un-printable bytes not having been copied into the question; the OP didn't use repr() here after all.

Something like this then?

>>> print u'Ã¨Â¢â€¹Ã¨Â¢âdcx€¹Ã¤Â¸Å½Ã¦Å“â€¹Ã¥Ââ€¹Ã¤Â»Â¬Ã§â€ÂµÃ¥ÂÂÃ¥â€¢â€'.encode('sloppy-windows-1252').decode('utf8', 'ignore').encode('sloppy-windows-1252').decode('utf8', 'ignore')
袋dcx与朋们

Memtiras · 2022-09-08T19:10:04Z

��0$Í(e�K�>”ÇHÉÜbÆ{Â;fB�o;¬UóÙC‹�pÎ�?Ÿ÷�œ3·9B�Oƒ
æÏ��–�—œ²ÆúCø|�íÓ›C¼'Ž�ÿ+´+¨$/í›Ùa½�]íº3ÃÍ�ì]B8Eí�0d¤��oIß�ˆY��· 1qÍüV³èsÒ�¾žÿð1¾€CI‹ÞÉáÈ•©—Ï�ÿ�?Ñï-ÙŸKiäá�¶•�+¬Tä»e�ÀU{®§ÊEÅ3á!�ÎMY¬®*�k\qš¯�yÏ$à¼æ÷þ i@CZ¿R¯bä�c�0�ü�¢ªÛ�{níÎÜ}¤,¿” 3¬ì‡ýÒˆóÀr%�Ôå†6Ë”U��œe<ÎAå"=Ù”[™n�¯c_v>z�3\†lJ¬TÍ8Mß3S˜0�DÓ�Íï(ÓÍiHö©ü.�� $Í(e�K�>”ÇHÉÜbÆ{Z��”7×�Æ˜ë�;¦l@©�-8ýUu‘D¯l§µ7 ¢�+ëžE$¿ÎÙãÑíË›©[»¸˜�OþzL¢ÒGÞe°öð¦€c6„æŽh»êv#‡äHµà²�ä+ŸÄÚwqá �rÜQe7>›2HUä�™šÆY¡æ1d,h¦ž×��êöˆXHõ^
}ï˜Ú�‘>†V�ä ÉcWÛš,ÔI W³<=–Öh£"F±¶gëd¦ ä·\¨7 ¿´{��ËìE<aWµã×�EÚ�–|¿$!ú°¦�<�Ü¢�$Öq�9Âˆ5J3ëz
�Yag»°uÉñ’‡¡r'�©È�Ùðo0’í‚3ÄÙÈ—�k"w�~~Ø�ü�l˜mS6�“–@*íûáw��p_Y…OÈänaþ+'�dr„�‡ˆ]=ì¢@qtæøÂéÚr.žà´G�m›L=î�¬„«9D��/3m‡ÜvOiäO§Î;Fâ�¾¾œ£ÐL€)�î–�»È«~~¶âXd|ÂiJ\µR�Ë0'ÛªST€�9šÕ'Ç�ò†�� _Y…OÈänaþ+'�dr„�„Â®Á>!äó‡åÐ�pœšN�‰ónÞo-{*ï{ë6ñ�-„}NŸÉ «¾—O{fÀ¡(Ï�iÖ�Ù¸ma·¨éÍ‘�WuŠoÈ2XÎ.sf<Ï*º�· 8‹xë0sa/ŠE 1'! SÈ�œæIwQÔ�bÄD�Žiµ�WçØ^R/©y¯¯Ž¿.µ Eí²1�žôõI Ð!^à£´ñòbeüÞY¥ï,Žc@9õF�:ïx)5Û4˜ˆ¡bx3Ê8��@2�Ð¬}Ðˆ�A|w¾p?Í�{Ê–°^s|0]{�«¾©öË•×c�i �e¶íºðà7)C\;æ�úi0ýÉ�N¶^s1ïÇ_�æŸY0Ìú)GØH��ß³éûÐÁ_YÉ!J��_Y…OÈänaþ+'�dr„�"YBD�O(µÉÛ]Ã2æjÂ�3„°k°–À¬<³%EÔpFø|ªt…)v¡b�Ç�íŸ¼”ï�-q�ê�6ÚŸDìNìÃÛƒ’,2r×¦*§ì�Ó�ä��_Y…OÈänaþ+'�dr„�P‰�nði¢�MQ÷ºp�&�…Or™Ó‚O]½•�‡¦'þŒ"+µ€*�xÃXóÚ�vðÃvi¬�ÁJš¸äÞŒq"‹–�BÃ8ß£š©7û|’¹9Z��€_Y…OÈänaþ+'�dr„�*VÛåF¢u�&Ú]Ï‡,8ë�!í¯b|ž�]k§!¸�K2KŽ°! íô©ÚïŒÙƒ¯À�eZÉ=]F>N“u¤õÐ=}Iÿ48ÐV‹•��lÓðµÈ†Ü�˜Óa•�Ûý³çÃ6ƒ��m¾�í¶ê]�g}\j�:îu��_Y…OÈänaþ+'�dr„�SjK �·¦‡,8A¡bÌ�O�»ª#�, ¶Ÿ!dT0g‚ îªŒw{#d©wÆ±DsÂz(�±s’ºÀ~®’�|d'm3)*GŠ?_ÿ_ÈçÕsŒóy�%��@_Y…OÈänaþ+'�dr„�l�±)X�Í�f�_cˆ¢Ü0¿)uä5¶�QH€~žhXš�öÍ÷²Èg��q-ÀYŸU+�� _Y…OÈänaþ+'�dr„�£ò#"q°\µÚ�Ë$Uû�^��¡5½AQƒ§nÛàßÛä„�ä\�æ#Ð�{³ÎI�2‡”$¤iP$1˜—�o�ó®uãPÎÿ—ÿÓ-~¼KU “¦pøPÏ|��B…��ÄðÀš}q6aÈÀZ”¾Ê^Mfp/t#¹Èí1‹q_q�P�n\=À§Ï»� <Ã°ªˆ¾ggp�Õ–Fð½1/7�~�¶ž�I¶G3– ü)$‡ä‹—¥æVb 9zÁ�íÎ@� \¸ÿˆ ëé;ˆÂ°§±úÉO$�ö=~í�ï��}ê�-Žº�Sš;û—j�T*k2F"�º1©Z�2ÐVXAV�°fTîóH�UqÙ²/�¾bD�¤/d��/¡�á�‚¾��_Y…OÈänaþ+'�dr„��QP°D«L•¹Ñš;� �^¤hÆµÒf¶J,�ãž2 u*!AMÈ^Ì×ð��%kzåxCò”“�ÓžN�Nz‘�µBÁIcã�Ôôí€ÿ˜BefÔ��€_Y…OÈänaþ+'�dr„�?'gfUZÇuÙårŸŠöp¦#’P«¯�’äd_��ÞyÜ¹ÁäÅ˜‘ê�&ûÒ��a6@m''Q³‰ñ�}~Y°ábFêÊ�{ÍkMƒ��S±¯b�£Í5cKç;q¯¬CÌÑF�§ÔðæÜ<7“yKì�ˆ¬nW®��0_Y…OÈänaþ+'�dr„��²WAí)ŸB qu��«3pÛ€êÄt;2Ñ¨e?�Ö¨�—%õ�I÷�’V$.¹3fXG�.ÃÃ�ç31ØC�X® +¯ý+=Ì«®Å-Óëµ�OE¹èn5uPTU�"ß<»·:�‰ò�Nõµ ™g'ß:vÝÆA]’ò¾û”ßÂÕ.
ã(×�©ð6·vQ Ê}ìr\ß›êV”ýÁ¡bŸýI »IãAÇªëâÚ]ÈÓ°j+!ê�HãAN´:·ñ5�;ø�÷‰Ž“ %FaLV9d…IÆ7³ZŸ0†£Ò¢ëòœþ]�®ï}�´.Ë|yÆ�¬�šŠâz¤# ïÆSAüöéZ�kŸŠw�l•É|~Õch�—�Š�n�ü�Ù1¬OÜÀý®#Ur�+é��`Y…OÈänaþ+'�dr„�-f†Ýi
B4i^+þ”z¸�%-9Å5Ï,~~k�töY[ï>æ� ��+ùj$�L��…kJRþÁcc�Ž��|�U×Çn²�Y˜²#†³ÌEÿÍÁ��€_Y…OÈänaþ+'�dr„�%(ü.Ù7A�0†�5j'�•0¶¦¡7DŸ�°�¾êA~~�ò�•YðÎÆÌß�¦úbŽ��,½® DÊ�Ö��€,><CæWÌÙî}²@ôã�æ—†œ½Ð‚Öm�©PÄ…HœÑäó)ëœC§l>÷�÷¬ß} z®'}��0_Y…OÈänaþ+'�dr„�°�¯ðd(T4õ‡¹Ä‰{�a^o%«ÿ‹µHž„ÆjÛÂ”á»ùá~�Œ˜(5�C{Ê
pÐÀ§wH7ÄÅ@�Š6�Q.xædWj_w¡@Á2è±²ccLf½¹�æ�’uL”ñ;ÝŒ�È�2ÿÕNí‚éâó¬�ñ
¾�‹¬‡¡e �£kU“˜&WÞÈ��Üy©�fÔN��É ÓtáP�Í³5ñ�
ÜQ¥þv

rspeer closed this as completed Mar 16, 2015

Artoria2e5 mentioned this issue Oct 12, 2016

documentation: But gb18030 is a UTF. #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility to support Chinese codecs? #34

Possibility to support Chinese codecs? #34

mjpieters commented Mar 16, 2015

rspeer commented Mar 16, 2015

mjpieters commented Mar 16, 2015

rspeer commented Mar 16, 2015

mjpieters commented Mar 16, 2015

Memtiras commented Sep 8, 2022

Possibility to support Chinese codecs? #34

Possibility to support Chinese codecs? #34

Comments

mjpieters commented Mar 16, 2015

rspeer commented Mar 16, 2015

mjpieters commented Mar 16, 2015

rspeer commented Mar 16, 2015

mjpieters commented Mar 16, 2015

Memtiras commented Sep 8, 2022