Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility to support Chinese codecs? #34

Closed
mjpieters opened this issue Mar 16, 2015 · 5 comments
Closed

Possibility to support Chinese codecs? #34

mjpieters opened this issue Mar 16, 2015 · 5 comments

Comments

@mjpieters
Copy link

Based on this Stack Overflow question I looked into support for Chinese character encodings.

The GB* series of codecs are, like UTF-8, a variable width encoding. The example in the question reads:

袋è¢âdcx€¹Ã¤Â¸Å½Ã¦Å“‹å‹们çâ€ÂµÃ¥Â­Âå•â€

which can be decoded using GB* encodings to varying degrees of success:

>>> print text.encode('windows-1252').decode('gb2312', 'replace')
猫垄�⑩dcx盲赂沤忙��姑ヂ姑ぢ宦р得ヂ�⑩�
>>> print text.encode('windows-1252').decode('gbk', 'replace')
猫垄鈥姑�⑩dcx盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦р得ヂ氓鈥⑩�
>>> print text.encode('windows-1252').decode('gb18030', 'replace')
猫垄鈥姑⑩dcx盲赂沤忙艙鈥姑ヂ鈥姑ぢ宦р得ヂ氓鈥⑩�
>>> print text.encode('windows-1252').decode('big5', 'replace')
癡瞽嘔阬Tdcx瓣繡鬚疆��嘔氐嘔刈鄞珍把腕氐倦疇

Unfortunately I do not know which one of these is closest to the original, but that doesn't matter all that much. What'd be needed is an analysis of how GB* encodings pushed through the CP1252 / Latin-1 sieve can be distinguished from UTF-8 Mojibakes and handled fix_one_step_and_explain().

Is supporting these codecs feasible?

@rspeer
Copy link
Owner

rspeer commented Mar 16, 2015

I don't think it would be possible to support GB* without a huge loss in accuracy. The problem is that most sequences of bytes can be decoded as GB18030, for example, regardless of whether they're actually intended to be GB18030. (Are you sure the text in that example is meant to be Chinese at all?)

The thing that makes ftfy possible is that most sequences of bytes aren't valid UTF-8, so when you can decode something as UTF-8, it's a strong signal that it's the right thing to do.

At one point I looked into trying to support the Japanese encoding Shift-JIS. Even though it has fewer valid sequences than the GB* encodings, I was getting too many false positives on likely sequences of bytes.

@rspeer rspeer closed this as completed Mar 16, 2015
@mjpieters
Copy link
Author

Sure, I indeed did not look at what kind of byte sequences GB* codecs produce; if you say it's not feasible due to the false-positive rates, then it's not an option.

Yes, the text in question was meant to be Chinese; the problem was explicitly constrained to text that was either English or Chinese only.

@rspeer
Copy link
Owner

rspeer commented Mar 16, 2015

This string is an interesting puzzle. I'm coming to the conclusion that it's not actually GB* - it seems to be Chinese in triple-UTF-8 with some bytes missing.

@mjpieters
Copy link
Author

See, to me any Chinese character looks like any other Chinese character and I made the incorrect assumption that by using GB* on the sloppy-cp-1252 result I'd get something approaching valid..

The bytes missing are probably due to un-printable bytes not having been copied into the question; the OP didn't use repr() here after all.

Something like this then?

>>> print u'袋è¢âdcx€¹Ã¤Â¸Å½Ã¦Å“‹å‹们çâ€ÂµÃ¥Â­Âå•â€'.encode('sloppy-windows-1252').decode('utf8', 'ignore').encode('sloppy-windows-1252').decode('utf8', 'ignore')
袋dcx与朋们

@Memtiras
Copy link

Memtiras commented Sep 8, 2022

���0$Í(e�K�>”ÇHÉÜbÆ{Â;fB�o;¬UóÙC‹�pÎ�?Ÿ÷�œ3·9B�Oƒ
æÏ��–�—œ²ÆúCø|�íÓ›C¼'Ž�ÿ+´+¨$/í›Ùa½�]íº3ÃÍ�ì]B8Eí�0d¤��oIß�ˆY��· 1qÍüV³èsÒ�¾žÿð1¾€CI‹ÞÉáÈ•©—Ï�ÿ�?Ñï-ÙŸKiäá�¶•�+¬Tä»e�ÀU{®§ÊEÅ3á!�ÎMY¬®*�k\qš¯�yÏ$à¼æ÷þ i@CZ¿R¯bä�c�0�ü�¢ªÛ�{n­íÎÜ}¤,¿” 3¬ì‡ýÒˆóÀr%�Ôå†6Ë”U��œe<ÎAå"=Ù”[™n�¯c_v>z�3\†lJ¬TÍ8Mß3S˜0�DÓ�Íï(ÓÍiHö©ü.��� $Í(e�K�&gt;”ÇHÉÜbÆ{Z��”7×�Ƙë�;¦l@©�-8ýUu‘D¯l§µ7 ¢�+ëžE$¿ÎÙãÑíË›©[»¸˜�OþzL¢ÒGÞe°öð¦€c6„æŽh»êv#‡äHµà²�ä+ŸÄÚwqá �rÜQe7>›2HUä�™šÆY¡æ1d,h¦ž×��êöˆXHõ^
}ï˜Ú�‘>†V�ä ÉcWÛš,ÔI W³<=–Öh£"F±¶gëd¦ ä·\¨7 ¿´{��ËìE<aWµã×�EÚ�–|¿$!ú°¦�<�Ü¢�$Öq�9ˆ5J3ëz
�Yag»°uÉñ’‡¡r'�©È�Ùðo0’í‚3ÄÙÈ—�k"w�Ø�ü�l˜mS6�“–@*íûáw����p_Y…OÈänaþ+'�dr„�‡ˆ]=ì¢@qtæøÂéÚr.žà´G�m›L=î�¬„«9D��/3m‡ÜvOiäO§Î;Fâ�¾¾œ£ÐL€)�î–�»È«¶âXd|ÂiJ\µR�Ë0'ÛªST€�9šÕ'Ç�ò†��� _Y…OÈänaþ+'�dr„�„®Á>!äó‡åÐ�pœšN�‰ónÞo-{*ï{ë6ñ�-„}NŸÉ «¾—O{fÀ¡(Ï�iÖ�Ù¸ma·¨éÍ‘�WuŠoÈ2XÎ.sf<Ï*º�· 8‹xë0sa/ŠE 1'! SÈ�œæIwQÔ�bÄD�Ži
µ�WçØ^R/©y¯¯Ž¿.µ Eí²1�žôõI Ð!^ࣴñòbeüÞY¥ï,Žc@9õF�:ïx)5Û4˜ˆ¡bx3Ê8��@2�Ь}Ј�A|w¾p?Í�{Ê–°^s|0]{�«¾©öË•×c�i �e¶íºðà7)C\;æ�úi0ýÉ�N¶^s1ïÇ_�æŸY0Ìú)GØH��ß³éûÐÁ_YÉ!J���_Y…OÈänaþ+'�dr„�"YBD�O(µÉÛ]Ã2æjÂ�3„°k°–À¬<³%EÔpFø|ªt…)v¡b�Ç�퟼”ï�-q�ê�6ÚŸDìNìÃÛƒ’,2rצ*§ì�Ó�ä����_Y…OÈänaþ+'�dr„�P‰�nði¢�MQ÷ºp�&�…Or™Ó‚O]½•�‡¦'þŒ"+µ€*�xÃXóÚ­�vðÃvi¬�ÁJš¸äÞŒq"‹–�BÃ8ߣš©7û|­’¹9Z����€_Y…OÈänaþ+'�dr„�*VÛåF¢u�&Ú]χ,8ë�!í¯b|ž�]k§!¸�K2KŽ°! íô©ÚïŒÙƒ¯À�eZÉ=]F>N“u¤õÐ=}Iÿ48ÐV‹•��lÓðµÈ†Ü�˜Óa•�Ûý³çÃ6ƒ��m¾�í¶ê]�g}\j�:îu���_Y…OÈänaþ+'�dr„�SjK �·¦‡,8A¡bÌ�O�»ª#�, ¶Ÿ!dT0g‚ w{#d©wƱDsÂz(�±s’ºÀ~®’�|d'm3)*GŠ?_ÿ_ÈçÕsŒóy�%���@_Y…OÈänaþ+'�dr„�l�±)X�Í�f�_cˆ¢Ü0¿)uä5¶�QH€~žhXš�öÍ÷²Èg���q-ÀYŸU+��� _Y…OÈänaþ+'�dr„�£ò#"q°\µÚ�Ë$Uû�^��¡5½AQƒ§nÛàßÛä„�ä\�æ#Ð�{³ÎI�2‡”$¤iP$1˜—�o�ó®uãPÎÿ—ÿÓ-~¼KU “¦pøPÏ|����B…��ÄðÀš}q6aÈÀZ”¾Ê^Mfp/t#¹Èí1‹q_q�P�n\=À§Ï»� <𪈾ggp�Õ–Fð½1/7�~�¶ž�I¶G3– ü)$‡ä‹—¥æVb 9zÁ�íÎ@� \¸ÿˆ ëé;ˆÂ°§±úÉO$�ö=~í�ï��}ê�-Žº�Sš;û—j�T*k2F"�º1©Z�2ÐVXAV�°fTîóH�UqÙ²/�¾bD�¤/d���/¡�á�‚¾���_Y…OÈänaþ+'�dr„��QP°D«L•¹Ñš;� �^¤hƵÒf¶J,�ãž2 u*!AMÈ^Ì×ð��%kzåxCò”“�ÓžN�Nz‘�µBÁIcã�Ôôí€ÿ˜BefÔ���€_Y…OÈänaþ+'�dr„�?'gfUZÇuÙårŸŠöp¦#’P«¯�’äd_����ÞyܹÁäŘ‘ê�&ûÒ��a6@m''Q³‰ñ�}~Y°ábFêÊ�{ÍkMƒ��S±¯b�£Í5cKç;q¯¬CÌÑF�§ÔðæÜ<7“yKì�ˆ¬nW®���0_Y…OÈänaþ+'�dr„��²WAí)ŸB qu��«3pÛ€êÄt;2Ѩe?�Ö¨�—%õ�I÷�’V$.¹3fXG�.ÃÃ�ç31ØC�X® +¯ý+=Ì«®Å-Óëµ�OE¹èn5uPTU�­"ß<»·:�‰ò�Nõ­µ ™g'ß:vÝÆA]’ò¾û”ßÂÕ.
ã(×�©ð6·vQ Ê}ìr\ß›êV”ýÁ¡bŸýI »IãAǪëâÚ]ÈÓ°j+!ê�
HãAN´:·ñ5�;ø�÷‰Ž“ %F
aLV9d­…IÆ7³ZŸ0†£Ò¢ëòœ­þ]�®ï}�´.Ë|yÆ�¬�šŠâz¤# ïÆSAüöéZ�kŸŠw�l•É|~Õch�—�Š�n�ü�Ù1¬OÜÀý®#Ur�+é����`Y…OÈänaþ+'�dr„�-f†Ýi
B4i^+þ”z¸�%-9Å5Ï,k�töY[ï>æ� ���+ùj­$�L���…kJRþÁcc�Ž��|�U×Çn²�Y˜²#†³ÌEÿÍÁ���€_Y…OÈänaþ+'�dr„�%(ü.Ù7A�0†�5j'�•0¶¦¡7DŸ�°�¾êA�ò�•YðÎÆÌß�¦úbŽ��,½® DÊ�Ö��€,><CæWÌÙî}²@ôã�æ—†œ½Ð‚Öm�©PÄ…HœÑäó)ëœC§l>÷�÷¬ß} z®
'}���0_Y…OÈänaþ+'�dr„�°�¯ðd(T4õ‡¹Ä‰{�a^o%«ÿ‹µHž„ÆjÛ”á»ùá~�Œ˜(5�C

­pÐÀ§wH7ÄÅ@�Š6�Q.xædWj_w¡@Á2è±²ccLf½¹�æ�’uL”ñ;ÝŒ�È�2ÿÕNí‚éâó¬�ñ
¾�‹¬‡¡e �£k
U“˜&WÞÈ��Üy©�fÔN��É ÓtáP�ͳ5ñ�
ÜQ¥þv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants