Detection error when encounter full-width characters #56

joewong826 · 2016-06-30T04:00:32Z

The langid mistakens full-width English texts like 'ｈｅｌｌｏ　ｗｏｒｌｄ' as CJK language texts.
>>> import langid
>>> langid.classify('ｈｅｌｌｏ　ｗｏｒｌｄ')
('zh', 0.9339664571825803)

The text was updated successfully, but these errors were encountered:

saffsd · 2016-07-05T23:10:15Z

Thanks for reporting this! Unfortunately there is no easy fix for this - langid.py training data didn't contain any "full-width" English text. If this is an issue for you in a real use case, here are possible options:

detect and pre-process "full-width" text into normal text
re-train langid.py with "full-width" text

saffsd closed this as completed Jul 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detection error when encounter full-width characters #56

Detection error when encounter full-width characters #56

joewong826 commented Jun 30, 2016

saffsd commented Jul 5, 2016

Detection error when encounter full-width characters #56

Detection error when encounter full-width characters #56

Comments

joewong826 commented Jun 30, 2016

saffsd commented Jul 5, 2016