Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detection error when encounter full-width characters #56

Closed
joewong826 opened this issue Jun 30, 2016 · 1 comment
Closed

Detection error when encounter full-width characters #56

joewong826 opened this issue Jun 30, 2016 · 1 comment

Comments

@joewong826
Copy link

The langid mistakens full-width English texts like 'hello world' as CJK language texts.
>>> import langid
>>> langid.classify('hello world')
('zh', 0.9339664571825803)

@saffsd
Copy link
Owner

saffsd commented Jul 5, 2016

Thanks for reporting this! Unfortunately there is no easy fix for this - langid.py training data didn't contain any "full-width" English text. If this is an issue for you in a real use case, here are possible options:

  1. detect and pre-process "full-width" text into normal text
  2. re-train langid.py with "full-width" text

@saffsd saffsd closed this as completed Jul 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants