You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I input strings like 'hello world hello world hello world', langid can't identify it as English text. >>> import langid >>> langid.classify('hello world hello world hello world') ('af', 0.683057652874482)
The text was updated successfully, but these errors were encountered:
Thanks for getting in touch! This is an interesting one!
>>> hello world
(array([1426, 1428, 2273, 3948]),)
[1 1 1 1]
('en', -23.719746112823486)
>>> hello world hello world
(array([1339, 1426, 1428, 2273, 3948]),)
[1 2 2 2 2]
('en', -62.565943241119385)
>>> hello world hello world hello world
(array([1339, 1426, 1428, 2273, 3948]),)
[2 3 3 3 3]
('af', -100.6344223022461)
>>> ld
(array([1339]),)
[1]
('en', 2.9972290992736816)
The issue is that in the training data, the pattern "ld " must be more strongly associated with afrikaans than English, especially when considered with the other patterns in "hello world".
Unfortunately, there's no easy fix for this. Is this a problem in a real use case for you?
Not yet. But my code using langid might process millions of data and texts, and I cannot guarantee there would be no extreme cases like this one.
With that being said, I have to admit such circumstances may not even happen. If there's no easy fix, then not fixing it is fine. Thank you for your patience!
When I input strings like 'hello world hello world hello world', langid can't identify it as English text.
>>> import langid
>>> langid.classify('hello world hello world hello world')
('af', 0.683057652874482)
The text was updated successfully, but these errors were encountered: