Repetition of words causes detection error #55

joewong826 · 2016-06-30T03:56:51Z

When I input strings like 'hello world hello world hello world', langid can't identify it as English text.
>>> import langid
>>> langid.classify('hello world hello world hello world')
('af', 0.683057652874482)

The text was updated successfully, but these errors were encountered:

saffsd · 2016-07-05T23:06:55Z

Thanks for getting in touch! This is an interesting one!

>>> hello world
(array([1426, 1428, 2273, 3948]),)
[1 1 1 1]
('en', -23.719746112823486)
>>> hello world hello world
(array([1339, 1426, 1428, 2273, 3948]),)
[1 2 2 2 2]
('en', -62.565943241119385)
>>> hello world hello world hello world
(array([1339, 1426, 1428, 2273, 3948]),)
[2 3 3 3 3]
('af', -100.6344223022461)
>>> ld 
(array([1339]),)
[1]
('en', 2.9972290992736816)

The issue is that in the training data, the pattern "ld " must be more strongly associated with afrikaans than English, especially when considered with the other patterns in "hello world".

Unfortunately, there's no easy fix for this. Is this a problem in a real use case for you?

joewong826 · 2016-07-08T09:14:13Z

Not yet. But my code using langid might process millions of data and texts, and I cannot guarantee there would be no extreme cases like this one.
With that being said, I have to admit such circumstances may not even happen. If there's no easy fix, then not fixing it is fine. Thank you for your patience!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repetition of words causes detection error #55

Repetition of words causes detection error #55

joewong826 commented Jun 30, 2016 •

edited

Loading

saffsd commented Jul 5, 2016

joewong826 commented Jul 8, 2016

Repetition of words causes detection error #55

Repetition of words causes detection error #55

Comments

joewong826 commented Jun 30, 2016 • edited Loading

saffsd commented Jul 5, 2016

joewong826 commented Jul 8, 2016

joewong826 commented Jun 30, 2016 •

edited

Loading