New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce toot size limit for language detection #9760

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
5 participants
@stevenroose
Copy link

stevenroose commented Jan 8, 2019

Improves #9550

@ThibG

This comment has been minimized.

Copy link
Collaborator

ThibG commented Jan 8, 2019

The issue is that automatic detection for such short texts is completely unreliable :/

@faho

This comment has been minimized.

Copy link

faho commented Jan 13, 2019

The issue is that automatic detection for such short texts is completely unreliable :/

Is there any data about that?

Because, anecdotally, 140 is long enough to basically be useless, and google translate at least (no idea if that actually uses cld3 in the background) can detect the language for much shorter texts.

@Gargron

This comment has been minimized.

Copy link
Member

Gargron commented Jan 13, 2019

Google Translate doesn't use CLD3, they use machine learning and it's trained on a very large dataset. Not something you can replicate on a small standalone mastodon server.

@faho

This comment has been minimized.

Copy link

faho commented Jan 13, 2019

Google Translate doesn't use CLD3

Sure, but that shows that it's possible to reliably detect languages with texts shorter than 140 characters.

So I'd like to know if anyone knows why 140 was picked, and if it's maybe possible that CLD3 could handle shorter texts?

Maybe 20 is too short, but 40 or 60 would work?

@Gargron

This comment has been minimized.

Copy link
Member

Gargron commented Jan 13, 2019

140 was picked arbitrarily after some back and forth testing, as documented in the screenshot here: #8010. I'm not claiming it's a scientific number, but the less characters there are, the more wildly inaccurate CLD3 is.

@faho

This comment has been minimized.

Copy link

faho commented Jan 13, 2019

140 was picked arbitrarily after some back and forth testing, as documented in the screenshot here:

That screenshot shows:

  • It's worthless for single words - which is obvious, since e.g. "die" is both english and german ("die bart, die"), and that'll happen frequently.

  • It gives you a "probability" score (which I'm guessing is also used for the "reliable?" bool?) that is worthless for single words, but not all that bad for longer sentences? For the two longer tests, it's < 50% (and the longest is a correct guess). These are 51/24 characters, which is a far cry from the 140 that is used currently!

So it doesn't seem all that awful to use a sliding scale? I.e. for things under 40 characters or things marked unreliable, don't trust the detection.

@nightpool

This comment has been minimized.

Copy link
Collaborator

nightpool commented Jan 13, 2019

If i remember correctly, CLD used to contain a disclaimer that it wasn't supposed to be used on texts smaller then N size (i believe around 300-500). We're using it with 140 but I wouldn't want to go any lowers.

@stevenroose

This comment has been minimized.

Copy link

stevenroose commented Jan 16, 2019

Are there any alternative language detection systems?

Of course, I still consider #9550 to be the better way to go here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment