Misidentifies empty content and missing languages #6

caged · 2018-04-30T18:54:28Z

👋 Hi, thanks for this neat little Python library!

I've been tinkering with it for a bit and noticed a couple of things that you might already be aware of. If you pass content from a language the classifier doesn't know of or if you pass something like null or an empty string, you will get a misidentification. Here's some examples:

echo '' | guesslang
# The source code is written in Shell

echo '""' | guesslang
# The source code is written in Python

# This file is written in Assembly 
cat fasm.asm | guesslang
# The source code is written in Python

A few questions:

Have you thought about returning null for guesses that don't meet a certain threshold?
Have you thought about returning the probability that a particular guess is correct and letting clients/consumers determine if the threshold is high enough to proceed?

The text was updated successfully, but these errors were encountered:

yoeo · 2018-05-07T10:28:02Z

Hello @caged

I'm happy to see that you liked playing with this library.

You have raised some interesting points here:

Have you thought about returning null for guesses that don't meet a certain threshold?

At first I tried setting arbitrary thresholds (at least 10 words, or the difference between the languages probabilities should be bigger than a given value, etc...) with no success.

Now I'm thinking about implementing an abnomality detection system: detect if a text is written in a known language or not (https://en.m.wikipedia.org/wiki/One-class_classification)

When the guess_language method will be called with an abnormal/unknown text, it will return a None value.

And if you have an other solution in mind, feel free to share it.

Have you thought about returning the probability that a particular guess is correct and letting clients/consumers determine if the threshold is high enough to proceed?

That's a nice idea 👍, I didn't think about that.

By the way I'm already using the probabilities to build the list of probable languages, it will be quite simple to expose them to the consumers

guesslang/guesslang/guesser.py

Line 87 in a9468b4

proba = next(self._classifier.predict_proba(input_fn=input_fn))

I'll keep you posted.

Thank you.

yoeo · 2020-06-14T13:19:53Z

Hi @caged ,

I've made few changes on Guesslang about this issue:

Empty and blank source codes are now detected
Prediction probabilities are given by guess.probabilities(source_code) function.
guess.language_name(source_code) returns None when the detected language probability doesn't reach a certain threshold threshold < 2 * stdev(all_probabilities)

Thank you.

yoeo closed this as completed May 7, 2018

yoeo reopened this May 7, 2018

yoeo self-assigned this May 7, 2018

yoeo added the enhancement label May 10, 2018

yoeo mentioned this issue Jun 13, 2020

V2.0.0a1 #18

Merged

yoeo closed this as completed Jun 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misidentifies empty content and missing languages #6

Misidentifies empty content and missing languages #6

caged commented Apr 30, 2018

yoeo commented May 7, 2018

yoeo commented Jun 14, 2020

Misidentifies empty content and missing languages #6

Misidentifies empty content and missing languages #6

Comments

caged commented Apr 30, 2018

yoeo commented May 7, 2018

yoeo commented Jun 14, 2020