Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misidentifies empty content and missing languages #6

Closed
caged opened this issue Apr 30, 2018 · 2 comments
Closed

Misidentifies empty content and missing languages #6

caged opened this issue Apr 30, 2018 · 2 comments
Assignees

Comments

@caged
Copy link

caged commented Apr 30, 2018

馃憢 Hi, thanks for this neat little Python library!

I've been tinkering with it for a bit and noticed a couple of things that you might already be aware of. If you pass content from a language the classifier doesn't know of or if you pass something like null or an empty string, you will get a misidentification. Here's some examples:

echo '' | guesslang
# The source code is written in Shell
echo '""' | guesslang
# The source code is written in Python
# This file is written in Assembly 
cat fasm.asm | guesslang
# The source code is written in Python

A few questions:

  • Have you thought about returning null for guesses that don't meet a certain threshold?
  • Have you thought about returning the probability that a particular guess is correct and letting clients/consumers determine if the threshold is high enough to proceed?
@yoeo
Copy link
Owner

yoeo commented May 7, 2018

Hello @caged

I'm happy to see that you liked playing with this library.

You have raised some interesting points here:

Have you thought about returning null for guesses that don't meet a certain threshold?

At first I tried setting arbitrary thresholds (at least 10 words, or the difference between the languages probabilities should be bigger than a given value, etc...) with no success.

Now I'm thinking about implementing an abnomality detection system: detect if a text is written in a known language or not (https://en.m.wikipedia.org/wiki/One-class_classification)

When the guess_language method will be called with an abnormal/unknown text, it will return a None value.

And if you have an other solution in mind, feel free to share it.

Have you thought about returning the probability that a particular guess is correct and letting clients/consumers determine if the threshold is high enough to proceed?

That's a nice idea 馃憤, I didn't think about that.

By the way I'm already using the probabilities to build the list of probable languages, it will be quite simple to expose them to the consumers

proba = next(self._classifier.predict_proba(input_fn=input_fn))

I'll keep you posted.

Thank you.

@yoeo yoeo closed this as completed May 7, 2018
@yoeo yoeo reopened this May 7, 2018
@yoeo yoeo self-assigned this May 7, 2018
@yoeo yoeo mentioned this issue Jun 13, 2020
@yoeo
Copy link
Owner

yoeo commented Jun 14, 2020

Hi @caged ,

I've made few changes on Guesslang about this issue:

  • Empty and blank source codes are now detected
  • Prediction probabilities are given by guess.probabilities(source_code) function.
  • guess.language_name(source_code) returns None when the detected language probability doesn't reach a certain threshold threshold < 2 * stdev(all_probabilities)

Thank you.

@yoeo yoeo closed this as completed Jun 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants