Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confidence-based dispatch #421

Merged
merged 8 commits into from
Feb 12, 2021
Merged

Confidence-based dispatch #421

merged 8 commits into from
Feb 12, 2021

Conversation

gcampax
Copy link
Contributor

@gcampax gcampax commented Jan 12, 2021

This PR implements getting confidence scores out of genienlp, and using them to recognize likely errors and out-of-domain commands.
Well, at least that's the goal of the PR. I haven't got that far yet, but I will, before this PR is tagged for review.

@s-jse if you're interested, this is how I worked around passing the new flags to genienlp server. It's basically what I suggested in the other PR, except I always enable MC dropout. If you have a way to avoid MC dropout when dropout features are not useful, that's great.

@jgd5 tagging you so you're aware that there is work happening at the boundary of genie-toolkit and genienlp, which will impact the kf serving work.

@gcampax gcampax added enhancement New feature or request dialogue-agent Issues with the dialogue agent at runtime (not state machine related) labels Jan 12, 2021
@gcampax gcampax added this to the Almond 2.0 milestone Jan 12, 2021
lib/prediction/predictor.ts Outdated Show resolved Hide resolved
Thanks to MC dropout and calibration, genienlp can now output
confidence scores also when not doing beam search
Move all the code responsible for converting natural language
to ThingTalk and deciding what to do with the ThingTalk code
to the DialogueLoop class, leaving Conversation to only take
care of maintaining history and dispatching to multiple web
clients for the same conversation.
@gcampax gcampax marked this pull request as ready for review February 5, 2021 23:24
With hardcoded thresholds of 0.5 confidence, we either handle the
command, ask the user for confirmation, or report some kind of
error.
@gcampax
Copy link
Contributor Author

gcampax commented Feb 5, 2021

Ok, this is the latest version of this work.

@s-jse I would like to have:

  • a score called "ignore", such that, if the score is >= 0.5, the sentence should be ignored entirely (marked as "junk" in the OOD set)
  • a score called "in_domain", such that, if the score is <0.5, the sentence should be treated as OOD (this corresponds to all sentences in the OOD set)
  • a score called "confidence", such that, if the score is <0.25, the sentence is an automatic parse failure, and if the score is <0.5, the sentence will require additional confimation.

Does that sound reasonable to you?

@gcampax gcampax requested review from sileix and s-jse February 5, 2021 23:27
@s-jse
Copy link
Member

s-jse commented Feb 11, 2021

OK stanford-oval/genie-k8s#40 will train 4 calibrators, which means at runtime you will have these 4 scores:

  1. is_correct: above threshold means it is correct
  2. is_probably_correct: above threshold means it is probably correct
  3. is_ood: above threshold means it is OOD
  4. is_junk: above threshold means it is junk

There are four additional inputs to all train pipelines. In each one, you can specify for example --threshold 0.5 --recall 0.9 will shift the scores (so they are not in [0, 1] anymore) so that if you let everything above 0.5 in, you will have 0.9 recall. Similarly, to set the precision to a specific number you can use --threshold 0.5 --precision 0.9

Both is_correct and is_probably_correct have the same calibrator, but with potentially different thresholds, which you will set using the above commands. For example if you want one to be high-recall and one to be high-precision, you can.

@gcampax
Copy link
Contributor Author

gcampax commented Feb 11, 2021

@s-jse please double check that the math in LocalParserClient corresponds to your definition

@rayslxu this PR interacts quite a bit with your work on transcript logs because it moves around how commands are parsed and refactors DialogueLoop a bit, I merged everything but please check that the expected log is correct at the end and see if you still need fixes

// convert is_correct and is_probably_correct scores into
// a single scale such that >0.5 is correct and >0.25 is
// probably correct
const score = (c.score.is_correct ?? 1) > 0.5 ? (c.score.is_correct ?? 1) :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L229-L239: I don't understand why you are doing calculations with these numbers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not probably scores. As I said above, they are not guaranteed to be in [0, 1].

Copy link
Member

@s-jse s-jse Feb 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are meant to be used as follows:

if (is_junk > threshold) {
  is junk, so ignore;
}
elif (is_ood > threshold){
  is OOD, so send to different backends;
elif (is_probably_correct < threshold) {
 parsing error;
}
elif (is_correct < threshold) {
 probably correct, but confirm with user to be sure;
}
else {
  correct;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the API that we provide to clients has one confidence number per candidate parse, in the range [0, 1], and a set of intent confidence scores, each in the range [0,1] and summing to 1.
So please adjust them here or in genienlp.

These scores cannot be treated as probabilities, but the API treats
them like such.
@gcampax gcampax merged commit 2111a8f into next Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dialogue-agent Issues with the dialogue agent at runtime (not state machine related) enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants