-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confidence-based dispatch #421
Conversation
6b258de
to
00c1c52
Compare
Thanks to MC dropout and calibration, genienlp can now output confidence scores also when not doing beam search
Move all the code responsible for converting natural language to ThingTalk and deciding what to do with the ThingTalk code to the DialogueLoop class, leaving Conversation to only take care of maintaining history and dispatching to multiple web clients for the same conversation.
00c1c52
to
00d52ef
Compare
With hardcoded thresholds of 0.5 confidence, we either handle the command, ask the user for confirmation, or report some kind of error.
00d52ef
to
e1348d6
Compare
Ok, this is the latest version of this work. @s-jse I would like to have:
Does that sound reasonable to you? |
OK stanford-oval/genie-k8s#40 will train 4 calibrators, which means at runtime you will have these 4 scores:
There are four additional inputs to all train pipelines. In each one, you can specify for example Both |
@s-jse please double check that the math in LocalParserClient corresponds to your definition @rayslxu this PR interacts quite a bit with your work on transcript logs because it moves around how commands are parsed and refactors DialogueLoop a bit, I merged everything but please check that the expected log is correct at the end and see if you still need fixes |
lib/prediction/localparserclient.ts
Outdated
// convert is_correct and is_probably_correct scores into | ||
// a single scale such that >0.5 is correct and >0.25 is | ||
// probably correct | ||
const score = (c.score.is_correct ?? 1) > 0.5 ? (c.score.is_correct ?? 1) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L229-L239: I don't understand why you are doing calculations with these numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not probably scores. As I said above, they are not guaranteed to be in [0, 1].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are meant to be used as follows:
if (is_junk > threshold) {
is junk, so ignore;
}
elif (is_ood > threshold){
is OOD, so send to different backends;
elif (is_probably_correct < threshold) {
parsing error;
}
elif (is_correct < threshold) {
probably correct, but confirm with user to be sure;
}
else {
correct;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the API that we provide to clients has one confidence number per candidate parse, in the range [0, 1], and a set of intent confidence scores, each in the range [0,1] and summing to 1.
So please adjust them here or in genienlp.
These scores cannot be treated as probabilities, but the API treats them like such.
This PR implements getting confidence scores out of genienlp, and using them to recognize likely errors and out-of-domain commands.
Well, at least that's the goal of the PR. I haven't got that far yet, but I will, before this PR is tagged for review.
@s-jse if you're interested, this is how I worked around passing the new flags to
genienlp server
. It's basically what I suggested in the other PR, except I always enable MC dropout. If you have a way to avoid MC dropout when dropout features are not useful, that's great.@jgd5 tagging you so you're aware that there is work happening at the boundary of genie-toolkit and genienlp, which will impact the kf serving work.