Confidence-based dispatch #421

gcampax · 2021-01-12T01:57:40Z

This PR implements getting confidence scores out of genienlp, and using them to recognize likely errors and out-of-domain commands.
Well, at least that's the goal of the PR. I haven't got that far yet, but I will, before this PR is tagged for review.

@s-jse if you're interested, this is how I worked around passing the new flags to genienlp server. It's basically what I suggested in the other PR, except I always enable MC dropout. If you have a way to avoid MC dropout when dropout features are not useful, that's great.

@jgd5 tagging you so you're aware that there is work happening at the boundary of genie-toolkit and genienlp, which will impact the kf serving work.

lib/prediction/predictor.ts

Thanks to MC dropout and calibration, genienlp can now output confidence scores also when not doing beam search

Move all the code responsible for converting natural language to ThingTalk and deciding what to do with the ThingTalk code to the DialogueLoop class, leaving Conversation to only take care of maintaining history and dispatching to multiple web clients for the same conversation.

With hardcoded thresholds of 0.5 confidence, we either handle the command, ask the user for confirmation, or report some kind of error.

gcampax · 2021-02-05T23:27:36Z

Ok, this is the latest version of this work.

@s-jse I would like to have:

a score called "ignore", such that, if the score is >= 0.5, the sentence should be ignored entirely (marked as "junk" in the OOD set)
a score called "in_domain", such that, if the score is <0.5, the sentence should be treated as OOD (this corresponds to all sentences in the OOD set)
a score called "confidence", such that, if the score is <0.25, the sentence is an automatic parse failure, and if the score is <0.5, the sentence will require additional confimation.

Does that sound reasonable to you?

s-jse · 2021-02-11T06:41:03Z

OK stanford-oval/genie-k8s#40 will train 4 calibrators, which means at runtime you will have these 4 scores:

is_correct: above threshold means it is correct
is_probably_correct: above threshold means it is probably correct
is_ood: above threshold means it is OOD
is_junk: above threshold means it is junk

There are four additional inputs to all train pipelines. In each one, you can specify for example --threshold 0.5 --recall 0.9 will shift the scores (so they are not in [0, 1] anymore) so that if you let everything above 0.5 in, you will have 0.9 recall. Similarly, to set the precision to a specific number you can use --threshold 0.5 --precision 0.9

Both is_correct and is_probably_correct have the same calibrator, but with potentially different thresholds, which you will set using the above commands. For example if you want one to be high-recall and one to be high-precision, you can.

gcampax · 2021-02-11T20:07:16Z

@s-jse please double check that the math in LocalParserClient corresponds to your definition

@rayslxu this PR interacts quite a bit with your work on transcript logs because it moves around how commands are parsed and refactors DialogueLoop a bit, I merged everything but please check that the expected log is correct at the end and see if you still need fixes

s-jse · 2021-02-11T20:16:40Z

lib/prediction/localparserclient.ts

+                // convert is_correct and is_probably_correct scores into
+                // a single scale such that >0.5 is correct and >0.25 is
+                // probably correct
+                const score = (c.score.is_correct ?? 1) > 0.5 ? (c.score.is_correct ?? 1) :


L229-L239: I don't understand why you are doing calculations with these numbers.

These are not probably scores. As I said above, they are not guaranteed to be in [0, 1].

These are meant to be used as follows:

if (is_junk > threshold) { is junk, so ignore; } elif (is_ood > threshold){ is OOD, so send to different backends; elif (is_probably_correct < threshold) { parsing error; } elif (is_correct < threshold) { probably correct, but confirm with user to be sure; } else { correct; }

Well, the API that we provide to clients has one confidence number per candidate parse, in the range [0, 1], and a set of intent confidence scores, each in the range [0,1] and summing to 1.
So please adjust them here or in genienlp.

These scores cannot be treated as probabilities, but the API treats them like such.

gcampax added enhancement New feature or request dialogue-agent Issues with the dialogue agent at runtime (not state machine related) labels Jan 12, 2021

gcampax added this to the Almond 2.0 milestone Jan 12, 2021

s-jse reviewed Jan 12, 2021

View reviewed changes

lib/prediction/predictor.ts Outdated Show resolved Hide resolved

gcampax force-pushed the wip/confidence branch from 6b258de to 00c1c52 Compare January 12, 2021 04:43

gcampax self-assigned this Feb 4, 2021

gcampax added 3 commits February 5, 2021 10:18

Fix genie server

ed37b30

Predictor: add support for confidence scores

d50bbb0

Thanks to MC dropout and calibration, genienlp can now output confidence scores also when not doing beam search

gcampax force-pushed the wip/confidence branch from 00c1c52 to 00d52ef Compare February 5, 2021 23:23

gcampax marked this pull request as ready for review February 5, 2021 23:24

dialogue-loop: implement confidence-based dispatch

e1348d6

With hardcoded thresholds of 0.5 confidence, we either handle the command, ask the user for confirmation, or report some kind of error.

gcampax force-pushed the wip/confidence branch from 00d52ef to e1348d6 Compare February 5, 2021 23:25

gcampax requested review from sileix and s-jse February 5, 2021 23:27

gcampax added 2 commits February 11, 2021 12:00

Merge remote-tracking branch 'origin/next' into wip/confidence

b65df8c

Update confidence computation to the final naming

d76980f

s-jse reviewed Feb 11, 2021

View reviewed changes

gcampax added 2 commits February 11, 2021 17:17

prediction: use discrete scores for junk/OOD

c12bc33

These scores cannot be treated as probabilities, but the API treats them like such.

Merge remote-tracking branch 'origin/next' into wip/confidence

2f70c68

gcampax merged commit 2111a8f into next Feb 12, 2021

This was referenced Feb 12, 2021

update kf-server json parse #467

Closed

Implement confidence based dispatch #417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confidence-based dispatch #421

Confidence-based dispatch #421

gcampax commented Jan 12, 2021

gcampax commented Feb 5, 2021

s-jse commented Feb 11, 2021 •

edited

Loading

gcampax commented Feb 11, 2021

s-jse Feb 11, 2021

s-jse Feb 11, 2021

s-jse Feb 11, 2021 •

edited

Loading

gcampax Feb 11, 2021

Confidence-based dispatch #421

Confidence-based dispatch #421

Conversation

gcampax commented Jan 12, 2021

gcampax commented Feb 5, 2021

s-jse commented Feb 11, 2021 • edited Loading

gcampax commented Feb 11, 2021

s-jse Feb 11, 2021

Choose a reason for hiding this comment

s-jse Feb 11, 2021

Choose a reason for hiding this comment

s-jse Feb 11, 2021 • edited Loading

Choose a reason for hiding this comment

gcampax Feb 11, 2021

Choose a reason for hiding this comment

s-jse commented Feb 11, 2021 •

edited

Loading

s-jse Feb 11, 2021 •

edited

Loading