Question about evaluation #2

haoningz · 2022-12-05T05:53:55Z

Hello,

Could you please explain the evaluation setting on sub-tast 4? The default evaluation setting of sub-task 4 only calculates the BLEU score of the last turn for each dialogue (eval_model.py line 992, args.single_round_evaluation), while the BLEU score decreases by 3-4 if considering all dialogue turns, and it seems that the original GPT-2 baseline calculates all dialogue turns.

Besides, are the best scores of 4 sub-tasks come from a single checkpoint, or each of them is derived using different checkpoints?

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about evaluation #2

Question about evaluation #2

haoningz commented Dec 5, 2022

Question about evaluation #2

Question about evaluation #2

Comments

haoningz commented Dec 5, 2022