You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could you please explain the evaluation setting on sub-tast 4? The default evaluation setting of sub-task 4 only calculates the BLEU score of the last turn for each dialogue (eval_model.py line 992, args.single_round_evaluation), while the BLEU score decreases by 3-4 if considering all dialogue turns, and it seems that the original GPT-2 baseline calculates all dialogue turns.
Besides, are the best scores of 4 sub-tasks come from a single checkpoint, or each of them is derived using different checkpoints?
Thank you.
The text was updated successfully, but these errors were encountered:
Hello,
Could you please explain the evaluation setting on sub-tast 4? The default evaluation setting of sub-task 4 only calculates the BLEU score of the last turn for each dialogue (eval_model.py line 992, args.single_round_evaluation), while the BLEU score decreases by 3-4 if considering all dialogue turns, and it seems that the original GPT-2 baseline calculates all dialogue turns.
Besides, are the best scores of 4 sub-tasks come from a single checkpoint, or each of them is derived using different checkpoints?
Thank you.
The text was updated successfully, but these errors were encountered: