Issues of reproducing Table 1 results on Commonsense Conversation Dataset (CCD) #13

Silin159 · 2023-03-16T10:06:02Z

Hi, I try to use your script (ccd.sh) to reproduce the Table 1 results on Commonsense Conversation Dataset, but it turns out that my reproduced results (BLEU: 0.154, Rouge-L: 6.38) are far below your reported values (BLEU: 1.02, Rouge-L: 8.59). Could you check whether the hyperparameters in ccd.sh are the optimal ones that you use? It would be better if you could also provide the evaluation scripts for producing BLEU and Rouge-L (currently the inference_scripts only save the testing outputs but no metrics evaluation results if I run it right)? Besides, are there any model checkpoints and testing outputs available?

Yuanhy1997 · 2023-03-16T10:17:44Z

I think you can select the checkpoints around 100000 training steps using the validation data. The number on the paper is out-of-date, the new results is a bit lower and is 0.84 in BLEU. BTW CCD is a pretty bizarre datasets in a way that it easily overfit the training data and the outputs actually require commonsense knowledges. (Diffuseq only achieved BLEU around 1. This means the outputs barely correlate with the labels.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues of reproducing Table 1 results on Commonsense Conversation Dataset (CCD) #13

Issues of reproducing Table 1 results on Commonsense Conversation Dataset (CCD) #13

Silin159 commented Mar 16, 2023

Yuanhy1997 commented Mar 16, 2023

Issues of reproducing Table 1 results on Commonsense Conversation Dataset (CCD) #13

Issues of reproducing Table 1 results on Commonsense Conversation Dataset (CCD) #13

Comments

Silin159 commented Mar 16, 2023

Yuanhy1997 commented Mar 16, 2023