Train set and test set ranking distribution difference #16

Hannibal046 · 2022-05-17T10:12:20Z

Hi, since the model used for CNNDM is Facebook/bart-large-cnn, which means the model actually got fine-tuned on the CNNDM training set. Considering the Neural model's amazing capacity for memorization, the candidate generation of training set for evaluation model should be nearly perfect. Do I understand this correctly ? How do you avoid this to generate useful data for ranking ? And does Pegasus also fine tuned on the CNNDM before generating summary candidate ? Thanks .

The text was updated successfully, but these errors were encountered:

Hannibal046 · 2022-05-17T14:01:20Z

I check the distribution of given data, and I found that train and test set give same data distribution, how to achieve this using PLM fine-tuned on CNNDM ?

yixinL7 · 2022-05-17T19:54:08Z

Good questions :)

Considering the Neural model's amazing capacity for memorization, the candidate generation of training set for evaluation model should be nearly perfect.

That's not exactly true because for BART and other models the checkpoint is selected based on their performance on the evaluation set, and if the model is overfitting too much on the training set it would not perform well on the evaluation set.

How do you avoid this to generate useful data for ranking ?

We found diverse beam search to be very useful in terms of generating diverse data. Please refer to https://github.com/yixinL7/BRIO/blob/main/gen_candidate.py.

And does Pegasus also fine tuned on the CNNDM before generating summary candidate ?

It is only fine-tuned on XSum.

and I found that train and test set give same data distribution, how to achieve this using PLM fine-tuned on CNNDM ?

Firstly, having similar ROUGE scores doesn't necessarily mean the data distribution is the same. For example, if you calculate the extractive oracle performance on the training set and test set on CNN/DM, you will find the score is higher on the test set.
Second, as I mentioned, the checkpoint (facebook/bart-large-cnn) is probably not overfitting too much on the training set.
Also, sampling 16 outputs using diverse beam search may also help to mitigate the effect of overfitting. Consider this: if the model has perfect performance on the training set, it would mean p_{model}(reference_summary) = 1, which may actually makes the other candidate summaries much worse.

Hannibal046 · 2022-05-18T03:40:37Z

Hi, thanks for reply. But I am still a little bit confused.

That's not exactly true because for BART and other models the checkpoint is selected based on their performance on the evaluation set, and if the model is overfitting too much on the training set it would not perform well on the evaluation set.

This is true. But when using the model to generate candidates on the training set, which means the model has already seen the ground truth summary during training, p_{model}(reference_summary) = 1 as you mentioned, how could the average max rouge score of training set is almost equivalent to the test set ?

Also, Diverse Beam Search may mitigate the problem for some extend, but what I suppose is something like this:

For rouge score:
diverse_beam_search_max_train > beam_search_train >>> diverse_beam_search_max_test > beam_search_test

And as you recommend in this #14, I check this paper SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization, there indeed has some special tricks for this mismatch problem:

yixinL7 · 2022-05-18T03:50:17Z

I'd like to emphasize my point that if the model is overfitting too much on the training set it would not perform well on the evaluation set. So it's possible that the selected checkpoint doesn't really overfit the training data.
It's really an empirical question in the end. So I'd recommend you to use the pre-trained model (facebook/bart-large-cnn) with the original generation script/method (beam search) to generate the outputs on both the test set and training set and evaluate if your assumption is correct :)

Hannibal046 · 2022-05-18T05:06:20Z

Hi, Sorry for take your time. But here I think it is maybe not a overfitting problem but a memorization problem. If train set and validation set gives same results with respect to some metrics, what is the point of validation set ? I think the meaning of validation set is to test the model's ability with data in the same distribution with training data but not exactly the same as training data considering model's memorization capacity.

I admit this is an empirical problem. And thanks so much for providing data for reranking and generation scripts. But considering the large dataset 280k , large model bart-large and large beam 16, I can't test it myself in a short time.

So just to be clear, the whole process of SimCLS on CNNDM is as follows (correct me if wrong) :

fine tune bart large on the training set of CNNDM, pick the best ckpt according the performance of validation set. (Facebook/bart-large-cnn)
use the ckpt to generate candidate summaries on train, validation and test set of CNNDM with diverse beam search.
use the generated data to train a reranking model on the train set, pick the best ckpt according the performance of validation set.
use trained reranking model to select the best candidates of the test set as final results.

Hannibal046 · 2022-05-18T12:58:43Z

Hi, I use bart-large-cnn ckpt to eval full test set and 2000 random data from train set, it gives almost identical results, this really surprises me. Train on the train set, also test on the train set, isn't this 100% label leakage ? I am so confused...

Hannibal046 · 2022-05-19T02:26:12Z

I have to admit this surprises me a lot. Because from my previous experience, training a transformer model from scratch in translation or summarization task, the BLEU or ROUGE of training set demonstrates a totally different distribution with that of test set. This is an interesting problem actually, I guess it may be a unique phenomenon in Large PLM, I am verifying this with vanilla transformer and bart_base, I will let you know if there is any progress. Thanks again for your detailed explanation !

Hannibal046 changed the title ~~Overfitting problem~~ Train set and test set ranking distribution difference May 17, 2022

Hannibal046 mentioned this issue May 18, 2022

Experiment set up Ravoxsg/SummaReranker#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train set and test set ranking distribution difference #16

Train set and test set ranking distribution difference #16

Hannibal046 commented May 17, 2022

Hannibal046 commented May 17, 2022

yixinL7 commented May 17, 2022

Hannibal046 commented May 18, 2022

yixinL7 commented May 18, 2022

Hannibal046 commented May 18, 2022

Hannibal046 commented May 18, 2022 •

edited

Hannibal046 commented May 19, 2022

Train set and test set ranking distribution difference #16

Train set and test set ranking distribution difference #16

Comments

Hannibal046 commented May 17, 2022

Hannibal046 commented May 17, 2022

yixinL7 commented May 17, 2022

Hannibal046 commented May 18, 2022

yixinL7 commented May 18, 2022

Hannibal046 commented May 18, 2022

Hannibal046 commented May 18, 2022 • edited

Hannibal046 commented May 19, 2022

Hannibal046 commented May 18, 2022 •

edited