Difficulties to reproduce XSUM results with BART #1971

astariul · 2020-04-07T02:15:15Z

I'm trying to reproduce the results of BART on XSUM dataset.

I followed README, didn't apply any preprocessing to the XSUM data and use beam=6, lenpen=1.0, max_len_b=60, min_len=10 for generation.

I got following results :

1 ROUGE-1 Average_F: 0.43809 (95%-conf.int. 0.43543 - 0.44078)
---------------------------------------------
1 ROUGE-2 Average_F: 0.20327 (95%-conf.int. 0.20052 - 0.20598)
---------------------------------------------
1 ROUGE-L Average_F: 0.34652 (95%-conf.int. 0.34382 - 0.34941)

which is a bit lower than reported results :

For the CNN/DM dataset, there was a few details to add in the data preprocessing step, I'm wondering if I missed these details for XSUM dataset.

Adding the missing preprocessing steps lead to score improvments, so I think it's the same issue for XSUM dataset. Does someone know where I can find a detailed explanation on how to preprocess XSUM dataset ?

@ngoyal2707 @yinhanliu

The text was updated successfully, but these errors were encountered:

yinhanliu · 2020-04-07T02:58:56Z

I got raw text from one of Xsum authors. if you can get one from them, you should get a better number. I am not very sure about how to revert their released data (with tokenization) to the raw text.

yinhanliu · 2020-04-07T15:58:05Z

Hi @colanim can you all do me a favor. after this line
https://github.com/pytorch/fairseq/blob/966436403e5e927e3e7d5b389dad6ef06aaa7e03/fairseq/sequence_generator.py#L281
can you add:
if step == 0:
lprobs[:, self.bos] = 1000
else:
lprobs[:, self.bos] = -math.inf
let me know the Rouge on your data with these line.
Thanks a lot!

astariul · 2020-04-08T02:32:55Z

@yinhanliu thanks for the details.

After modifying as you mention, my score is 1 point higher :

1 ROUGE-1 Average_F: 0.44628 (95%-conf.int. 0.44348 - 0.44919)
---------------------------------------------
1 ROUGE-2 Average_F: 0.21263 (95%-conf.int. 0.20981 - 0.21558)
---------------------------------------------
1 ROUGE-L Average_F: 0.36099 (95%-conf.int. 0.35794 - 0.36411)

It's great !

But it's still 1 point from the paper's results.

I asked for the raw XSUM dataset, I will update this issue when I receive it (author didn't respond yet).

In the meantime, any idea on where this 1 point difference might come from ?

yinhanliu · 2020-04-08T02:58:05Z

@colanim
I think there is a bug in the code. Rouge doesn't work on my side currently. so can you help me try below things and see the result?

https://github.com/pytorch/fairseq/blob/d37529ed234ea9173ed35f6797a51a85378ecfca/fairseq/tasks/fairseq_task.py#L350
can you add **kwargs
so does to line 352

and at
https://github.com/pytorch/fairseq/blob/d37529ed234ea9173ed35f6797a51a85378ecfca/fairseq/models/bart/hub_interface.py#L124
add
bos_token=self.task.source_dictionary.bos()

Let me know how it goes.

astariul · 2020-04-08T04:33:27Z

@yinhanliu thanks for your help.

Here is my results after applying the changes you mentioned :

1 ROUGE-1 Average_F: 0.44610 (95%-conf.int. 0.44316 - 0.44902)
---------------------------------------------
1 ROUGE-2 Average_F: 0.21318 (95%-conf.int. 0.21023 - 0.21619)
---------------------------------------------
1 ROUGE-L Average_F: 0.36227 (95%-conf.int. 0.35930 - 0.36535)

yinhanliu · 2020-04-08T05:44:55Z

Thank you!
Let us see what Xsum author says.

astariul · 2020-04-13T01:50:41Z

@yinhanliu According to the author of XSUM, the provided link is the same as the dataset you used.

I followed the same train/val/test distribution as the one provided by the author. I didn't apply any additional processing. I applied BPE encoding + binarization with exact same parameters as for CNN/DM.

yinhanliu · 2020-04-13T17:59:28Z

Thanks so much for letting me know. I will work on this shortly. The released model is supposed to work better than the one in the paper actually.

yinhanliu · 2020-04-25T02:11:35Z

@colanim I figured. in the original paper, we added BOS in to each src and tgt during fine-tune. but we didn't do so when we open-sourced code. set prepend-bos in to True in translation task can enhance the result when you fine-tune.

astariul · 2020-04-27T00:10:24Z

@yinhanliu Thanks for your answer.

I see. But I didn't finetune BART myself, I used your already fine-tuned checkpoint on XSUM.

So I'm doing only evaluation. Should I modify any parameters in the evaluation script ?

yinhanliu · 2020-04-27T00:37:57Z

I tuned it incorrectly (I didn't add bos when I fine-tune it) @ngoyal2707 can you double check? I think current code doesn't have the option for prepend bos

astariul · 2020-04-27T01:33:25Z

@yinhanliu thanks for the fast answer !

Do you plan to release a fixed checkpoint ?

zsquaredz · 2020-05-13T22:36:37Z

Hi @colanim can you all do me a favor. after this line
https://github.com/pytorch/fairseq/blob/966436403e5e927e3e7d5b389dad6ef06aaa7e03/fairseq/sequence_generator.py#L281

can you add:
if step == 0:
lprobs[:, self.bos] = 1000
else:
lprobs[:, self.bos] = -math.inf
let me know the Rouge on your data with these line.
Thanks a lot!

Hi, I tried to add above code to sequence_generator.py and it seems to give me an error with self.bos does not exist. Do I have to add this manually?

zsquaredz · 2020-05-16T23:59:38Z

I'm trying to reproduce the results of BART on XSUM dataset.

I followed README, didn't apply any preprocessing to the XSUM data and use beam=6, lenpen=1.0, max_len_b=60, min_len=10 for generation.

I got following results :
1 ROUGE-1 Average_F: 0.43809 (95%-conf.int. 0.43543 - 0.44078)
---------------------------------------------
1 ROUGE-2 Average_F: 0.20327 (95%-conf.int. 0.20052 - 0.20598)
---------------------------------------------
1 ROUGE-L Average_F: 0.34652 (95%-conf.int. 0.34382 - 0.34941)
which is a bit lower than reported results :

For the CNN/DM dataset, there was a few details to add in the data preprocessing step, I'm wondering if I missed these details for XSUM dataset.

Adding the missing preprocessing steps lead to score improvments, so I think it's the same issue for XSUM dataset. Does someone know where I can find a detailed explanation on how to preprocess XSUM dataset ?

@ngoyal2707 @yinhanliu

hi @colanim , when you say without any preprocessing, do you mean even without lowercasing the text? I am also try to reproduce results for XSum using the uploaded checkpoint but my ROUGE scores are a lower than yours (ROUGE-1 41, ROUGE-2 17, ROUGE-L 32).

astariul · 2020-05-17T22:10:48Z

Yes, I didn't apply any other processing, just the raw datasets and the checkpoint given by author.

zsquaredz · 2020-05-17T22:20:38Z

@colanim Thanks, I managed to get similar results using the raw dataset. I am also wondering how did you use the following piece of code (provided by author above) to further boost ROUGE?

if step == 0:
lprobs[:, self.bos] = 1000
else:
lprobs[:, self.bos] = -math.inf

It seems that self.bos is not defined in the code.

astariul · 2020-05-17T22:26:50Z

@zsquaredz I can't access my code right now, but I think you can't access self.bos, you need to "find" it yourself. The author also mentioned to write :

bos_token=self.task.source_dictionary.bos()

Can you try this ?

zsquaredz · 2020-05-17T22:34:30Z

@colanim Got it, thanks for the suggestion.

astariul · 2020-07-01T00:14:10Z

Any update on this ?
Can anyone reproduce the results on XSUM dataset using the checkpoint provided by authors ?

monologue1107 · 2020-12-05T08:41:26Z

@colanim I figured. in the original paper, we added BOS in to each src and tgt during fine-tune. but we didn't do so when we open-sourced code. set prepend-bos in to True in translation task can enhance the result when you fine-tune.

hi, can you specify the place which should be revised so that I can try to fix it?

shirley-wu · 2021-02-01T07:40:57Z

Hi @colanim can you all do me a favor. after this line
https://github.com/pytorch/fairseq/blob/966436403e5e927e3e7d5b389dad6ef06aaa7e03/fairseq/sequence_generator.py#L281

can you add:
if step == 0:
lprobs[:, self.bos] = 1000
else:
lprobs[:, self.bos] = -math.inf
let me know the Rouge on your data with these line.
Thanks a lot!

Hi @yinhanliu , I'm trying to reproduce the results too. I tried this code, and it indeed improved the rouge scores, but I'm confused why it works. I'm using fairseq v0.10.2. Here is what I've tried:

I download the raw dataset from http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz; For documents with multiple lines, I concatenate them into one single line separated by spaces; I apply no preprocessing steps; I use beam=6, lenpen=1.0, max_len_b=60, min_len=10 for generation. I get R1 / R2 / Rl = 44.33 / 20.98 / 35.24
I tested with your code and get a better result: R1 / R2 / Rl = 45.23 / 21.92 / 36.71
I further tested this code and find that it's lprobs[:, self.bos] = 1000 that works. lprobs[:, self.bos] = -math.inf has no influence on the results.

However, this is very confusing, because BART has already force the prefix token to be BOS here. I don't understand why setting the score as 1000 can make any difference. My observation is that, with lprobs[:, self.bos] = 1000, the generation seems to be shorter.

Could you help me on why it works?

stale · 2021-07-21T02:04:51Z

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

Ricardokevins · 2021-11-10T05:44:45Z

so, how can we reproduce the result in BART? I still confused >.<

Ricardokevins · 2021-11-10T05:52:30Z

@yinhanliu According to the author of XSUM, the provided link is the same as the dataset you used.

I followed the same train/val/test distribution as the one provided by the author. I didn't apply any additional processing. I applied BPE encoding + binarization with exact same parameters as for CNN/DM.

Hey man, thank you for your issue.
I want to know after download data from this link, what operation you do to further process?
write a script to divide the data? not use Stanford CoreNLP toolkit and Extracting data from CoreNLP XML files?
Thank you for any Suggestions ~

stale · 2022-03-02T11:33:20Z

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale · 2022-04-18T00:20:54Z

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

astariul added needs triage question labels Apr 7, 2020

astariul mentioned this issue Apr 7, 2020

Raw dataset EdinburghNLP/XSum#20

Closed

astariul mentioned this issue Apr 16, 2020

Pre-trained BART performance on XSum lower than expected huggingface/transformers#3811

Closed

lematt1991 removed the needs triage label Apr 21, 2020

stale bot added the stale label Jul 21, 2021

stale bot removed the stale label Nov 10, 2021

stale bot added the stale label Mar 2, 2022

stale bot closed this as completed Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulties to reproduce XSUM results with BART #1971

Difficulties to reproduce XSUM results with BART #1971

astariul commented Apr 7, 2020 •

edited

yinhanliu commented Apr 7, 2020

yinhanliu commented Apr 7, 2020

astariul commented Apr 8, 2020

yinhanliu commented Apr 8, 2020

astariul commented Apr 8, 2020

yinhanliu commented Apr 8, 2020

astariul commented Apr 13, 2020

yinhanliu commented Apr 13, 2020

yinhanliu commented Apr 25, 2020

astariul commented Apr 27, 2020

yinhanliu commented Apr 27, 2020

astariul commented Apr 27, 2020

zsquaredz commented May 13, 2020

zsquaredz commented May 16, 2020

astariul commented May 17, 2020

zsquaredz commented May 17, 2020 •

edited

astariul commented May 17, 2020

zsquaredz commented May 17, 2020

astariul commented Jul 1, 2020

monologue1107 commented Dec 5, 2020

shirley-wu commented Feb 1, 2021 •

edited

stale bot commented Jul 21, 2021

Ricardokevins commented Nov 10, 2021

Ricardokevins commented Nov 10, 2021 •

edited

stale bot commented Mar 2, 2022

stale bot commented Apr 18, 2022

Difficulties to reproduce XSUM results with BART #1971

Difficulties to reproduce XSUM results with BART #1971

Comments

astariul commented Apr 7, 2020 • edited

yinhanliu commented Apr 7, 2020

yinhanliu commented Apr 7, 2020

astariul commented Apr 8, 2020

yinhanliu commented Apr 8, 2020

astariul commented Apr 8, 2020

yinhanliu commented Apr 8, 2020

astariul commented Apr 13, 2020

yinhanliu commented Apr 13, 2020

yinhanliu commented Apr 25, 2020

astariul commented Apr 27, 2020

yinhanliu commented Apr 27, 2020

astariul commented Apr 27, 2020

zsquaredz commented May 13, 2020

zsquaredz commented May 16, 2020

astariul commented May 17, 2020

zsquaredz commented May 17, 2020 • edited

astariul commented May 17, 2020

zsquaredz commented May 17, 2020

astariul commented Jul 1, 2020

monologue1107 commented Dec 5, 2020

shirley-wu commented Feb 1, 2021 • edited

stale bot commented Jul 21, 2021

Ricardokevins commented Nov 10, 2021

Ricardokevins commented Nov 10, 2021 • edited

stale bot commented Mar 2, 2022

stale bot commented Apr 18, 2022

astariul commented Apr 7, 2020 •

edited

zsquaredz commented May 17, 2020 •

edited

shirley-wu commented Feb 1, 2021 •

edited

Ricardokevins commented Nov 10, 2021 •

edited