Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulties to reproduce XSUM results with BART #1971

Closed
astariul opened this issue Apr 7, 2020 · 26 comments
Closed

Difficulties to reproduce XSUM results with BART #1971

astariul opened this issue Apr 7, 2020 · 26 comments

Comments

@astariul
Copy link
Contributor

astariul commented Apr 7, 2020

I'm trying to reproduce the results of BART on XSUM dataset.

I followed README, didn't apply any preprocessing to the XSUM data and use beam=6, lenpen=1.0, max_len_b=60, min_len=10 for generation.

I got following results :

1 ROUGE-1 Average_F: 0.43809 (95%-conf.int. 0.43543 - 0.44078)
---------------------------------------------
1 ROUGE-2 Average_F: 0.20327 (95%-conf.int. 0.20052 - 0.20598)
---------------------------------------------
1 ROUGE-L Average_F: 0.34652 (95%-conf.int. 0.34382 - 0.34941)

which is a bit lower than reported results :
image


For the CNN/DM dataset, there was a few details to add in the data preprocessing step, I'm wondering if I missed these details for XSUM dataset.

Adding the missing preprocessing steps lead to score improvments, so I think it's the same issue for XSUM dataset. Does someone know where I can find a detailed explanation on how to preprocess XSUM dataset ?

@ngoyal2707 @yinhanliu

@yinhanliu
Copy link

I got raw text from one of Xsum authors. if you can get one from them, you should get a better number. I am not very sure about how to revert their released data (with tokenization) to the raw text.

@yinhanliu
Copy link

Hi @colanim can you all do me a favor. after this line
https://github.com/pytorch/fairseq/blob/966436403e5e927e3e7d5b389dad6ef06aaa7e03/fairseq/sequence_generator.py#L281
can you add:
if step == 0:
lprobs[:, self.bos] = 1000
else:
lprobs[:, self.bos] = -math.inf
let me know the Rouge on your data with these line.
Thanks a lot!

@astariul
Copy link
Contributor Author

astariul commented Apr 8, 2020

@yinhanliu thanks for the details.

After modifying as you mention, my score is 1 point higher :

1 ROUGE-1 Average_F: 0.44628 (95%-conf.int. 0.44348 - 0.44919)
---------------------------------------------
1 ROUGE-2 Average_F: 0.21263 (95%-conf.int. 0.20981 - 0.21558)
---------------------------------------------
1 ROUGE-L Average_F: 0.36099 (95%-conf.int. 0.35794 - 0.36411)

It's great !


But it's still 1 point from the paper's results.

I asked for the raw XSUM dataset, I will update this issue when I receive it (author didn't respond yet).

In the meantime, any idea on where this 1 point difference might come from ?

@yinhanliu
Copy link

@colanim
I think there is a bug in the code. Rouge doesn't work on my side currently. so can you help me try below things and see the result?

https://github.com/pytorch/fairseq/blob/d37529ed234ea9173ed35f6797a51a85378ecfca/fairseq/tasks/fairseq_task.py#L350
can you add **kwargs
so does to line 352

and at
https://github.com/pytorch/fairseq/blob/d37529ed234ea9173ed35f6797a51a85378ecfca/fairseq/models/bart/hub_interface.py#L124
add
bos_token=self.task.source_dictionary.bos()

Let me know how it goes.

@astariul
Copy link
Contributor Author

astariul commented Apr 8, 2020

@yinhanliu thanks for your help.

Here is my results after applying the changes you mentioned :

1 ROUGE-1 Average_F: 0.44610 (95%-conf.int. 0.44316 - 0.44902)
---------------------------------------------
1 ROUGE-2 Average_F: 0.21318 (95%-conf.int. 0.21023 - 0.21619)
---------------------------------------------
1 ROUGE-L Average_F: 0.36227 (95%-conf.int. 0.35930 - 0.36535)

@yinhanliu
Copy link

Thank you!
Let us see what Xsum author says.

@astariul
Copy link
Contributor Author

@yinhanliu According to the author of XSUM, the provided link is the same as the dataset you used.

I followed the same train/val/test distribution as the one provided by the author. I didn't apply any additional processing. I applied BPE encoding + binarization with exact same parameters as for CNN/DM.

@yinhanliu
Copy link

Thanks so much for letting me know. I will work on this shortly. The released model is supposed to work better than the one in the paper actually.

@yinhanliu
Copy link

@colanim I figured. in the original paper, we added BOS in to each src and tgt during fine-tune. but we didn't do so when we open-sourced code. set prepend-bos in to True in translation task can enhance the result when you fine-tune.

@astariul
Copy link
Contributor Author

@yinhanliu Thanks for your answer.

I see. But I didn't finetune BART myself, I used your already fine-tuned checkpoint on XSUM.

So I'm doing only evaluation. Should I modify any parameters in the evaluation script ?

@yinhanliu
Copy link

I tuned it incorrectly (I didn't add bos when I fine-tune it) @ngoyal2707 can you double check? I think current code doesn't have the option for prepend bos

@astariul
Copy link
Contributor Author

@yinhanliu thanks for the fast answer !

Do you plan to release a fixed checkpoint ?

@zsquaredz
Copy link

Hi @colanim can you all do me a favor. after this line
https://github.com/pytorch/fairseq/blob/966436403e5e927e3e7d5b389dad6ef06aaa7e03/fairseq/sequence_generator.py#L281

can you add:
if step == 0:
lprobs[:, self.bos] = 1000
else:
lprobs[:, self.bos] = -math.inf
let me know the Rouge on your data with these line.
Thanks a lot!

Hi, I tried to add above code to sequence_generator.py and it seems to give me an error with self.bos does not exist. Do I have to add this manually?

@zsquaredz
Copy link

I'm trying to reproduce the results of BART on XSUM dataset.

I followed README, didn't apply any preprocessing to the XSUM data and use beam=6, lenpen=1.0, max_len_b=60, min_len=10 for generation.

I got following results :

1 ROUGE-1 Average_F: 0.43809 (95%-conf.int. 0.43543 - 0.44078)
---------------------------------------------
1 ROUGE-2 Average_F: 0.20327 (95%-conf.int. 0.20052 - 0.20598)
---------------------------------------------
1 ROUGE-L Average_F: 0.34652 (95%-conf.int. 0.34382 - 0.34941)

which is a bit lower than reported results :
image

For the CNN/DM dataset, there was a few details to add in the data preprocessing step, I'm wondering if I missed these details for XSUM dataset.

Adding the missing preprocessing steps lead to score improvments, so I think it's the same issue for XSUM dataset. Does someone know where I can find a detailed explanation on how to preprocess XSUM dataset ?

@ngoyal2707 @yinhanliu

hi @colanim , when you say without any preprocessing, do you mean even without lowercasing the text? I am also try to reproduce results for XSum using the uploaded checkpoint but my ROUGE scores are a lower than yours (ROUGE-1 41, ROUGE-2 17, ROUGE-L 32).

@astariul
Copy link
Contributor Author

Yes, I didn't apply any other processing, just the raw datasets and the checkpoint given by author.

@zsquaredz
Copy link

zsquaredz commented May 17, 2020

@colanim Thanks, I managed to get similar results using the raw dataset. I am also wondering how did you use the following piece of code (provided by author above) to further boost ROUGE?

if step == 0:
lprobs[:, self.bos] = 1000
else:
lprobs[:, self.bos] = -math.inf

It seems that self.bos is not defined in the code.

@astariul
Copy link
Contributor Author

@zsquaredz I can't access my code right now, but I think you can't access self.bos, you need to "find" it yourself. The author also mentioned to write :

bos_token=self.task.source_dictionary.bos()

Can you try this ?

@zsquaredz
Copy link

@colanim Got it, thanks for the suggestion.

@astariul
Copy link
Contributor Author

astariul commented Jul 1, 2020

Any update on this ?
Can anyone reproduce the results on XSUM dataset using the checkpoint provided by authors ?

@monologue1107
Copy link

@colanim I figured. in the original paper, we added BOS in to each src and tgt during fine-tune. but we didn't do so when we open-sourced code. set prepend-bos in to True in translation task can enhance the result when you fine-tune.

hi, can you specify the place which should be revised so that I can try to fix it?

@shirley-wu
Copy link

shirley-wu commented Feb 1, 2021

Hi @colanim can you all do me a favor. after this line
https://github.com/pytorch/fairseq/blob/966436403e5e927e3e7d5b389dad6ef06aaa7e03/fairseq/sequence_generator.py#L281

can you add:
if step == 0:
lprobs[:, self.bos] = 1000
else:
lprobs[:, self.bos] = -math.inf
let me know the Rouge on your data with these line.
Thanks a lot!

Hi @yinhanliu , I'm trying to reproduce the results too. I tried this code, and it indeed improved the rouge scores, but I'm confused why it works. I'm using fairseq v0.10.2. Here is what I've tried:

  1. I download the raw dataset from http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz; For documents with multiple lines, I concatenate them into one single line separated by spaces; I apply no preprocessing steps; I use beam=6, lenpen=1.0, max_len_b=60, min_len=10 for generation. I get R1 / R2 / Rl = 44.33 / 20.98 / 35.24
  2. I tested with your code and get a better result: R1 / R2 / Rl = 45.23 / 21.92 / 36.71
  3. I further tested this code and find that it's lprobs[:, self.bos] = 1000 that works. lprobs[:, self.bos] = -math.inf has no influence on the results.

However, this is very confusing, because BART has already force the prefix token to be BOS here. I don't understand why setting the score as 1000 can make any difference. My observation is that, with lprobs[:, self.bos] = 1000, the generation seems to be shorter.

Could you help me on why it works?

@stale
Copy link

stale bot commented Jul 21, 2021

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

@stale stale bot added the stale label Jul 21, 2021
@Ricardokevins
Copy link

so, how can we reproduce the result in BART? I still confused >.<

@stale stale bot removed the stale label Nov 10, 2021
@Ricardokevins
Copy link

Ricardokevins commented Nov 10, 2021

@yinhanliu According to the author of XSUM, the provided link is the same as the dataset you used.

I followed the same train/val/test distribution as the one provided by the author. I didn't apply any additional processing. I applied BPE encoding + binarization with exact same parameters as for CNN/DM.

Hey man, thank you for your issue.
I want to know after download data from this link, what operation you do to further process?
write a script to divide the data? not use Stanford CoreNLP toolkit and Extracting data from CoreNLP XML files?
Thank you for any Suggestions ~

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

@stale stale bot added the stale label Mar 2, 2022
@stale
Copy link

stale bot commented Apr 18, 2022

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

@stale stale bot closed this as completed Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants