Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to replicate paper numbers on SQuAD using HF checkpoint #7

Closed
yyw1999 opened this issue Jan 3, 2022 · 3 comments
Closed

Unable to replicate paper numbers on SQuAD using HF checkpoint #7

yyw1999 opened this issue Jan 3, 2022 · 3 comments

Comments

@yyw1999
Copy link

yyw1999 commented Jan 3, 2022

Hello! Thank you for releasing the paper and code for TaCL. I'm running into an issue where I'm unable to replicate the numbers in the paper, using the released HF checkpoint cambridgeltl/tacl-bert-base-uncased (to be exact, I did no pretraining on my side). My results on SQuAD, following the exact package versions listed in requirements.txt, using the default HF QA scripts, and with 8 Tesla K80's, are as follows:

Here are my results for SQuAD v1 and v2:
Screen Shot 2022-01-02 at 10 39 37 PM

Screen Shot 2022-01-02 at 10 35 53 PM

The EM/F1 (80.8 and 87.96 for V1, and 70.81 and 74.05 for V2) are lower than both the BERT and TaCL numbers in the paper, and I don't think this drop is due to a difference in hardware configurations. I wonder if this could be an issue where the HF repo has changed in the meantime. With the current version of the HF repo (obtained after git clone), maybe TaCL's performance is lower on the QA benchmark? Please advise. Thank you!

@yxuansu
Copy link
Owner

yxuansu commented Jan 3, 2022

Hello! Thank you for releasing the paper and code for TaCL. I'm running into an issue where I'm unable to replicate the numbers in the paper, using the released HF checkpoint cambridgeltl/tacl-bert-base-uncased (to be exact, I did no pretraining on my side). My results on SQuAD, following the exact package versions listed in requirements.txt, using the default HF QA scripts, and with 8 Tesla K80's, are as follows:

Here are my results for SQuAD v1 and v2: Screen Shot 2022-01-02 at 10 39 37 PM

Screen Shot 2022-01-02 at 10 35 53 PM

The EM/F1 (80.8 and 87.96 for V1, and 70.81 and 74.05 for V2) are lower than both the BERT and TaCL numbers in the paper, and I don't think this drop is due to a difference in hardware configurations. I wonder if this could be an issue where the HF repo has changed in the meantime. With the current version of the HF repo (obtained after git clone), maybe TaCL's performance is lower on the QA benchmark? Please advise. Thank you!

Thank you for your interest in our work. The results you got is quite strange, we are not sure why your results are different from ours. We re-run the experiment our side and gets the same results as listed in the paper. Our hardware configurations are:
(1) Ubuntu 16.04.4 LTS; (2) NVIDIA-SMI 430.26; (3) Driver Version: 430.26; (4) CUDA Version: 10.2; (5) GeForce GTX 1080 (12GB);

One potential reason for the results discrepancy might be the number of GPUs. We use a single GPU to do the fine-tuning which tasks ~2 hours for SQuAD 1.1 and ~2.5 hours for SQuAD 2.0. And we did not use the fp16 approximation in our experiment. Please refer to the following figures of our results on SQuAD.

SQuAD 1.1:
image

SQuAD 2.0:
image

Could you try to re-run the experiments for both TaCL and BERT using one GPU with the same scripts provided here. Let's see your updated results :)

@yyw1999
Copy link
Author

yyw1999 commented Jan 4, 2022

Thank you for the quick response! Rerunning the SQuAD V1 experiment on 1 GPU results in slightly worse numbers unfortunately (78.87 EM / 86.45 F1). I didn't use fp16 training at all. Did you rerun the experiments on your end with a fresh copy of transformers? Maybe the version change could account for the discrepancy. Thanks again!

@yxuansu
Copy link
Owner

yxuansu commented Jan 4, 2022

Thank you for the quick response! Rerunning the SQuAD V1 experiment on 1 GPU results in slightly worse numbers unfortunately (78.87 EM / 86.45 F1). I didn't use fp16 training at all. Did you rerun the experiments on your end with a fresh copy of transformers? Maybe the version change could account for the discrepancy. Thanks again!

Hi,

I just rerun the experiments by cloning the latest version of huggingface and using the same scripts here. I could not replicate your results and my results are still the same as reported in the paper. Please refer to the below screenshots and see the timestamps as a reference.

SQuAD 1.1:
截屏2022-01-04 17 11 46

SQuAD 2.0:
截屏2022-01-04 17 12 12

I think maybe you can try to install the latest version of huggingface (do not use requirements.txt provided in the repo) and see the results? (Please follow all same steps as described here.)

I hope my response could help you :)

@yxuansu yxuansu closed this as completed Jan 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants