Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enquiry about your implementation #10

Closed
riven314 opened this issue Jul 9, 2020 · 2 comments
Closed

Enquiry about your implementation #10

riven314 opened this issue Jul 9, 2020 · 2 comments

Comments

@riven314
Copy link

riven314 commented Jul 9, 2020

Thanks for your great work!

I have a few enquiries about your implementations:

  1. Could you reproduce the paper results (or approximately similar) with your implementation?
  2. While ordinary transformer requires multiple GPUs to train from scratch, as for your implementation of Linformer, is it possible to train it from scratch with single GPU only(8GB/ 11GB)?

Thanks,
Alex Lau

@tatp22
Copy link
Owner

tatp22 commented Jul 9, 2020

Hi @riven314!

  1. I could in theory reproduce the results, but my hardware isn't as powerful as the one at FB. As stated in section 5.1 of the paper:
We first compare the pretraining performance of our proposed architecture against RoBERTa (Liuet al., 2019), which is based on
 the Transformer. Following Devlin et al. (2019), we use Book Corpus (Zhu et al., 2015) plus English Wikipedia as our pretraining 
set (3300M words). All models are pretrained with the masked-language-modeling (MLM) objective, and the training for all 
experiments are parallelized across 64 Tesla V100 GPUs with 250k updates.

I unfortunately don't have 64 V100 GPUs at my disposal, so if I wanted to reproduce the results, it would take way longer than for the authors of the paper. That being said, if someone reading this does have resources at their disposal, they can try the tests themselves.
2. That being said, it is totally possible to train it using one GPU, although it will take longer to do so. Of course, it depends on task you are training for, but as long as one training example doesn't take more than the GPU memory, you can work with a small batch size and come to the same results as if you had multiple GPUs. In fact, the reasons for having multiple GPUs in the first place is just to speed up training, whether it be through model or data parallelism.

Hope this helps!

@riven314
Copy link
Author

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants