Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A major refactor to sacrifice some performance for flexiblity and simplicity #304

Closed
2 tasks
zhuzilin opened this issue Feb 28, 2022 · 1 comment
Closed
2 tasks

Comments

@zhuzilin
Copy link
Collaborator

Currently, PatrickStart could train the largest pretrained model with the lowest hardware requirement, i.e. gpu and cpu memory. However, this comes with a price, as we are specializing the design to naive bert and gpt structure as well as adam optimizer. This makes our users hard to use PatrickStar for their latest research project and hard for us to tweak some edge cases to be compatible with popular NLP repos.

Therefore, we decide to refactor PatrickStar to make it simple and flexible. After all, comparing to break record, we prefer creating a handy tool to the NLP community :)

Here are some of the changes we are making now:

  • Stop reusing the parameter chunks for gradient: This would increase memory usage but will make PatrickStar support more network structure.
  • No longer managing the optimizer state with chunks: This should allow us to use pytorch native or thirdparty optimizers directly.

We will try to make the new design as performant and as efficient as the old one, however if what you need is the extreme performance mentioned as the paper, please refer to release v0.4.6.

@zhuzilin zhuzilin pinned this issue Feb 28, 2022
@Jack47
Copy link

Jack47 commented Feb 28, 2022

sounds cool, we really need these two features !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants