You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Oct 9, 2024. It is now read-only.
How to understand this note: "note: Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16 ..."
#99
Open
HuipengXu opened this issue
Jun 14, 2023
· 1 comment
Why is it said that only ds_zero is currently doing world_size streams on world_size gpus, while acclerate and ds inference should be doing the same as well since they also use multiprocessing?
The text was updated successfully, but these errors were encountered:
Hey, ds-inference is also doing world_size streams
However, accelerate is only doing 1 stream since we are just using naive pipeline parallelism capability from accelerate.
A more efficient approach for pipeline parallelism could be overlapping microbatches in the forward pass (no backward pass is needed)
For example, check this image from the Megatron-LM paper. This would be more efficient when serving. I think this will require you to have multiple processes for implementing this. But you might still get better throughout using DS-inference.
Also, if you are really interested in exploring serving models, I would suggest using text-gen-inference. This does dynamic batching and is much more efficient.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Why is it said that only ds_zero is currently doing world_size streams on world_size gpus, while acclerate and ds inference should be doing the same as well since they also use multiprocessing?
The text was updated successfully, but these errors were encountered: