Skip to content
This repository was archived by the owner on Oct 9, 2024. It is now read-only.

How to understand this note: "note: Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput can be further divided by 8 or 16 ..." #99

Open
HuipengXu opened this issue Jun 14, 2023 · 1 comment

Comments

@HuipengXu
Copy link

Why is it said that only ds_zero is currently doing world_size streams on world_size gpus, while acclerate and ds inference should be doing the same as well since they also use multiprocessing?

@mayank31398
Copy link
Collaborator

Hey, ds-inference is also doing world_size streams
However, accelerate is only doing 1 stream since we are just using naive pipeline parallelism capability from accelerate.
A more efficient approach for pipeline parallelism could be overlapping microbatches in the forward pass (no backward pass is needed)

For example, check this image from the Megatron-LM paper. This would be more efficient when serving. I think this will require you to have multiple processes for implementing this. But you might still get better throughout using DS-inference.
Screenshot 2023-06-15 at 12 05 46 PM

Also, if you are really interested in exploring serving models, I would suggest using text-gen-inference. This does dynamic batching and is much more efficient.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants