-
Notifications
You must be signed in to change notification settings - Fork 251
Add pipeline parallel #1060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pipeline parallel #1060
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1060
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 09946d6 with merge base 19a47e7 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
# Use ModuleDict so that each layer can be assigned its layer ID in the original model | ||
self.layers = nn.ModuleDict() | ||
for layer_id in range(self.layers_per_stage * stage_idx, self.layers_per_stage * (stage_idx + 1)): | ||
self.layers[str(layer_id)] = TransformerBlock(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty clever!
@@ -67,7 +75,7 @@ def setup_caches(self, max_batch_size, max_seq_length): | |||
max_seq_length = find_multiple(max_seq_length, 8) | |||
self.max_seq_length = max_seq_length | |||
self.max_batch_size = max_batch_size | |||
for b in self.layers: | |||
for b in self.layers.values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block instead of b for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from original model.
dist_run.py
Outdated
|
||
# Model config | ||
config = TransformerArgs.from_name("Transformer-2-7b-chat-hf") | ||
print(config) | ||
|
||
# Construct a device mesh with available devices (multi-host or single host) | ||
device_mesh = dist.init_device_mesh("cuda", (2,), mesh_dim_names=("tp",)) | ||
device_mesh = dist.init_device_mesh("cuda", (2, 2), mesh_dim_names=("pp", "tp")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this file expands in future, maybe better to functionalize these different stages ala
create_device_mesh(mesh_shape=(2,2))
setup_model
create_pipeline_stage
etc.
It's clear atm what's happening but if this file becomes larger/more dynamic then maybe it would make sense in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great - the fancy work with module dict for assigning layers to stage is very clever.
82a1c44
to
09946d6
Compare
mb_ids = torch.randint(0, config.vocab_size, (mb_size, seqlen), device=device) | ||
activation = torch.rand(mb_size, seqlen, dim, device=device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does PipelineStage
require that these are materialized on CUDA for them to be input_args
?
Stack from ghstack (oldest at bottom):
PP + TP now working.