-
Notifications
You must be signed in to change notification settings - Fork 9.7k
Add accelerator API to RPC distributed examples: ddp_rpc, parameter_server, rnn #1371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
- ddp_rpc - parameter_server - rnn Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
✅ Deploy Preview for pytorch-examples-preview canceled.
|
failing CI |
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
I added numpy to requirement.txt files |
still failing :D |
- Added a function to verify minimum GPU count before execution. - Updated HybridModel initialization to use rank instead of device. - Ensured proper cleanup of the process group to avoid resource leaks. - Added exit message if insufficient GPUs are detected. Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Hi @soumith, DDP step needs two gpu's. Fix:
|
@@ -1 +1,2 @@ | |||
torch>=1.6.0 | |||
torch>=2.7.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 2.7.1 instead of 2.7? Were there any fixes in 2.7.1 which we need in this example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are not
Rolling back to 2.7.0.
else: | ||
device = torch.device("cpu") | ||
backend = torch.distributed.get_default_backend_for_device(device) | ||
torch.accelerator.device_index(rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how setting torch.accelerator.device_index(rank)
works for "cpu"? why it's not under torch.accelerator.is_available()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove CPU execution option since DDP requires 2 GPUs for this example.
- Remove CPU execution option since DDP requires 2 GPUs for this example. - Refine README.md for DDP RPC example clarity and detail Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
thanks! |
Add accelerator API to RPC distributed examples:
CC: @soumith