Skip to content

Conversation

jafraustro
Copy link
Contributor

Add accelerator API to RPC distributed examples:

  • ddp_rpc
  • parameter_server
  • rnn

CC: @soumith

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
- ddp_rpc
- parameter_server
- rnn

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Copy link

netlify bot commented Jul 14, 2025

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit e044bc7
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-examples-preview/deploys/68815375610bf80008748783

@jafraustro jafraustro marked this pull request as ready for review July 14, 2025 16:34
@soumith
Copy link
Member

soumith commented Jul 15, 2025

failing CI

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
@jafraustro
Copy link
Contributor Author

I added numpy to requirement.txt files

@jafraustro jafraustro closed this Jul 15, 2025
@jafraustro jafraustro reopened this Jul 15, 2025
@soumith
Copy link
Member

soumith commented Jul 16, 2025

still failing :D

- Added a function to verify minimum GPU count before execution.
- Updated HybridModel initialization to use rank instead of device.
- Ensured proper cleanup of the process group to avoid resource leaks.
- Added exit message if insufficient GPUs are detected.

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
@jafraustro
Copy link
Contributor Author

Hi @soumith,

DDP step needs two gpu's.

Fix:

  • Added verify_min_gpu_count() function to check for sufficient GPU resources.
  • Updated the HybridModel class to use rank-based device assignment instead of generic device handling, improving device placement consistency across distributed processes.
  • Implemented proper cleanup by adding dist.destroy_process_group() calls for trainer processes,

@@ -1 +1,2 @@
torch>=1.6.0
torch>=2.7.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 2.7.1 instead of 2.7? Were there any fixes in 2.7.1 which we need in this example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are not

Rolling back to 2.7.0.

else:
device = torch.device("cpu")
backend = torch.distributed.get_default_backend_for_device(device)
torch.accelerator.device_index(rank)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how setting torch.accelerator.device_index(rank) works for "cpu"? why it's not under torch.accelerator.is_available()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove CPU execution option since DDP requires 2 GPUs for this example.

- Remove CPU execution option since DDP requires 2 GPUs for this example.
- Refine README.md for DDP RPC example clarity and detail

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
@soumith soumith merged commit e9a4e75 into pytorch:main Jul 27, 2025
8 checks passed
@soumith
Copy link
Member

soumith commented Jul 27, 2025

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants