Add accelerator API to RPC distributed examples: ddp_rpc, parameter_server, rnn #1371

jafraustro · 2025-07-14T16:34:00Z

Add accelerator API to RPC distributed examples:

ddp_rpc
parameter_server
rnn

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>

- ddp_rpc - parameter_server - rnn Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>

netlify · 2025-07-14T16:34:06Z

✅ Deploy Preview for pytorch-examples-preview canceled.

Name	Link
🔨 Latest commit	`e044bc7`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-examples-preview/deploys/68815375610bf80008748783

soumith · 2025-07-15T01:20:25Z

failing CI

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>

jafraustro · 2025-07-15T15:55:02Z

I added numpy to requirement.txt files

soumith · 2025-07-16T15:34:24Z

still failing :D

- Added a function to verify minimum GPU count before execution. - Updated HybridModel initialization to use rank instead of device. - Ensured proper cleanup of the process group to avoid resource leaks. - Added exit message if insufficient GPUs are detected. Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>

jafraustro · 2025-07-17T23:16:37Z

Hi @soumith,

DDP step needs two gpu's.

Fix:

Added verify_min_gpu_count() function to check for sufficient GPU resources.
Updated the HybridModel class to use rank-based device assignment instead of generic device handling, improving device placement consistency across distributed processes.
Implemented proper cleanup by adding dist.destroy_process_group() calls for trainer processes,

dvrogozh · 2025-07-23T18:05:36Z

distributed/rpc/ddp_rpc/requirements.txt

@@ -1 +1,2 @@
-torch>=1.6.0
+torch>=2.7.1


why 2.7.1 instead of 2.7? Were there any fixes in 2.7.1 which we need in this example?

There are not

Rolling back to 2.7.0.

dvrogozh · 2025-07-23T18:16:26Z

distributed/rpc/ddp_rpc/main.py

+        else:
+            device = torch.device("cpu")
+        backend = torch.distributed.get_default_backend_for_device(device)
+        torch.accelerator.device_index(rank)


how setting torch.accelerator.device_index(rank) works for "cpu"? why it's not under torch.accelerator.is_available()?

Remove CPU execution option since DDP requires 2 GPUs for this example.

- Remove CPU execution option since DDP requires 2 GPUs for this example. - Refine README.md for DDP RPC example clarity and detail Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>

soumith · 2025-07-27T19:15:39Z

thanks!

jafraustro added 2 commits July 14, 2025 09:29

Add rpc/ddp_rpc and rpc/rnn examples to CI

de7db4c

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>

Add accelerator API to RPC distributed examples:

9354458

- ddp_rpc - parameter_server - rnn Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>

facebook-github-bot added the cla signed label Jul 14, 2025

jafraustro marked this pull request as ready for review July 14, 2025 16:34

Update requirements for RPC examples to include numpy

a790549

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>

jafraustro force-pushed the jafraust/rcp branch from d5697db to a790549 Compare July 15, 2025 15:53

jafraustro closed this Jul 15, 2025

jafraustro reopened this Jul 15, 2025

jafraustro force-pushed the jafraust/rcp branch from ff4b307 to a84f91c Compare July 17, 2025 23:08

dvrogozh reviewed Jul 23, 2025

View reviewed changes

- Update torch version in requirements.txt

e044bc7

- Remove CPU execution option since DDP requires 2 GPUs for this example. - Refine README.md for DDP RPC example clarity and detail Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>

jafraustro force-pushed the jafraust/rcp branch from 1c866fb to e044bc7 Compare July 23, 2025 21:26

soumith merged commit e9a4e75 into pytorch:main Jul 27, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add accelerator API to RPC distributed examples: ddp_rpc, parameter_server, rnn #1371

Add accelerator API to RPC distributed examples: ddp_rpc, parameter_server, rnn #1371

Uh oh!

jafraustro commented Jul 14, 2025

Uh oh!

netlify bot commented Jul 14, 2025 •

edited

Loading

Uh oh!

soumith commented Jul 15, 2025

Uh oh!

jafraustro commented Jul 15, 2025

Uh oh!

soumith commented Jul 16, 2025

Uh oh!

jafraustro commented Jul 17, 2025

Uh oh!

dvrogozh Jul 23, 2025

Uh oh!

jafraustro Jul 23, 2025

Uh oh!

dvrogozh Jul 23, 2025

Uh oh!

jafraustro Jul 23, 2025

Uh oh!

Uh oh!

soumith commented Jul 27, 2025

Uh oh!

Uh oh!

		@@ -1 +1,2 @@
		torch>=1.6.0
		torch>=2.7.1

Add accelerator API to RPC distributed examples: ddp_rpc, parameter_server, rnn #1371

Add accelerator API to RPC distributed examples: ddp_rpc, parameter_server, rnn #1371

Uh oh!

Conversation

jafraustro commented Jul 14, 2025

Uh oh!

netlify bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-examples-preview canceled.

Uh oh!

soumith commented Jul 15, 2025

Uh oh!

jafraustro commented Jul 15, 2025

Uh oh!

soumith commented Jul 16, 2025

Uh oh!

jafraustro commented Jul 17, 2025

Uh oh!

dvrogozh Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

jafraustro Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

dvrogozh Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

jafraustro Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

soumith commented Jul 27, 2025

Uh oh!

Uh oh!

netlify bot commented Jul 14, 2025 •

edited

Loading