[DTensor] High CPU memory usage/slowdown for `aten._foreach_addcdiv_.ScalarList` #123457

awgu · 2024-04-05T17:34:06Z

Repro script (internal only): P1206409663

For aten._foreach_addcdiv_.ScalarList:

Using 291 tensors
NCCL version 2.19.3+cuda12.0
Running optimizer!
Ran optimizer in 25.393 seconds
Peak start: 2956176
Peak end: 81930936

For aten._foreach_addcdiv_.Scalar:

Using 291 tensors
NCCL version 2.19.3+cuda12.0
Running optimizer!
Ran optimizer in 0.047 seconds
Peak start: 2951704
Peak end: 2955256

Some observations:

The issue is specific to the ScalarList variant.
The issue is proportional to the tensor size (if we instead use tiny tensors, there is no issue).
The issue cannot be repro'd with TwoTensor instead of DTensor (i.e. it is not common to all wrapper subclasses).

The text was updated successfully, but these errors were encountered:

awgu · 2024-04-05T18:27:09Z

Replace with single-GPU repro: #123461

awgu added the module: dtensor distributed tensor tag label Apr 5, 2024

gnadathur mentioned this issue Apr 5, 2024

TorchTrain: Release blocking Issues master tracker pytorch/torchtitan#186

Closed

awgu closed this as completed Apr 5, 2024

Provide feedback