Skip to content

Documentation for DDP-related environment variables #128204

@GuWei007

Description

@GuWei007

📚 The doc issue

I found these environment variables in the PyTorch code. Is there any document that describes the application scenarios?
TORCH_NCCL_BLOCKING_WAIT
TORCH_NCCL_ASYNC_ERROR_HANDLING
TORCH_NCCL_DUMP_ON_TIMEOUT
TORCH_NCCL_DESYNC_DEBUG
TORCH_NCCL_ENABLE_TIMING
TORCH_NCCL_ENABLE_MONITORING
TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC
TORCH_NCCL_TRACE_BUFFER_SIZE
TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC
TORCH_NCCL_COORD_CHECK_MILSEC
TORCH_NCCL_ABORT_IN_DESTROY_PG
TORCH_NCCL_AVOID_RECORD_STREAMS
TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK
TORCH_NCCL_DEBUG_INFO_PIPE_FILE
TORCH_NCCL_DEBUG_INFO_TEMP_FILE

Suggest a potential alternative/fix

No response

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Metadata

Metadata

Assignees

Labels

oncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions