Skip to content

Conversation

@vanbasten23
Copy link
Collaborator

@vanbasten23 vanbasten23 commented Dec 4, 2023

Currently we create a coordinator service even if we run single processing on CUDA. I observed that when the single process runs for a long time (>1h), there is a chance that the coordinator service would crash. But single processing really doesn't need the coordinator service.

Test plan:

  • PJRT_DEVICE=CUDA GPU_NUM_DEVICES=1 python pytorch/xla/test/pjrt/test_runtime_gpu.py
  • PJRT_DEVICE=CUDA GPU_NUM_DEVICES=2 python pytorch/xla/test/pjrt/test_runtime_gpu.py
  • PJRT_DEVICE=CUDA torchrun --nnodes 1 --nproc-per-node 2 pytorch/xla/test/pjrt/test_torchrun.py
  • PJRT_DEVICE=CUDA python pytorch/xla/test/test_operations.py MpDecoratorTest.test_mp_decorator

@vanbasten23 vanbasten23 changed the title Not creating the coordinator servie for single process. Not creating a coordinator service for the single processing job. Dec 4, 2023
@jonb377
Copy link
Collaborator

jonb377 commented Dec 5, 2023

@vanbasten23 Is the crash only in the case of single-process workloads? I wonder if it's a more general issue that we should handle. Agree that we don't need a coordinator for distributed kv store in single-process though.

@vanbasten23 vanbasten23 force-pushed the notCreateCoordinatorServiceWhenNumNodesIsOne branch from 269554a to 4c856b3 Compare December 6, 2023 23:45
@vanbasten23
Copy link
Collaborator Author

@vanbasten23 Is the crash only in the case of single-process workloads? I wonder if it's a more general issue that we should handle. Agree that we don't need a coordinator for distributed kv store in single-process though.

It fails in the persistent cache test https://gist.github.com/vanbasten23/2cb90b2f72a40ef965b965bc12bc5ded. I fixed your comment and let me rerun.

@jonb377
Copy link
Collaborator

jonb377 commented Dec 7, 2023

@vanbasten23 Is the crash only in the case of single-process workloads? I wonder if it's a more general issue that we should handle. Agree that we don't need a coordinator for distributed kv store in single-process though.

It fails in the persistent cache test https://gist.github.com/vanbasten23/2cb90b2f72a40ef965b965bc12bc5ded. I fixed your comment and let me rerun.

@vanbasten23 I've merged the persistent cache change, if you want to test the single device test with this fix you can do a rebase on master and modify the single-device test to run on GPU here

global_process_rank, global_world_size, master_addr, port);
std::shared_ptr<xla::DistributedRuntimeClient> distributed_client =
coordinator_->GetClient();
if (distributed_client != nullptr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should be doing an XLA_CHECK(distributed_client != nullptr) here - is there a case where we want the ComputationClient creation to succeed without a DistributedRuntimeClient?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea!

Copy link
Collaborator

@will-cromar will-cromar Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this case already checked in GetClient?

@vanbasten23 vanbasten23 force-pushed the notCreateCoordinatorServiceWhenNumNodesIsOne branch from 5235c90 to 8b0f9f5 Compare December 7, 2023 06:12
Copy link
Collaborator

@jonb377 jonb377 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@JackCaoG
Copy link
Collaborator

JackCaoG commented Dec 7, 2023

Do we need this in 2.2?

@vanbasten23
Copy link
Collaborator Author

Do we need this in 2.2?

We don't need this for 2.2. This pr is more of optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants