Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Add accelerator ids to workers and share neuron_cores by default #39091

Merged
merged 18 commits into from
Sep 29, 2023

Conversation

chappidim
Copy link
Contributor

@chappidim chappidim commented Aug 30, 2023

Why are these changes needed?

This change enables the train worker to share neuron cores between workers on same node by retrieving the accelerator/neuron_core ids from each worker's runtime context

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Manual tests

Manual testing

  • Created a two node cluster with trn1.32xl and run simple all_reduce function

workers=2, neuron_cores=32 ✅

{'NEURON_RT_VISIBLE_CORES': '0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31', 'MASTER_ADDR': '10.0.143.151', 'MASTER_PORT': '60277', 'TORCHELASTIC_RUN_ID': '40a4e995-d7c7-4c34-bc1c-1770ce22b051', 'LOCAL_RANK': '0', 'RANK': '1', 'LOCAL_WORLD_SIZE': '1', 'WORLD_SIZE': '2', 'GROUP_RANK': '1', 'GROUP_WORLD_SIZE': '2.0', 'ROLE_RANK': '1', 'ROLE_NAME': 'default', 'ROLE_WORLD_SIZE': '2', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'FI_PROVIDER': 'efa', 'FI_EFA_USE_DEVICE_RDMA': '1', 'FI_EFA_FORK_SAFE': '1', 'TPU_LIBRARY_PATH': '/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/libtpu.so', 'NEURON_RT_ROOT_COMM_ID': '10.0.143.151:62182', 'TF_NUM_INTEROP_THREADS': '512', 'TF_NUM_INTRAOP_THREADS': '128', 'XLA_TRANSFER_SCALAR_ASYNC': '1', 'XLA_IR_SHAPE_CACHE_SIZE': '20480', 'TF_CPP_MIN_LOG_LEVEL': '1', 'GRPC_VERBOSITY': 'ERROR', 'ALLOW_MULTIPLE_LIBTPU_LOAD': '1', 'TPU_ML_PLATFORM': 'PyTorch/XLA', 'TF_GRPC_DEFAULT_OPTIONS': 'grpc.keepalive_time_ms=60000,grpc.keepalive_timeout_ms=14400000,grpc.http2.max_pings_without_data=0,grpc.http2.min_ping_interval_without_data_ms=300000', 'XLA_FLAGS': ' --xla_cpu_enable_fast_math=false --xla_gpu_simplify_all_fp_conversions=false', 'DISABLE_NUMERIC_CC_TOKEN': 'true', 'XRT_LOCAL_WORKER': 'c_localservice:1', 'XRT_SHARD_WORLD_SIZE': '2', 'XRT_HOST_WORLD_SIZE': '2', 'XRT_SHARD_ORDINAL': '1', 'XRT_SHARD_LOCAL_ORDINAL': '0', 'XRT_MULTI_PROCESSING_DEVICE': 'TPU:1', 'XRT_START_LOCAL_SERVER': '0', 'XRT_MESH_SERVICE_ADDRESS': '10.0.143.151:32795', 'TPU_MESH_CONTROLLER_ADDRESS': '10.0.143.151:39449', 'TPU_MESH_CONTROLLER_PORT': '39449', 'NEURON_USE_LOAD_COLLECTIVES': '1', 'NEURON_GLOBAL_DEVICE_ID': '1', 'NEURON_GLOBAL_DEVICE_COUNT': '2'}
{'NEURON_RT_VISIBLE_CORES': '0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31', 'MASTER_ADDR': '10.0.143.151', 'MASTER_PORT': '60277', 'TORCHELASTIC_RUN_ID': '40a4e995-d7c7-4c34-bc1c-1770ce22b051', 'LOCAL_RANK': '0', 'RANK': '0', 'LOCAL_WORLD_SIZE': '1', 'WORLD_SIZE': '2', 'GROUP_RANK': '0', 'GROUP_WORLD_SIZE': '2.0', 'ROLE_RANK': '0', 'ROLE_NAME': 'default', 'ROLE_WORLD_SIZE': '2', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'FI_PROVIDER': 'efa', 'FI_EFA_USE_DEVICE_RDMA': '1', 'FI_EFA_FORK_SAFE': '1', 'TPU_LIBRARY_PATH': '/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/libtpu.so', 'NEURON_RT_ROOT_COMM_ID': '10.0.143.151:62182', 'TF_NUM_INTEROP_THREADS': '512', 'TF_NUM_INTRAOP_THREADS': '128', 'XLA_TRANSFER_SCALAR_ASYNC': '1', 'XLA_IR_SHAPE_CACHE_SIZE': '20480', 'TF_CPP_MIN_LOG_LEVEL': '1', 'GRPC_VERBOSITY': 'ERROR', 'ALLOW_MULTIPLE_LIBTPU_LOAD': '1', 'TPU_ML_PLATFORM': 'PyTorch/XLA', 'TF_GRPC_DEFAULT_OPTIONS': 'grpc.keepalive_time_ms=60000,grpc.keepalive_timeout_ms=14400000,grpc.http2.max_pings_without_data=0,grpc.http2.min_ping_interval_without_data_ms=300000', 'XLA_FLAGS': ' --xla_cpu_enable_fast_math=false --xla_gpu_simplify_all_fp_conversions=false', 'DISABLE_NUMERIC_CC_TOKEN': 'true', 'XRT_TPU_CONFIG': 'c_localservice;0;10.0.143.151:36447|c_localservice;1;10.0.136.216:55947', 'TPU_NUM_DEVICES': '1', 'XRT_LOCAL_WORKER': 'c_localservice:0', 'XRT_SHARD_WORLD_SIZE': '2', 'XRT_HOST_WORLD_SIZE': '2', 'XRT_SHARD_ORDINAL': '0', 'XRT_SHARD_LOCAL_ORDINAL': '0', 'XRT_MULTI_PROCESSING_DEVICE': 'TPU:0', 'XRT_START_LOCAL_SERVER': '0', 'XRT_MESH_SERVICE_ADDRESS': '10.0.143.151:32795', 'TPU_MESH_CONTROLLER_ADDRESS': '10.0.143.151:39449', 'TPU_MESH_CONTROLLER_PORT': '39449', 'NEURON_USE_LOAD_COLLECTIVES': '1', 'NEURON_GLOBAL_DEVICE_ID': '0', 'NEURON_GLOBAL_DEVICE_COUNT': '2'}

workers=64, neuron_cores=1 ✅

{'NEURON_RT_VISIBLE_CORES': '0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31', 'MASTER_ADDR': '10.0.143.151', 'MASTER_PORT': '57283', 'TORCHELASTIC_RUN_ID': '320f5fa7-651e-4c72-9472-fe96d0bc1cb1', 'LOCAL_RANK': '0', 'RANK': '32', 'LOCAL_WORLD_SIZE': '32', 'WORLD_SIZE': '64', 'GROUP_RANK': '1', 'GROUP_WORLD_SIZE': '2.0', 'ROLE_RANK': '32', 'ROLE_NAME': 'default', 'ROLE_WORLD_SIZE': '64', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'FI_PROVIDER': 'efa', 'FI_EFA_USE_DEVICE_RDMA': '1', 'FI_EFA_FORK_SAFE': '1', 'TPU_LIBRARY_PATH': '/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/libtpu.so', 'NEURON_RT_ROOT_COMM_ID': '10.0.143.151:62182', 'TF_NUM_INTEROP_THREADS': '512', 'TF_NUM_INTRAOP_THREADS': '128', 'XLA_TRANSFER_SCALAR_ASYNC': '1', 'XLA_IR_SHAPE_CACHE_SIZE': '20480', 'TF_CPP_MIN_LOG_LEVEL': '1', 'GRPC_VERBOSITY': 'ERROR', 'ALLOW_MULTIPLE_LIBTPU_LOAD': '1', 'TPU_ML_PLATFORM': 'PyTorch/XLA', 'TF_GRPC_DEFAULT_OPTIONS': 'grpc.keepalive_time_ms=60000,grpc.keepalive_timeout_ms=14400000,grpc.http2.max_pings_without_data=0,grpc.http2.min_ping_interval_without_data_ms=300000', 'XLA_FLAGS': ' --xla_cpu_enable_fast_math=false --xla_gpu_simplify_all_fp_conversions=false', 'DISABLE_NUMERIC_CC_TOKEN': 'true', 'XRT_LOCAL_WORKER': 'c_localservice:1', 'XRT_SHARD_WORLD_SIZE': '64', 'XRT_HOST_WORLD_SIZE': '2', 'XRT_SHARD_ORDINAL': '32', 'XRT_SHARD_LOCAL_ORDINAL': '0', 'XRT_MULTI_PROCESSING_DEVICE': 'TPU:32', 'XRT_START_LOCAL_SERVER': '0', 'XRT_MESH_SERVICE_ADDRESS': '10.0.143.151:55059', 'TPU_MESH_CONTROLLER_ADDRESS': '10.0.143.151:43253', 'TPU_MESH_CONTROLLER_PORT': '43253', 'NEURON_USE_LOAD_COLLECTIVES': '1', 'NEURON_GLOBAL_DEVICE_ID': '32', 'NEURON_GLOBAL_DEVICE_COUNT': '64'}
{'NEURON_RT_VISIBLE_CORES': '0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31', 'MASTER_ADDR': '10.0.143.151', 'MASTER_PORT': '57283', 'TORCHELASTIC_RUN_ID': '320f5fa7-651e-4c72-9472-fe96d0bc1cb1', 'LOCAL_RANK': '1', 'RANK': '33', 'LOCAL_WORLD_SIZE': '32', 'WORLD_SIZE': '64', 'GROUP_RANK': '1', 'GROUP_WORLD_SIZE': '2.0', 'ROLE_RANK': '33', 'ROLE_NAME': 'default', 'ROLE_WORLD_SIZE': '64', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'FI_PROVIDER': 'efa', 'FI_EFA_USE_DEVICE_RDMA': '1', 'FI_EFA_FORK_SAFE': '1', 'TPU_LIBRARY_PATH': '/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/libtpu.so', 'NEURON_RT_ROOT_COMM_ID': '10.0.143.151:62182', 'TF_NUM_INTEROP_THREADS': '512', 'TF_NUM_INTRAOP_THREADS': '128', 'XLA_TRANSFER_SCALAR_ASYNC': '1', 'XLA_IR_SHAPE_CACHE_SIZE': '20480', 'TF_CPP_MIN_LOG_LEVEL': '1', 'GRPC_VERBOSITY': 'ERROR', 'ALLOW_MULTIPLE_LIBTPU_LOAD': '1', 'TPU_ML_PLATFORM': 'PyTorch/XLA', 'TF_GRPC_DEFAULT_OPTIONS': 'grpc.keepalive_time_ms=60000,grpc.keepalive_timeout_ms=14400000,grpc.http2.max_pings_without_data=0,grpc.http2.min_ping_interval_without_data_ms=300000', 'XLA_FLAGS': ' --xla_cpu_enable_fast_math=false --xla_gpu_simplify_all_fp_conversions=false', 'DISABLE_NUMERIC_CC_TOKEN': 'true', 'XRT_LOCAL_WORKER': 'c_localservice:1', 'XRT_SHARD_WORLD_SIZE': '64', 'XRT_HOST_WORLD_SIZE': '2', 'XRT_SHARD_ORDINAL': '33', 'XRT_SHARD_LOCAL_ORDINAL': '1', 'XRT_MULTI_PROCESSING_DEVICE': 'TPU:33', 'XRT_START_LOCAL_SERVER': '0', 'XRT_MESH_SERVICE_ADDRESS': '10.0.143.151:55059', 'TPU_MESH_CONTROLLER_ADDRESS': '10.0.143.151:43253', 'TPU_MESH_CONTROLLER_PORT': '43253', 'NEURON_USE_LOAD_COLLECTIVES': '1', 'NEURON_GLOBAL_DEVICE_ID': '33', 'NEURON_GLOBAL_DEVICE_COUNT': '64'}
{'NEURON_RT_VISIBLE_CORES': '0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31', 'MASTER_ADDR': '10.0.143.151', 'MASTER_PORT': '57283', 'TORCHELASTIC_RUN_ID': '320f5fa7-651e-4c72-9472-fe96d0bc1cb1', 'LOCAL_RANK': '0', 'RANK': '0', 'LOCAL_WORLD_SIZE': '32', 'WORLD_SIZE': '64', 'GROUP_RANK': '0', 'GROUP_WORLD_SIZE': '2.0', 'ROLE_RANK': '0', 'ROLE_NAME': 'default', 'ROLE_WORLD_SIZE': '64', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'FI_PROVIDER': 'efa', 'FI_EFA_USE_DEVICE_RDMA': '1', 'FI_EFA_FORK_SAFE': '1', 'TPU_LIBRARY_PATH': '/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/libtpu.so', 'NEURON_RT_ROOT_COMM_ID': '10.0.143.151:62182', 'TF_NUM_INTEROP_THREADS': '512', 'TF_NUM_INTRAOP_THREADS': '128', 'XLA_TRANSFER_SCALAR_ASYNC': '1', 'XLA_IR_SHAPE_CACHE_SIZE': '20480', 'TF_CPP_MIN_LOG_LEVEL': '1', 'GRPC_VERBOSITY': 'ERROR', 'ALLOW_MULTIPLE_LIBTPU_LOAD': '1', 'TPU_ML_PLATFORM': 'PyTorch/XLA', 'TF_GRPC_DEFAULT_OPTIONS': 'grpc.keepalive_time_ms=60000,grpc.keepalive_timeout_ms=14400000,grpc.http2.max_pings_without_data=0,grpc.http2.min_ping_interval_without_data_ms=300000', 'XLA_FLAGS': ' --xla_cpu_enable_fast_math=false --xla_gpu_simplify_all_fp_conversions=false', 'DISABLE_NUMERIC_CC_TOKEN': 'true', 'XRT_TPU_CONFIG': 'c_localservice;0;10.0.143.151:53763|c_localservice;1;10.0.136.216:48277', 'TPU_NUM_DEVICES': '32', 'XRT_LOCAL_WORKER': 'c_localservice:0', 'XRT_SHARD_WORLD_SIZE': '64', 'XRT_HOST_WORLD_SIZE': '2', 'XRT_SHARD_ORDINAL': '0', 'XRT_SHARD_LOCAL_ORDINAL': '0', 'XRT_MULTI_PROCESSING_DEVICE': 'TPU:0', 'XRT_START_LOCAL_SERVER': '0', 'XRT_MESH_SERVICE_ADDRESS': '10.0.143.151:55059', 'TPU_MESH_CONTROLLER_ADDRESS': '10.0.143.151:43253', 'TPU_MESH_CONTROLLER_PORT': '43253', 'NEURON_USE_LOAD_COLLECTIVES': '1', 'NEURON_GLOBAL_DEVICE_ID': '0', 'NEURON_GLOBAL_DEVICE_COUNT': '64'}
{'NEURON_RT_VISIBLE_CORES': '0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31', 'MASTER_ADDR': '10.0.143.151', 'MASTER_PORT': '57283', 'TORCHELASTIC_RUN_ID': '320f5fa7-651e-4c72-9472-fe96d0bc1cb1', 'LOCAL_RANK': '3', 'RANK': '3', 'LOCAL_WORLD_SIZE': '32', 'WORLD_SIZE': '64', 'GROUP_RANK': '0', 'GROUP_WORLD_SIZE': '2.0', 'ROLE_RANK': '3', 'ROLE_NAME': 'default', 'ROLE_WORLD_SIZE': '64', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'FI_PROVIDER': 'efa', 'FI_EFA_USE_DEVICE_RDMA': '1', 'FI_EFA_FORK_SAFE': '1', 'TPU_LIBRARY_PATH': '/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/libtpu.so', 'NEURON_RT_ROOT_COMM_ID': '10.0.143.151:62182', 'TF_NUM_INTEROP_THREADS': '512', 'TF_NUM_INTRAOP_THREADS': '128', 'XLA_TRANSFER_SCALAR_ASYNC': '1', 'XLA_IR_SHAPE_CACHE_SIZE': '20480', 'TF_CPP_MIN_LOG_LEVEL': '1', 'GRPC_VERBOSITY': 'ERROR', 'ALLOW_MULTIPLE_LIBTPU_LOAD': '1', 'TPU_ML_PLATFORM': 'PyTorch/XLA', 'TF_GRPC_DEFAULT_OPTIONS': 'grpc.keepalive_time_ms=60000,grpc.keepalive_timeout_ms=14400000,grpc.http2.max_pings_without_data=0,grpc.http2.min_ping_interval_without_data_ms=300000', 'XLA_FLAGS': ' --xla_cpu_enable_fast_math=false --xla_gpu_simplify_all_fp_conversions=false', 'DISABLE_NUMERIC_CC_TOKEN': 'true', 'XRT_LOCAL_WORKER': 'c_localservice:0', 'XRT_SHARD_WORLD_SIZE': '64', 'XRT_HOST_WORLD_SIZE': '2', 'XRT_SHARD_ORDINAL': '3', 'XRT_SHARD_LOCAL_ORDINAL': '3', 'XRT_MULTI_PROCESSING_DEVICE': 'TPU:3', 'XRT_START_LOCAL_SERVER': '0', 'XRT_MESH_SERVICE_ADDRESS': '10.0.143.151:55059', 'TPU_MESH_CONTROLLER_ADDRESS': '10.0.143.151:43253', 'TPU_MESH_CONTROLLER_PORT': '43253', 'NEURON_USE_LOAD_COLLECTIVES': '1', 'NEURON_GLOBAL_DEVICE_ID': '3', 'NEURON_GLOBAL_DEVICE_COUNT': '64'}

… default

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>
@chappidim chappidim changed the title [Train] Add accelerator ids to worker metadata, share neuron_cores by… [Train] Add accelerator ids to workers and share neuron_cores by default Aug 30, 2023
Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>
@chappidim
Copy link
Contributor Author

cc: @matthewdeng

@matthewdeng matthewdeng self-assigned this Aug 30, 2023
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! At a high level the logic makes sense. I'm wondering if we can use this opportunity to make this a little more generic to support even more accelerator types.

For example, support for TPUs was just merged today and I think it would be great if we could establish a pattern here that would allow us to easily extend this to TPUs with minimal boiler plate code. 😄

python/ray/train/_internal/worker_group.py Outdated Show resolved Hide resolved
python/ray/train/_internal/worker_group.py Outdated Show resolved Hide resolved
python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>
Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just some minor comments around forward compatibility, I think this should be good to go soon 🙂

python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
Comment on lines 157 to 165
if self._num_gpus_per_worker > 0 and share_cuda_visible_devices_enabled:
self._share_cuda_visible_devices()
elif self._additional_resources_per_worker:
for (
accelerator,
env_var,
) in SUPPORTED_ACCELERATOR_DEVICES_TO_ENV_VAR.items():
if self._share_accelerator_devices_enabled(accelerator):
self._share_resource_ids(accelerator, env_var)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I am thinking about is unifying the logic between GPUs and the other resources, which might be simpler since that's how it's set up in the WorkerGroup now, but this does not need to be done in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally kept separate for now.

python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
python/ray/train/constants.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_worker_group.py Outdated Show resolved Hide resolved
python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_backend.py Outdated Show resolved Hide resolved
python/ray/train/_internal/worker_group.py Outdated Show resolved Hide resolved
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few final touches. Thanks!

python/ray/train/constants.py Outdated Show resolved Hide resolved
python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
Comment on lines 89 to 90
# 2_CPUs, 1_GPU
assert len(gpu_and_accelerator_ids) == 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me what the purpose of this comment is. Also I'm not sure if we should assert the length of this, as it may change if we add more resource types in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List doesn't change because these resources are configured at test/UT level (let say ray_start_2_cpus_and_gpus). Comments were added during my UT debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh what is the output of this? I tried running a super naive script on my laptop that prints out ray.get_runtime_context().get_resource_ids(), and it showed up as {'GPU': [], 'neuron_cores': [], 'TPU': []}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, i was under the impression (got lost with refactors) I get CPU/GPU/neuron_cores. It's actually TPU which got added recently. Fixed the UTs by removing length check

python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
@jjyao
Copy link
Contributor

jjyao commented Sep 26, 2023

@matthewdeng besides this PR, what are the other changes needed to support different accelerators? For example, E.g. Currently ScalingConfig has a use_gpu flag, do we need to change that?

@matthewdeng
Copy link
Contributor

@jjyao right now this can be used by setting ScalingConfig.resources_per_worker and ignoring the use_gpu flag. In the future we may be able to further extend the API to be more friendly for different accelerator types.

See #39130 for more info

@woshiyyya
Copy link
Member

woshiyyya commented Sep 26, 2023

@jjyao The users can do something like ScalingConfig(num_workers=8, resources_per_worker={"neuron_cores": 1}) with the current API. But as we incorporate more accelerator, the use_gpu flag would be a bit confusing.

Possibly we can update the apis as ScalingConfig(accelerator="gpu/neuron/tpu")

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>
Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>
python/ray/train/constants.py Outdated Show resolved Hide resolved
Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>
Comment on lines 357 to 360
has_resource_requested = (
self._additional_resources_per_worker.get(resource_name, None) is not None
)
return bool(env_integer(enable_sharing_env, has_resource_requested))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I was looking over this again, should we change this logic so that it only returns True if both are True? I don't think it makes sense to share resources if has_resource_requested is False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my thinking,

  1. If user requests additional resources with neuron_cores, share by default (no TRAIN_ENABLE_SHARE_NEURON_CORES_ACCELERATOR exists)
  2. If user pass env (TRAIN_ENABLE_SHARE_NEURON_CORES_ACCELERATOR) check if integer exists. For, 1/True 0/False.

I agree then the env naming must be some DISABLE_SHARE_xxx. Fixed it. Good catch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I was thinking to just make this consistent with the GPU logic:

            if self._num_gpus_per_worker > 0 and share_cuda_visible_devices_enabled:
                self._share_cuda_visible_devices()

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>
Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much!

@matthewdeng matthewdeng merged commit 58cf2a9 into ray-project:master Sep 29, 2023
41 of 43 checks passed
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…ult (ray-project#39091)

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Train] Share neuron_cores between train workers
4 participants