[Train] Add accelerator ids to workers and share neuron_cores by default #39091

chappidim · 2023-08-30T01:00:41Z

Why are these changes needed?

This change enables the train worker to share neuron cores between workers on same node by retrieving the accelerator/neuron_core ids from each worker's runtime context

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Manual tests

Manual testing

Created a two node cluster with trn1.32xl and run simple all_reduce function

workers=2, neuron_cores=32 ✅

{'NEURON_RT_VISIBLE_CORES': '0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31', 'MASTER_ADDR': '10.0.143.151', 'MASTER_PORT': '60277', 'TORCHELASTIC_RUN_ID': '40a4e995-d7c7-4c34-bc1c-1770ce22b051', 'LOCAL_RANK': '0', 'RANK': '1', 'LOCAL_WORLD_SIZE': '1', 'WORLD_SIZE': '2', 'GROUP_RANK': '1', 'GROUP_WORLD_SIZE': '2.0', 'ROLE_RANK': '1', 'ROLE_NAME': 'default', 'ROLE_WORLD_SIZE': '2', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'FI_PROVIDER': 'efa', 'FI_EFA_USE_DEVICE_RDMA': '1', 'FI_EFA_FORK_SAFE': '1', 'TPU_LIBRARY_PATH': '/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/libtpu.so', 'NEURON_RT_ROOT_COMM_ID': '10.0.143.151:62182', 'TF_NUM_INTEROP_THREADS': '512', 'TF_NUM_INTRAOP_THREADS': '128', 'XLA_TRANSFER_SCALAR_ASYNC': '1', 'XLA_IR_SHAPE_CACHE_SIZE': '20480', 'TF_CPP_MIN_LOG_LEVEL': '1', 'GRPC_VERBOSITY': 'ERROR', 'ALLOW_MULTIPLE_LIBTPU_LOAD': '1', 'TPU_ML_PLATFORM': 'PyTorch/XLA', 'TF_GRPC_DEFAULT_OPTIONS': 'grpc.keepalive_time_ms=60000,grpc.keepalive_timeout_ms=14400000,grpc.http2.max_pings_without_data=0,grpc.http2.min_ping_interval_without_data_ms=300000', 'XLA_FLAGS': ' --xla_cpu_enable_fast_math=false --xla_gpu_simplify_all_fp_conversions=false', 'DISABLE_NUMERIC_CC_TOKEN': 'true', 'XRT_LOCAL_WORKER': 'c_localservice:1', 'XRT_SHARD_WORLD_SIZE': '2', 'XRT_HOST_WORLD_SIZE': '2', 'XRT_SHARD_ORDINAL': '1', 'XRT_SHARD_LOCAL_ORDINAL': '0', 'XRT_MULTI_PROCESSING_DEVICE': 'TPU:1', 'XRT_START_LOCAL_SERVER': '0', 'XRT_MESH_SERVICE_ADDRESS': '10.0.143.151:32795', 'TPU_MESH_CONTROLLER_ADDRESS': '10.0.143.151:39449', 'TPU_MESH_CONTROLLER_PORT': '39449', 'NEURON_USE_LOAD_COLLECTIVES': '1', 'NEURON_GLOBAL_DEVICE_ID': '1', 'NEURON_GLOBAL_DEVICE_COUNT': '2'}
{'NEURON_RT_VISIBLE_CORES': '0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31', 'MASTER_ADDR': '10.0.143.151', 'MASTER_PORT': '60277', 'TORCHELASTIC_RUN_ID': '40a4e995-d7c7-4c34-bc1c-1770ce22b051', 'LOCAL_RANK': '0', 'RANK': '0', 'LOCAL_WORLD_SIZE': '1', 'WORLD_SIZE': '2', 'GROUP_RANK': '0', 'GROUP_WORLD_SIZE': '2.0', 'ROLE_RANK': '0', 'ROLE_NAME': 'default', 'ROLE_WORLD_SIZE': '2', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'FI_PROVIDER': 'efa', 'FI_EFA_USE_DEVICE_RDMA': '1', 'FI_EFA_FORK_SAFE': '1', 'TPU_LIBRARY_PATH': '/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/libtpu.so', 'NEURON_RT_ROOT_COMM_ID': '10.0.143.151:62182', 'TF_NUM_INTEROP_THREADS': '512', 'TF_NUM_INTRAOP_THREADS': '128', 'XLA_TRANSFER_SCALAR_ASYNC': '1', 'XLA_IR_SHAPE_CACHE_SIZE': '20480', 'TF_CPP_MIN_LOG_LEVEL': '1', 'GRPC_VERBOSITY': 'ERROR', 'ALLOW_MULTIPLE_LIBTPU_LOAD': '1', 'TPU_ML_PLATFORM': 'PyTorch/XLA', 'TF_GRPC_DEFAULT_OPTIONS': 'grpc.keepalive_time_ms=60000,grpc.keepalive_timeout_ms=14400000,grpc.http2.max_pings_without_data=0,grpc.http2.min_ping_interval_without_data_ms=300000', 'XLA_FLAGS': ' --xla_cpu_enable_fast_math=false --xla_gpu_simplify_all_fp_conversions=false', 'DISABLE_NUMERIC_CC_TOKEN': 'true', 'XRT_TPU_CONFIG': 'c_localservice;0;10.0.143.151:36447|c_localservice;1;10.0.136.216:55947', 'TPU_NUM_DEVICES': '1', 'XRT_LOCAL_WORKER': 'c_localservice:0', 'XRT_SHARD_WORLD_SIZE': '2', 'XRT_HOST_WORLD_SIZE': '2', 'XRT_SHARD_ORDINAL': '0', 'XRT_SHARD_LOCAL_ORDINAL': '0', 'XRT_MULTI_PROCESSING_DEVICE': 'TPU:0', 'XRT_START_LOCAL_SERVER': '0', 'XRT_MESH_SERVICE_ADDRESS': '10.0.143.151:32795', 'TPU_MESH_CONTROLLER_ADDRESS': '10.0.143.151:39449', 'TPU_MESH_CONTROLLER_PORT': '39449', 'NEURON_USE_LOAD_COLLECTIVES': '1', 'NEURON_GLOBAL_DEVICE_ID': '0', 'NEURON_GLOBAL_DEVICE_COUNT': '2'}

workers=64, neuron_cores=1 ✅

{'NEURON_RT_VISIBLE_CORES': '0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31', 'MASTER_ADDR': '10.0.143.151', 'MASTER_PORT': '57283', 'TORCHELASTIC_RUN_ID': '320f5fa7-651e-4c72-9472-fe96d0bc1cb1', 'LOCAL_RANK': '0', 'RANK': '32', 'LOCAL_WORLD_SIZE': '32', 'WORLD_SIZE': '64', 'GROUP_RANK': '1', 'GROUP_WORLD_SIZE': '2.0', 'ROLE_RANK': '32', 'ROLE_NAME': 'default', 'ROLE_WORLD_SIZE': '64', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'FI_PROVIDER': 'efa', 'FI_EFA_USE_DEVICE_RDMA': '1', 'FI_EFA_FORK_SAFE': '1', 'TPU_LIBRARY_PATH': '/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/libtpu.so', 'NEURON_RT_ROOT_COMM_ID': '10.0.143.151:62182', 'TF_NUM_INTEROP_THREADS': '512', 'TF_NUM_INTRAOP_THREADS': '128', 'XLA_TRANSFER_SCALAR_ASYNC': '1', 'XLA_IR_SHAPE_CACHE_SIZE': '20480', 'TF_CPP_MIN_LOG_LEVEL': '1', 'GRPC_VERBOSITY': 'ERROR', 'ALLOW_MULTIPLE_LIBTPU_LOAD': '1', 'TPU_ML_PLATFORM': 'PyTorch/XLA', 'TF_GRPC_DEFAULT_OPTIONS': 'grpc.keepalive_time_ms=60000,grpc.keepalive_timeout_ms=14400000,grpc.http2.max_pings_without_data=0,grpc.http2.min_ping_interval_without_data_ms=300000', 'XLA_FLAGS': ' --xla_cpu_enable_fast_math=false --xla_gpu_simplify_all_fp_conversions=false', 'DISABLE_NUMERIC_CC_TOKEN': 'true', 'XRT_LOCAL_WORKER': 'c_localservice:1', 'XRT_SHARD_WORLD_SIZE': '64', 'XRT_HOST_WORLD_SIZE': '2', 'XRT_SHARD_ORDINAL': '32', 'XRT_SHARD_LOCAL_ORDINAL': '0', 'XRT_MULTI_PROCESSING_DEVICE': 'TPU:32', 'XRT_START_LOCAL_SERVER': '0', 'XRT_MESH_SERVICE_ADDRESS': '10.0.143.151:55059', 'TPU_MESH_CONTROLLER_ADDRESS': '10.0.143.151:43253', 'TPU_MESH_CONTROLLER_PORT': '43253', 'NEURON_USE_LOAD_COLLECTIVES': '1', 'NEURON_GLOBAL_DEVICE_ID': '32', 'NEURON_GLOBAL_DEVICE_COUNT': '64'}
{'NEURON_RT_VISIBLE_CORES': '0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31', 'MASTER_ADDR': '10.0.143.151', 'MASTER_PORT': '57283', 'TORCHELASTIC_RUN_ID': '320f5fa7-651e-4c72-9472-fe96d0bc1cb1', 'LOCAL_RANK': '1', 'RANK': '33', 'LOCAL_WORLD_SIZE': '32', 'WORLD_SIZE': '64', 'GROUP_RANK': '1', 'GROUP_WORLD_SIZE': '2.0', 'ROLE_RANK': '33', 'ROLE_NAME': 'default', 'ROLE_WORLD_SIZE': '64', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'FI_PROVIDER': 'efa', 'FI_EFA_USE_DEVICE_RDMA': '1', 'FI_EFA_FORK_SAFE': '1', 'TPU_LIBRARY_PATH': '/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/libtpu.so', 'NEURON_RT_ROOT_COMM_ID': '10.0.143.151:62182', 'TF_NUM_INTEROP_THREADS': '512', 'TF_NUM_INTRAOP_THREADS': '128', 'XLA_TRANSFER_SCALAR_ASYNC': '1', 'XLA_IR_SHAPE_CACHE_SIZE': '20480', 'TF_CPP_MIN_LOG_LEVEL': '1', 'GRPC_VERBOSITY': 'ERROR', 'ALLOW_MULTIPLE_LIBTPU_LOAD': '1', 'TPU_ML_PLATFORM': 'PyTorch/XLA', 'TF_GRPC_DEFAULT_OPTIONS': 'grpc.keepalive_time_ms=60000,grpc.keepalive_timeout_ms=14400000,grpc.http2.max_pings_without_data=0,grpc.http2.min_ping_interval_without_data_ms=300000', 'XLA_FLAGS': ' --xla_cpu_enable_fast_math=false --xla_gpu_simplify_all_fp_conversions=false', 'DISABLE_NUMERIC_CC_TOKEN': 'true', 'XRT_LOCAL_WORKER': 'c_localservice:1', 'XRT_SHARD_WORLD_SIZE': '64', 'XRT_HOST_WORLD_SIZE': '2', 'XRT_SHARD_ORDINAL': '33', 'XRT_SHARD_LOCAL_ORDINAL': '1', 'XRT_MULTI_PROCESSING_DEVICE': 'TPU:33', 'XRT_START_LOCAL_SERVER': '0', 'XRT_MESH_SERVICE_ADDRESS': '10.0.143.151:55059', 'TPU_MESH_CONTROLLER_ADDRESS': '10.0.143.151:43253', 'TPU_MESH_CONTROLLER_PORT': '43253', 'NEURON_USE_LOAD_COLLECTIVES': '1', 'NEURON_GLOBAL_DEVICE_ID': '33', 'NEURON_GLOBAL_DEVICE_COUNT': '64'}
{'NEURON_RT_VISIBLE_CORES': '0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31', 'MASTER_ADDR': '10.0.143.151', 'MASTER_PORT': '57283', 'TORCHELASTIC_RUN_ID': '320f5fa7-651e-4c72-9472-fe96d0bc1cb1', 'LOCAL_RANK': '0', 'RANK': '0', 'LOCAL_WORLD_SIZE': '32', 'WORLD_SIZE': '64', 'GROUP_RANK': '0', 'GROUP_WORLD_SIZE': '2.0', 'ROLE_RANK': '0', 'ROLE_NAME': 'default', 'ROLE_WORLD_SIZE': '64', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'FI_PROVIDER': 'efa', 'FI_EFA_USE_DEVICE_RDMA': '1', 'FI_EFA_FORK_SAFE': '1', 'TPU_LIBRARY_PATH': '/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/libtpu.so', 'NEURON_RT_ROOT_COMM_ID': '10.0.143.151:62182', 'TF_NUM_INTEROP_THREADS': '512', 'TF_NUM_INTRAOP_THREADS': '128', 'XLA_TRANSFER_SCALAR_ASYNC': '1', 'XLA_IR_SHAPE_CACHE_SIZE': '20480', 'TF_CPP_MIN_LOG_LEVEL': '1', 'GRPC_VERBOSITY': 'ERROR', 'ALLOW_MULTIPLE_LIBTPU_LOAD': '1', 'TPU_ML_PLATFORM': 'PyTorch/XLA', 'TF_GRPC_DEFAULT_OPTIONS': 'grpc.keepalive_time_ms=60000,grpc.keepalive_timeout_ms=14400000,grpc.http2.max_pings_without_data=0,grpc.http2.min_ping_interval_without_data_ms=300000', 'XLA_FLAGS': ' --xla_cpu_enable_fast_math=false --xla_gpu_simplify_all_fp_conversions=false', 'DISABLE_NUMERIC_CC_TOKEN': 'true', 'XRT_TPU_CONFIG': 'c_localservice;0;10.0.143.151:53763|c_localservice;1;10.0.136.216:48277', 'TPU_NUM_DEVICES': '32', 'XRT_LOCAL_WORKER': 'c_localservice:0', 'XRT_SHARD_WORLD_SIZE': '64', 'XRT_HOST_WORLD_SIZE': '2', 'XRT_SHARD_ORDINAL': '0', 'XRT_SHARD_LOCAL_ORDINAL': '0', 'XRT_MULTI_PROCESSING_DEVICE': 'TPU:0', 'XRT_START_LOCAL_SERVER': '0', 'XRT_MESH_SERVICE_ADDRESS': '10.0.143.151:55059', 'TPU_MESH_CONTROLLER_ADDRESS': '10.0.143.151:43253', 'TPU_MESH_CONTROLLER_PORT': '43253', 'NEURON_USE_LOAD_COLLECTIVES': '1', 'NEURON_GLOBAL_DEVICE_ID': '0', 'NEURON_GLOBAL_DEVICE_COUNT': '64'}
{'NEURON_RT_VISIBLE_CORES': '0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31', 'MASTER_ADDR': '10.0.143.151', 'MASTER_PORT': '57283', 'TORCHELASTIC_RUN_ID': '320f5fa7-651e-4c72-9472-fe96d0bc1cb1', 'LOCAL_RANK': '3', 'RANK': '3', 'LOCAL_WORLD_SIZE': '32', 'WORLD_SIZE': '64', 'GROUP_RANK': '0', 'GROUP_WORLD_SIZE': '2.0', 'ROLE_RANK': '3', 'ROLE_NAME': 'default', 'ROLE_WORLD_SIZE': '64', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'FI_PROVIDER': 'efa', 'FI_EFA_USE_DEVICE_RDMA': '1', 'FI_EFA_FORK_SAFE': '1', 'TPU_LIBRARY_PATH': '/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.8/site-packages/libneuronxla/libtpu.so', 'NEURON_RT_ROOT_COMM_ID': '10.0.143.151:62182', 'TF_NUM_INTEROP_THREADS': '512', 'TF_NUM_INTRAOP_THREADS': '128', 'XLA_TRANSFER_SCALAR_ASYNC': '1', 'XLA_IR_SHAPE_CACHE_SIZE': '20480', 'TF_CPP_MIN_LOG_LEVEL': '1', 'GRPC_VERBOSITY': 'ERROR', 'ALLOW_MULTIPLE_LIBTPU_LOAD': '1', 'TPU_ML_PLATFORM': 'PyTorch/XLA', 'TF_GRPC_DEFAULT_OPTIONS': 'grpc.keepalive_time_ms=60000,grpc.keepalive_timeout_ms=14400000,grpc.http2.max_pings_without_data=0,grpc.http2.min_ping_interval_without_data_ms=300000', 'XLA_FLAGS': ' --xla_cpu_enable_fast_math=false --xla_gpu_simplify_all_fp_conversions=false', 'DISABLE_NUMERIC_CC_TOKEN': 'true', 'XRT_LOCAL_WORKER': 'c_localservice:0', 'XRT_SHARD_WORLD_SIZE': '64', 'XRT_HOST_WORLD_SIZE': '2', 'XRT_SHARD_ORDINAL': '3', 'XRT_SHARD_LOCAL_ORDINAL': '3', 'XRT_MULTI_PROCESSING_DEVICE': 'TPU:3', 'XRT_START_LOCAL_SERVER': '0', 'XRT_MESH_SERVICE_ADDRESS': '10.0.143.151:55059', 'TPU_MESH_CONTROLLER_ADDRESS': '10.0.143.151:43253', 'TPU_MESH_CONTROLLER_PORT': '43253', 'NEURON_USE_LOAD_COLLECTIVES': '1', 'NEURON_GLOBAL_DEVICE_ID': '3', 'NEURON_GLOBAL_DEVICE_COUNT': '64'}

… default Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

chappidim · 2023-08-30T01:06:17Z

cc: @matthewdeng

matthewdeng

This is awesome! At a high level the logic makes sense. I'm wondering if we can use this opportunity to make this a little more generic to support even more accelerator types.

For example, support for TPUs was just merged today and I think it would be great if we could establish a pattern here that would allow us to easily extend this to TPUs with minimal boiler plate code. 😄

python/ray/train/_internal/worker_group.py

python/ray/train/_internal/backend_executor.py

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

matthewdeng

Mostly just some minor comments around forward compatibility, I think this should be good to go soon 🙂

python/ray/train/_internal/backend_executor.py

matthewdeng · 2023-09-13T03:43:04Z

python/ray/train/_internal/backend_executor.py

            if self._num_gpus_per_worker > 0 and share_cuda_visible_devices_enabled:
                self._share_cuda_visible_devices()
+            elif self._additional_resources_per_worker:
+                for (
+                    accelerator,
+                    env_var,
+                ) in SUPPORTED_ACCELERATOR_DEVICES_TO_ENV_VAR.items():
+                    if self._share_accelerator_devices_enabled(accelerator):
+                        self._share_resource_ids(accelerator, env_var)


One thing I am thinking about is unifying the logic between GPUs and the other resources, which might be simpler since that's how it's set up in the WorkerGroup now, but this does not need to be done in this PR.

I intentionally kept separate for now.

python/ray/train/_internal/backend_executor.py

python/ray/train/constants.py

python/ray/train/tests/test_worker_group.py

python/ray/train/_internal/backend_executor.py

python/ray/train/tests/test_backend.py

python/ray/train/_internal/worker_group.py

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

…o feat-train-trn2

matthewdeng

A few final touches. Thanks!

python/ray/train/constants.py

python/ray/train/_internal/backend_executor.py

matthewdeng · 2023-09-26T02:36:36Z

python/ray/train/tests/test_worker_group.py

+        # 2_CPUs, 1_GPU
+        assert len(gpu_and_accelerator_ids) == 3


It's unclear to me what the purpose of this comment is. Also I'm not sure if we should assert the length of this, as it may change if we add more resource types in the future?

List doesn't change because these resources are configured at test/UT level (let say ray_start_2_cpus_and_gpus). Comments were added during my UT debugging.

Oh what is the output of this? I tried running a super naive script on my laptop that prints out ray.get_runtime_context().get_resource_ids(), and it showed up as {'GPU': [], 'neuron_cores': [], 'TPU': []}.

You are right, i was under the impression (got lost with refactors) I get CPU/GPU/neuron_cores. It's actually TPU which got added recently. Fixed the UTs by removing length check

python/ray/train/_internal/backend_executor.py

jjyao · 2023-09-26T15:48:34Z

@matthewdeng besides this PR, what are the other changes needed to support different accelerators? For example, E.g. Currently ScalingConfig has a use_gpu flag, do we need to change that?

matthewdeng · 2023-09-26T17:21:24Z

@jjyao right now this can be used by setting ScalingConfig.resources_per_worker and ignoring the use_gpu flag. In the future we may be able to further extend the API to be more friendly for different accelerator types.

See #39130 for more info

woshiyyya · 2023-09-26T17:26:53Z

@jjyao The users can do something like ScalingConfig(num_workers=8, resources_per_worker={"neuron_cores": 1}) with the current API. But as we incorporate more accelerator, the use_gpu flag would be a bit confusing.

Possibly we can update the apis as ScalingConfig(accelerator="gpu/neuron/tpu")

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

python/ray/train/constants.py

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

matthewdeng · 2023-09-27T23:35:33Z

python/ray/train/_internal/backend_executor.py

+        has_resource_requested = (
+            self._additional_resources_per_worker.get(resource_name, None) is not None
+        )
+        return bool(env_integer(enable_sharing_env, has_resource_requested))


Oops, I was looking over this again, should we change this logic so that it only returns True if both are True? I don't think it makes sense to share resources if has_resource_requested is False?

This was my thinking,

If user requests additional resources with neuron_cores, share by default (no TRAIN_ENABLE_SHARE_NEURON_CORES_ACCELERATOR exists)

If user pass env (TRAIN_ENABLE_SHARE_NEURON_CORES_ACCELERATOR) check if integer exists. For, 1/True 0/False.

I agree then the env naming must be some DISABLE_SHARE_xxx. Fixed it. Good catch.

Oh I was thinking to just make this consistent with the GPU logic:

if self._num_gpus_per_worker > 0 and share_cuda_visible_devices_enabled: self._share_cuda_visible_devices()

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

matthewdeng

Thanks so much!

…ult (ray-project#39091) Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com> Signed-off-by: Victor <vctr.y.m@example.com>

[Train] Add accelerator ids to worker metadata, share neuron_cores by…

94ed600

… default Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

chappidim changed the title ~~[Train] Add accelerator ids to worker metadata, share neuron_cores by…~~ [Train] Add accelerator ids to workers and share neuron_cores by default Aug 30, 2023

Add env to list

8d168f8

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

matthewdeng self-assigned this Aug 30, 2023

matthewdeng reviewed Sep 7, 2023

View reviewed changes

python/ray/train/_internal/worker_group.py Outdated Show resolved Hide resolved

python/ray/train/_internal/worker_group.py Outdated Show resolved Hide resolved

python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved

chappidim and others added 3 commits September 8, 2023 14:16

Merge branch 'ray-project:master' into feat-train-trn2

5baf89a

Merge branch 'ray-project:master' into feat-train-trn2

96e50f6

Refactor and make sharing generic for resource_ids

04ba327

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

chappidim requested a review from matthewdeng September 11, 2023 19:05

chappidim added 3 commits September 11, 2023 13:23

Bug-fix, missing dict items

f76d4a5

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

Bug-fix, missing dict items - lint

8acae7a

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

Merge branch 'master' into feat-train-trn2

f2b974b

matthewdeng reviewed Sep 13, 2023

View reviewed changes

chappidim added 4 commits September 13, 2023 15:01

Refactor changes

3264e01

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

Merge branch 'feat-train-trn2' of https://github.com/chappidm/ray int…

50c02f9

…o feat-train-trn2

Merge branch 'master' into feat-train-trn2

f27d6cd

Merge branch 'master' into feat-train-trn2

134a1ea

chappidim requested a review from matthewdeng September 14, 2023 04:19

matthewdeng approved these changes Sep 26, 2023

View reviewed changes

Merge branch 'ray-project:master' into feat-train-trn2

d6a62e1

Refactor changes with resource_config

7357315

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

chappidim requested a review from matthewdeng September 27, 2023 01:03

Fix UT

719d455

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

matthewdeng approved these changes Sep 27, 2023

View reviewed changes

python/ray/train/constants.py Outdated Show resolved Hide resolved

Remove unused code

755dc2e

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

chappidim requested a review from matthewdeng September 27, 2023 22:58

matthewdeng reviewed Sep 27, 2023

View reviewed changes

Bug fix on env

0f812fc

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

chappidim requested a review from matthewdeng September 28, 2023 17:58

Fix on resources value

114699a

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

matthewdeng approved these changes Sep 29, 2023

View reviewed changes

matthewdeng merged commit 58cf2a9 into ray-project:master Sep 29, 2023
41 of 43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Add accelerator ids to workers and share neuron_cores by default #39091

[Train] Add accelerator ids to workers and share neuron_cores by default #39091

chappidim commented Aug 30, 2023 •

edited

Loading

chappidim commented Aug 30, 2023

matthewdeng left a comment

matthewdeng left a comment

matthewdeng Sep 13, 2023

chappidim Sep 13, 2023

matthewdeng left a comment

matthewdeng Sep 26, 2023

chappidim Sep 27, 2023

matthewdeng Sep 27, 2023

chappidim Sep 27, 2023

jjyao commented Sep 26, 2023

matthewdeng commented Sep 26, 2023

woshiyyya commented Sep 26, 2023 •

edited

Loading

matthewdeng Sep 27, 2023

chappidim Sep 28, 2023

matthewdeng Sep 28, 2023

matthewdeng left a comment

[Train] Add accelerator ids to workers and share neuron_cores by default #39091

[Train] Add accelerator ids to workers and share neuron_cores by default #39091

Conversation

chappidim commented Aug 30, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Manual testing

workers=2, neuron_cores=32 ✅

workers=64, neuron_cores=1 ✅

chappidim commented Aug 30, 2023

matthewdeng left a comment

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjyao commented Sep 26, 2023

matthewdeng commented Sep 26, 2023

woshiyyya commented Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

chappidim commented Aug 30, 2023 •

edited

Loading

woshiyyya commented Sep 26, 2023 •

edited

Loading