Enable bucketized all-reduce for gradients #7216

amithrm · 2024-06-07T03:46:07Z

This PR adds bucketing (aka coalescing) to all-reduce, to increase DMA utilization and reduce DMA overhead associated with small data transfers.

Replaces #6417 .

jeffhataws · 2024-06-07T18:54:50Z

Something wrong with the build @JackCaoG . Maybe it is one-off? I don't know how to restart the run though.

jeffhataws

(removed)

jeffhataws · 2024-06-07T18:58:31Z

test/test_torch_distributed_bucketed_all_reduce_xla_backend.py

+import torch_xla.distributed.xla_backend
+import torch.distributed as dist
+
+


Do you need to set the ALLREDUCE_GRADIENTS_BUCKET_SIZE_MB envvar for this test?

Also, do you need to add this test to run_tests.sh?

yea this test is not being run in CI

we don't need the env flag, the test runs the bucketized version directly

JackCaoG · 2024-06-07T18:59:51Z

Hmm this is weird

Attempt 1 of 5 failed with error: Unexpected token '<', "<!DOCTYPE "... is not valid JSON. Retrying request in 3000 ms...
Attempt 2 of 5 failed with error: Unexpected token '<', "<!DOCTYPE "... is not valid JSON. Retrying request in 4669 ms...
Attempt 3 of 5 failed with error: Unexpected token '<', "<!DOCTYPE "... is not valid JSON. Retrying request in 7065 ms...
Attempt 4 of 5 failed with error: Unexpected token '<', "<!DOCTYPE "... is not valid JSON. Retrying request in 13950 ms...
Error: Failed to FinalizeArtifact: Failed to make request after 5 attempts: Unexpected token '<', "<!DOCTYPE "... is not valid JSON

Let me rerun, if it still fails I will ask someone on our end to take a look. Sorry that CI gave you guys so much trouble,,

JackCaoG · 2024-06-07T22:55:48Z

cpu failures looks real

2024-06-07T21:55:07.5781389Z ======================================================================
2024-06-07T21:55:07.5782463Z ERROR: test_all_reduce_no_op_with_one_replica (__main__.TestExperimentalPjrtMultiCpu)
2024-06-07T21:55:07.5783741Z TestExperimentalPjrtMultiCpu.test_all_reduce_no_op_with_one_replica
2024-06-07T21:55:07.5784883Z ----------------------------------------------------------------------
2024-06-07T21:55:07.5787364Z concurrent.futures.process._RemoteTraceback: 
2024-06-07T21:55:07.5788147Z """
2024-06-07T21:55:07.5788624Z Traceback (most recent call last):
2024-06-07T21:55:07.5789762Z   File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
2024-06-07T21:55:07.5790972Z     r = call_item.fn(*call_item.args, **call_item.kwargs)
2024-06-07T21:55:07.5792188Z   File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
2024-06-07T21:55:07.5793238Z     return [fn(*args) for args in chunk]
2024-06-07T21:55:07.5794366Z   File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
2024-06-07T21:55:07.5795412Z     return [fn(*args) for args in chunk]
2024-06-07T21:55:07.5796704Z   File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 95, in wrapper
2024-06-07T21:55:07.5797661Z     return fn(*args, **kwargs)
2024-06-07T21:55:07.5798915Z   File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 78, in _run_thread_per_device
2024-06-07T21:55:07.5800088Z     replica_results = list(
2024-06-07T21:55:07.5801068Z   File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
2024-06-07T21:55:07.5802133Z     yield _result_or_cancel(fs.pop())
2024-06-07T21:55:07.5803199Z   File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
2024-06-07T21:55:07.5804274Z     return fut.result(timeout)
2024-06-07T21:55:07.5805210Z   File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
2024-06-07T21:55:07.5806190Z     return self.__get_result()
2024-06-07T21:55:07.5807167Z   File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
2024-06-07T21:55:07.5808175Z     raise self._exception
2024-06-07T21:55:07.5809053Z   File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
2024-06-07T21:55:07.5810078Z     result = self.fn(*self.args, **self.kwargs)
2024-06-07T21:55:07.5811369Z   File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 71, in _thread_fn
2024-06-07T21:55:07.5812441Z     return fn()
2024-06-07T21:55:07.5813878Z   File "/__w/xla/xla/pytorch/xla/test/pjrt/test_runtime_multi_cpu.py", line 136, in _all_reduce_hlo
2024-06-07T21:55:07.5815107Z     return torch_xla._XLAC._get_xla_tensors_hlo([reduced])
2024-06-07T21:55:07.5816583Z RuntimeError: Error while lowering: [UNKNOWN_SCALAR[]] xla::device_data, xla_shape=f32[3,3]***1,0***, dynamic_dims: (), device=CPU:0
2024-06-07T21:55:07.5818214Z Error: ./torch_xla/csrc/runtime/pjrt_computation_client.h:185 : Check failed: HasValue() 
2024-06-07T21:55:07.5819213Z *** Begin stack trace ***
2024-06-07T21:55:07.5819799Z 	tsl::CurrentStackTrace()
2024-06-07T21:55:07.5820587Z 	torch_xla::runtime::PjRtComputationClient::PjRtData::GetHandle()
2024-06-07T21:55:07.5822634Z 	torch_xla::LoweringContext::GetParameter(std::shared_ptr<torch::lazy::BackendData> const&, std::unordered_set<unsigned int, std::hash<unsigned int>, std::equal_to<unsigned int>, std::allocator<unsigned int> > const&)
2024-06-07T21:55:07.5824574Z 	torch_xla::DeviceData::Lower(torch_xla::LoweringContext*) const
2024-06-07T21:55:07.5825586Z 	torch_xla::LoweringContext::LowerNode(torch::lazy::Node const*)
2024-06-07T21:55:07.5826615Z 	torch_xla::LoweringContext::GetOutputOp(torch::lazy::Output const&)
2024-06-07T21:55:07.5827919Z 	torch_xla::LoweringContext::AddResult(torch::lazy::Output const&)
2024-06-07T21:55:07.5829336Z 	torch_xla::DumpUtil::ToHlo(c10::ArrayRef<torch::lazy::Value>, torch::lazy::BackendDevice const&, torch_xla::EmitMode)
2024-06-07T21:55:07.5832459Z 	torch_xla::XLAGraphExecutor::DumpHloComputation(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > > const&, torch_xla::EmitMode)
2024-06-07T21:55:07.5834849Z 	
2024-06-07T21:55:07.5835264Z 	
2024-06-07T21:55:07.5835675Z 	
2024-06-07T21:55:07.5836135Z 	_PyObject_MakeTpCall
2024-06-07T21:55:07.5836690Z 	_PyEval_EvalFrameDefault
2024-06-07T21:55:07.5837242Z 	
2024-06-07T21:55:07.5837653Z 	
2024-06-07T21:55:07.5838182Z 	_PyEval_EvalFrameDefault
2024-06-07T21:55:07.5838725Z 	
2024-06-07T21:55:07.5839171Z 	_PyEval_EvalFrameDefault
2024-06-07T21:55:07.5839704Z 	
2024-06-07T21:55:07.5840157Z 	_PyEval_EvalFrameDefault
2024-06-07T21:55:07.5840717Z 	
2024-06-07T21:55:07.5841162Z 	_PyEval_EvalFrameDefault
2024-06-07T21:55:07.5841715Z 	
2024-06-07T21:55:07.5842160Z 	_PyEval_EvalFrameDefault
2024-06-07T21:55:07.5842693Z 	
2024-06-07T21:55:07.5843135Z 	_PyEval_EvalFrameDefault
2024-06-07T21:55:07.5843670Z 	
2024-06-07T21:55:07.5844077Z 	
2024-06-07T21:55:07.5844488Z 	
2024-06-07T21:55:07.5844901Z 	
2024-06-07T21:55:07.5845335Z 	
2024-06-07T21:55:07.5845742Z 	
2024-06-07T21:55:07.5846154Z 	clone
2024-06-07T21:55:07.5846616Z *** End stack trace ***
2024-06-07T21:55:07.5847298Z buffer with shape f32[3,3] on device CPU:0 is deleted

seems like one of the buffer has been aliased(hence the buffer is deleted) but is being referenced again.

JackCaoG · 2024-06-07T23:14:55Z

Ci might be glictching.. You should add your test to test.sh with correct env var and rerun

jeffhataws · 2024-06-10T21:04:17Z

@JackCaoG the error in the GPU run for torch_mp_op is not very clear. Do you know why it is failing?

JackCaoG · 2024-06-10T21:06:38Z

it is relevant, we can ignore. However the test in this pr is not being run.

jeffhataws · 2024-06-11T21:42:56Z

Hi @JackCaoG seems like there's some CI infra issue?

JackCaoG · 2024-06-11T22:30:57Z

yea github action is glitching, this affects all github projects.

…distributed dir

ManfeiBai · 2024-10-02T21:04:41Z

Hi, I saw we want to backport this PR into release 2.4, and mentioned in #7242, but looks like not backported in release 2.4, please correct me if I'm wrong

now we are preparing 2.5 release, and would you mind help to describe more context about this PR's feature/use-case for or inspired from? @amithrm

jeffhataws · 2024-10-09T20:01:59Z

Hi, I saw we want to backport this PR into release 2.4, and mentioned in #7242, but looks like not backported in release 2.4, please correct me if I'm wrong

now we are preparing 2.5 release, and would you mind help to describe more context about this PR's feature/use-case for or inspired from? @amithrm

@ManfeiBai thanks for checking. Yeah we can drop the 2.4 backport request.

This change adds bucketing of all-reduce so that it prevents small tensor all-reduces which are inefficient for DMAs. The bucketing aggregates/coaelesce small tensors until a specified size, and to one all-reduce on the aggregate. This feature is already part of all-gather/reduce-scatter.

jeffhataws mentioned this pull request Jun 7, 2024

Gradient bucketing using a pre-defined bucket size cap #6417

Closed

JackCaoG requested review from jeffhataws and alanwaketan June 7, 2024 17:29

jeffhataws requested changes Jun 7, 2024

View reviewed changes

JackCaoG approved these changes Jun 11, 2024

View reviewed changes

jeffhataws approved these changes Jun 11, 2024

View reviewed changes

amithrm and others added 10 commits June 12, 2024 16:50

Gradient bucketing using a pre-defined bucket size cap

c491f05

Fix linter issues

3f7cc77

Added ALLREDUCE_BUCKET_SIZE_MB to turn on bucketing for allreduce

4c1233b

Fix import

45ba885

Fixing API for allreduce bucketized gradients

d7ecbba

fix linter checks

35991a6

Fixing test case

7fba20a

Add bucketized all-reduce test to run_tests.sh; move test into torch_…

a7193b4

…distributed dir

Add init_method='xla://' to bucketized allreduce test

c63f00c

Lint fix

f332a8f

jeffhataws force-pushed the bucketing_gradients branch from b1bf93c to f332a8f Compare June 12, 2024 16:50

Add bucket_cap_mb arguments; fix bucketized allreduce test

85c6ec1

jeffhataws changed the title ~~Bucketing gradients~~ Enable bucketized all-reduce for gradients Jun 14, 2024

jeffhataws mentioned this pull request Jun 14, 2024

2.4 backport PR request list #7242

Open

JackCaoG merged commit 28f9887 into master Jun 14, 2024
21 of 22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable bucketized all-reduce for gradients #7216

Enable bucketized all-reduce for gradients #7216

amithrm commented Jun 7, 2024 •

edited by jeffhataws

Loading

jeffhataws commented Jun 7, 2024

jeffhataws left a comment •

edited

Loading

jeffhataws Jun 7, 2024

jeffhataws Jun 7, 2024

JackCaoG Jun 7, 2024

amithrm Jun 7, 2024

JackCaoG commented Jun 7, 2024

JackCaoG commented Jun 7, 2024 •

edited

Loading

JackCaoG commented Jun 7, 2024

jeffhataws commented Jun 10, 2024

JackCaoG commented Jun 10, 2024

jeffhataws commented Jun 11, 2024

JackCaoG commented Jun 11, 2024

ManfeiBai commented Oct 2, 2024 •

edited

Loading

jeffhataws commented Oct 9, 2024

		import torch_xla.distributed.xla_backend
		import torch.distributed as dist

Enable bucketized all-reduce for gradients #7216

Enable bucketized all-reduce for gradients #7216

Conversation

amithrm commented Jun 7, 2024 • edited by jeffhataws Loading

jeffhataws commented Jun 7, 2024

jeffhataws left a comment • edited Loading

Choose a reason for hiding this comment

jeffhataws Jun 7, 2024

Choose a reason for hiding this comment

jeffhataws Jun 7, 2024

Choose a reason for hiding this comment

JackCaoG Jun 7, 2024

Choose a reason for hiding this comment

amithrm Jun 7, 2024

Choose a reason for hiding this comment

JackCaoG commented Jun 7, 2024

JackCaoG commented Jun 7, 2024 • edited Loading

JackCaoG commented Jun 7, 2024

jeffhataws commented Jun 10, 2024

JackCaoG commented Jun 10, 2024

jeffhataws commented Jun 11, 2024

JackCaoG commented Jun 11, 2024

ManfeiBai commented Oct 2, 2024 • edited Loading

jeffhataws commented Oct 9, 2024

amithrm commented Jun 7, 2024 •

edited by jeffhataws

Loading

jeffhataws left a comment •

edited

Loading

JackCaoG commented Jun 7, 2024 •

edited

Loading

ManfeiBai commented Oct 2, 2024 •

edited

Loading