-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable bucketized all-reduce for gradients #7216
Conversation
Something wrong with the build @JackCaoG . Maybe it is one-off? I don't know how to restart the run though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(removed)
import torch_xla.distributed.xla_backend | ||
import torch.distributed as dist | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to set the ALLREDUCE_GRADIENTS_BUCKET_SIZE_MB envvar for this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do you need to add this test to run_tests.sh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea this test is not being run in CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need the env flag, the test runs the bucketized version directly
Hmm this is weird
Let me rerun, if it still fails I will ask someone on our end to take a look. Sorry that CI gave you guys so much trouble,, |
cpu failures looks real
seems like one of the buffer has been aliased(hence the buffer is |
Ci might be glictching.. You should add your test to |
@JackCaoG the error in the GPU run for torch_mp_op is not very clear. Do you know why it is failing? |
it is relevant, we can ignore. However the test in this pr is not being run. |
Hi @JackCaoG seems like there's some CI infra issue? |
yea github action is glitching, this affects all github projects. |
b1bf93c
to
f332a8f
Compare
Hi, I saw we want to backport this PR into release 2.4, and mentioned in #7242, but looks like not backported in release 2.4, please correct me if I'm wrong now we are preparing 2.5 release, and would you mind help to describe more context about this PR's feature/use-case for or inspired from? @amithrm |
@ManfeiBai thanks for checking. Yeah we can drop the 2.4 backport request. This change adds bucketing of all-reduce so that it prevents small tensor all-reduces which are inefficient for DMAs. The bucketing aggregates/coaelesce small tensors until a specified size, and to one all-reduce on the aggregate. This feature is already part of all-gather/reduce-scatter. |
This PR adds bucketing (aka coalescing) to all-reduce, to increase DMA utilization and reduce DMA overhead associated with small data transfers.
Replaces #6417 .