-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use dest_offsets directly in LoadPlanner #7155
Conversation
Any reason this has not been merged? As context this also fixed an issue we have by saving a distributed checkpoint on a 8x16 and then loading it in other topology. |
@alanwaketan Do you have a sec to review? I'll also add @JackCaoG |
does this change needs to go into 2.4 release(branch cut was yesterday)? |
Yes, we should cherry-pick this. I'll open a PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Sorry for missing this PR.
In create_read_items_for_chunk_list, the ReadItem's
dest_offsets
are set to index into the local shard. Despite this, we performed an translation from global to local indices. This caused issues restoring checkpoints with padded tensors, i.e. where the global tensor size is not evenly divisible by the shard size.We should instead rely on
dest_offsets
directly to narrow the local shards.