Add utility functions for distributed checkpointing #5128

jonb377 · 2023-06-07T01:19:00Z

This change adds a few utility functions to support distributed checkpointing. The following changes are included:

Add sharding_type to XLAShardedTensor to get the ShardingType
Add wrap_if_sharded to convert torch.Tensor into XLAShardedTensor if the underlying data is sharded.
Remove devices parameter from _get_local_shard_indices and instead always return the shard indices in the order of the shards
Clamp the start bound of the index slices to the tensor's size.
Add a setter to the unpadded_data property of XLAShard

jonb377 · 2023-06-07T02:06:05Z

alanwaketan · 2023-06-07T21:57:16Z

torch_xla/csrc/xla_sharding_util.cpp

-        // Clamp the end of the slice to the tensor shape to accurately reflect
+        // Clamp the slice bounds to the tensor shape to accurately reflect
        // the shard size without padding.
+        int start = std::min(n_j * shard_shape[j], tensor_shape[j]);


Do you catch a bug or this is a more like a safeguard?

If n_j * shard_shape[j] is going to be > tensor_shape[j], it means (n_j + 1) * shard_shape[j] is certainly going to be larger than tensor_shape[j]. Therefore, for that scenario, start will be equal to end and equals to tensor_shape[j]. And that slice seems meaningless.

I would call this a latent bug, but it wasn't breaking anything because torch indexing handles negative-length indices as though they were empty. It just breaks the expectation that stop - start reflects the size of the unpadded shard, which we rely on in distributed checkpointing.

You're right - these index slices will end up empty, but this is the desired outcome when the shard consists entirely of padding.

alanwaketan

LGTM.

alanwaketan · 2023-06-08T01:42:51Z

torch_xla/csrc/xla_sharding_util.cpp

+      return ShardingType::TUPLE;
+    case xla::OpSharding::OTHER:
+      // OTHER sharding can indicate either PARTIAL or TILED sharding.
+      return sharding.replicate_on_last_tile_dim() ? ShardingType::PARTIAL


This seems pretty hacky. But I guess we don't have other ways round?

My understanding is that we distinguish partial replication as a different sharding type whereas XLA treats partial and tiled as the same type OTHER. @yeounoh could you confirm?

Yes, this is actually the correct way, the compiler treats TILED and PARTIAL the same as the OTHER type. The differences between the two would be how the tile shards are assigned to difference devices.

JackCaoG · 2023-06-09T17:37:49Z

@jonb377 @alanwaketan is this pr ready to merge?

jonb377 · 2023-06-09T19:34:34Z

Yes, I'll merge after TPU CI finishes

jonb377 added the distributed SPMD and other distributed things. label Jun 7, 2023

jonb377 requested a review from yeounoh June 7, 2023 01:19

jonb377 force-pushed the jonbolin-checkpoint-restore branch from 5872316 to d27834a Compare June 7, 2023 01:21

jonb377 mentioned this pull request Jun 7, 2023

Implement SPMDLoadPlanner to enable distributed checkpoint loading #5130

Merged

jonb377 force-pushed the jonbolin-checkpoint-restore branch from d27834a to 318074b Compare June 7, 2023 02:10

alanwaketan reviewed Jun 7, 2023

View reviewed changes

Add utility functions for distributed checkpointing

cfcd622

jonb377 force-pushed the jonbolin-checkpoint-restore branch from 318074b to cfcd622 Compare June 8, 2023 00:24

alanwaketan approved these changes Jun 8, 2023

View reviewed changes

alanwaketan reviewed Jun 8, 2023

View reviewed changes

jonb377 merged commit c7fe0f9 into master Jun 9, 2023

jonb377 deleted the jonbolin-checkpoint-restore branch June 9, 2023 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add utility functions for distributed checkpointing #5128

Add utility functions for distributed checkpointing #5128

Uh oh!

jonb377 commented Jun 7, 2023 •

edited

Loading

Uh oh!

jonb377 commented Jun 7, 2023

Uh oh!

alanwaketan Jun 7, 2023

Uh oh!

jonb377 Jun 7, 2023

Uh oh!

alanwaketan left a comment

Uh oh!

alanwaketan Jun 8, 2023

Uh oh!

jonb377 Jun 8, 2023

Uh oh!

yeounoh Jun 10, 2023

Uh oh!

JackCaoG commented Jun 9, 2023

Uh oh!

jonb377 commented Jun 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add utility functions for distributed checkpointing #5128

Add utility functions for distributed checkpointing #5128

Uh oh!

Conversation

jonb377 commented Jun 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonb377 commented Jun 7, 2023

Uh oh!

alanwaketan Jun 7, 2023

Choose a reason for hiding this comment

Uh oh!

jonb377 Jun 7, 2023

Choose a reason for hiding this comment

Uh oh!

alanwaketan left a comment

Choose a reason for hiding this comment

Uh oh!

alanwaketan Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

jonb377 Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

yeounoh Jun 10, 2023

Choose a reason for hiding this comment

Uh oh!

JackCaoG commented Jun 9, 2023

Uh oh!

jonb377 commented Jun 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jonb377 commented Jun 7, 2023 •

edited

Loading