Skip to content

Conversation

@zpcore
Copy link
Member

@zpcore zpcore commented Sep 5, 2025

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162294

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit a7c1e62 with merge base 95a0532 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@ezyang
Copy link
Contributor

ezyang commented Sep 6, 2025

Testing plan?

@zpcore
Copy link
Member Author

zpcore commented Sep 8, 2025

Testing plan?

Yes, let me add tests in AP first and port it here.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Sep 11, 2025
ghstack-source-id: e67346b
Pull Request resolved: #162294
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Sep 16, 2025
ghstack-source-id: 0faeebf
Pull Request resolved: #162294
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Sep 24, 2025
Not Ready For Review!


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Sep 24, 2025
Not Ready For Review!


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Sep 26, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Sep 26, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Sep 27, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Sep 27, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Sep 27, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Sep 27, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Sep 28, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 2, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 2, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 2, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 2, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 2, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 2, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 2, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 2, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 2, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 2, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 2, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 2, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 2, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 2, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 3, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 3, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 5, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 5, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 6, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
zpcore added a commit that referenced this pull request Oct 6, 2025
***Not Ready For Review!***
***Not Ready For Review!***
***Not Ready For Review!***

### Summary
Introduce the `AllPermute` collective operation as mentioned in https://arxiv.org/pdf/2112.01075 "section 2.6 Collective operations".  

### What is AllPermute?
AllPermute can transform any 𝜏1 to 𝜏2 if their local and global shapes match. For example:
Given mesh and size {X:4, Y:4, Z:16}, we have
- example 1: [32{X,Y}}512, 128] -> [32{Y,X}512, 128]
- example 2: [128{Y}512, 32{X}128] -> [128{X}512, 32{Y}128]
- example 3: [32{X,Y}512, 128] -> [32{Z}512, 128]
Note: annotation borrowed from https://arxiv.org/pdf/2112.01075 "section 2.1 Distributed array types"

### Why we need AllPermute?
With AllPermute, we can eliminate some AllGather ops during redistribution. This plays an important role in reducing the memory overhead. In theory, at most one AllPermute is needed to redistribute from any 𝜏1 to 𝜏2. The `AllPermute` can be performed as the final step, or moved before the last `AllGather` to minimize the amount of data relocated between shards in the `AllPermute`.


Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #163772
* #162294
* #160903
* #160266



cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta ezyang msaroufim dcci

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants