New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add batch_index_select_dim0 (w/ TBE backend) #1897

Closed

sryap wants to merge 1 commit into pytorch:main from sryap:export-D46084590

Contributor

sryap commented Jul 27, 2023

Summary:
Usage:

# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )

Differential Revision: D46084590

netlify bot commented Jul 27, 2023 •

edited

Loading

✅ Deploy Preview for pytorch-fbgemm-docs canceled.

Name	Link
🔨 Latest commit	`e5ee9d2`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/64c9454e8a817f00075d41cd

facebook-github-bot added cla signed fb-exported labels

Contributor

facebook-github-bot commented Jul 27, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

43819cc

Summary:
Pull Request resolved: pytorch#1897

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: a5ec8c2d45ae39d5eb79b61a8263e112276de50f

sryap force-pushed the export-D46084590 branch from faeddd5 to 43819cc Compare

July 27, 2023 23:40

Contributor

facebook-github-bot commented Jul 27, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap force-pushed the export-D46084590 branch from 43819cc to c8d9b41 Compare

July 28, 2023 00:22

Contributor

facebook-github-bot commented Jul 28, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

45f055b

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 96fb152d0270e2d09127fbaab349b5ac02068bcb

sryap force-pushed the export-D46084590 branch from c8d9b41 to 45f055b Compare

July 28, 2023 00:28

Contributor

facebook-github-bot commented Jul 28, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

77f1ffc

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 69df7e36784e77bc4d06cec2e9aba1fb59587e42

sryap force-pushed the export-D46084590 branch from 45f055b to 77f1ffc Compare

July 28, 2023 01:00

Contributor

facebook-github-bot commented Jul 28, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

9e2b759

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 1b8a94d7eb886337c8764751e262df7016cbf7dc

sryap force-pushed the export-D46084590 branch from 77f1ffc to 9e2b759 Compare

July 28, 2023 01:04

Contributor

facebook-github-bot commented Jul 28, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

4e51320

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 190a7bc19a205837da91f268b078a34f145c3273

sryap force-pushed the export-D46084590 branch from 9e2b759 to 4e51320 Compare

July 28, 2023 01:11

Contributor

facebook-github-bot commented Jul 28, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

8967bf7

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: ef02e43cff10311b29bff3d351839ac9fde13ddf

sryap force-pushed the export-D46084590 branch from 4e51320 to 8967bf7 Compare

July 28, 2023 01:18

Contributor

facebook-github-bot commented Jul 28, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

370efc0

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 3c36b52615176b8941f24682e993f63c564942d9

sryap force-pushed the export-D46084590 branch from 8967bf7 to 370efc0 Compare

July 28, 2023 01:34

Contributor

facebook-github-bot commented Jul 28, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 59f99f5c2bc5c5424205bd668a6c7777ecf53f7b

sryap force-pushed the export-D46084590 branch from 370efc0 to 1140128 Compare

July 28, 2023 07:27

Contributor

facebook-github-bot commented Jul 28, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

9ee003b

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 6225ae90362ca478b6c3120febb1dcf93291a3d6

sryap force-pushed the export-D46084590 branch from 1140128 to 9ee003b Compare

July 28, 2023 23:13

Contributor

facebook-github-bot commented Jul 28, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

886b0b0

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 0d2e55c7150cf678a5feb1569beb4ab448565916

sryap force-pushed the export-D46084590 branch from 9ee003b to 886b0b0 Compare

July 28, 2023 23:28

Contributor

facebook-github-bot commented Jul 28, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

d8bbfa6

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 6dd9d2bc6cf86832b301a855337a28fed76d748a

sryap force-pushed the export-D46084590 branch from 886b0b0 to d8bbfa6 Compare

July 28, 2023 23:36

Contributor

facebook-github-bot commented Jul 28, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

f7df255

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 92a7b39c9e0ae6fe9dced906376e294efc6a6bf7

sryap force-pushed the export-D46084590 branch from d8bbfa6 to f7df255 Compare

July 28, 2023 23:44

Contributor

facebook-github-bot commented Jul 28, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

acb9aff

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Reviewed By: jianyuh

Differential Revision: D46084590

fbshipit-source-id: 04d4487c277e1d164e669af9bb71b4f6d19c1460

sryap force-pushed the export-D46084590 branch from f7df255 to acb9aff Compare

August 1, 2023 01:13

Contributor

facebook-github-bot commented Aug 1, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

49239ea

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Reviewed By: jianyuh

Differential Revision: D46084590

fbshipit-source-id: 00a912f39e0b9f92c8c22b3a4b5ca26e0981d858

sryap force-pushed the export-D46084590 branch from acb9aff to 49239ea Compare

August 1, 2023 01:22

Contributor

facebook-github-bot commented Aug 1, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

47949dc

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Reviewed By: jianyuh

Differential Revision: D46084590

fbshipit-source-id: eeb58b27c77ebcacba6ad27465c67feb673b4bbe

sryap force-pushed the export-D46084590 branch from 49239ea to 47949dc Compare

August 1, 2023 17:42

Contributor

facebook-github-bot commented Aug 1, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590


          Add batch_index_select_dim0 (w/ TBE backend) (pytorch#1897)

e5ee9d2

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Reviewed By: jianyuh

Differential Revision: D46084590

fbshipit-source-id: 160ea0810abf3be3fc5f087f8a7a56437e481874

sryap force-pushed the export-D46084590 branch from 47949dc to e5ee9d2 Compare

August 1, 2023 17:47

Contributor

facebook-github-bot commented Aug 1, 2023

This pull request was exported from Phabricator. Differential Revision: D46084590

facebook-github-bot closed this in

410d264

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Aug 2, 2023

This pull request has been merged in 410d264.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment