-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimized TBE training forward #1641
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs canceled.
|
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 036d8d984fc4a0fbc4ac9e5a7fb746bf783dd80f
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 9991218d631aac96f7c54f8ecadf8e46de402a66
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: dbecfc183d4d93bb186a9db64c7cee81775c73aa
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 3ec1ef98e408bb8d06c44fc74a098f6b483833b2
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 42b8c5b853dd30df9bb3b2f808668d1ebf0db9a7
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 615518b61a305ae63e5cb6e9010ba4e9f7b689b9
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: ec11b9e8c665207200f0a8699a414757d4bd005e
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 2ede3008140d8f0d1c33d34867d0c3aaf3c98ce0
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: b36d4982c245f22187d27f150faaa93f541f7a5a
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 0f34b1e02932a71d225e19a44c45f18f29fc5e7c
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) Note that this optimization is enabled for NVIDIA GPUs, but **not** enabled for AMD GPUs. **Usage** The frontend changes are in D44479772 The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. This can also be enabled by passing `use_experimental_tbe=True` when instantiating the TBE operator. ``` emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=..., ..., use_experimental_tbe=True, ) ``` **Optimization** The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Reviewed By: jianyuh Differential Revision: D43634651 fbshipit-source-id: 96ad56f0e5567959fd28c72a649f862e1f5dd307
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 42fea6790c0fef1e60bae3d57c247ca61da46ec0
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) Note that this optimization is enabled for NVIDIA GPUs, but **not** enabled for AMD GPUs. **Usage** The frontend changes are in D44479772 The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. This can also be enabled by passing `use_experimental_tbe=True` when instantiating the TBE operator. ``` emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=..., ..., use_experimental_tbe=True, ) ``` **Optimization** The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Reviewed By: jianyuh Differential Revision: D43634651 fbshipit-source-id: 64d0d0752fc2689dae75ea1064a7c80551d3a15f
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 8be8753cba190ffde4c3be3a9e016cf09a99b5d4
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 3a3c3ce39c6a1deb1e217581e3717b98e7629e04
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: d03445efde04e978e8f5bb8853452a5c85ed9236
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 27876f466229cf1fd6a0aeb66e3d35bd6b43f930
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) Note that this optimization is enabled for NVIDIA GPUs, but **not** enabled for AMD GPUs. **Usage** The frontend changes are in D44479772 The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. This can also be enabled by passing `use_experimental_tbe=True` when instantiating the TBE operator. ``` emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=..., ..., use_experimental_tbe=True, ) ``` **Optimization** The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Reviewed By: jianyuh Differential Revision: D43634651 fbshipit-source-id: 30f9fc00c306515400e89d2f7c78063b75630722
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) Note that this optimization is enabled for NVIDIA GPUs, but **not** enabled for AMD GPUs. **Usage** The frontend changes are in D44479772 The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. This can also be enabled by passing `use_experimental_tbe=True` when instantiating the TBE operator. ``` emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=..., ..., use_experimental_tbe=True, ) ``` **Optimization** The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Reviewed By: jianyuh Differential Revision: D43634651 fbshipit-source-id: 3d5c90de057af284014a4a916f8aac1e0361750b
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 1c07c1e62f9fabde9ca4a5b166d666d8d01b1cf3
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) Note that this optimization is enabled for NVIDIA GPUs, but **not** enabled for AMD GPUs. **Usage** The frontend changes are in D44479772 The `FBGEMM_EXPERIMENTAL_TBE` environment variable flag is added for enabling/disabling the new implementation at runtime. If `FBGEMM_EXPERIMENTAL_TBE` is not set, TBE will use the orignal implementation. If `FBGEMM_EXPERIMENTAL_TBE=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `FBGEMM_EXPERIMENTAL_TBE` is not set. This can also be enabled by passing `use_experimental_tbe=True` when instantiating the TBE operator. ``` emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=..., ..., use_experimental_tbe=True, ) ``` **Optimization** The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Reviewed By: jianyuh Differential Revision: D43634651 fbshipit-source-id: 0e72b4809d2a7e26a8db88d8639c3d329ddd34ec
This pull request was exported from Phabricator. Differential Revision: D43634651 |
Summary: Pull Request resolved: pytorch#1641 This diff adds an optimized implementation of TBE training forward, namely `split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel`. The implementation currently supports only a subset of usecases of TBE including: - Split TBE (`SplitTableBatchedEmbeddingBagsCodegen`) - Pooled TBE (`pooling_mode`: `PoolingMode.SUM`, `PoolingMode.MEAN`) - Weighted and unweighted TBE (`per_sample_weights`: `Tensor`, `None`) - FP32 and FP16 weight types (`weights_precision`: `SparseType.FP32`, `SparseType.FP16`) - FP32 and FP16 output types (`output_dtype`: `SparseType.FP32`, `SparseType.FP16`) - Device, manged, managed caching embedding locations (`EmbeddingLocation`: `EmbeddingLocation.DEVICE`, `EmbeddingLocation.MANAGED`, `EmbeddingLocation.MANAGED_CACHING`) Cases that the new implementation does **NOT** support: - Dense TBE (`DenseTableBatchedEmbeddingBagsCodegen`) - Sequence TBE (`pooling_mode`: `PoolingMode.NONE`) - FP8, INT8, INT4, INT2, and BF16 weight types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - FP8, INT8, INT4, INT2, and BF16 output types (`weights_precision`: `SparseType.FP8`, `SparseType.INT8`, `SparseType.INT4`, `SparseType.INT2`, `SparseType.BF16`) - Host embedding locations (`EmbeddingLocation`: `EmbeddingLocation.HOST`) The `IS_EXPERIMENTAL` environment variable flag is added for enabling/disabling the new implementation at runtime. If `IS_EXPERIMENTAL` is not set, TBE will use the orignal implementation. If `IS_EXPERIMENTAL=1`, TBE will use the new implementation. If the TBE usecases are not supported in the new implementation, TBE will fall back to the original implementation. By default, `IS_EXPERIMENTAL` is not set. The new implementation contains the following optimizations: - Use multiple warps per bag for D > 128 to maintain a constant number of registers per thread - Use subwarps to process subsets of input rows in a bag if D < 128 - Cooperatively compute weight pointers and store them in shared memory - Save state variables in shared memory instead of registers to free registers for compiler optimizations - Use the upper bound number of warps for all tables to avoid complex warp offset computation - Process multiple samples (up to kWarpSize samples) in a warp for small Ls Note: D = embedding dimension, L = pooling factor Differential Revision: D43634651 fbshipit-source-id: 6953f7f8c9fd3a415d1ea5ed2af771ea85eb1d84
This pull request has been merged in d1c4a6f. |
@liligwu FYI, we currently disable this functionality on ROCm due to various compilation errors. This is the optimized table batched embedding implementation. Currently it is not used by default but this might change in the future. We are considering replacing the old implementation with the new one. |
Hi @sryap , thank you for letting us know the changes. |
Summary:
This diff adds an optimized implementation of TBE training forward,
namely
split_embedding_codegen_forward_[weighted|unweighted]_v2_kernel
.The implementation currently supports only a subset of usecases of TBE
including:
SplitTableBatchedEmbeddingBagsCodegen
)pooling_mode
:PoolingMode.SUM
,PoolingMode.MEAN
)per_sample_weights
:Tensor
,None
)weights_precision
:SparseType.FP32
,SparseType.FP16
)output_dtype
:SparseType.FP32
,SparseType.FP16
)(
EmbeddingLocation
:EmbeddingLocation.DEVICE
,EmbeddingLocation.MANAGED
,EmbeddingLocation.MANAGED_CACHING
)Cases that the new implementation does NOT support:
DenseTableBatchedEmbeddingBagsCodegen
)pooling_mode
:PoolingMode.NONE
)weights_precision
:SparseType.FP8
,SparseType.INT8
,SparseType.INT4
,SparseType.INT2
,SparseType.BF16
)weights_precision
:SparseType.FP8
,SparseType.INT8
,SparseType.INT4
,SparseType.INT2
,SparseType.BF16
)EmbeddingLocation
:EmbeddingLocation.HOST
)The
IS_EXPERIMENTAL
environment variable flag is added forenabling/disabling the new implementation at runtime. If
IS_EXPERIMENTAL
is not set, TBE will use the orignal implementation.If
IS_EXPERIMENTAL=1
, TBE will use the new implementation. If theTBE usecases are not supported in the new implementation, TBE will
fall back to the original implementation. By default,
IS_EXPERIMENTAL
is not set.The new implementation contains the following optimizations:
number of registers per thread
memory
registers for compiler optimizations
warp offset computation
small Ls
Note: D = embedding dimension, L = pooling factor
Differential Revision: D43634651