New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add variable batch size support to TBE training #1752
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs canceled.
|
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds the variable batch size (or variable length) support in split TBE training on GPU. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: 435e3e6d0b6166c6fbb2a8b92cab24dbf6d77933
This pull request was exported from Phabricator. Differential Revision: D42663369 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds the variable batch size (or variable length) support in split TBE training on GPU. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: 9918a51ac0be5da077e37bb9315716380c12b7e0
Summary: Pull Request resolved: pytorch#1752 This diff adds the variable batch size (or variable length) support in split TBE training on GPU. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: d7dcc298d229468863e5ccef39ccf736d6fb0504
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds the variable batch size (or variable length) support in split TBE training on GPU. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: c2fda3eeadbfe95c52caa33289a6ea323ffbcec8
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds the variable batch size (or variable length) support in split TBE training on GPU. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: 3e715f7a8372434a0531af181b3b8461cd4171ee
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds the variable batch size (or variable length) support in split TBE training on GPU. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: 9b70a672f9f4713e827cb096315ba91a2e701a76
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: 7e9acf65e33f57b2ec876a0565ab80cd9e0fd3f8
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: 60edca0e6d9ee12e5201c7ee0a74a4c706b856e2
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Differential Revision: D42663369 fbshipit-source-id: 0f89ad11f49a8241d8bf4407fe34038b000660ee
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Reviewed By: jianyuh Differential Revision: D42663369 fbshipit-source-id: 74b0fbf41cd228cf449df15a0af755ad267a6bb5
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F967393126} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Reviewed By: jianyuh Differential Revision: D42663369 fbshipit-source-id: 06dfd8f09d08f20f2470c43d1e9368a8106c4bdd
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: //caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_syncPull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F982891369} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Reviewed By: jianyuh Differential Revision: D42663369 fbshipit-source-id: eccac79678b8ab482d4b0e91043197247648d650
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F982891369} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Reviewed By: jianyuh Differential Revision: D42663369 fbshipit-source-id: d71f8b06242756d966d1d6b5f04e83875f07f5bd
This pull request was exported from Phabricator. Differential Revision: D42663369 |
Summary: Pull Request resolved: pytorch#1752 This diff adds support for variable batch size (or variable length) in split TBE training on GPU (the extension is called "VBE"). VBE is enabled for the following usecase: - split (`SplitTableBatchedEmbeddingBagsCodegen`), and - pooled (`pooling_mode != PoolingMode.NONE`), and - weighted/unweighted, and - rowwise Adagrad optimizer (`optimizer == OptimType.EXACT_ROWWISE_ADAGRAD`) Important note: This feature is enabled for a specific use case in order to keep the binary size of the FBGEMM library within limits. **Usage:** ``` # Initialize TBE as same as previously emb_op = SplitTableBatchedEmbeddingBagsCodegen( embedding_specs=[...], ... # other params ) # batch sizes (one for each FEATURE and each RANK). # Example: num_features = 2, num_ranks = 4 batch_size_per_feature_per_rank = [ [1, 2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0 [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1 ] # Pass a list of batch_size_per_feature_per_rank to forward. # !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !! output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank) ``` **Output format** {F982891369} **Limitation:** `T` and `max_B` have to fit in 32 bits. - We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`). Supported `max_B` = `2^info_B_num_bits` - We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`). Supported `T` = `2^(32 - info_B_num_bits)` Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`. If they cannot fit into 32 bits, it will abort. Reviewed By: jianyuh Differential Revision: D42663369 fbshipit-source-id: d613b0a9ced838e3ae8b421a1e5a30de8b158e69
This pull request was exported from Phabricator. Differential Revision: D42663369 |
This pull request has been merged in 05bf018. |
Summary:
This diff adds the variable batch size (or variable length) support in split TBE training on GPU.
Usage:
Output format
{F967393126}
Limitation:
T
andmax_B
have to fit in 32 bits.info_B_num_bits
bits to storeb
(bag ID;b
<max_B
). Supportedmax_B
=2^info_B_num_bits
32 - info_B_num_bits
bits to storet
(table ID;t
<T
). SupportedT
=2^(32 - info_B_num_bits)
Note that we adjust
info_B_num_bits
automatically at runtime based onmax_B
andT
. If they cannot fit into 32 bits, it will abort.Differential Revision: D42663369