Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add variable batch size support to TBE training #1752

Closed
wants to merge 1 commit into from

Conversation

sryap
Copy link
Contributor

@sryap sryap commented May 5, 2023

Summary:
This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

Usage:

# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)

Output format

{F967393126}

Limitation:

T and max_B have to fit in 32 bits.

  • We use lower info_B_num_bits bits to store b (bag ID; b < max_B). Supported max_B = 2^info_B_num_bits
  • We use upper 32 - info_B_num_bits bits to store t (table ID; t < T). Supported T = 2^(32 - info_B_num_bits)

Note that we adjust info_B_num_bits automatically at runtime based on max_B and T. If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

@netlify
Copy link

netlify bot commented May 5, 2023

Deploy Preview for pytorch-fbgemm-docs canceled.

Name Link
🔨 Latest commit dbff94e
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6476e964b5e0c200080d7e50

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 5, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 435e3e6d0b6166c6fbb2a8b92cab24dbf6d77933
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 5, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 9918a51ac0be5da077e37bb9315716380c12b7e0
sryap added a commit to sryap/FBGEMM that referenced this pull request May 5, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: d7dcc298d229468863e5ccef39ccf736d6fb0504
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 10, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: c2fda3eeadbfe95c52caa33289a6ea323ffbcec8
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 10, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 3e715f7a8372434a0531af181b3b8461cd4171ee
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 10, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 9b70a672f9f4713e827cb096315ba91a2e701a76
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 10, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 7e9acf65e33f57b2ec876a0565ab80cd9e0fd3f8
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 16, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 60edca0e6d9ee12e5201c7ee0a74a4c706b856e2
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 16, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D42663369

fbshipit-source-id: 0f89ad11f49a8241d8bf4407fe34038b000660ee
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 17, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: 74b0fbf41cd228cf449df15a0af755ad267a6bb5
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 17, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: 06dfd8f09d08f20f2470c43d1e9368a8106c4bdd
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 19, 2023
Summary:
//caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_syncPull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: eccac79678b8ab482d4b0e91043197247648d650
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

sryap added a commit to sryap/FBGEMM that referenced this pull request May 31, 2023
Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: d71f8b06242756d966d1d6b5f04e83875f07f5bd
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

Summary:
Pull Request resolved: pytorch#1752

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D42663369

fbshipit-source-id: d613b0a9ced838e3ae8b421a1e5a30de8b158e69
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D42663369

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 05bf018.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants