Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add variable length (batch size) support to TBE training #1653

Closed
wants to merge 1 commit into from

Conversation

sryap
Copy link
Contributor

@sryap sryap commented Mar 20, 2023

Summary:
This diff adds the variable length (or variable batch size) support in split TBE training on GPU.

Usage:

# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)

Output

{F854479754}

Limitation:

T and max_B have to fit in 32 bits.

  • We use lower info_B_num_bits bits to store b (bag ID; b < max_B). Supported max_B = 2^info_B_num_bits
  • We use upper 32 - info_B_num_bits bits to store t (table ID; t < T). Supported T = 2^(32 - info_B_num_bits)

Note that we adjust info_B_num_bits automatically at runtime based on max_B and T. If they cannot fit into 32 bits, it will abort.

Differential Revision: D43259020

@netlify
Copy link

netlify bot commented Mar 20, 2023

Deploy Preview for pytorch-fbgemm-docs canceled.

Name Link
🔨 Latest commit f413fdc
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/64646a42d2c2ba0008906934

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request Mar 21, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable length (or variable batch size) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)
```

**Output**

{F854479754}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D43259020

fbshipit-source-id: ac5950387d2908ab15f09d50c8ffeec483da5047
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request Mar 21, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable length (or variable batch size) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)
```

**Output**

{F854479754}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D43259020

fbshipit-source-id: 603a4fc3851ececce5eccf957df58dea9de121a1
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request Mar 27, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable length (or variable batch size) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)
```

**Output**

{F854479754}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: f353e3c86bf873d2f999a21d27be5eff646da682
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request Mar 27, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable length (or variable batch size) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)
```

**Output**

{F854479754}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 1b922366319f929a782044d45b5bbff796a58756
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request Mar 27, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable length (or variable batch size) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)
```

**Output**

{F854479754}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 614b92e9dbf66286afb645e7b90800f85c922816
sryap added a commit to sryap/FBGEMM that referenced this pull request Mar 27, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable length (or variable batch size) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)
```

**Output**

{F854479754}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 3302c540a0f4227ae6299442e28328248a89ddf1
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request Mar 29, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)
```

**Output**

{F854479754}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 91a48a40ce2b9a4b427294ab6a14937dc2a6cfcb
sryap added a commit to sryap/FBGEMM that referenced this pull request May 2, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=Bs_feature_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: dc9f88e62086bf335f1662a56ed4c10c2fdcbe0c
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request May 5, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=Bs_feature_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: cb62702ad49b8380bd120a33617a129708fdbc29
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request May 10, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 6577637feb35c6473f2708fffa50e71ef8dbff9c
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request May 10, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 1dbe4830e72826e2846b596926387ea12ee08a71
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request May 10, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 1f29816aab4ea7005bdd7da18940fd1c1aeba511
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request May 16, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 5b5f26c481da4193412c22ae6e2870fc7bf8ffcb
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request May 16, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 9cec581e56059c328adcade7870636706659d695
sryap added a commit to sryap/FBGEMM that referenced this pull request May 16, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: dd11698ab91f747bff148b18e28083ffe20f0bd5
sryap added a commit to sryap/FBGEMM that referenced this pull request May 17, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 9702b63511a91e8beabd7b9ce56f627dfdd7282a
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request May 17, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 3b82b6f6015a208273aab18ebc861f0ec27d7707
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 4b801c6b419096d1b1a6570b3696a18b6ae24ab7
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request May 17, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 7a635d25962dd33fe7a52767b64978850d696380
sryap added a commit to sryap/FBGEMM that referenced this pull request May 17, 2023
Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: a185c20af972e76195e1a844141a440f1f734290
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in f46904e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants