New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add variable length (batch size) support to TBE training #1653

Closed

sryap wants to merge 1 commit into pytorch:main from sryap:export-D43259020

Contributor

sryap commented Mar 20, 2023

Summary:
This diff adds the variable length (or variable batch size) support in split TBE training on GPU.

Usage:

# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)

Output

{F854479754}

Limitation:

T and max_B have to fit in 32 bits.

We use lower info_B_num_bits bits to store b (bag ID; b < max_B). Supported max_B = 2^info_B_num_bits
We use upper 32 - info_B_num_bits bits to store t (table ID; t < T). Supported T = 2^(32 - info_B_num_bits)

Note that we adjust info_B_num_bits automatically at runtime based on max_B and T. If they cannot fit into 32 bits, it will abort.

Differential Revision: D43259020

facebook-github-bot added the cla signed label

netlify bot commented Mar 20, 2023 •

edited

✅ Deploy Preview for pytorch-fbgemm-docs canceled.

Name	Link
🔨 Latest commit	`f413fdc`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/64646a42d2c2ba0008906934

facebook-github-bot added the fb-exported label

Contributor

facebook-github-bot commented Mar 20, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable length (batch size) support to TBE training (pytorch#1653)

36480b0

Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable length (or variable batch size) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)
```

**Output**

{F854479754}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D43259020

fbshipit-source-id: ac5950387d2908ab15f09d50c8ffeec483da5047

Contributor

facebook-github-bot commented Mar 21, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap force-pushed the export-D43259020 branch from 65386fe to 36480b0 Compare

March 21, 2023 05:12

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable length (batch size) support to TBE training (pytorch#1653)

39810d8

Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable length (or variable batch size) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)
```

**Output**

{F854479754}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Differential Revision: D43259020

fbshipit-source-id: 603a4fc3851ececce5eccf957df58dea9de121a1

sryap force-pushed the export-D43259020 branch from 36480b0 to 39810d8 Compare

March 21, 2023 05:15

Contributor

facebook-github-bot commented Mar 21, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable length (batch size) support to TBE training (pytorch#1653)

d49bb7d

Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable length (or variable batch size) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)
```

**Output**

{F854479754}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: f353e3c86bf873d2f999a21d27be5eff646da682

sryap force-pushed the export-D43259020 branch from 39810d8 to d49bb7d Compare

March 27, 2023 01:32

Contributor

facebook-github-bot commented Mar 27, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable length (batch size) support to TBE training (pytorch#1653)

93f4a60

Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable length (or variable batch size) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)
```

**Output**

{F854479754}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 1b922366319f929a782044d45b5bbff796a58756

sryap force-pushed the export-D43259020 branch from d49bb7d to 93f4a60 Compare

March 27, 2023 01:36

Contributor

facebook-github-bot commented Mar 27, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

1 similar comment

Contributor

facebook-github-bot commented Mar 27, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable length (batch size) support to TBE training (pytorch#1653)

24b2570

Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable length (or variable batch size) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)
```

**Output**

{F854479754}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 614b92e9dbf66286afb645e7b90800f85c922816

sryap force-pushed the export-D43259020 branch from 93f4a60 to 24b2570 Compare

March 27, 2023 18:42

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable length (batch size) support to TBE training (pytorch#1653)

b8d86d2

Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable length (or variable batch size) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)
```

**Output**

{F854479754}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 3302c540a0f4227ae6299442e28328248a89ddf1

sryap force-pushed the export-D43259020 branch from 24b2570 to b8d86d2 Compare

March 27, 2023 18:47

Contributor

facebook-github-bot commented Mar 27, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

1 similar comment

Contributor

facebook-github-bot commented Mar 29, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1653)

e5c68ec

Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE).
# If `feature_table_map` is None, `len(Bs)` must be as same as `len(embedding_specs)`
# If `feature_table_map` is not None, `len(Bs)` must be as same as `len(feature_table_map)`
Bs = [2, 3, 4, 5]

# Pass a list of batch_sizes to forward.
# !! Make sure to pass batch_sizes as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_sizes=Bs)
```

**Output**

{F854479754}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 91a48a40ce2b9a4b427294ab6a14937dc2a6cfcb

sryap force-pushed the export-D43259020 branch from b8d86d2 to e5c68ec Compare

March 29, 2023 06:20

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1653)

6a44ba4

Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=Bs_feature_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: dc9f88e62086bf335f1662a56ed4c10c2fdcbe0c

sryap force-pushed the export-D43259020 branch from e5c68ec to 6a44ba4 Compare

May 2, 2023 18:09

Contributor

facebook-github-bot commented May 2, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1653)

96350d0

Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=Bs_feature_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: cb62702ad49b8380bd120a33617a129708fdbc29

sryap force-pushed the export-D43259020 branch from 6a44ba4 to 96350d0 Compare

May 5, 2023 05:07

Contributor

facebook-github-bot commented May 10, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1653)

3bf4d49

Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 6577637feb35c6473f2708fffa50e71ef8dbff9c

sryap force-pushed the export-D43259020 branch from 7abde1b to 3bf4d49 Compare

May 10, 2023 05:38

Contributor

facebook-github-bot commented May 10, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1653)

f7431c7

Summary:
Pull Request resolved: pytorch#1653

This diff adds the variable batch size (or variable length) support in split TBE training on GPU.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F967393126}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 1dbe4830e72826e2846b596926387ea12ee08a71

sryap force-pushed the export-D43259020 branch from 3bf4d49 to f7431c7 Compare

May 10, 2023 05:45

Contributor

facebook-github-bot commented May 10, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1653)

Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 1f29816aab4ea7005bdd7da18940fd1c1aeba511

sryap force-pushed the export-D43259020 branch from f7431c7 to 7959194 Compare

May 10, 2023 06:26

Contributor

facebook-github-bot commented May 10, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1653)

1e89e45

Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 5b5f26c481da4193412c22ae6e2870fc7bf8ffcb

sryap force-pushed the export-D43259020 branch from 7959194 to 1e89e45 Compare

May 16, 2023 17:12

Contributor

facebook-github-bot commented May 16, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1653)

257ec72

Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 9cec581e56059c328adcade7870636706659d695

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1653)

c4e2ca3

Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: dd11698ab91f747bff148b18e28083ffe20f0bd5

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1653)

803a192

Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 9702b63511a91e8beabd7b9ce56f627dfdd7282a

sryap force-pushed the export-D43259020 branch from 1e89e45 to 803a192 Compare

May 17, 2023 05:33

Contributor

facebook-github-bot commented May 17, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1653)

6164d95

Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 3b82b6f6015a208273aab18ebc861f0ec27d7707

sryap force-pushed the export-D43259020 branch from 803a192 to 6164d95 Compare

May 17, 2023 05:39

Contributor

facebook-github-bot commented May 17, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020


          Add variable batch size support to TBE training (pytorch#1653)

f413fdc

Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 4b801c6b419096d1b1a6570b3696a18b6ae24ab7

sryap force-pushed the export-D43259020 branch from 6164d95 to f413fdc Compare

May 17, 2023 05:46

Contributor

facebook-github-bot commented May 17, 2023

This pull request was exported from Phabricator. Differential Revision: D43259020

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1653)

0bb79e7

Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: 7a635d25962dd33fe7a52767b64978850d696380

sryap added a commit to sryap/FBGEMM that referenced this pull request


          Add variable batch size support to TBE training (pytorch#1653)

f8e52cf

Summary:
Pull Request resolved: pytorch#1653

This diff adds support for variable batch size (or variable length) in
split TBE training on GPU (the extension is called "VBE").

VBE is enabled for the following usecase:
- split (`SplitTableBatchedEmbeddingBagsCodegen`), and
- pooled (`pooling_mode != PoolingMode.NONE`), and
- weighted/unweighted, and
- rowwise Adagrad optimizer (`optimizer ==
  OptimType.EXACT_ROWWISE_ADAGRAD`)

Important note: This feature is enabled for a specific use case in
order to keep the binary size of the FBGEMM library within limits.

This diff adds ~40 MB to the library size.

**Usage:**

```
# Initialize TBE as same as previously
emb_op = SplitTableBatchedEmbeddingBagsCodegen(
    embedding_specs=[...],
    ... # other params
)

# batch sizes (one for each FEATURE and each RANK).
# Example: num_features = 2, num_ranks = 4
batch_size_per_feature_per_rank = [
    [1,  2, 8, 3] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 0
    [6, 10, 3, 5] # batch sizes for [Rank 0, Rank 1, Rank 2, Rank 3] in Feature 1
]

# Pass a list of batch_size_per_feature_per_rank to forward.
# !! Make sure to pass batch_size_per_feature_per_rank as a keyword arg because there can be other keyword args in forward. !!
output = emb_op(indices, offsets, batch_size_per_feature_per_rank=batch_size_per_feature_per_rank)
```

**Output format**

{F982891369}

**Limitation:**

`T` and `max_B` have to fit in 32 bits.
- We use lower `info_B_num_bits` bits to store `b` (bag ID; `b` < `max_B`).  Supported `max_B` = `2^info_B_num_bits`
- We use upper `32 - info_B_num_bits` bits to store `t` (table ID; `t` < `T`).  Supported `T` = `2^(32 - info_B_num_bits)`

Note that we adjust `info_B_num_bits` automatically at runtime based on `max_B` and `T`.  If they cannot fit into 32 bits, it will abort.

Reviewed By: jianyuh

Differential Revision: D43259020

fbshipit-source-id: a185c20af972e76195e1a844141a440f1f734290

facebook-github-bot closed this in

f46904e

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented May 17, 2023

This pull request has been merged in f46904e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment