23 Jul 18:00

9264186

v0.8.0 Latest

Latest

New Features

In Training Embedding Pruning (ITEP) for more efficient RecSys training

Provides a representation of In-Training Embedding Pruning, which is used internally at Meta for more efficient RecSys training by decreasing memory footprint of embedding tables. Pull Request: #2074 introduces the modules into TorchRec, with tests showing how to use them.

Mean Pooling

Mean pooling enabled on embeddings for row-wise and table-row-wise sharding types in TorchRec. Mean pooling mode done through TBE (table-batched embedding) won’t be accurate for row-wise and table-row-wise sharding types, which modify the input due to sharding. This feature efficiently calculates the divisor using caching and overlapping in input dist to implement mean pooling, which had proved to be much more performant than out-of-library implementations. PR: #1772

Changelog

Torch.export (non-strict) compatibility with KJT/JT/KT, EBC/Quantized EBC, sharded variants #1815 #1816 #1788 #1850 #1976 and dynamic shapes #2058

torch.compile support with TorchRec #2045 #2018 #1979

TorchRec serialization with non-strict torch.export for regenerating eager sparse modules (EBC) from IR for sharding #1860 #1848 with meta functionalization when torch.exporting #1974

More benchmarking for TorchRec modules/data types #2094 #2033 #2001 #1855

More VBE support (data parallel sharding) #2093 (EmbeddingCollection) #2047 #1849

RegroupAsDict module for performance improvements with caching #2007

Train Pipeline improvements #1967 #1969 #1971

Bug Fixes and library improvements

Assets 2

17 Jun 13:44

PaulZhang12

v0.8.0-rc1

67fb709

v0.8.0-rc1 Pre-release

Pre-release

Update setup and version for release 0.8.0

Assets 2

25 Apr 01:39

PaulZhang12

v0.7.0

8a8acfa

v0.7.0

No major features in this release

Changelog

Expanding out ZCH/MCH
Increased support with Torch Dynamo/Export
Distributed Benchmarking introduced under torchrec/distributed/benchmarks for inference and training
VBE optimizations
TWRW support for VBE (I think this happened in the last release, Josh can confirm)
Generalized train_pipeline for different pipeline stage overlapping
Autograd support for traceable collectives
Output dtype support for embeddings
Dynamo tracing for sharded embedding modules
Bug fixes

Assets 2

18 Mar 18:38

PaulZhang12

v0.7.0-rc1

bd96e8f

v0.7.0-rc1 Pre-release

Pre-release

Pre release for v0.7.0

Assets 2

30 Jan 23:40

henrylhtsang

v0.6.0

7c266e6

v0.6.0

VBE

TorchRec now natively supports VBE (variable batched embeddings) within the EmbeddingBagCollection module. This allows variable batch size per feature, unlocking sparse input data deduplication, which can greatly speed up embedding lookup and all-to-all time. To enable, simply initialize KeyedJaggedTensor with stride_per_key_per_rank and inverse_indices fields, which specify batch size per feature and inverse indices to reindex the embedding output respectively.

Embedding offloading

Embedding offloading is UVM caching (i.e. storing embedding tables on host memory with cache on HBM memory) plus prefetching and optimal sizing of cache. Embedding offloading would allow running a larger model with fewer GPUs, while maintaining competitive performance. To use, one needs to use the prefetching pipeline (PrefetchTrainPipelineSparseDist) and pass in per table cache load factor and the prefetch_pipeline flag through constraints in the planner.

Trec.shard/shard_modules

These APIs replace embedding submodules with its sharded variant. The shard API applies to an individual embedding module while the shard_modules API replaces all embedding modules and won’t touch other non-embedding submodules.
Embedding sharding follows similar behavior to the prior TorchRec DistributedModuleParallel behavior, except the ShardedModules have been made composable, meaning the modules are backed by TableBatchedEmbeddingSlices which are views into the underlying TBE (including .grad). This means that fused parameters are now returned with named_parameters(), including in DistributedModuleParallel.

Assets 2

22 Jan 19:38

henrylhtsang

v0.6.0-rc2

1fc7cac

v0.6.0-rc2 Pre-release

Pre-release

v0.6.0-rc2

Assets 2

18 Dec 18:05

henrylhtsang

v0.6.0-rc1

b09fd9e

v0.6.0-rc1 Pre-release

Pre-release

This should support python 3.8 - 3.11 and 3.12 (experimental)

pip install torchrec --index-url https://download.pytorch.org/whl/test/cpu
pip install torchrec --index-url https://download.pytorch.org/whl/test/cu118
pip install torchrec --index-url https://download.pytorch.org/whl/test/cu121

Assets 2

05 Oct 17:21

henrylhtsang

v0.5.0

c8686d2

v0.5.0

[Prototype] Zero Collision / Managed Collision Embedding Bags

A common constraint in Recommender Systems is the sparse id input range is larger than the number of embeddings the model can learn for a given parameter size. To resolve this issue, the conventional solution is to hash sparse ids into the same size range as the embedding table. This will ultimately lead to hash collisions, with multiple sparse ids sharing the same embedding space. We have developed a performant alternative algorithm that attempts to address this problem by tracking the N most common sparse ids and ensuring that they have a unique embedding representation. The module is defined here and an example can be found here.

[Prototype] UVM Caching - Prefetch Training Pipeline

For tables where on-device memory is insufficient to hold the entire embedding table, it is common to leverage a caching architecture where part of the embedding table is cached on device and the full embedding table is on host memory (typically DDR SDRAM). However, in practice, caching misses are common, and hurt performance due to relatively high latency of going to host memory. Building on TorchRec’s existing data pipelining, we developed a new Prefetch Training Pipeline to avoid these cache misses by prefetching the relevant embeddings for upcoming batch from host memory, effectively eliminating cache misses in the forward path.

Assets 2

03 Oct 22:48

henrylhtsang

v0.5.0-rc2

b0254a7

v0.5.0-rc2 Pre-release

Pre-release

Install fbgemm via nova

Assets 2

13 Sep 01:15

henrylhtsang

v0.5.0-rc1

6e8cc97

v0.5.0-rc1 Pre-release

Pre-release

remove fbgemm-gpu-nightly instead

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Features

In Training Embedding Pruning (ITEP) for more efficient RecSys training

Mean Pooling

Changelog

No major features in this release

Changelog

VBE

Embedding offloading

Trec.shard/shard_modules

[Prototype] Zero Collision / Managed Collision Embedding Bags

[Prototype] UVM Caching - Prefetch Training Pipeline

Releases: pytorch/torchrec

v0.8.0

New Features

In Training Embedding Pruning (ITEP) for more efficient RecSys training

Mean Pooling

Changelog

v0.8.0-rc1

v0.7.0

No major features in this release

Changelog

v0.7.0-rc1

v0.6.0

VBE

Embedding offloading

Trec.shard/shard_modules

v0.6.0-rc2

v0.6.0-rc1

v0.5.0

[Prototype] Zero Collision / Managed Collision Embedding Bags

[Prototype] UVM Caching - Prefetch Training Pipeline

v0.5.0-rc2

v0.5.0-rc1