Skip to content

v0.3.2

Compare
Choose a tag to compare
@colin2328 colin2328 released this 15 Dec 02:27

KeyedJaggedTensor

We observed performance regression due to a bottleneck in sparse data distribution for models that have multiple, large KJTs to redistribute.

To combat this we altered the comms pattern to transport the minimum data required in the initial collective to support the collective calls for the actual KJT tensor data. This data sent in the initial collective, ‘splits’ means more data is transmitted over the comms stream overall, but the CPU is blocked for significantly shorter amounts of time leading to better overall QPS.

Furthermore, we altered the TorchRec train pipeline to group the initial collective calls for the splits together before launching the more expensive KJT tensor collective calls. This pseudo ‘fusing’ minimizes the CPU blocked time as launching each subsequent input distribution is no longer dependent on the previous input distribution.

We no longer pass in variable batch size in the sharder

Planner

On the planner side, we introduced a new feature “early stopping” to GreedyProposer. This brings a 4X speedup to planner when there are many proposals (>1000) to propose with. To use the feature, simply add “threshold=10” to GreedyProposer (10 is the suggested number for it, which means GreedyProposer will stop proposing after seeing 10 consecutive bad proposals). Secondly, we refactored the “deepcopy” logic in the planner code, which bring a 8X speedup on the overall planning time. See PR #665 for the details.

Pinning requirements

We are also pinning requirements to add more stability to TorchRec users