Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add torchbench for Distributed Shampoo Optimizer v2 #2616

Closed
wants to merge 1 commit into from

Conversation

minddrummer
Copy link
Contributor

Summary:

  • There is no optimizer that has been integrated into TorchBench. Distributed Shampoo is quite complicate, and has a direct dependency on Pytorch. This creates a need to add it to torchbench to guardrail it from Pytorch 2.0 changes.
  • This diff is to realize this feature, and particularly to enable Distributed Shampoo on Torchbench in Eager mode. I will create a follow up diff to add py2 compile feature.
  • For the current design of integration:
    -- Pick Ads DHEN CMF 5x model, since CMF is a major MC model
    -- choose optimizer stage alone benchmarking, rather than a full e2e benchmarking. This is because the computation of optimizer step itself is relatively ligher than fwd and bwd; and picking the e2e would make the optimizer step stage benchmarking results being shadowed by other stages(fwd, bwd) and make the benchmarking result not sensitive
    -- build on top of originall ads_dhen_5x pipeline, and skip the fwd and bwd stage, and also set up the Shampoo config inside the Model init stage
    -- For Distributed Shampoo, there is a matrix root inverse computation, and in production, this is decided by precondition_frequency and its presence is trivial in the overall computation. And here for torchbench, we also skip it: by add the iteration count to bypass first root inverse compute. I.e.: Inside _prepare_before_optimizer func.
    -- Eventually the torchbench would do the following: 1. initialize the ads_dhen_cmf 5x model on a local gpu, preload the data, and do fwd and bwd; 2. change some state variable of Shampoo(iteration step for preconditioning etc), and get the optimizer ready; 3. benchmarking the optimizer with torchbench pipeline, and return the results back

05/16:

  • update the diff given the Shampoo v2 impl

Reviewed By: xuzhao9

Differential Revision: D51192560

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51192560

Copy link

netlify bot commented May 21, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 2a576d9
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/664d1c8c0a392d0008d11da4
😎 Deploy Preview https://deploy-preview-2616--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary:

- There is no optimizer that has been integrated into TorchBench. Distributed Shampoo is quite complicate, and has a direct dependency on Pytorch. This creates a need to add it to torchbench to guardrail it from Pytorch 2.0 changes.
- This diff is to realize this feature, and particularly to enable Distributed Shampoo on Torchbench in Eager mode.  I will create a follow up diff to add py2 compile feature.
- For the current design of integration:
-- Pick Ads DHEN CMF 5x model, since CMF is a major MC model
-- choose optimizer stage alone benchmarking, rather than a full e2e benchmarking. This is because the computation of optimizer step itself is relatively ligher than fwd and bwd; and picking the e2e would make the optimizer step stage benchmarking results being shadowed by other stages(fwd, bwd) and make the benchmarking result not sensitive
-- build on top of originall ads_dhen_5x pipeline, and skip the fwd and bwd stage, and also set up the Shampoo config inside the Model __init__ stage
-- For Distributed Shampoo, there is a matrix root inverse computation, and in production, this is decided by precondition_frequency and its presence is trivial in the overall computation. And here for torchbench, we also skip it: by add the iteration count to bypass first root inverse compute. I.e.: Inside _prepare_before_optimizer func.
-- Eventually the torchbench would do the following: 1. initialize the ads_dhen_cmf 5x model on a local gpu, preload the data, and do fwd and bwd; 2. change some state variable of Shampoo(iteration step for preconditioning etc), and get the optimizer ready; 3. benchmarking the optimizer with torchbench pipeline, and return the results back

05/16:
- update the diff given the Shampoo v2 impl

Reviewed By: xuzhao9

Differential Revision: D51192560
minddrummer added a commit to minddrummer/FBGEMM that referenced this pull request May 21, 2024
Summary:

- There is no optimizer that has been integrated into TorchBench. Distributed Shampoo is quite complicate, and has a direct dependency on Pytorch. This creates a need to add it to torchbench to guardrail it from Pytorch 2.0 changes.
- This diff is to realize this feature, and particularly to enable Distributed Shampoo on Torchbench in Eager mode.  I will create a follow up diff to add py2 compile feature.
- For the current design of integration:
-- Pick Ads DHEN CMF 5x model, since CMF is a major MC model
-- choose optimizer stage alone benchmarking, rather than a full e2e benchmarking. This is because the computation of optimizer step itself is relatively ligher than fwd and bwd; and picking the e2e would make the optimizer step stage benchmarking results being shadowed by other stages(fwd, bwd) and make the benchmarking result not sensitive
-- build on top of originall ads_dhen_5x pipeline, and skip the fwd and bwd stage, and also set up the Shampoo config inside the Model __init__ stage
-- For Distributed Shampoo, there is a matrix root inverse computation, and in production, this is decided by precondition_frequency and its presence is trivial in the overall computation. And here for torchbench, we also skip it: by add the iteration count to bypass first root inverse compute. I.e.: Inside _prepare_before_optimizer func.
-- Eventually the torchbench would do the following: 1. initialize the ads_dhen_cmf 5x model on a local gpu, preload the data, and do fwd and bwd; 2. change some state variable of Shampoo(iteration step for preconditioning etc), and get the optimizer ready; 3. benchmarking the optimizer with torchbench pipeline, and return the results back

05/16:
- update the diff given the Shampoo v2 impl

Reviewed By: xuzhao9

Differential Revision: D51192560
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51192560

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51192560

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in d7a5500.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants