Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarks to CI #481

Closed
wants to merge 9 commits into from
Closed

Conversation

moaradwan
Copy link
Contributor

@moaradwan moaradwan commented Aug 25, 2022

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Docs change / refactoring / dependency upgrade

Issue: #368

Motivation and Context / Related issue

There's a task #368 for committing benchmark code. In this change I add these benchmarks into CI integration tests. To choose thresholds I ran the benchmarks locally on all the layers with (batch size: 16, num_runs: 100, num_repeats: 20, forward_only: False), please check the comment below for more details.

Using the report and section 3 in the paper, I parameterised the runtime and memory thresholds for different layers.

How Has This Been Tested (if it applies)

  • I ran the jobs locally and generated reports.
  • Local CircleCI config validation circleci config process .circleci/config.yml
  • Local CircleCI job run: circleci local execute --job JOB_NAME

Checklist

  • The documentation is up-to-date with the changes I made.
  • I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
  • All tests passed, and additional code has been covered with new tests.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 25, 2022
@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@moaradwan
Copy link
Contributor Author

moaradwan commented Aug 26, 2022

Results after running on GPU

The following table shows the memory and runtime metrics after running it on CircleCI using gpu.nvidia.small.multi

Group 1: groupnorm, instancenorm, layernorm, dpmha

Threshold based on paper:

  • runtime_ratio_threshold: "2.6"
  • memory_ratio_threshold: "1.6"
  • Succeeded
  • Pipeline runtime: 22s
base_layer/value memory memory memory memory runtime runtime runtime runtime
control dp/control gsm gsm/control control dp/control gsm gsm/control
groupnorm 107520.0 140288.0 1.3047619047619048 0.00040995383000020535 0.000998464277499892 2.4355529926367363
instancenorm 6345728.0 7394304.0 1.1652412457640795 0.0005690672350000909 0.0012307228030001341 2.162701922207729
layernorm 28672.0 37888.0 1.3214285714285714 0.0003554899874999648 0.000726604483499955 2.043952035358034

Group 2: Linear layer

Threshold based on paper:

  • runtime_ratio_threshold: "3.6"
  • memory_ratio_threshold: "13"
  • ✅ Succeeded
  • Pipeline runtime: 14s
base_layer/value memory memory memory memory runtime runtime runtime runtime
control dp/control gsm gsm/control control dp/control gsm gsm/control
linear 3283968.0 36903936.0 11.237605238540691 0.00041353099599993477 0.0010285112289999743 2.487144226064584

Group 3: GSM-DPMHA

Threshold based on paper:

  • runtime_ratio_threshold: "3.5"
  • memory_ratio_threshold: "2.0"
  • ✅ Succeeded
  • Pipeline runtime: 23s
base_layer/value memory memory memory memory runtime runtime runtime runtime
control dp/control gsm gsm/control control dp/control gsm gsm/control
mha 13630464.0 24162304.0 1.772669220945083 0.0012178095074999362 0.0037577736325000903 3.0856826206050023

Group 4: GRU

Threshold based on paper:

  • runtime_ratio_threshold: "18.5"
  • memory_ratio_threshold: "1.5"
  • 🚫 FAILED
  • Pipeline runtime: 7m47s
base_layer/value memory memory memory memory memory runtime runtime runtime runtime runtime
control dp dp/control gsm gsm/control control dp dp/control gsm gsm/control
gru 11186176.0 12154368.0 1.0865525448553643 16603136.0 1.484254851702673 0.004054944954499915 0.05725822975399994 14.120593595347938 0.14559441694749992 35.905399106818614

Group 5: LSTM

Threshold based on paper:

  • runtime_ratio_threshold: "16.5"
  • memory_ratio_threshold: "1.2"
  • 🚫 FAILED
  • Pipeline runtime: 7m17s
base_layer/value memory memory memory memory memory runtime runtime runtime runtime runtime
control dp dp/control gsm gsm/control control dp dp/control gsm gsm/control
lstm 10801152.0 11527680.0 1.06726393629124 18021376.0 1.6684679560106181 0.004167011488000128 0.051514086188000296 12.362357612007312 0.13722742892250026 32.93185759570818

Group 6: RNN

Threshold based on paper:

  • runtime_ratio_threshold: "16.5"
  • memory_ratio_threshold: "1.5"
  • 🚫 FAILED
  • Pipeline runtime: 4m41s
base_layer/value memory memory memory memory memory runtime runtime runtime runtime runtime
control dp dp/control gsm gsm/control control dp dp/control gsm gsm/control
rnn 6287360.0 5936640.0 0.9442182410423453 6346240.0 1.0093648208469055 0.003437245790000362 0.020753186856500516 6.03773722463366 0.09805997271550064 28.52864726775629

Group 7: Embedding

Threshold based on paper:

  • runtime_ratio_threshold: "6.0"
  • memory_ratio_threshold: "15.0"
  • ✅ Succeeded
  • Pipeline runtime: 20s
base_layer/value memory memory memory memory runtime runtime runtime runtime
control dp/control gsm gsm/control control dp/control gsm gsm/control
embedding 24021504.0 280028160.0 11.657394974103203 0.0004076045779994501 0.0023867599680010014 5.855576941052466

Open points

1. Reducing runtime of the jobs

Right now I only excluded conv layer since it takes up to an hour when run locally.

In total the new tasks increase the pipeline execution time by ~20 minutes, with recurrent layers taking most of that time. This makes the integration pipeline total execution time 32 minutes.

Some improvements:

  • Run the benchmark jobs in parallel.
  • Remove some of the layers.

2. Changing thresholds

Currently I used the paper to infer most of the highlights. The recurrent layer validation has failed though.

Group Memory Threshold - Hi Memory Runtime Threshold - Hi Runtime
4: GRU ✅ 1.5, 1.48 🚫 18 , 35.9
5: LSTM 🚫 1.2, 1.668 🚫 16.5, 32.9
6: RNN ✅ 1.5, 1.009 🚫 16.5, 28.528

@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@moaradwan moaradwan marked this pull request as ready for review August 26, 2022 12:23
@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@moaradwan
Copy link
Contributor Author

moaradwan commented Aug 30, 2022

@ffuuugor @ashkan-software regarding point number 2 mentioned in #481 (comment) should I just update the threshold of the failing tests to let it pass?

Copy link
Contributor

@ffuuugor ffuuugor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks a lot for this!

On thresholds - yes please, just put the current performance on the CI machine as norm. It might not match the numbers from the paper due to a different hardware - and that's not the point anyway. We want to track the relative changes to the performance, absolute numbers are not that important

On the 30 minutes - it's not a big deal, but I would suggest to move this to nightly tests instead and don't run it for every commit.
I would create a separate job (and not bundle together benchmarks and integration tests) and only run this job on nightly workflow. WDYT?

@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@moaradwan
Copy link
Contributor Author

Final results after updating thresholds

The jobs will run separately under the name micro_benchmarks_py37_torch_release_cuda only under nightly. The whole run will take: ~27 minutes. There are 10 tasks as follows.

  • An example run is here.

Group 1: GSM of: (groupnorm, instancenorm, layernorm), and DPMHA

Threshold based on paper:

  • runtime_ratio_threshold: "2.6"
  • memory_ratio_threshold: "1.6"
  • ✅ Succeeded
  • Pipeline runtime: 22s
base_layer/value memory memory memory memory runtime runtime runtime runtime
control dp/control gsm gsm/control control dp/control gsm gsm/control
groupnorm 107520.0 140288.0 1.3047619047619048 0.00040995383000020535 0.000998464277499892 2.4355529926367363
instancenorm 6345728.0 7394304.0 1.1652412457640795 0.0005690672350000909 0.0012307228030001341 2.162701922207729
layernorm 28672.0 37888.0 1.3214285714285714 0.0003554899874999648 0.000726604483499955 2.043952035358034
mha 13630464.0 13632512.0 1.00015025167155 0.0012336450990000003 0.001333286202000039 1.0807696663171673

Group 2: GSM-Linear layer

Threshold based on paper:

  • runtime_ratio_threshold: "3.6"
  • memory_ratio_threshold: "13"
  • ✅ Succeeded
  • Pipeline runtime: 14s
base_layer/value memory memory memory memory runtime runtime runtime runtime
control dp/control gsm gsm/control control dp/control gsm gsm/control
linear 3283968.0 36903936.0 11.237605238540691 0.00041353099599993477 0.0010285112289999743 2.487144226064584

Group 3: GSM-DPMHA

Threshold based on paper:

  • runtime_ratio_threshold: "3.5"
  • memory_ratio_threshold: "2.0"
  • ✅ Succeeded
  • Pipeline runtime: 23s
base_layer/value memory memory memory memory runtime runtime runtime runtime
control dp/control gsm gsm/control control dp/control gsm gsm/control
mha 13630464.0 24162304.0 1.772669220945083 0.0012178095074999362 0.0037577736325000903 3.0856826206050023

Group 4&5: DPGRU and GSM-DPGRU

DPGRU:

  • runtime_ratio_threshold: "18.5"
  • memory_ratio_threshold: "1.2"
  • ✅ Succeeded
  • Pipeline runtime: 2m28s

GSM-DPGRU:

  • runtime_ratio_threshold: "40"
  • memory_ratio_threshold: "1.6"
  • ✅ Succeeded
  • Pipeline runtime: 5m42s
base_layer/value memory memory memory memory memory runtime runtime runtime runtime runtime
control dp dp/control gsm gsm/control control dp dp/control gsm gsm/control
gru 11186176.0 12154368.0 1.0865525448553643 16603136.0 1.484254851702673 0.004054944954499915 0.05725822975399994 14.120593595347938 0.14559441694749992 35.905399106818614

Group 6&7: DLSTM and GSM-DPLSTM

DLSTM

  • runtime_ratio_threshold: "16.5"
  • memory_ratio_threshold: "1.2"
  • ✅ Succeeded
  • Pipeline runtime: 2m12s

GSMDLSTM

  • runtime_ratio_threshold: "38"
  • memory_ratio_threshold: "1.8"
  • ✅ Succeeded
  • Pipeline runtime: 5m8s
base_layer/value memory memory memory memory memory runtime runtime runtime runtime runtime
control dp dp/control gsm gsm/control control dp dp/control gsm gsm/control
lstm 10801152.0 11527680.0 1.06726393629124 18021376.0 1.6684679560106181 0.004167011488000128 0.051514086188000296 12.362357612007312 0.13722742892250026 32.93185759570818

Group 8&9: DPRNN and GSM-DPRNN

DPRNN:

  • runtime_ratio_threshold: "10"
  • memory_ratio_threshold: "1.2"
  • ✅ Succeeded
  • Pipeline runtime: 1m4s

GSM-DPRNN:

  • runtime_ratio_threshold: "33"
  • memory_ratio_threshold: "1.2"
  • ✅ Succeeded
  • Pipeline runtime: 3m44s
base_layer/value memory memory memory memory memory runtime runtime runtime runtime runtime
control dp dp/control gsm gsm/control control dp dp/control gsm gsm/control
rnn 6287360.0 5936640.0 0.9442182410423453 6346240.0 1.0093648208469055 0.003437245790000362 0.020753186856500516 6.03773722463366 0.09805997271550064 28.52864726775629

Group 10: Embedding

Threshold based on paper:

  • runtime_ratio_threshold: "6.0"
  • memory_ratio_threshold: "15.0"
  • ✅ Succeeded
  • Pipeline runtime: 20s
base_layer/value memory memory memory memory runtime runtime runtime runtime
control dp/control gsm gsm/control control dp/control gsm gsm/control
embedding 24021504.0 280028160.0 11.657394974103203 0.0004076045779994501 0.0023867599680010014 5.855576941052466

@facebook-github-bot
Copy link
Contributor

@moaradwan has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@moaradwan
Copy link
Contributor Author

@ffuuugor I updated the code as follows:

  • Split the tasks into different job.
  • Update thresholds PS: these are ratios and not absolute values.
  • Different tasks for DP and GSM for recurrent layers to have tighter thresholds.
  • Only run on nightly.

Consider the comment above for more details.

Copy link
Contributor

@ffuuugor ffuuugor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you for this amazing contribution! LGTM

Please note, that once approved, you shouldn't merge PR on github, but rather land the diff on Phabricator, PR will then be closed automatically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate microbenchmarks into CI pipeline
3 participants