Add benchmarks to CI #481

moaradwan · 2022-08-25T17:33:41Z

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Docs change / refactoring / dependency upgrade

Issue: #368

Motivation and Context / Related issue

There's a task #368 for committing benchmark code. In this change I add these benchmarks into CI integration tests. To choose thresholds I ran the benchmarks locally on all the layers with (batch size: 16, num_runs: 100, num_repeats: 20, forward_only: False), please check the comment below for more details.

Using the report and section 3 in the paper, I parameterised the runtime and memory thresholds for different layers.

How Has This Been Tested (if it applies)

I ran the jobs locally and generated reports.
Local CircleCI config validation circleci config process .circleci/config.yml
Local CircleCI job run: circleci local execute --job JOB_NAME

Checklist

The documentation is up-to-date with the changes I made.
I have read the CONTRIBUTING document and completed the CLA (see CONTRIBUTING).
All tests passed, and additional code has been covered with new tests.

facebook-github-bot · 2022-08-26T06:50:19Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-26T06:50:28Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-26T08:22:50Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-26T08:22:57Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-26T08:45:38Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-26T08:45:46Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-26T09:17:53Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-26T09:18:01Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-26T09:29:58Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-26T09:30:05Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

moaradwan · 2022-08-26T10:14:41Z

Results after running on GPU

The following table shows the memory and runtime metrics after running it on CircleCI using gpu.nvidia.small.multi

Group 1: groupnorm, instancenorm, layernorm, dpmha

Threshold based on paper:

runtime_ratio_threshold: "2.6"
memory_ratio_threshold: "1.6"
Succeeded
Pipeline runtime: 22s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
groupnorm	107520.0		140288.0	1.3047619047619048	0.00040995383000020535		0.000998464277499892	2.4355529926367363
instancenorm	6345728.0		7394304.0	1.1652412457640795	0.0005690672350000909		0.0012307228030001341	2.162701922207729
layernorm	28672.0		37888.0	1.3214285714285714	0.0003554899874999648		0.000726604483499955	2.043952035358034

Group 2: Linear layer

Threshold based on paper:

runtime_ratio_threshold: "3.6"
memory_ratio_threshold: "13"
✅ Succeeded
Pipeline runtime: 14s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
linear	3283968.0		36903936.0	11.237605238540691	0.00041353099599993477		0.0010285112289999743	2.487144226064584

Group 3: GSM-DPMHA

Threshold based on paper:

runtime_ratio_threshold: "3.5"
memory_ratio_threshold: "2.0"
✅ Succeeded
Pipeline runtime: 23s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
mha	13630464.0		24162304.0	1.772669220945083	0.0012178095074999362		0.0037577736325000903	3.0856826206050023

Group 4: GRU

Threshold based on paper:

runtime_ratio_threshold: "18.5"
memory_ratio_threshold: "1.5"
🚫 FAILED
Pipeline runtime: 7m47s

base_layer/value	memory	memory	memory	memory	memory	runtime	runtime	runtime	runtime	runtime
	control	dp	dp/control	gsm	gsm/control	control	dp	dp/control	gsm	gsm/control
gru	11186176.0	12154368.0	1.0865525448553643	16603136.0	1.484254851702673	0.004054944954499915	0.05725822975399994	14.120593595347938	0.14559441694749992	35.905399106818614

Group 5: LSTM

Threshold based on paper:

runtime_ratio_threshold: "16.5"
memory_ratio_threshold: "1.2"
🚫 FAILED
Pipeline runtime: 7m17s

base_layer/value	memory	memory	memory	memory	memory	runtime	runtime	runtime	runtime	runtime
	control	dp	dp/control	gsm	gsm/control	control	dp	dp/control	gsm	gsm/control
lstm	10801152.0	11527680.0	1.06726393629124	18021376.0	1.6684679560106181	0.004167011488000128	0.051514086188000296	12.362357612007312	0.13722742892250026	32.93185759570818

Group 6: RNN

Threshold based on paper:

runtime_ratio_threshold: "16.5"
memory_ratio_threshold: "1.5"
🚫 FAILED
Pipeline runtime: 4m41s

base_layer/value	memory	memory	memory	memory	memory	runtime	runtime	runtime	runtime	runtime
	control	dp	dp/control	gsm	gsm/control	control	dp	dp/control	gsm	gsm/control
rnn	6287360.0	5936640.0	0.9442182410423453	6346240.0	1.0093648208469055	0.003437245790000362	0.020753186856500516	6.03773722463366	0.09805997271550064	28.52864726775629

Group 7: Embedding

Threshold based on paper:

runtime_ratio_threshold: "6.0"
memory_ratio_threshold: "15.0"
✅ Succeeded
Pipeline runtime: 20s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
embedding	24021504.0		280028160.0	11.657394974103203	0.0004076045779994501		0.0023867599680010014	5.855576941052466

Open points

1. Reducing runtime of the jobs

Right now I only excluded conv layer since it takes up to an hour when run locally.

In total the new tasks increase the pipeline execution time by ~20 minutes, with recurrent layers taking most of that time. This makes the integration pipeline total execution time 32 minutes.

Some improvements:

Run the benchmark jobs in parallel.
Remove some of the layers.

2. Changing thresholds

Currently I used the paper to infer most of the highlights. The recurrent layer validation has failed though.

Group	Memory Threshold - Hi Memory	Runtime Threshold - Hi Runtime
4: GRU	✅ 1.5, 1.48	🚫 18 , 35.9
5: LSTM	🚫 1.2, 1.668	🚫 16.5, 32.9
6: RNN	✅ 1.5, 1.009	🚫 16.5, 28.528

facebook-github-bot · 2022-08-26T10:56:25Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-26T10:56:33Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-26T12:34:03Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-26T12:34:11Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-26T14:01:28Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-26T14:01:37Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-26T14:16:23Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-26T14:16:30Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-26T14:18:04Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-26T14:18:12Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

moaradwan · 2022-08-30T08:07:32Z

@ffuuugor @ashkan-software regarding point number 2 mentioned in #481 (comment) should I just update the threshold of the failing tests to let it pass?

ffuuugor

Great work, thanks a lot for this!

On thresholds - yes please, just put the current performance on the CI machine as norm. It might not match the numbers from the paper due to a different hardware - and that's not the point anyway. We want to track the relative changes to the performance, absolute numbers are not that important

On the 30 minutes - it's not a big deal, but I would suggest to move this to nightly tests instead and don't run it for every commit.
I would create a separate job (and not bundle together benchmarks and integration tests) and only run this job on nightly workflow. WDYT?

facebook-github-bot · 2022-08-31T08:02:09Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-31T08:02:18Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-31T08:09:05Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-31T08:09:13Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-31T08:38:03Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-31T08:38:12Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-08-31T08:59:11Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-31T08:59:19Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

moaradwan · 2022-08-31T09:35:44Z

Final results after updating thresholds

The jobs will run separately under the name micro_benchmarks_py37_torch_release_cuda only under nightly. The whole run will take: ~27 minutes. There are 10 tasks as follows.

An example run is here.

Group 1: GSM of: (groupnorm, instancenorm, layernorm), and DPMHA

Threshold based on paper:

runtime_ratio_threshold: "2.6"
memory_ratio_threshold: "1.6"
✅ Succeeded
Pipeline runtime: 22s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
groupnorm	107520.0		140288.0	1.3047619047619048	0.00040995383000020535		0.000998464277499892	2.4355529926367363
instancenorm	6345728.0		7394304.0	1.1652412457640795	0.0005690672350000909		0.0012307228030001341	2.162701922207729
layernorm	28672.0		37888.0	1.3214285714285714	0.0003554899874999648		0.000726604483499955	2.043952035358034
mha	13630464.0	13632512.0	1.00015025167155			0.0012336450990000003	0.001333286202000039	1.0807696663171673

Group 2: GSM-Linear layer

Threshold based on paper:

runtime_ratio_threshold: "3.6"
memory_ratio_threshold: "13"
✅ Succeeded
Pipeline runtime: 14s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
linear	3283968.0		36903936.0	11.237605238540691	0.00041353099599993477		0.0010285112289999743	2.487144226064584

Group 3: GSM-DPMHA

Threshold based on paper:

runtime_ratio_threshold: "3.5"
memory_ratio_threshold: "2.0"
✅ Succeeded
Pipeline runtime: 23s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
mha	13630464.0		24162304.0	1.772669220945083	0.0012178095074999362		0.0037577736325000903	3.0856826206050023

Group 4&5: DPGRU and GSM-DPGRU

DPGRU:

runtime_ratio_threshold: "18.5"
memory_ratio_threshold: "1.2"
✅ Succeeded
Pipeline runtime: 2m28s

GSM-DPGRU:

runtime_ratio_threshold: "40"
memory_ratio_threshold: "1.6"
✅ Succeeded
Pipeline runtime: 5m42s

base_layer/value	memory	memory	memory	memory	memory	runtime	runtime	runtime	runtime	runtime
	control	dp	dp/control	gsm	gsm/control	control	dp	dp/control	gsm	gsm/control
gru	11186176.0	12154368.0	1.0865525448553643	16603136.0	1.484254851702673	0.004054944954499915	0.05725822975399994	14.120593595347938	0.14559441694749992	35.905399106818614

Group 6&7: DLSTM and GSM-DPLSTM

DLSTM

runtime_ratio_threshold: "16.5"
memory_ratio_threshold: "1.2"
✅ Succeeded
Pipeline runtime: 2m12s

GSMDLSTM

runtime_ratio_threshold: "38"
memory_ratio_threshold: "1.8"
✅ Succeeded
Pipeline runtime: 5m8s

base_layer/value	memory	memory	memory	memory	memory	runtime	runtime	runtime	runtime	runtime
	control	dp	dp/control	gsm	gsm/control	control	dp	dp/control	gsm	gsm/control
lstm	10801152.0	11527680.0	1.06726393629124	18021376.0	1.6684679560106181	0.004167011488000128	0.051514086188000296	12.362357612007312	0.13722742892250026	32.93185759570818

Group 8&9: DPRNN and GSM-DPRNN

DPRNN:

runtime_ratio_threshold: "10"
memory_ratio_threshold: "1.2"
✅ Succeeded
Pipeline runtime: 1m4s

GSM-DPRNN:

runtime_ratio_threshold: "33"
memory_ratio_threshold: "1.2"
✅ Succeeded
Pipeline runtime: 3m44s

base_layer/value	memory	memory	memory	memory	memory	runtime	runtime	runtime	runtime	runtime
	control	dp	dp/control	gsm	gsm/control	control	dp	dp/control	gsm	gsm/control
rnn	6287360.0	5936640.0	0.9442182410423453	6346240.0	1.0093648208469055	0.003437245790000362	0.020753186856500516	6.03773722463366	0.09805997271550064	28.52864726775629

Group 10: Embedding

Threshold based on paper:

runtime_ratio_threshold: "6.0"
memory_ratio_threshold: "15.0"
✅ Succeeded
Pipeline runtime: 20s

base_layer/value	memory	memory	memory	memory	runtime	runtime	runtime	runtime
	control	dp/control	gsm	gsm/control	control	dp/control	gsm	gsm/control
embedding	24021504.0		280028160.0	11.657394974103203	0.0004076045779994501		0.0023867599680010014	5.855576941052466

facebook-github-bot · 2022-08-31T09:37:33Z

@moaradwan has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-08-31T09:37:41Z

@facebook-github-bot has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

moaradwan · 2022-08-31T09:42:31Z

@ffuuugor I updated the code as follows:

Split the tasks into different job.
Update thresholds PS: these are ratios and not absolute values.
Different tasks for DP and GSM for recurrent layers to have tighter thresholds.
Only run on nightly.

Consider the comment above for more details.

ffuuugor

Awesome, thank you for this amazing contribution! LGTM

Please note, that once approved, you shouldn't merge PR on github, but rather land the diff on Phabricator, PR will then be closed automatically

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 25, 2022

moaradwan force-pushed the benchmarks-ci branch from 05c3d0b to b2911a4 Compare August 26, 2022 09:29

moaradwan marked this pull request as ready for review August 26, 2022 12:23

romovpa requested review from ffuuugor and ashkan-software August 26, 2022 12:26

moaradwan force-pushed the benchmarks-ci branch from f86fe79 to 19f0a28 Compare August 26, 2022 14:01

moaradwan force-pushed the benchmarks-ci branch from 19f0a28 to b87eca0 Compare August 26, 2022 14:16

moaradwan force-pushed the benchmarks-ci branch from b87eca0 to b8b2cac Compare August 26, 2022 14:18

ffuuugor reviewed Aug 30, 2022

View reviewed changes

moaradwan force-pushed the benchmarks-ci branch from ded5a7c to 82474f1 Compare August 31, 2022 08:09

Attia Radwan added 8 commits August 31, 2022 10:58

Add benchmarks to CI

b9d21f9

update ci config

278c614

create different report for each benchmarks run

5fc1ba1

fix by isort

8d908c9

fix ci job

08fc15d

Refactor CI th and groups

b08b6ee

minor changes

6d2c3a2

Split benchmarks jobs and update thresholds

5ee3dd9

moaradwan force-pushed the benchmarks-ci branch from 8fb03aa to 5ee3dd9 Compare August 31, 2022 08:59

benchmark only on nightly

61df954

ffuuugor approved these changes Aug 31, 2022

View reviewed changes

facebook-github-bot closed this in 1e10a18 Aug 31, 2022

romovpa linked an issue Sep 13, 2022 that may be closed by this pull request

Integrate microbenchmarks into CI pipeline #369

Closed

romovpa mentioned this pull request Sep 13, 2022

Integrate microbenchmarks into CI pipeline #369

Closed

Add benchmarks to CI #481

Add benchmarks to CI #481

Conversation

moaradwan commented Aug 25, 2022 • edited

Types of changes

Motivation and Context / Related issue

How Has This Been Tested (if it applies)

Checklist

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

moaradwan commented Aug 26, 2022 • edited

Results after running on GPU

Group 1: groupnorm, instancenorm, layernorm, dpmha

Group 2: Linear layer

Group 3: GSM-DPMHA

Group 4: GRU

Group 5: LSTM

Group 6: RNN

Group 7: Embedding

Open points

1. Reducing runtime of the jobs

2. Changing thresholds

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

facebook-github-bot commented Aug 26, 2022

moaradwan commented Aug 30, 2022 • edited

ffuuugor left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Aug 31, 2022

facebook-github-bot commented Aug 31, 2022

facebook-github-bot commented Aug 31, 2022

facebook-github-bot commented Aug 31, 2022

facebook-github-bot commented Aug 31, 2022

facebook-github-bot commented Aug 31, 2022

facebook-github-bot commented Aug 31, 2022

facebook-github-bot commented Aug 31, 2022

moaradwan commented Aug 31, 2022

Final results after updating thresholds

Group 1: GSM of: (groupnorm, instancenorm, layernorm), and DPMHA

Group 2: GSM-Linear layer

Group 3: GSM-DPMHA

Group 4&5: DPGRU and GSM-DPGRU

Group 6&7: DLSTM and GSM-DPLSTM

Group 8&9: DPRNN and GSM-DPRNN

Group 10: Embedding

facebook-github-bot commented Aug 31, 2022

facebook-github-bot commented Aug 31, 2022

moaradwan commented Aug 31, 2022

ffuuugor left a comment

Choose a reason for hiding this comment

moaradwan commented Aug 25, 2022 •

edited

moaradwan commented Aug 26, 2022 •

edited

moaradwan commented Aug 30, 2022 •

edited