Add neuron benchmarking to automation and other enhancements #1099

nikhil-sk · 2021-05-27T16:34:52Z

Description

Adds support for automating benchmarks using AWS neuron for BERT model.

Fixes part of #1067

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the tests [UT/IT] that you ran to verify your changes and relevent result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A
Test B
UT/IT execution results
Logs
Benchmark report

bert_neuron | scripted_mode | inf1.6xlarge | batch size 1

Benchmark	Model	Concurrency	Requests	TS failed requests	TS throughput	TS latency P50	TS latency P90	TS latency P99	TS latency mean	TS error rate	Model_p50	Model_p90	Model_p99	predict_mean	handler_time_mean	waiting_time_mean	worker_thread_mean
AB	bert_neuron	100	1000	0	213.10	438	440	2975	469.265	0.0	17.18	17.3	18.46	17.26	17.22	411.22	0.49

bert_neuron | scripted_mode | inf1.6xlarge | batch size 2

Benchmark	Model	Concurrency	Requests	TS failed requests	TS throughput	TS latency P50	TS latency P90	TS latency P99	TS latency mean	TS error rate	Model_p50	Model_p90	Model_p99	predict_mean	handler_time_mean	waiting_time_mean	worker_thread_mean
AB	bert_neuron	100	1000	0	392.73	223	225	3031	254.628	0.0	17.27	17.48	19.29	34.56	34.51	392.56	1.13

bert_neuron | scripted_mode | inf1.6xlarge | batch size 4

Benchmark	Model	Concurrency	Requests	TS failed requests	TS throughput	TS latency P50	TS latency P90	TS latency P99	TS latency mean	TS error rate	Model_p50	Model_p90	Model_p99	predict_mean	handler_time_mean	waiting_time_mean	worker_thread_mean
AB	vgg16	100	1000	0	589.39	133	137	3578	169.667	0.0	20.17	21	23.93	67.03	66.98	345.57	2.15

bert_neuron | scripted_mode | inf1.6xlarge | batch size 8

Benchmark	Model	Concurrency	Requests	TS failed requests	TS throughput	TS latency P50	TS latency P90	TS latency P99	TS latency mean	TS error rate	Model_p50	Model_p90	Model_p99	predict_mean	handler_time_mean	waiting_time_mean	worker_thread_mean
AB	bert_neuron	100	1000	0	537.51	141	155	4267	186.043	0.0	44.04	45.37	48.27	130.94	130.88	272.18	4.09

Checklist:

Have you added tests that prove your fix is effective or that this feature works?
New and existing unit tests pass locally with these changes?
Has code been commented, particularly in hard-to-understand areas?
[] TODO Have you made corresponding changes to the documentation?

sagemaker-neo-ci-bot · 2021-05-27T16:59:57Z

AWS CodeBuild CI Report

CodeBuild project: torch-serve-build-cpu
Commit ID: bdb874a
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-neo-ci-bot · 2021-05-27T17:11:40Z

AWS CodeBuild CI Report

CodeBuild project: torch-serve-build-win
Commit ID: bdb874a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-neo-ci-bot · 2021-05-27T17:13:05Z

AWS CodeBuild CI Report

CodeBuild project: torch-serve-build-gpu
Commit ID: bdb874a
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

chauhang · 2021-06-14T17:07:59Z

docker/Dockerfile.dev

    && git clone https://github.com/pytorch/serve.git \
    && cd serve \
-    && git checkout ${BRANCH_NAME} \
+    && git checkout --track origin/release_0.4.0 \


Why is the branch being hardcoded? It was parameter earlier

This is a dev-change that should've been reverted in the cleanup, fixed.

chauhang · 2021-06-14T17:09:29Z

test/benchmark/instances.yaml

+test_vgg16_benchmark:
+  inf1.6xlarge:
+    instance_id: i-04b6adea9c066ad0f
+    key_filename: /Users/nikhilsk/nskool/serve/test/benchmark/ec2-key-name-6028.pem


Can we the keys from a common shared location and not have user specific items here?

Removed, and this file has been added to .gitignore.

chauhang

Please see comments inline. There are lot of EC2 specific items for the non-neuron scenario as well. How do we expect Contributors to run the benchmark in their env?

HamidShojanazeri

Thanks @nskool overall LGTM with some minor changes I added the in inline comments. Also, pls handle the error message when it was unregistering the model, where it was mentioning it failed to unregister the model.

HamidShojanazeri · 2021-06-24T01:01:51Z

test/benchmark/instances.yaml

+test_vgg16_benchmark:
+  inf1.6xlarge:
+    instance_id: i-04b6adea9c066ad0f
+    key_filename: /Users/nikhilsk/nskool/serve/test/benchmark/ec2-key-name-6028.pem


@nskool pls fix this line as in this test it was not required and key got generated and placed in fixed directory.

Removed, and this file has been added to .gitignore.

HamidShojanazeri · 2021-06-24T01:03:14Z

test/benchmark/tests/utils/__init__.py

 DEFAULT_REGION = "us-west-2"
 IAM_INSTANCE_PROFILE = "EC2Admin"
-S3_BUCKET_BENCHMARK_ARTIFACTS = "s3://torchserve-model-serving/benchmark_artifacts"
+S3_BUCKET_BENCHMARK_ARTIFACTS = "s3://nikhilsk-model-serving/benchmark_artifacts"


@nskool pls make sure we can pass this s3 bucket as an argument.

Since this is a non-trivial and non-breaking change, I've added it to the issue which describes changes in phase 3 of the automation: #1121

@nskool Why is the URL change to s3://torchserve-model-serving/xxx vs your personal S3 bucket a non-trivial change? Whatever tests we have can be put in the main s3://torchserve-model-serving bucket

HamidShojanazeri · 2021-06-24T01:05:17Z

test/benchmark/README.md

 8. For generating benchmarking report, modify the argument to function `generate_comprehensive_report()` to point to the s3 bucket uri for the benchmark run. Run the script as:
 ```
 python report.py
 ```


@nskool as discussed, it might be good to add report.py as part of the automation as well.

Will make this part of a separate CR, with an all-inclusive report.

This is now fixed in the current PR

sagemaker-neo-ci-bot · 2021-06-24T09:32:24Z

AWS CodeBuild CI Report

CodeBuild project: torch-serve-build-gpu
Commit ID: 30b1a68
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-neo-ci-bot · 2021-06-24T09:32:42Z

AWS CodeBuild CI Report

CodeBuild project: torch-serve-build-gpu
Commit ID: 5fe56cc
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Addressed concerns

chauhang · 2021-06-25T00:49:01Z

docker/Dockerfile.neuron.dev

+RUN if [ "$MACHINE_TYPE" = "gpu" ]; then export USE_CUDA=1; fi \
+    && git clone https://github.com/pytorch/serve.git \
+    && cd serve \
+    && git checkout --track origin/release_0.4.0 \


Please parameterize the branch name

chauhang · 2021-06-25T00:51:27Z

docker/build_image.sh

  DOCKER_BUILDKIT=1 docker build --file Dockerfile --build-arg BASE_IMAGE=$BASE_IMAGE --build-arg CUDA_VERSION=$CUDA_VERSION -t $DOCKER_TAG .
 else
-  DOCKER_BUILDKIT=1 docker build --file Dockerfile.dev -t $DOCKER_TAG --build-arg BUILD_TYPE=$BUILD_TYPE --build-arg BASE_IMAGE=$BASE_IMAGE --build-arg BRANCH_NAME=$BRANCH_NAME --build-arg CUDA_VERSION=$CUDA_VERSION --build-arg MACHINE_TYPE=$MACHINE .
+  DOCKER_BUILDKIT=1 docker build --pull --network=host --no-cache --file Dockerfile.neuron.dev -t $DOCKER_TAG --build-arg BUILD_TYPE=$BUILD_TYPE --build-arg BASE_IMAGE=$BASE_IMAGE --build-arg BRANCH_NAME=$BRANCH_NAME --build-arg CUDA_VERSION=$CUDA_VERSION --build-arg MACHINE_TYPE=$MACHINE .


Please verify there is impact on the benchmark when using --network=host

Reverted, it's not needed anymore because I've changed the neuron benchmark to use a local torchserve installation on an ec2 instance rather than a docker installation.

chauhang · 2021-06-25T00:53:12Z

test/benchmark/README.md

 The final benchmark report will be available in markdown format as `report.md` in the `serve/` folder. 

-**Example report for vgg16 model**
+**Example report for vgg11 model**


What is the reason for switching to vgg11 here?

The results quoted below are vgg11 results, this 'vgg16' was just a typo that was corrected.

chauhang · 2021-06-25T00:56:02Z

test/benchmark/tests/suite/fastrcnn.yaml

        requests: 10000
        concurrency: 100
-        backend_profiling: True
+        backend_profiling: False


Is there any reason for this change?

Benchmarking results have been recorded so far with 'backend_profiling' False - this 'True' was only used to feature-test that specifying 'True' works. Benchmarking with backend_profiling: True gives worse results than with backend_profiling:False.

chauhang · 2021-06-25T00:57:45Z

test/benchmark/tests/test_vgg16.py


 INSTANCE_TYPES_TO_TEST = ["p3.8xlarge"]

+@pytest.mark.skip()


Is this being skipped?

Yes, apologies, a leftover from testing. Unskipped.

chauhang · 2021-06-25T00:58:41Z

test/benchmark/tests/test_vgg16.py

            )
            docker_repo_tag_for_current_instance = docker_repo_tag
-            cuda_version_for_instance = cuda_version
+            cuda_version_for_instance = None


Have we validated on GPU/CUDA after these changes?

Yes, in fact validation on CPU was failing and hence this 'cuda_version_for_instance' change was a bug fix, the value should have been None. I've validated the GPU docker image is used after these changes.

chauhang · 2021-06-25T00:59:19Z

test/benchmark/tests/utils/__init__.py


 GPU_INSTANCES = ["p2", "p3", "p4", "g2", "g3", "g4"]

 # DLAMI with nVidia Driver ver. 450.119.03 (support upto CUDA 11.2), Ubuntu 18.04


Is this description still valid for the new AMI?

Yes, that it still holds true.

chauhang · 2021-06-25T01:00:55Z

test/benchmark/tests/utils/report.py

    # Download the s3 files
-    run(f"mkdir -p /tmp/report")
-    run(f"aws s3 cp --recursive {s3_bucket_uri} /tmp/report")
+    # run(f"mkdir -p /tmp/report")


Do we need these commented lines?

We do - currently 'report.py' needs to be run separately from automation. I've added a task here to automate that as well: #1121

Fixed in the current PR now.

chauhang · 2021-06-25T01:01:19Z

test/benchmark/tests/utils/report.py

        config_header = f"{model} | {mode} | {instance_type} | batch size {batch_size}"

-        markdownDocument.add_paragraph(config_header, bold=True, newline=True)
+        #markdownDocument.add_paragraph(config_header, bold=True, newline=True)


Is this needed?

Removed, thanks

chauhang · 2021-06-25T01:02:29Z

test/benchmark/tests/utils/report.py


    # Clean up 
-    run(f"rm -rf /tmp/report")
+    # run(f"rm -rf /tmp/report")


Is this needed? It might be better to put logic in code such that it can automatically create the report folder if it does not exists so that one does not have to keep commenting/uncommenting out these lines to reruns.

Agreed, that's the plan - but will be done later as part of a couple other changes listed here: #1121. The reason is that the 'wiring' that needs to be done to automate this i.e. make the 's3 folder' available to the report.py script is non-trivial, and will need additional testing.

This has been fixed in the current PR itself.

chauhang

Please see the review comments inline.

msaroufim · 2021-07-01T19:08:01Z

Documenting how benchmarking a CPU only model worked for me

In suite/docker/docker.yaml - deleted everything not CPU specific
in vgg16.yaml deleted torchscript section and deleted gpu processor for eager
in init.py added my own s3 bucket
In Dockerfile.dev cloned my own fork and checked it out
In test_vgg16.py I needed to uncomment LOGGER.info(f"processors: {processors[1]}") and this issue exists in all the test files

1,2,3 just need mentions in the README

4 is a bit tricky because and could be simplified. I haven't tried local benchmarks instead of docker benchmarks yet

5 is a simple fix but is just annoying

still getting a weird error https://gist.github.com/msaroufim/fb9ceb5be8ad710ae46e4464f56490c7

And did not expect to see 3 instances created on my AWS account instead of just 1

sagemaker-neo-ci-bot · 2021-07-08T22:06:17Z

AWS CodeBuild CI Report

CodeBuild project: torch-serve-build-cpu
Commit ID: d06fa02
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-neo-ci-bot · 2021-07-08T22:10:38Z

AWS CodeBuild CI Report

CodeBuild project: torch-serve-build-gpu
Commit ID: d06fa02
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

msaroufim · 2021-07-09T02:29:35Z

Alright I'm really glad to see the improvements

For example I was trying to run CPU benchmarks after setting torch_num_threads(1) and all I had to do was run

python test/benchmark/run_benchmark.py --use-torchserve-branch master --run-only mnist 2>&1 | tee ~/benchmark.log

In a future PR would be great to just make the S3 bucket configurable in __init__.py but as long as I did, this worked like a charm S3_BUCKET_BENCHMARK_ARTIFACTS = "s3://torchserve-benchmark-fb/benchmark_artifacts"

Also an awkward change was needed in Dockerfile.dev to replace the git checkout line by && git checkout ${BRANCH_NAME} \

Master benchmarks throughput: ~190 im/s

But it was cool because I could validate that the metrics were printed to my console

That they were indeed in the S3 bucket I configured

Actual metrics on the im threads = 1 branch: 267 im/s

Benchmark	Model	Concurrency	Requests	TS failed requests	TS throughput	TS latency P50	TS latency P90	TS latency P99	TS latency mean	TS error rate	Model_p50	Model_p90	Model_p99	predict_mean	handler_time_mean	waiting_time_mean	worker_thread_mean
AB	mnist	10	1000	0	267.32	34	65	126	37.409	0	7.48	10.42	11.37	27.07	27.02	6.29	1.01

In test_mnist.py all I had to was confirm the instance name INSTANCE_TYPES_TO_TEST = ["c4.4xlarge"] and GPU also worked just great

I tried out a GPU instance as well and that also worked easily INSTANCE_TYPES_TO_TEST = ["p2.xlarge"] which created the below instance correctly in my ec2 console

list_of_rows: [['Benchmark', 'Model', 'Concurrency', 'Requests', 'TS failed requests', 'TS throughput', 'TS latency P50', 'TS latency P90', 'TS latency P99', 'TS latency mean', 'TS error rate', 'Model_p50', 'Model_p90', 'Model_p99', 'predict_mean', 'handler_time_mean', 'waiting_time_mean', 'worker_thread_mean'], ['AB', 'mnist', '10', '1000', '0', '2253.33', '3', '5', '79', '4.438', '0.0', '1.35', '1.37', '1.37', '1.61', '1.57', '0.25', '0.92']]

docker/Dockerfile.dev

sagemaker-neo-ci-bot · 2021-07-09T23:41:06Z

AWS CodeBuild CI Report

CodeBuild project: torch-serve-build-win
Commit ID: ec0bd82
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-neo-ci-bot · 2021-07-09T23:46:05Z

AWS CodeBuild CI Report

CodeBuild project: torch-serve-build-gpu
Commit ID: ec0bd82
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

nikhil-sk changed the title ~~[WIP] Add neuron benchmarking to automation and other enhancements~~ Add neuron benchmarking to automation and other enhancements Jun 7, 2021

nikhil-sk added the enhancement New feature or request label Jun 7, 2021

This was referenced Jun 10, 2021

Enhance benchmarking automation phase 3 #1121

Closed

Enhance benchmarking automation #1067

Closed

nikhil-sk force-pushed the neuron_automation branch from 413e952 to 0d37b0b Compare June 10, 2021 16:36

nikhil-sk marked this pull request as ready for review June 10, 2021 16:37

lxning assigned nikhil-sk Jun 11, 2021

nikhil-sk force-pushed the neuron_automation branch from 0d37b0b to ddf6d6d Compare June 14, 2021 15:48

chauhang reviewed Jun 14, 2021

View reviewed changes

chauhang previously requested changes Jun 14, 2021

View reviewed changes

chauhang requested a review from HamidShojanazeri June 14, 2021 17:12

HamidShojanazeri reviewed Jun 24, 2021

View reviewed changes

nikhil-sk force-pushed the neuron_automation branch 2 times, most recently from 30b1a68 to 6ce3107 Compare June 24, 2021 09:30

HamidShojanazeri approved these changes Jun 24, 2021

View reviewed changes

chauhang reviewed Jun 25, 2021

View reviewed changes

chauhang suggested changes Jun 25, 2021

View reviewed changes

msaroufim self-requested a review July 9, 2021 02:29

msaroufim approved these changes Jul 9, 2021

View reviewed changes

docker/Dockerfile.dev Outdated Show resolved Hide resolved

nikhil-sk force-pushed the neuron_automation branch from d06fa02 to 09d54a3 Compare July 9, 2021 19:50

Nikhil Kulkarni added 13 commits July 9, 2021 16:38

Add neuron benchmarking to automation and other enhancements

5eb4250

Add working neuron handler changes

c96320b

Fix bert_compile script

58f2b06

Remove report.md extra file

71f087d

Fix batch-size issues in neuron benchmarking

f866a2d

Fix vgg16 instance type

cdc6b99

Refactor imports

f6b444d

Remove instances.yaml, address commends and unskip functions

cc99803

Address Geeta's comments

a7ea429

Refactor according to comments

188eef1

Cleanup

e0d44f9

Remove comments

4385731

Correct bert input

ddd9d27

nikhil-sk force-pushed the neuron_automation branch from ec0bd82 to ddd9d27 Compare July 9, 2021 23:39

chauhang approved these changes Jul 12, 2021

View reviewed changes

Merge branch 'master' into neuron_automation

4108825

lxning merged commit ccaba6f into pytorch:master Jul 12, 2021


		GPU_INSTANCES = ["p2", "p3", "p4", "g2", "g3", "g4"]

		# DLAMI with nVidia Driver ver. 450.119.03 (support upto CUDA 11.2), Ubuntu 18.04

Add neuron benchmarking to automation and other enhancements #1099

Add neuron benchmarking to automation and other enhancements #1099

Uh oh!

Conversation

nikhil-sk commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Feature/Issue validation/testing

Benchmark report

Checklist:

Uh oh!

sagemaker-neo-ci-bot commented May 27, 2021

AWS CodeBuild CI Report

Uh oh!

sagemaker-neo-ci-bot commented May 27, 2021

AWS CodeBuild CI Report

Uh oh!

sagemaker-neo-ci-bot commented May 27, 2021

AWS CodeBuild CI Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chauhang left a comment

Choose a reason for hiding this comment

Uh oh!

HamidShojanazeri left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sagemaker-neo-ci-bot commented Jun 24, 2021

AWS CodeBuild CI Report

Uh oh!

sagemaker-neo-ci-bot commented Jun 24, 2021

AWS CodeBuild CI Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikhil-sk commented May 27, 2021 •

edited

Loading

HamidShojanazeri left a comment •

edited

Loading

msaroufim commented Jul 1, 2021 •

edited

Loading