Skip to content
This repository was archived by the owner on Aug 7, 2025. It is now read-only.

Conversation

@nikhil-sk
Copy link
Collaborator

@nikhil-sk nikhil-sk commented May 27, 2021

Description

Adds support for automating benchmarks using AWS neuron for BERT model.

Fixes part of #1067

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the tests [UT/IT] that you ran to verify your changes and relevent result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test A

  • Test B

  • UT/IT execution results

  • Logs

  • Benchmark report

bert_neuron | scripted_mode | inf1.6xlarge | batch size 1
Benchmark Model Concurrency Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean
AB bert_neuron 100 1000 0 213.10 438 440 2975 469.265 0.0 17.18 17.3 18.46 17.26 17.22 411.22 0.49
bert_neuron | scripted_mode | inf1.6xlarge | batch size 2
Benchmark Model Concurrency Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean
AB bert_neuron 100 1000 0 392.73 223 225 3031 254.628 0.0 17.27 17.48 19.29 34.56 34.51 392.56 1.13
bert_neuron | scripted_mode | inf1.6xlarge | batch size 4
Benchmark Model Concurrency Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean
AB vgg16 100 1000 0 589.39 133 137 3578 169.667 0.0 20.17 21 23.93 67.03 66.98 345.57 2.15
bert_neuron | scripted_mode | inf1.6xlarge | batch size 8
Benchmark Model Concurrency Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean
AB bert_neuron 100 1000 0 537.51 141 155 4267 186.043 0.0 44.04 45.37 48.27 130.94 130.88 272.18 4.09

Checklist:

  • Have you added tests that prove your fix is effective or that this feature works?
  • New and existing unit tests pass locally with these changes?
  • Has code been commented, particularly in hard-to-understand areas?
  • [] TODO Have you made corresponding changes to the documentation?

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: bdb874a
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-win
  • Commit ID: bdb874a
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: bdb874a
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@nikhil-sk nikhil-sk changed the title [WIP] Add neuron benchmarking to automation and other enhancements Add neuron benchmarking to automation and other enhancements Jun 7, 2021
@nikhil-sk nikhil-sk added the enhancement New feature or request label Jun 7, 2021
@nikhil-sk nikhil-sk force-pushed the neuron_automation branch from 413e952 to 0d37b0b Compare June 10, 2021 16:36
@nikhil-sk nikhil-sk marked this pull request as ready for review June 10, 2021 16:37
@nikhil-sk nikhil-sk force-pushed the neuron_automation branch from 0d37b0b to ddf6d6d Compare June 14, 2021 15:48
&& git clone https://github.com/pytorch/serve.git \
&& cd serve \
&& git checkout ${BRANCH_NAME} \
&& git checkout --track origin/release_0.4.0 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the branch being hardcoded? It was parameter earlier

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dev-change that should've been reverted in the cleanup, fixed.

test_vgg16_benchmark:
inf1.6xlarge:
instance_id: i-04b6adea9c066ad0f
key_filename: /Users/nikhilsk/nskool/serve/test/benchmark/ec2-key-name-6028.pem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we the keys from a common shared location and not have user specific items here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, and this file has been added to .gitignore.

Copy link
Contributor

@chauhang chauhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments inline. There are lot of EC2 specific items for the non-neuron scenario as well. How do we expect Contributors to run the benchmark in their env?

Copy link
Collaborator

@HamidShojanazeri HamidShojanazeri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nskool overall LGTM with some minor changes I added the in inline comments. Also, pls handle the error message when it was unregistering the model, where it was mentioning it failed to unregister the model.

test_vgg16_benchmark:
inf1.6xlarge:
instance_id: i-04b6adea9c066ad0f
key_filename: /Users/nikhilsk/nskool/serve/test/benchmark/ec2-key-name-6028.pem
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nskool pls fix this line as in this test it was not required and key got generated and placed in fixed directory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, and this file has been added to .gitignore.

DEFAULT_REGION = "us-west-2"
IAM_INSTANCE_PROFILE = "EC2Admin"
S3_BUCKET_BENCHMARK_ARTIFACTS = "s3://torchserve-model-serving/benchmark_artifacts"
S3_BUCKET_BENCHMARK_ARTIFACTS = "s3://nikhilsk-model-serving/benchmark_artifacts"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nskool pls make sure we can pass this s3 bucket as an argument.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a non-trivial and non-breaking change, I've added it to the issue which describes changes in phase 3 of the automation: #1121

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nskool Why is the URL change to s3://torchserve-model-serving/xxx vs your personal S3 bucket a non-trivial change? Whatever tests we have can be put in the main s3://torchserve-model-serving bucket

8. For generating benchmarking report, modify the argument to function `generate_comprehensive_report()` to point to the s3 bucket uri for the benchmark run. Run the script as:
```
python report.py
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nskool as discussed, it might be good to add report.py as part of the automation as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make this part of a separate CR, with an all-inclusive report.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now fixed in the current PR

@nikhil-sk nikhil-sk force-pushed the neuron_automation branch 2 times, most recently from 30b1a68 to 6ce3107 Compare June 24, 2021 09:30
@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: 30b1a68
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: 5fe56cc
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@nikhil-sk nikhil-sk dismissed chauhang’s stale review June 24, 2021 21:45

Addressed concerns

RUN if [ "$MACHINE_TYPE" = "gpu" ]; then export USE_CUDA=1; fi \
&& git clone https://github.com/pytorch/serve.git \
&& cd serve \
&& git checkout --track origin/release_0.4.0 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please parameterize the branch name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

DOCKER_BUILDKIT=1 docker build --file Dockerfile --build-arg BASE_IMAGE=$BASE_IMAGE --build-arg CUDA_VERSION=$CUDA_VERSION -t $DOCKER_TAG .
else
DOCKER_BUILDKIT=1 docker build --file Dockerfile.dev -t $DOCKER_TAG --build-arg BUILD_TYPE=$BUILD_TYPE --build-arg BASE_IMAGE=$BASE_IMAGE --build-arg BRANCH_NAME=$BRANCH_NAME --build-arg CUDA_VERSION=$CUDA_VERSION --build-arg MACHINE_TYPE=$MACHINE .
DOCKER_BUILDKIT=1 docker build --pull --network=host --no-cache --file Dockerfile.neuron.dev -t $DOCKER_TAG --build-arg BUILD_TYPE=$BUILD_TYPE --build-arg BASE_IMAGE=$BASE_IMAGE --build-arg BRANCH_NAME=$BRANCH_NAME --build-arg CUDA_VERSION=$CUDA_VERSION --build-arg MACHINE_TYPE=$MACHINE .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please verify there is impact on the benchmark when using --network=host

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted, it's not needed anymore because I've changed the neuron benchmark to use a local torchserve installation on an ec2 instance rather than a docker installation.

The final benchmark report will be available in markdown format as `report.md` in the `serve/` folder.

**Example report for vgg16 model**
**Example report for vgg11 model**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for switching to vgg11 here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results quoted below are vgg11 results, this 'vgg16' was just a typo that was corrected.

requests: 10000
concurrency: 100
backend_profiling: True
backend_profiling: False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason for this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarking results have been recorded so far with 'backend_profiling' False - this 'True' was only used to feature-test that specifying 'True' works. Benchmarking with backend_profiling: True gives worse results than with backend_profiling:False.


INSTANCE_TYPES_TO_TEST = ["p3.8xlarge"]

@pytest.mark.skip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being skipped?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, apologies, a leftover from testing. Unskipped.

)
docker_repo_tag_for_current_instance = docker_repo_tag
cuda_version_for_instance = cuda_version
cuda_version_for_instance = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we validated on GPU/CUDA after these changes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in fact validation on CPU was failing and hence this 'cuda_version_for_instance' change was a bug fix, the value should have been None. I've validated the GPU docker image is used after these changes.


GPU_INSTANCES = ["p2", "p3", "p4", "g2", "g3", "g4"]

# DLAMI with nVidia Driver ver. 450.119.03 (support upto CUDA 11.2), Ubuntu 18.04
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this description still valid for the new AMI?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that it still holds true.

# Download the s3 files
run(f"mkdir -p /tmp/report")
run(f"aws s3 cp --recursive {s3_bucket_uri} /tmp/report")
# run(f"mkdir -p /tmp/report")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these commented lines?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do - currently 'report.py' needs to be run separately from automation. I've added a task here to automate that as well: #1121

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the current PR now.

config_header = f"{model} | {mode} | {instance_type} | batch size {batch_size}"

markdownDocument.add_paragraph(config_header, bold=True, newline=True)
#markdownDocument.add_paragraph(config_header, bold=True, newline=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks


# Clean up
run(f"rm -rf /tmp/report")
# run(f"rm -rf /tmp/report")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? It might be better to put logic in code such that it can automatically create the report folder if it does not exists so that one does not have to keep commenting/uncommenting out these lines to reruns.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that's the plan - but will be done later as part of a couple other changes listed here: #1121. The reason is that the 'wiring' that needs to be done to automate this i.e. make the 's3 folder' available to the report.py script is non-trivial, and will need additional testing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed in the current PR itself.

Copy link
Contributor

@chauhang chauhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the review comments inline.

@msaroufim
Copy link
Member

msaroufim commented Jul 1, 2021

Documenting how benchmarking a CPU only model worked for me

  1. In suite/docker/docker.yaml - deleted everything not CPU specific
  2. in vgg16.yaml deleted torchscript section and deleted gpu processor for eager
  3. in init.py added my own s3 bucket
  4. In Dockerfile.dev cloned my own fork and checked it out
  5. In test_vgg16.py I needed to uncomment LOGGER.info(f"processors: {processors[1]}") and this issue exists in all the test files

1,2,3 just need mentions in the README

4 is a bit tricky because and could be simplified. I haven't tried local benchmarks instead of docker benchmarks yet

5 is a simple fix but is just annoying

still getting a weird error https://gist.github.com/msaroufim/fb9ceb5be8ad710ae46e4464f56490c7

And did not expect to see 3 instances created on my AWS account instead of just 1

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: d06fa02
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: d06fa02
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@msaroufim
Copy link
Member

msaroufim commented Jul 9, 2021

Alright I'm really glad to see the improvements

For example I was trying to run CPU benchmarks after setting torch_num_threads(1) and all I had to do was run

python test/benchmark/run_benchmark.py --use-torchserve-branch master --run-only mnist 2>&1 | tee ~/benchmark.log

In a future PR would be great to just make the S3 bucket configurable in __init__.py but as long as I did, this worked like a charm S3_BUCKET_BENCHMARK_ARTIFACTS = "s3://torchserve-benchmark-fb/benchmark_artifacts"

Also an awkward change was needed in Dockerfile.dev to replace the git checkout line by && git checkout ${BRANCH_NAME} \

Master benchmarks throughput: ~190 im/s

But it was cool because I could validate that the metrics were printed to my console
Screen Shot 2021-07-08 at 7 26 48 PM

That they were indeed in the S3 bucket I configured
Screen Shot 2021-07-08 at 7 27 37 PM

Actual metrics on the im threads = 1 branch: 267 im/s

Benchmark Model Concurrency Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean
AB mnist 10 1000 0 267.32 34 65 126 37.409 0 7.48 10.42 11.37 27.07 27.02 6.29 1.01

In test_mnist.py all I had to was confirm the instance name INSTANCE_TYPES_TO_TEST = ["c4.4xlarge"] and GPU also worked just great

I tried out a GPU instance as well and that also worked easily INSTANCE_TYPES_TO_TEST = ["p2.xlarge"] which created the below instance correctly in my ec2 console

Screen Shot 2021-07-08 at 7 44 37 PM

list_of_rows: [['Benchmark', 'Model', 'Concurrency', 'Requests', 'TS failed requests', 'TS throughput', 'TS latency P50', 'TS latency P90', 'TS latency P99', 'TS latency mean', 'TS error rate', 'Model_p50', 'Model_p90', 'Model_p99', 'predict_mean', 'handler_time_mean', 'waiting_time_mean', 'worker_thread_mean'], ['AB', 'mnist', '10', '1000', '0', '2253.33', '3', '5', '79', '4.438', '0.0', '1.35', '1.37', '1.37', '1.61', '1.57', '0.25', '0.92']]

@msaroufim msaroufim self-requested a review July 9, 2021 02:29
@nikhil-sk nikhil-sk force-pushed the neuron_automation branch from d06fa02 to 09d54a3 Compare July 9, 2021 19:50
@nikhil-sk nikhil-sk force-pushed the neuron_automation branch from ec0bd82 to ddd9d27 Compare July 9, 2021 23:39
@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-win
  • Commit ID: ec0bd82
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: ec0bd82
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@lxning lxning merged commit ccaba6f into pytorch:master Jul 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants