Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU optimizations #1138

Closed
wants to merge 15 commits into from
Closed

CPU optimizations #1138

wants to merge 15 commits into from

Conversation

msaroufim
Copy link
Member

@msaroufim msaroufim commented Jun 23, 2021

Description

CPU based models suffer heavily right now from bad throughput if we don't set torch.set_num_threads(1)

Tested on BERT mini this change introduces a 100x throughput improvement on a 32 core GCP machine on both bare metal and Docker. I still need to test this on a few more models to confirm that this pattern is reproducible.

This issue is not unique to torchserve but is a general suggestion when deploying Pytorch models https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html

The below is an example of a toy resnet model where this issue is shown running on a CPU instance in Google Collab

Screen Shot 2021-06-23 at 11 39 18 AM

So this PR introduces a simple change in the base handler to change num_threads to 1. The base handler is used in many places so which handlers need to change and which can stay the same

Can stay the same

  • The default supported handlers require no change since they don't overwrite the device anywhere
  • wave2glow is GPU only
  • Text classification uses the default handler
  • Object detector uses default handler
  • Image Segmentor uses default handler
  • All image classifiers
  • Workflow examples

Need update

Fixes #(issue)
#1128

Type of change

Please delete options that are not relevant.

  • [ x] Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing

Take a look at the attached mar file and run it and see the poor performance. If you then unzip it and uncomment torch.set_num_threads(1) in the handler and rearchive it you will see the improvements

torch-model-archiver --model-name ptclassifier --version 1.0 --serialized-file traced_pt_classifer.pt --handler ./TransformerSeqClassificationHandler.py --extra-files "./config.json,./labels.json,./setup_config.json,./requirements.txt,./tokenizer.json,./tokenizer_config.json,./vocab.txt" --model-file model_ph.py

And then

mkdir model_store
mv ptclassifier.mar model_store
![ava_slowfast](https://user-images.githubusercontent.com/3282513/123155166-9fcb3680-d41c-11eb-866a-fbec68e26abf.gif)

torchserve --ncs --model-store model_store

After launching torchserve, it's time to start the benchmark

To test out this improvement we used jmeter with the JMX file from here https://gist.github.com/msaroufim/1451b2623919e1b0a13c57d560867c18

Then to run the test

wget https://downloads.apache.org//jmeter/binaries/apache-jmeter-5.4.1.zip 
unzip apache-jmeter-5.4.1.zip
cd apache-jmeter-5.4.1/bin
./jmeter -n -t /home/marksaroufim/test.jmx -l testresults.jtl

BERT Results

torch.set_num_threads(16) default

Bare metal: 4.3/s
Docker: 4.1/s

torch.set_num_threads(1) (this change)

Bare metal: 535.0s
Docker: 524

Machine num_threads Throughput Avg Latency Min Latency Max Latency Error Rate Netty Threads
GCP N1 32 core 16 4.3/s 6990 3601 9750 0 30
Docker 16 4.1/s 6939 4049 9390 0 30
GCP N1 32 core 1 535.0/s 56 12 799 0 30
Docker 1 524.0/s 57 11 635 0 30

Resnet 18

We also tested out this change on resnet 18 and observed the following improvements - not as drastic but still substancial

Model Num_torch_threads Throughput Netty Threads Avg Latency Min Latency Max Latency
https://torchserve.pytorch.org/mar_files/resnet-18.mar 1 1161.1/s 30 25 1 179
https://torchserve.pytorch.org/mar_files/resnet-18.mar 18 585/s 30 50 1 98

Faster CPU

Faster CPUs make a substancial difference

Machine num_threads Throughput Avg Latency Min Latency Max Latency Error Rate Netty Threads
GCP N1 32 core 16 4.3/s 6990 3601 9750 0 30
Docker 16 4.1/s 6939 4049 9390 0 30
GCP N1 32 core 1 535.0/s 56 12 799 0 30
Docker 1 524.0/s 57 11 635 0 30
AWS C5n 36 core 1 2160.6/s 13 11 290 0 30

Batch inference

On a different AWS C5n machine with 36 cores I repeated a similar experiment to measure batch size differences. The type of hardware make a huge difference for CPU inference so recommendation should be to use latest gen.

For batch size 256 we saw a roughly 20% improvement in throughput over the default batch size 1 so our recommendation for CPU inference should be to use batching. Batch size 512 and 1024 caused the model to hang

Batch size Throughput r/s Avg Latency Min Latency Max Latency Num Netty Threads Percentage improvement over batch size 1
1 2160.6 13 0 290 30 0%
4 2117.9 13 0 262 30 -1.98%
128 2152.7 13 0 240 30 -0.37%
256 2555.1 11 0 194 30 18.26%
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
job_queue_size=1000
number_of_netty_threads=36
number_of_client_threads=36
model_store=/home/model-server/model-store
models={\
  "ptclassifier": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "ptclassifier.mar",\
        "batchSize": 512,\
        "maxBatchDelay": 1000,\
        "responseTimeout": 1200\
    }\
  }\
}
  • UT/IT execution results

  • Logs

Checklist:

  • [x ] Have you added tests that prove your fix is effective or that this feature works?
  • [ x] New and existing unit tests pass locally with these changes?

Setting torch.set_num_threads(32) will saturate all cores but will not improve throughput

So really what increasing threads does is just increasing the number of context switches which you can't identify with the current pytorch profileer because it doesn't profile from within and OpenMP thread

op start
    profiler record start time
        parallel_for
            OpenMP Multiple threads, running concurrently; Each thread run op logic on a single input partition
    profiler record end time
op end 

Screen Shot 2021-06-17 at 10 43 00 AM

Grid Search

Grid search shows that nothing matters to increase performance except torch num threads, everything else barely impacts it

Repro instructions: https://gist.github.com/msaroufim/73cf7f642e65a27f1ecb3ef5e832abd9
Default threads and threads 1 with all parameters changed like num_workers, num_netty_threads, queue size etc..
Default threads: https://gist.github.com/msaroufim/d8e5090cc07ae4e6f833a0dc330950ed
1 thread: https://gist.github.com/msaroufim/cdf32020c1ba24e8bb32e990588f1994

Automated Benchmark repot

Artifacts here but test has been flaky. Randomly hangs with no error message https://s3.console.aws.amazon.com/s3/buckets/torchserve-benchmark-fb?region=us-west-2&tab=objects

MNIST before change

Benchmark Model Concurrency Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean
AB mnist 10 1000 0 768.34 13 19 87 13.015 0 4.11 10.02 11.47 6.78 6.73 2.35 1.6

MNIST after change (minimal improvement)

Benchmark Model Concurrency Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean
AB mnist 10 1000 0 777.51 13 19 66 12.862 0 5.46 10.27 11.27 7.13 7.08 2.22 1.72

VGG 11 before change

Benchmark Model Concurrency Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean
AB vgg11 100 1000 0 6.84 14528 15308 15768 14623.285 0 1739.53 1809.89 1809.89 2328.74 2328.67 11660.6 5.84

VGG 11 after change (double throughput)

Benchmark Model Concurrency Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean
AB vgg11 100 1000 0 9.63 10191 11603 13215 10388.174 0 2764.69 3004.37 3004.37 3301.74 3301.68 6736.52 7.51

BERT before change (no improvement measured using apache bench but seen using jmeter and custom grid search)

Benchmark Model Concurrency Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean
AB bert 100 1000 0 0.45 221060 224425 227060 221136.921 0 6833.5 7046.13 7046.13 8841.3 8841.24 201744.58 0.7

BERT after change (close to no change)

Benchmark Model Concurrency Requests TS failed requests TS throughput TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean
AB bert 100 1000 0 0.46 219597 223332 225492 219655.097 0 6578.73 6826.95 6826.95 8781.67 8781.59 200650.3 0.45

Next steps in future PR

  • Pytorch Inference mode
  • More netty optimizations
  • Intel/AMD specific CPU optimizations

@msaroufim msaroufim changed the title changed base handler CPU optimizations Jun 23, 2021
@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: 0c6c06c
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-win
  • Commit ID: 0c6c06c
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: 0c6c06c
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@chauhang chauhang requested a review from maaquib June 23, 2021 21:28
@lxning lxning requested a review from chauhang July 6, 2021 20:59
@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-win
  • Commit ID: 986a148
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@msaroufim
Copy link
Member Author

msaroufim commented Jul 14, 2021

Ideal benchmarks are when number of workers are maximized and CPU optimization on @HamidShojanazeri @chauhang

The only caveat is that with a small number of workers reducing the number of torch threads may not be ideal

Benchmarked VGG11, VGG16 and BERT

EDIT: Talked with @HamidShojanazeri and given that for small worker counts with batched inference, this change can degrade performance we shouldn't go ahead and merge this PR. For expensive CPUs with large core counts this is the way to go.

So perhaps instead of merging this change into the base handler what we should do is extend the CPU optimization guide I've been working on and add it as documentation to the repo instead for users worried about scaling. The rational is that in the case where a change to the base handler causes a degradation then users will have to change the base handler and reinstall torchserve whereas if they are just made aware of this trick then they can update their handler without reinstall torchserve.

Easier to read table with colors https://docs.google.com/spreadsheets/d/1T3MeDBeGcwI4z226cPcMc_FzAzc757e0_tZAINX0-_4/edit?usp=sharing

<style type="text/css"></style>

Optimized Workers Netty Threads Batch Size TS throughput Model Machine Concurrency Requests TS failed requests TS latency P50 TS latency P90 TS latency P99 TS latency mean TS error rate Model_p50 Model_p90 Model_p99 predict_mean handler_time_mean waiting_time_mean worker_thread_mean Benchmark
Yes 16 32 1 30.41 https://torchserve.pytorch.org/mar_files/vgg11.mar AWS c4.4x 100 1000 0 3229 3305 3588 3288.39 0 511.59 549.89 632.66 519.3 519.21 2604.65 1.47 AB
Yes 16 32 4 19.41 https://bert-mar-file.s3.us-west-2.amazonaws.com/BERTSeqClassification.mar AWS c4.4x 100 1000 0 5902 6547 7298 5151.245 0 3216.64 3561.94 3695.63 3220.12 3220.01 1759.8 3.7 AB
Yes 16 32 1 18.74 https://bert-mar-file.s3.us-west-2.amazonaws.com/BERTSeqClassification.mar AWS c4.4x 100 1000 0 5170 5568 5932 5334.965 0 831.94 874.61 1137.79 846.18 846.08 4241.48 1.19 AB
Yes 16 32 1 17.26 https://torchserve.pytorch.org/mar_files/vgg16.mar AWS c4.4x 100 1000 0 5696 5868 6335 5792.164 0 908.85 955.26 1039.3 916.6 916.51 4590.38 1.27 AB
Yes 16 32 8 16.73 https://torchserve.pytorch.org/mar_files/vgg16.mar AWS c4.4x 100 1000 0 5868 7194 7652 5977.258 0 5790.01 7009.76 7198.24 5612.37 5612.29 8.05 6.57 AB
Yes 4 32 1 14.61 https://torchserve.pytorch.org/mar_files/vgg11.mar AWS c4.4x 100 1000 0 6819 6855 6880 6846.011 0 271.91 273.56 286.89 271.88 271.83 6220.74 0.56 AB
No 4 32 8 13.85 https://torchserve.pytorch.org/mar_files/vgg16.mar AWS c4.4x 100 1000 250 6961 8889 9079 7219.088 25 2298.35 2367.42 2411.98 2284.17 2283.99 4651.55 5.57 AB
No 16 32 8 12.21 https://torchserve.pytorch.org/mar_files/vgg16.mar AWS c4.4x 100 1000 236 8298 8813 9278 8187.905 23.6 8195.87 8719.44 8807.56 7796.13 7795.9 10.52 6.57 AB
No 4 32 8 11.74 https://s3.us-west-2.amazonaws.com/ts0.4.1-marfiles/BERTSeqClassification.mar AWS c4.4x 100 1000 0 8352 9223 10520 8517.33 0 2702.04 2946.85 3088.48 2704.87 2704.79 5505.18 3.72 AB
Yes 4 32 8 10.94 https://bert-mar-file.s3.us-west-2.amazonaws.com/BERTSeqClassification.mar AWS c4.4x 100 1000 0 8537 10706 11725 9136.63 0 2837.11 2975.97 3448.93 2863.46 2863.41 5795.61 2.24 AB
Yes 4 32 1 9.49 https://bert-mar-file.s3.us-west-2.amazonaws.com/BERTSeqClassification.mar AWS c4.4x 100 1000 0 10434 10511 10767 10534.992 0 415.65 424.3 499.04 419.23 419.18 9580.12 0.49 AB
No 16 32 1 8.72 https://torchserve.pytorch.org/mar_files/vgg11.mar AWS c4.4x (16 cores) 100 1000 0 11420 11890 12260 11473.403 0 1847.03 2135.59 2432.08 1825.47 1825.39 9140.71 1.64 AB
No 16 16 1 8.72 https://torchserve.pytorch.org/mar_files/vgg11.mar AWS c4.4x 100 1000 0 11437 11820 12176 11468.814 0 1842.98 2126.04 2472.25 1823.29 1823.22 9134.62 1.55 AB
No 32 32 1 8.66 https://torchserve.pytorch.org/mar_files/vgg11.mar AWS c4.4x 100 1000 0 11445 12260 13433 11546.973 0 3688.31 4301.94 4796.47 3664.89 3664.82 7448.45 2.49 AB
Yes 4 32 1 8.33 https://torchserve.pytorch.org/mar_files/vgg16.mar AWS c4.4x 100 1000 0 11958 12059 12156 11999.592 0 480.68 491.28 508.19 477.68 477.62 10914.73 0.57 AB
No 4 32 1 8.24 https://torchserve.pytorch.org/mar_files/vgg16.mar AWS c4.4x 100 1000 0 12095 12339 12492 12133.689 0 479.21 572.51 645.38 482.09 482 11053.48 1.06 AB
Yes 4 32 8 7.99 https://torchserve.pytorch.org/mar_files/vgg16.mar AWS c4.4x 100 1000 0 11782 15058 15587 12510.709 0 3912.86 3944.17 4014.83 3910.71 3910.65 7906.94 3.47 AB
No 16 32 1 5.65 https://torchserve.pytorch.org/mar_files/vgg16.mar AWS c4.4x 100 1000 0 17666 18233 18753 17703.121 0 2866.73 3210.72 3622.22 2820.71 2820.64 14123.29 1.65 AB
No 16 32 4 5.64 https://s3.us-west-2.amazonaws.com/ts0.4.1-marfiles/BERTSeqClassification.mar AWS c4.4x 100 1000 0 17323 21612 24331 17727.34 0 11353.91 12479.03 13071.58 11224.56 11224.49 6079.55 3.68 AB
No 16 32 1 1.96 https://s3.us-west-2.amazonaws.com/ts0.4.1-marfiles/BERTSeqClassification.mar AWS c4.4x 100 1000 0 50559 52899 55677 51081.161 0 8166.78 8642.7 9011.59 8148.57 8148.51 40784.37 1.48 AB
No 16 32 8   https://s3.us-west-2.amazonaws.com/ts0.4.1-marfiles/BERTSeqClassification.mar AWS c4.4x                               FAILED

@msaroufim
Copy link
Member Author

Closing this PR for now - we can revisit if needed

@msaroufim msaroufim deleted the cpu_opt branch June 16, 2022 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants