CPU optimizations #1138

msaroufim · 2021-06-23T18:35:42Z

Description

CPU based models suffer heavily right now from bad throughput if we don't set torch.set_num_threads(1)

Tested on BERT mini this change introduces a 100x throughput improvement on a 32 core GCP machine on both bare metal and Docker. I still need to test this on a few more models to confirm that this pattern is reproducible.

This issue is not unique to torchserve but is a general suggestion when deploying Pytorch models https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html

The below is an example of a toy resnet model where this issue is shown running on a CPU instance in Google Collab

So this PR introduces a simple change in the base handler to change num_threads to 1. The base handler is used in many places so which handlers need to change and which can stay the same

Can stay the same

The default supported handlers require no change since they don't overwrite the device anywhere
wave2glow is GPU only
Text classification uses the default handler
Object detector uses default handler
Image Segmentor uses default handler
All image classifiers
Workflow examples

Need update

Fixes #(issue)
#1128

Type of change

Please delete options that are not relevant.

[ x] Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing

Take a look at the attached mar file and run it and see the poor performance. If you then unzip it and uncomment torch.set_num_threads(1) in the handler and rearchive it you will see the improvements

torch-model-archiver --model-name ptclassifier --version 1.0 --serialized-file traced_pt_classifer.pt --handler ./TransformerSeqClassificationHandler.py --extra-files "./config.json,./labels.json,./setup_config.json,./requirements.txt,./tokenizer.json,./tokenizer_config.json,./vocab.txt" --model-file model_ph.py

And then

mkdir model_store
mv ptclassifier.mar model_store
![ava_slowfast](https://user-images.githubusercontent.com/3282513/123155166-9fcb3680-d41c-11eb-866a-fbec68e26abf.gif)

torchserve --ncs --model-store model_store

After launching torchserve, it's time to start the benchmark

To test out this improvement we used jmeter with the JMX file from here https://gist.github.com/msaroufim/1451b2623919e1b0a13c57d560867c18

Then to run the test

wget https://downloads.apache.org//jmeter/binaries/apache-jmeter-5.4.1.zip 
unzip apache-jmeter-5.4.1.zip
cd apache-jmeter-5.4.1/bin
./jmeter -n -t /home/marksaroufim/test.jmx -l testresults.jtl

BERT Results

torch.set_num_threads(16) default

Bare metal: 4.3/s
Docker: 4.1/s

torch.set_num_threads(1) (this change)

Bare metal: 535.0s
Docker: 524

Machine	num_threads	Throughput	Avg Latency	Min Latency	Max Latency	Netty Threads
GCP N1 32 core	16	4.3/s	6990	3601	9750	30
Docker	16	4.1/s	6939	4049	9390	30
GCP N1 32 core	1	535.0/s	56	12	799	30
Docker	1	524.0/s	57	11	635	30

Resnet 18

We also tested out this change on resnet 18 and observed the following improvements - not as drastic but still substancial

Model	Num_torch_threads	Throughput	Netty Threads	Avg Latency	Min Latency	Max Latency
https://torchserve.pytorch.org/mar_files/resnet-18.mar	1	1161.1/s	30	25	1	179
https://torchserve.pytorch.org/mar_files/resnet-18.mar	18	585/s	30	50	1	98

Faster CPU

Faster CPUs make a substancial difference

Machine	num_threads	Throughput	Avg Latency	Min Latency	Max Latency	Netty Threads
GCP N1 32 core	16	4.3/s	6990	3601	9750	30
Docker	16	4.1/s	6939	4049	9390	30
GCP N1 32 core	1	535.0/s	56	12	799	30
Docker	1	524.0/s	57	11	635	30
AWS C5n 36 core	1	2160.6/s	13	11	290	30

Batch inference

On a different AWS C5n machine with 36 cores I repeated a similar experiment to measure batch size differences. The type of hardware make a huge difference for CPU inference so recommendation should be to use latest gen.

For batch size 256 we saw a roughly 20% improvement in throughput over the default batch size 1 so our recommendation for CPU inference should be to use batching. Batch size 512 and 1024 caused the model to hang

Batch size	Throughput r/s	Avg Latency	Max Latency	Num Netty Threads	Percentage improvement over batch size 1
1	2160.6	13	290	30	0%
4	2117.9	13	262	30	-1.98%
128	2152.7	13	240	30	-0.37%
256	2555.1	11	194	30	18.26%

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
job_queue_size=1000
number_of_netty_threads=36
number_of_client_threads=36
model_store=/home/model-server/model-store
models={\
  "ptclassifier": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "ptclassifier.mar",\
        "batchSize": 512,\
        "maxBatchDelay": 1000,\
        "responseTimeout": 1200\
    }\
  }\
}

UT/IT execution results
Logs

Checklist:

[x ] Have you added tests that prove your fix is effective or that this feature works?
[ x] New and existing unit tests pass locally with these changes?

Setting torch.set_num_threads(32) will saturate all cores but will not improve throughput

So really what increasing threads does is just increasing the number of context switches which you can't identify with the current pytorch profileer because it doesn't profile from within and OpenMP thread

op start
    profiler record start time
        parallel_for
            OpenMP Multiple threads, running concurrently; Each thread run op logic on a single input partition
    profiler record end time
op end

Grid Search

Grid search shows that nothing matters to increase performance except torch num threads, everything else barely impacts it

Repro instructions: https://gist.github.com/msaroufim/73cf7f642e65a27f1ecb3ef5e832abd9
Default threads and threads 1 with all parameters changed like num_workers, num_netty_threads, queue size etc..
Default threads: https://gist.github.com/msaroufim/d8e5090cc07ae4e6f833a0dc330950ed
1 thread: https://gist.github.com/msaroufim/cdf32020c1ba24e8bb32e990588f1994

Automated Benchmark repot

Artifacts here but test has been flaky. Randomly hangs with no error message https://s3.console.aws.amazon.com/s3/buckets/torchserve-benchmark-fb?region=us-west-2&tab=objects

MNIST before change

Benchmark	Model	Concurrency	Requests	TS failed requests	TS throughput	TS latency P50	TS latency P90	TS latency P99	TS latency mean	TS error rate	Model_p50	Model_p90	Model_p99	predict_mean	handler_time_mean	waiting_time_mean	worker_thread_mean
AB	mnist	10	1000	0	768.34	13	19	87	13.015	0	4.11	10.02	11.47	6.78	6.73	2.35	1.6

MNIST after change (minimal improvement)

Benchmark	Model	Concurrency	Requests	TS failed requests	TS throughput	TS latency P50	TS latency P90	TS latency P99	TS latency mean	TS error rate	Model_p50	Model_p90	Model_p99	predict_mean	handler_time_mean	waiting_time_mean	worker_thread_mean
AB	mnist	10	1000	0	777.51	13	19	66	12.862	0	5.46	10.27	11.27	7.13	7.08	2.22	1.72

VGG 11 before change

Benchmark	Model	Concurrency	Requests	TS failed requests	TS throughput	TS latency P50	TS latency P90	TS latency P99	TS latency mean	TS error rate	Model_p50	Model_p90	Model_p99	predict_mean	handler_time_mean	waiting_time_mean	worker_thread_mean
AB	vgg11	100	1000	0	6.84	14528	15308	15768	14623.285	0	1739.53	1809.89	1809.89	2328.74	2328.67	11660.6	5.84

VGG 11 after change (double throughput)

Benchmark	Model	Concurrency	Requests	TS failed requests	TS throughput	TS latency P50	TS latency P90	TS latency P99	TS latency mean	TS error rate	Model_p50	Model_p90	Model_p99	predict_mean	handler_time_mean	waiting_time_mean	worker_thread_mean
AB	vgg11	100	1000	0	9.63	10191	11603	13215	10388.174	0	2764.69	3004.37	3004.37	3301.74	3301.68	6736.52	7.51

BERT before change (no improvement measured using apache bench but seen using jmeter and custom grid search)

Benchmark	Model	Concurrency	Requests	TS failed requests	TS throughput	TS latency P50	TS latency P90	TS latency P99	TS latency mean	TS error rate	Model_p50	Model_p90	Model_p99	predict_mean	handler_time_mean	waiting_time_mean	worker_thread_mean
AB	bert	100	1000	0	0.45	221060	224425	227060	221136.921	0	6833.5	7046.13	7046.13	8841.3	8841.24	201744.58	0.7

BERT after change (close to no change)

Benchmark	Model	Concurrency	Requests	TS failed requests	TS throughput	TS latency P50	TS latency P90	TS latency P99	TS latency mean	TS error rate	Model_p50	Model_p90	Model_p99	predict_mean	handler_time_mean	waiting_time_mean	worker_thread_mean
AB	bert	100	1000	0	0.46	219597	223332	225492	219655.097	0	6578.73	6826.95	6826.95	8781.67	8781.59	200650.3	0.45

Next steps in future PR

Pytorch Inference mode
More netty optimizations
Intel/AMD specific CPU optimizations

sagemaker-neo-ci-bot · 2021-06-23T19:01:30Z

AWS CodeBuild CI Report

CodeBuild project: torch-serve-build-cpu
Commit ID: 0c6c06c
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-neo-ci-bot · 2021-06-23T19:05:20Z

AWS CodeBuild CI Report

CodeBuild project: torch-serve-build-win
Commit ID: 0c6c06c
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-neo-ci-bot · 2021-06-23T19:05:28Z

AWS CodeBuild CI Report

CodeBuild project: torch-serve-build-gpu
Commit ID: 0c6c06c
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

sagemaker-neo-ci-bot · 2021-07-14T01:15:59Z

AWS CodeBuild CI Report

CodeBuild project: torch-serve-build-win
Commit ID: 986a148
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

msaroufim · 2021-07-14T02:56:03Z

Ideal benchmarks are when number of workers are maximized and CPU optimization on @HamidShojanazeri @chauhang

The only caveat is that with a small number of workers reducing the number of torch threads may not be ideal

Benchmarked VGG11, VGG16 and BERT

EDIT: Talked with @HamidShojanazeri and given that for small worker counts with batched inference, this change can degrade performance we shouldn't go ahead and merge this PR. For expensive CPUs with large core counts this is the way to go.

So perhaps instead of merging this change into the base handler what we should do is extend the CPU optimization guide I've been working on and add it as documentation to the repo instead for users worried about scaling. The rational is that in the case where a change to the base handler causes a degradation then users will have to change the base handler and reinstall torchserve whereas if they are just made aware of this trick then they can update their handler without reinstall torchserve.

Easier to read table with colors https://docs.google.com/spreadsheets/d/1T3MeDBeGcwI4z226cPcMc_FzAzc757e0_tZAINX0-_4/edit?usp=sharing

Optimized	Workers	Netty Threads	Batch Size	TS throughput	Model	Machine	Concurrency	Requests	TS failed requests	TS latency P50	TS latency P90	TS latency P99	TS latency mean	TS error rate	Model_p50	Model_p90	Model_p99	predict_mean	handler_time_mean	waiting_time_mean	worker_thread_mean	Benchmark
Yes	16	32	1	30.41	https://torchserve.pytorch.org/mar_files/vgg11.mar	AWS c4.4x	100	1000	0	3229	3305	3588	3288.39	0	511.59	549.89	632.66	519.3	519.21	2604.65	1.47	AB
Yes	16	32	4	19.41	https://bert-mar-file.s3.us-west-2.amazonaws.com/BERTSeqClassification.mar	AWS c4.4x	100	1000	0	5902	6547	7298	5151.245	0	3216.64	3561.94	3695.63	3220.12	3220.01	1759.8	3.7	AB
Yes	16	32	1	18.74	https://bert-mar-file.s3.us-west-2.amazonaws.com/BERTSeqClassification.mar	AWS c4.4x	100	1000	0	5170	5568	5932	5334.965	0	831.94	874.61	1137.79	846.18	846.08	4241.48	1.19	AB
Yes	16	32	1	17.26	https://torchserve.pytorch.org/mar_files/vgg16.mar	AWS c4.4x	100	1000	0	5696	5868	6335	5792.164	0	908.85	955.26	1039.3	916.6	916.51	4590.38	1.27	AB
Yes	16	32	8	16.73	https://torchserve.pytorch.org/mar_files/vgg16.mar	AWS c4.4x	100	1000	0	5868	7194	7652	5977.258	0	5790.01	7009.76	7198.24	5612.37	5612.29	8.05	6.57	AB
Yes	4	32	1	14.61	https://torchserve.pytorch.org/mar_files/vgg11.mar	AWS c4.4x	100	1000	0	6819	6855	6880	6846.011	0	271.91	273.56	286.89	271.88	271.83	6220.74	0.56	AB
No	4	32	8	13.85	https://torchserve.pytorch.org/mar_files/vgg16.mar	AWS c4.4x	100	1000	250	6961	8889	9079	7219.088	25	2298.35	2367.42	2411.98	2284.17	2283.99	4651.55	5.57	AB
No	16	32	8	12.21	https://torchserve.pytorch.org/mar_files/vgg16.mar	AWS c4.4x	100	1000	236	8298	8813	9278	8187.905	23.6	8195.87	8719.44	8807.56	7796.13	7795.9	10.52	6.57	AB
No	4	32	8	11.74	https://s3.us-west-2.amazonaws.com/ts0.4.1-marfiles/BERTSeqClassification.mar	AWS c4.4x	100	1000	0	8352	9223	10520	8517.33	0	2702.04	2946.85	3088.48	2704.87	2704.79	5505.18	3.72	AB
Yes	4	32	8	10.94	https://bert-mar-file.s3.us-west-2.amazonaws.com/BERTSeqClassification.mar	AWS c4.4x	100	1000	0	8537	10706	11725	9136.63	0	2837.11	2975.97	3448.93	2863.46	2863.41	5795.61	2.24	AB
Yes	4	32	1	9.49	https://bert-mar-file.s3.us-west-2.amazonaws.com/BERTSeqClassification.mar	AWS c4.4x	100	1000	0	10434	10511	10767	10534.992	0	415.65	424.3	499.04	419.23	419.18	9580.12	0.49	AB
No	16	32	1	8.72	https://torchserve.pytorch.org/mar_files/vgg11.mar	AWS c4.4x (16 cores)	100	1000	0	11420	11890	12260	11473.403	0	1847.03	2135.59	2432.08	1825.47	1825.39	9140.71	1.64	AB
No	16	16	1	8.72	https://torchserve.pytorch.org/mar_files/vgg11.mar	AWS c4.4x	100	1000	0	11437	11820	12176	11468.814	0	1842.98	2126.04	2472.25	1823.29	1823.22	9134.62	1.55	AB
No	32	32	1	8.66	https://torchserve.pytorch.org/mar_files/vgg11.mar	AWS c4.4x	100	1000	0	11445	12260	13433	11546.973	0	3688.31	4301.94	4796.47	3664.89	3664.82	7448.45	2.49	AB
Yes	4	32	1	8.33	https://torchserve.pytorch.org/mar_files/vgg16.mar	AWS c4.4x	100	1000	0	11958	12059	12156	11999.592	0	480.68	491.28	508.19	477.68	477.62	10914.73	0.57	AB
No	4	32	1	8.24	https://torchserve.pytorch.org/mar_files/vgg16.mar	AWS c4.4x	100	1000	0	12095	12339	12492	12133.689	0	479.21	572.51	645.38	482.09	482	11053.48	1.06	AB
Yes	4	32	8	7.99	https://torchserve.pytorch.org/mar_files/vgg16.mar	AWS c4.4x	100	1000	0	11782	15058	15587	12510.709	0	3912.86	3944.17	4014.83	3910.71	3910.65	7906.94	3.47	AB
No	16	32	1	5.65	https://torchserve.pytorch.org/mar_files/vgg16.mar	AWS c4.4x	100	1000	0	17666	18233	18753	17703.121	0	2866.73	3210.72	3622.22	2820.71	2820.64	14123.29	1.65	AB
No	16	32	4	5.64	https://s3.us-west-2.amazonaws.com/ts0.4.1-marfiles/BERTSeqClassification.mar	AWS c4.4x	100	1000	0	17323	21612	24331	17727.34	0	11353.91	12479.03	13071.58	11224.56	11224.49	6079.55	3.68	AB
No	16	32	1	1.96	https://s3.us-west-2.amazonaws.com/ts0.4.1-marfiles/BERTSeqClassification.mar	AWS c4.4x	100	1000	0	50559	52899	55677	51081.161	0	8166.78	8642.7	9011.59	8148.57	8148.51	40784.37	1.48	AB
No	16	32	8		https://s3.us-west-2.amazonaws.com/ts0.4.1-marfiles/BERTSeqClassification.mar	AWS c4.4x																FAILED

msaroufim · 2021-08-06T20:21:42Z

Closing this PR for now - we can revisit if needed

changed base handler

0c6c06c

msaroufim changed the title ~~changed base handler~~ CPU optimizations Jun 23, 2021

chauhang requested a review from maaquib June 23, 2021 21:28

msaroufim added 2 commits June 29, 2021 12:07

Update base_handler.py

3ad9e76

updated remaining non base handler examples

ff7294a

lxning requested a review from chauhang July 6, 2021 20:59

maaquib approved these changes Jul 7, 2021

View reviewed changes

msaroufim added 2 commits July 7, 2021 15:50

Merge branch 'master' into cpu_opt

7916a0c

Merge branch 'master' into cpu_opt

16e173f

msaroufim requested a review from HamidShojanazeri July 9, 2021 21:39

msaroufim added 10 commits July 12, 2021 13:52

Update model_handler_generalized.py

c02f630

reverted changes to all handlers except base

7f24631

readded device check in all handlers that override initialize function

afee5dc

Merge branch 'master' into cpu_opt

4825bd9

update bert.yaml cpu mar file link

a7cf7be

Update bert.yaml

b4f2558

Update bert.yaml

14a65ac

updated device comparison logic

073a3b7

Merge branch 'cpu_opt' of https://github.com/pytorch/serve into cpu_opt

656f0a3

changed device comparison logic

986a148

msaroufim mentioned this pull request Jul 16, 2021

Update README with CPU worker config #1167

Merged

msaroufim closed this Aug 6, 2021

This was referenced Aug 10, 2021

pytorch.org readme update #1187

Merged

NUMA Multi-Process optimization #1198

Closed

msaroufim deleted the cpu_opt branch June 16, 2022 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU optimizations #1138

CPU optimizations #1138

msaroufim commented Jun 23, 2021 •

edited

sagemaker-neo-ci-bot commented Jun 23, 2021

sagemaker-neo-ci-bot commented Jun 23, 2021

sagemaker-neo-ci-bot commented Jun 23, 2021

sagemaker-neo-ci-bot commented Jul 14, 2021

msaroufim commented Jul 14, 2021 •

edited

msaroufim commented Aug 6, 2021

CPU optimizations #1138

CPU optimizations #1138

Conversation

msaroufim commented Jun 23, 2021 • edited

Description

Can stay the same

Need update

Type of change

Feature/Issue validation/testing

BERT Results

torch.set_num_threads(16) default

torch.set_num_threads(1) (this change)

Resnet 18

Faster CPU

Batch inference

Checklist:

Grid Search

Automated Benchmark repot

Next steps in future PR

sagemaker-neo-ci-bot commented Jun 23, 2021

AWS CodeBuild CI Report

sagemaker-neo-ci-bot commented Jun 23, 2021

AWS CodeBuild CI Report

sagemaker-neo-ci-bot commented Jun 23, 2021

AWS CodeBuild CI Report

sagemaker-neo-ci-bot commented Jul 14, 2021

AWS CodeBuild CI Report

msaroufim commented Jul 14, 2021 • edited

msaroufim commented Aug 6, 2021

msaroufim commented Jun 23, 2021 •

edited

msaroufim commented Jul 14, 2021 •

edited