-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU optimizations #1138
CPU optimizations #1138
Conversation
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Ideal benchmarks are when number of workers are maximized and CPU optimization on @HamidShojanazeri @chauhang The only caveat is that with a small number of workers reducing the number of torch threads may not be ideal Benchmarked VGG11, VGG16 and BERT EDIT: Talked with @HamidShojanazeri and given that for small worker counts with batched inference, this change can degrade performance we shouldn't go ahead and merge this PR. For expensive CPUs with large core counts this is the way to go. So perhaps instead of merging this change into the base handler what we should do is extend the CPU optimization guide I've been working on and add it as documentation to the repo instead for users worried about scaling. The rational is that in the case where a change to the base handler causes a degradation then users will have to change the base handler and reinstall torchserve whereas if they are just made aware of this trick then they can update their handler without reinstall torchserve. Easier to read table with colors https://docs.google.com/spreadsheets/d/1T3MeDBeGcwI4z226cPcMc_FzAzc757e0_tZAINX0-_4/edit?usp=sharing <style type="text/css"></style>
|
Closing this PR for now - we can revisit if needed |
Description
CPU based models suffer heavily right now from bad throughput if we don't set
torch.set_num_threads(1)
Tested on BERT mini this change introduces a 100x throughput improvement on a 32 core GCP machine on both bare metal and Docker. I still need to test this on a few more models to confirm that this pattern is reproducible.
This issue is not unique to torchserve but is a general suggestion when deploying Pytorch models https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html
The below is an example of a toy resnet model where this issue is shown running on a CPU instance in Google Collab
So this PR introduces a simple change in the base handler to change
num_threads
to 1. The base handler is used in many places so which handlers need to change and which can stay the sameCan stay the same
device
anywhereNeed update
Fixes #(issue)
#1128
Type of change
Please delete options that are not relevant.
Feature/Issue validation/testing
Take a look at the attached mar file and run it and see the poor performance. If you then unzip it and uncomment
torch.set_num_threads(1)
in the handler and rearchive it you will see the improvementsAnd then
After launching torchserve, it's time to start the benchmark
To test out this improvement we used jmeter with the JMX file from here https://gist.github.com/msaroufim/1451b2623919e1b0a13c57d560867c18
Then to run the test
BERT Results
torch.set_num_threads(16) default
Bare metal: 4.3/s
Docker: 4.1/s
torch.set_num_threads(1) (this change)
Bare metal: 535.0s
Docker: 524
Resnet 18
We also tested out this change on resnet 18 and observed the following improvements - not as drastic but still substancial
Faster CPU
Faster CPUs make a substancial difference
Batch inference
On a different AWS C5n machine with 36 cores I repeated a similar experiment to measure batch size differences. The type of hardware make a huge difference for CPU inference so recommendation should be to use latest gen.
For batch size 256 we saw a roughly 20% improvement in throughput over the default batch size 1 so our recommendation for CPU inference should be to use batching. Batch size 512 and 1024 caused the model to hang
UT/IT execution results
Logs
Checklist:
Setting
torch.set_num_threads(32)
will saturate all cores but will not improve throughputSo really what increasing threads does is just increasing the number of context switches which you can't identify with the current pytorch profileer because it doesn't profile from within and OpenMP thread
Grid Search
Grid search shows that nothing matters to increase performance except torch num threads, everything else barely impacts it
Repro instructions: https://gist.github.com/msaroufim/73cf7f642e65a27f1ecb3ef5e832abd9
Default threads and threads 1 with all parameters changed like num_workers, num_netty_threads, queue size etc..
Default threads: https://gist.github.com/msaroufim/d8e5090cc07ae4e6f833a0dc330950ed
1 thread: https://gist.github.com/msaroufim/cdf32020c1ba24e8bb32e990588f1994
Automated Benchmark repot
Artifacts here but test has been flaky. Randomly hangs with no error message https://s3.console.aws.amazon.com/s3/buckets/torchserve-benchmark-fb?region=us-west-2&tab=objects
MNIST before change
MNIST after change (minimal improvement)
VGG 11 before change
VGG 11 after change (double throughput)
BERT before change (no improvement measured using apache bench but seen using jmeter and custom grid search)
BERT after change (close to no change)
Next steps in future PR