You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Preparing a model for inference is becoming an increasingly important part of shipping models to production. There's an overwhelming amount of choice ranging from which hardware to use, which configs to set, which optimizations to use, what are the tradeoffs, how to benchmark things properly.
All of this has immediate because it helps anyone run PyTorch models more quickly and more cheaply.
This is an overwhelming amount of problems to delegate to users in an unstructured way so what does an end to end solution look like?
Subproblems
There's a few components to the solution
How to benchmark models and get key metrics?
How to profile models using various tools to figure out bottlenecks in a structured way?
How to be aware of and explore various optimizations?
The good thing is we've already built most of these tools in isolation but we haven't yet strung them together into a cohesive story
Finally we need to add support for benchmarking on specific docker images so users can modularize their benchmark runs and allow anyone to run them without involving a complex machine setup.
Solutions
Benchmarking models
The current torchserve benchmarking story relies on apache-bench where users would package up a model into a .mar file, setup a config.json and then run
Which would provide lots of useful information like throughput, latency at X, number of errors. There's also lots of nuance that benchmarking tools need to think of like process isolation, cold starts which will throw off people building their own benchmarking tools from scratch.
And this approach is now being improved in #1442 by @lxning to
Make configs YAML based with multiple options for a config to allow easy grid search
Allow JSON export for easy comparisons to past runs
Easy export to dashboarding solutions like cloudwatch or prometheus
A major benefit to the approach in #1442 is that by using a standard format it makes it easy to compare, sort and filter models for e.g
Show me only models that have lower than 50ms latency
Show me only models with throughput greater than 1000
Show me only models that consume less than 30% of GPU memory
The only thing #1442 is missing is allowing anyone to run comprehensive benchmarks as well on real infrastructure. There's two options here
Add AWS credentials as an argument to suite
Use a Github Action based workflow
AWS is convenient because it's most flexible in setting up environment yet won't work for any community member that may be using another cloud. Making our work multi cloud will also be very time consuming unless we move to something like teraform templates
Github Actions need work to setup custom runners to allow GPU profiling BUT their big benefit is that artifacts can be made available directly in the Github Actions tab so anyone can inspect them without needing permissions to a special S3 bucket. Also because it's all on Github if community members want to run their own benchmarks all they need to do is fork the repo and then run their own benchmarks.
And this provides some useful insight when it comes to debugging problems with the Pytorch model but not so much problems with configuring torchserve. There are an extensive number of options of profilers that can be run in a separate process without affecting performance that we can either recommend users run or run out of the box gated behind some other environment variable.
Optimizations fall into a few very different categories
Optimizations to the model
When optimizing a model there's a few commonly used tricks for quantization to pruning to distillation to using a smaller model.
We've attempted to unify many of these tools behind a single CLI interface called torchprep a still very much experimental tool which needs lots of work.
# quantize a cpu model with int8 on cpu and profile with a float tensor of shape [64,3,7,7]
torchprep quantize models/resnet152.pt int8 --input-shape 64,3,7,7
# profile a model for a 100 iterations
torchprep profile models/resnet152.pt --iterations 100 --device cpu --input-shape 64,3,7,7
# set omp threads to 1 to optimize cpu inference
torchprep env --device cpu
# Prune 30% of model weights
torchprep prune models/resnet152.pt --prune-amount 0.3
torchprep unfortunately has 3 weaknesses
No good data format for multiple inputs and multiple dtypes
No good support training aware optimizations including calibrations
The input shape is used to generate a random matrix torch.randn(64,3,7,7) run that through the resnet152 model and calculate the latency
In this case resnet152 expects a single input with shape of [64,3,7,7] however this doesn't work quite as well for something like BERT which requires 2 inputs the tokens and masks
The current data format also doesn't make it easy to deal with arbitrary sized data like batches which can range from 1-n
Instead, we could design a YAML based data format that would support multiple inputs and dtypes so something like @jamesr66a
Training aware optimizations generally keep better training performance and are used in libraries like huggingface/optimum.
torchprep currently works only with saved model weights but a natural extension would be adding support for users to add their own training loop or data loader.
Runtime exports
A lot of torchserve users have been looking to export their models to an optimized runtime like TensorRT/IPEX/ORT for accelerated inference. All of these runtimes call an inference within the context of a session and won't work in an offline manner or are stored directly on a saved model.
Optimizations to the serving framework
Optimizations to the serving framework are even more opaque but include notable things like num_workers, num_threads, number of models per GPU, queue_size, batch_size
Out of all of these configurations only batch_size has a clear tradeoff
Big batch size means low latency, high throughput with diminishing returns
Small batch size means low latency, low throughput
Whereas for the others the tradeoff isn't so clear and the expectation is to run a grid search which depending on the model can take days of experiments which won't even lead to a conclusive solution. The goal should not be a comprehensive grid search but just enough to be able to detect performance tradeoffs
So there's a few options here
Leveraging other libraries for optimizations like Launcher core pinning #1401 which takes care of pinning workers to different CPU cores so users don't have to experiment with it
Using a simple ranking model to decide on optimizations
Bayesian optimization like Ax but for inference
Scaling out experiments by launching various instances of torchserve concurrently and then collecting the results in a central place
Try out simple heuristics based on QPS or utilization to change torchserve level configs like num of workers and see what happens. Could also include this by default in torchserve --start --configure_worker which would do this
Conclusion
Analyzing models is hard, building benchmark suites is hard so it's worthwhile creating a streamlined experience for all of the above to make it easier for people to benchmark, profile, optimize and analyze their models.
The text was updated successfully, but these errors were encountered:
msaroufim
changed the title
[RFC]: TorchServe Model Analyzer - AutoML for Inference - Cost aware Deployment
[RFC]: TorchServe Model Analyzer
Feb 23, 2022
Are most of the parameters that users want to tune available from the RPC endpoint or do you need to change the server config for each set? Might be feasible to bundle up the torchserver benchmark suite as a torchx component and run it via Ax for proper Bayesian HPO optimization
That sounds like it may require a lot of knobs to configure correctly so might be too hard to get started with for the average user
Are most of the parameters that users want to tune available from the RPC endpoint or do you need to change the server config for each set?
You can change a lot of torchserve parameters dynamically using the management API, you can also swap in a new more optimized model in the same running instance so it should be possible to do quite a bit without ever having to stop torchserve
That sounds like it may require a lot of knobs to configure correctly so might be too hard to get started with for the average user
Maybe, I think we can do quite a bit to streamline comparisons
TorchServe Model Analyzer
Tasks
Milestone 1
Milestone 2
Problem Statement
Preparing a model for inference is becoming an increasingly important part of shipping models to production. There's an overwhelming amount of choice ranging from which hardware to use, which configs to set, which optimizations to use, what are the tradeoffs, how to benchmark things properly.
All of this has immediate because it helps anyone run PyTorch models more quickly and more cheaply.
This is an overwhelming amount of problems to delegate to users in an unstructured way so what does an end to end solution look like?
Subproblems
There's a few components to the solution
The good thing is we've already built most of these tools in isolation but we haven't yet strung them together into a cohesive story
Finally we need to add support for benchmarking on specific docker images so users can modularize their benchmark runs and allow anyone to run them without involving a complex machine setup.
Solutions
Benchmarking models
The current
torchserve
benchmarking story relies onapache-bench
where users would package up a model into a.mar
file, setup aconfig.json
and then runpython benchmark-ab.py --config config.json
Which would provide lots of useful information like throughput, latency at X, number of errors. There's also lots of nuance that benchmarking tools need to think of like process isolation, cold starts which will throw off people building their own benchmarking tools from scratch.
And this approach is now being improved in #1442 by @lxning to
A major benefit to the approach in #1442 is that by using a standard format it makes it easy to compare, sort and filter models for e.g
The only thing #1442 is missing is allowing anyone to run comprehensive benchmarks as well on real infrastructure. There's two options here
AWS is convenient because it's most flexible in setting up environment yet won't work for any community member that may be using another cloud. Making our work multi cloud will also be very time consuming unless we move to something like teraform templates
Github Actions need work to setup custom runners to allow GPU profiling BUT their big benefit is that artifacts can be made available directly in the Github Actions tab so anyone can inspect them without needing permissions to a special S3 bucket. Also because it's all on Github if community members want to run their own benchmarks all they need to do is fork the repo and then run their own benchmarks.
Profiling
We've recently added support in
torchserve
for the pytorch profiler gated behind a simple environment variable https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#profilingexport ENABLE_TORCH_PROFILER=TRUE
And this provides some useful insight when it comes to debugging problems with the Pytorch model but not so much problems with configuring
torchserve
. There are an extensive number of options of profilers that can be run in a separate process without affecting performance that we can either recommend users run or run out of the box gated behind some other environment variable.Options for macro profilers
Exploring various optimizations
Optimizations fall into a few very different categories
Optimizations to the model
When optimizing a model there's a few commonly used tricks for quantization to pruning to distillation to using a smaller model.
We've attempted to unify many of these tools behind a single CLI interface called
torchprep
a still very much experimental tool which needs lots of work.torchprep
unfortunately has 3 weaknessesInput data format
An example of how to use
torchprep
is thistorchprep quantize models/resnet152.pt int8 --input-shape 64,3,7,7
The input shape is used to generate a random matrix
torch.randn(64,3,7,7)
run that through the resnet152 model and calculate the latencyIn this case resnet152 expects a single input with shape of
[64,3,7,7]
however this doesn't work quite as well for something like BERT which requires 2 inputs the tokens and masksThe current data format also doesn't make it easy to deal with arbitrary sized data like batches which can range from
1-n
Instead, we could design a YAML based data format that would support multiple inputs and dtypes so something like @jamesr66a
Training aware optimizations
Training aware optimizations generally keep better training performance and are used in libraries like
huggingface/optimum
.torchprep
currently works only with saved model weights but a natural extension would be adding support for users to add their own training loop or data loader.Runtime exports
A lot of
torchserve
users have been looking to export their models to an optimized runtime likeTensorRT/IPEX/ORT
for accelerated inference. All of these runtimes call aninference
within the context of asession
and won't work in an offline manner or are stored directly on a saved model.Optimizations to the serving framework
Optimizations to the serving framework are even more opaque but include notable things like
num_workers
,num_threads
, number of models per GPU,queue_size
,batch_size
Out of all of these configurations only
batch_size
has a clear tradeoffWhereas for the others the tradeoff isn't so clear and the expectation is to run a grid search which depending on the model can take days of experiments which won't even lead to a conclusive solution. The goal should not be a comprehensive grid search but just enough to be able to detect performance tradeoffs
So there's a few options here
torchserve --start --configure_worker
which would do thisConclusion
Analyzing models is hard, building benchmark suites is hard so it's worthwhile creating a streamlined experience for all of the above to make it easier for people to benchmark, profile, optimize and analyze their models.
cc: @chauhang @HamidShojanazeri @yqhu @mreso @lxning @nskool @maaquib @ashokei @d4l3k @gchanan
The text was updated successfully, but these errors were encountered: