Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: TorchServe Model Analyzer #1457

Closed
7 of 8 tasks
msaroufim opened this issue Feb 22, 2022 · 3 comments
Closed
7 of 8 tasks

[RFC]: TorchServe Model Analyzer #1457

msaroufim opened this issue Feb 22, 2022 · 3 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@msaroufim
Copy link
Member

msaroufim commented Feb 22, 2022

TorchServe Model Analyzer

Tasks

Milestone 1

Milestone 2

Problem Statement

Preparing a model for inference is becoming an increasingly important part of shipping models to production. There's an overwhelming amount of choice ranging from which hardware to use, which configs to set, which optimizations to use, what are the tradeoffs, how to benchmark things properly.

All of this has immediate because it helps anyone run PyTorch models more quickly and more cheaply.

This is an overwhelming amount of problems to delegate to users in an unstructured way so what does an end to end solution look like?

Subproblems

There's a few components to the solution

  1. How to benchmark models and get key metrics?
  2. How to profile models using various tools to figure out bottlenecks in a structured way?
  3. How to be aware of and explore various optimizations?

The good thing is we've already built most of these tools in isolation but we haven't yet strung them together into a cohesive story

Finally we need to add support for benchmarking on specific docker images so users can modularize their benchmark runs and allow anyone to run them without involving a complex machine setup.

Solutions

Benchmarking models

The current torchserve benchmarking story relies on apache-bench where users would package up a model into a .mar file, setup a config.json and then run

{
  "url":"https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar",
  "requests": 1000,
  "concurrency": 10,
  "input": "../examples/image_classifier/kitten.jpg",
  "exec_env": "docker",
  "gpus": "2"
}

python benchmark-ab.py --config config.json

Which would provide lots of useful information like throughput, latency at X, number of errors. There's also lots of nuance that benchmarking tools need to think of like process isolation, cold starts which will throw off people building their own benchmarking tools from scratch.

And this approach is now being improved in #1442 by @lxning to

  1. Make configs YAML based with multiple options for a config to allow easy grid search
  2. Allow JSON export for easy comparisons to past runs
  3. Easy export to dashboarding solutions like cloudwatch or prometheus

A major benefit to the approach in #1442 is that by using a standard format it makes it easy to compare, sort and filter models for e.g

  1. Show me only models that have lower than 50ms latency
  2. Show me only models with throughput greater than 1000
  3. Show me only models that consume less than 30% of GPU memory

The only thing #1442 is missing is allowing anyone to run comprehensive benchmarks as well on real infrastructure. There's two options here

  1. Add AWS credentials as an argument to suite
  2. Use a Github Action based workflow

AWS is convenient because it's most flexible in setting up environment yet won't work for any community member that may be using another cloud. Making our work multi cloud will also be very time consuming unless we move to something like teraform templates

Github Actions need work to setup custom runners to allow GPU profiling BUT their big benefit is that artifacts can be made available directly in the Github Actions tab so anyone can inspect them without needing permissions to a special S3 bucket. Also because it's all on Github if community members want to run their own benchmarks all they need to do is fork the repo and then run their own benchmarks.

Profiling

We've recently added support in torchserve for the pytorch profiler gated behind a simple environment variable https://github.com/pytorch/serve/blob/master/docs/performance_guide.md#profiling

export ENABLE_TORCH_PROFILER=TRUE

And this provides some useful insight when it comes to debugging problems with the Pytorch model but not so much problems with configuring torchserve. There are an extensive number of options of profilers that can be run in a separate process without affecting performance that we can either recommend users run or run out of the box gated behind some other environment variable.

Options for macro profilers

  1. Omni-perf: @yqhu has his own profiling tool which is an aggregator that he's used to great effect: https://github.com/yqhu/omni_perf
  2. Scalene: https://github.com/plasma-umass/scalene is a low overhead tool for CPU, GPU and memory profiling
  3. Others: https://github.com/msaroufim/awesome-profiling there are no shortage of profiler tools

Exploring various optimizations

Optimizations fall into a few very different categories

Optimizations to the model

When optimizing a model there's a few commonly used tricks for quantization to pruning to distillation to using a smaller model.

We've attempted to unify many of these tools behind a single CLI interface called torchprep a still very much experimental tool which needs lots of work.

# quantize a cpu model with int8 on cpu and profile with a float tensor of shape [64,3,7,7]
torchprep quantize models/resnet152.pt int8 --input-shape 64,3,7,7

# profile a model for a 100 iterations
torchprep profile models/resnet152.pt --iterations 100 --device cpu --input-shape 64,3,7,7

# set omp threads to 1 to optimize cpu inference
torchprep env --device cpu

# Prune 30% of model weights
torchprep prune models/resnet152.pt --prune-amount 0.3

torchprep unfortunately has 3 weaknesses

  1. No good data format for multiple inputs and multiple dtypes
  2. No good support training aware optimizations including calibrations
  3. Can't deal with runtime exports

Input data format

An example of how to use torchprep is this

torchprep quantize models/resnet152.pt int8 --input-shape 64,3,7,7

The input shape is used to generate a random matrix torch.randn(64,3,7,7) run that through the resnet152 model and calculate the latency

In this case resnet152 expects a single input with shape of [64,3,7,7] however this doesn't work quite as well for something like BERT which requires 2 inputs the tokens and masks

The current data format also doesn't make it easy to deal with arbitrary sized data like batches which can range from 1-n

Instead, we could design a YAML based data format that would support multiple inputs and dtypes so something like @jamesr66a

input1:
  size: 64
  dtpe:float16
input2:
 size: -1 (aka arbitrary)
 dtype: longint

Training aware optimizations

Training aware optimizations generally keep better training performance and are used in libraries like huggingface/optimum.

torchprep currently works only with saved model weights but a natural extension would be adding support for users to add their own training loop or data loader.

Runtime exports

A lot of torchserve users have been looking to export their models to an optimized runtime like TensorRT/IPEX/ORT for accelerated inference. All of these runtimes call an inference within the context of a session and won't work in an offline manner or are stored directly on a saved model.

Optimizations to the serving framework

Optimizations to the serving framework are even more opaque but include notable things like num_workers, num_threads, number of models per GPU, queue_size, batch_size

Out of all of these configurations only batch_size has a clear tradeoff

  1. Big batch size means low latency, high throughput with diminishing returns
  2. Small batch size means low latency, low throughput

Whereas for the others the tradeoff isn't so clear and the expectation is to run a grid search which depending on the model can take days of experiments which won't even lead to a conclusive solution. The goal should not be a comprehensive grid search but just enough to be able to detect performance tradeoffs

So there's a few options here

  1. Leveraging other libraries for optimizations like Launcher core pinning #1401 which takes care of pinning workers to different CPU cores so users don't have to experiment with it
  2. Using a simple ranking model to decide on optimizations
  3. Bayesian optimization like Ax but for inference
  4. Scaling out experiments by launching various instances of torchserve concurrently and then collecting the results in a central place
  5. Try out simple heuristics based on QPS or utilization to change torchserve level configs like num of workers and see what happens. Could also include this by default in torchserve --start --configure_worker which would do this

Conclusion

Analyzing models is hard, building benchmark suites is hard so it's worthwhile creating a streamlined experience for all of the above to make it easier for people to benchmark, profile, optimize and analyze their models.

cc: @chauhang @HamidShojanazeri @yqhu @mreso @lxning @nskool @maaquib @ashokei @d4l3k @gchanan

@msaroufim msaroufim changed the title [RFC]: TorchServe Model Analyzer - AutoML for Inference - Cost aware Deployment [RFC]: TorchServe Model Analyzer Feb 23, 2022
@msaroufim msaroufim pinned this issue Feb 23, 2022
@msaroufim msaroufim self-assigned this Feb 23, 2022
@maaquib maaquib added the enhancement New feature or request label Feb 23, 2022
@d4l3k
Copy link

d4l3k commented Feb 24, 2022

Are most of the parameters that users want to tune available from the RPC endpoint or do you need to change the server config for each set? Might be feasible to bundle up the torchserver benchmark suite as a torchx component and run it via Ax for proper Bayesian HPO optimization

That sounds like it may require a lot of knobs to configure correctly so might be too hard to get started with for the average user

@msaroufim
Copy link
Member Author

Are most of the parameters that users want to tune available from the RPC endpoint or do you need to change the server config for each set?

You can change a lot of torchserve parameters dynamically using the management API, you can also swap in a new more optimized model in the same running instance so it should be possible to do quite a bit without ever having to stop torchserve

That sounds like it may require a lot of knobs to configure correctly so might be too hard to get started with for the average user

Maybe, I think we can do quite a bit to streamline comparisons

@d4l3k
Copy link

d4l3k commented Feb 24, 2022

I think there's a lot of opportunities for automatic tuning while the server is running based off of memory, qps and latency.

A lot of these pain points we might be able to just mitigate at the service level so you don't even need to param sweep at all

Advanced users would likely want more control but I bet smarter defaults/auto tuning is good enough for 80% of users

@msaroufim msaroufim unpinned this issue Feb 24, 2022
@lxning lxning added this to the v0.6.0 milestone Mar 7, 2022
@lxning lxning added this to backlog in v0.6.0 lifecycle Mar 7, 2022
@msaroufim msaroufim moved this from backlog to in progress in v0.6.0 lifecycle Mar 14, 2022
@pytorch pytorch deleted a comment from Flyingdog-Huang Mar 23, 2022
@msaroufim msaroufim pinned this issue Mar 25, 2022
@msaroufim msaroufim unpinned this issue Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
v0.6.0 lifecycle
in progress
Development

No branches or pull requests

4 participants