[RFC]: Metrics Refactoring #1492

lxning · 2022-03-07T22:17:20Z

Current TorchServe has two mechanisms to emit metrics.

Emit metrics to logs files in a StatsD like format by default .

In this case, both frontend and backend metrics are recorded in log file. However, the logs format is not standard StatsD format. They miss the metric type information (ie. counter, gauge, timer and so on). Users have to write regex to parse the log to build dashboard.

Emit Prometheus formatted metrics.

In this case, existing TorchServe only emits 3 metrics.

ts_inference_requests_total
ts_inference_latency_microseconds
ts_queue_latency_microseconds

Users are not able to get model metrics and system metrics via metrics endpoint.

No central place to store Metrics definition

Existing TorchServe metrics definitions spread everywhere. It is difficult for users to know the available metrics.

Re-Design

TS_Metrics_Design.pdf

Sub tasks on frontend side

Tasks

Give feedback

msaroufim · 2022-03-22T22:19:37Z

For the requirements here need to make sure to include

Adding full coverage for frontend and backend metrics with prometheus export
Instructions to make it easy for anyone to export their own metrics to prometheus

thimabru1010 · 2022-03-31T21:46:46Z

Yes, I only can see these 3 metrics in prometheus. I tried to change model log behavior with a log4j2.xml file, but with no success. How do I get system metrics in prometheus?

jonhilgart22 · 2022-05-05T17:03:29Z

Custom metrics would be great. I'd like to add some histogram metrics for Prometheus to see latency percentiles.

sharvil · 2022-05-10T22:59:33Z

Refactoring the existing metrics implementation is a welcome change. After reading through the proposed design, I recommend against introducing the metrics.yaml file.

In our case, we have model-specific performance metrics that we'd like to log and monitor. The ideal scenario is for each model archive to manage its own metrics. As those archives are loaded / unloaded they would register / unregister their metrics and those changes would be reflected in the metrics exposed by the frontend.

Defining a global metrics schema in metrics.yaml would require us to update that file separately every time a model is loaded or unloaded. In addition, doing so is repetitive since the proposed MetricsCaching.addMetric method already needs us to specify all the properties that would be in metrics.yaml.

sharvil · 2022-07-13T21:44:48Z

@msaroufim just wanted to check in about my previous comment. Given that models can be loaded/unloaded dynamically, I'm not sure how a pre-defined metrics.yaml file will work. Can we discuss before the PR is merged?

msaroufim · 2022-07-13T21:54:49Z

@joshuaan7 and @lxning are leading the design and implementation for this refactor. It sounds like you're suggesting to archive metric definitions along with the model to have model specific metrics. It's a request in line with being a multi model serving framework if you view metrics as model specific but I can also understand if there's a case to be made that metrics are a framework level configuration. We can for convenience provide a global predefined metrics.yaml. I think your ask makes sense but I'll leave it to @joshuaan7 and @lxning to decide if, when and how to stage this work.

sharvil · 2022-07-14T01:01:06Z

My comments in this thread revolve around two suggestions which are independent of each other, but on the topic of this metrics refactor.

1. Model-specific metrics

We have model-specific metrics that we'd like to log and monitor. In our ideal world, TorchServe would provide a mechanism for each model to log and publish those metrics just like it does for global metrics (e.g. ts_inference_latency_microseconds). Doing so would reduce code duplication and present a consistent mechanism to handle metrics.

Concretely, we have models with multiple preprocessing and inference steps. We need to capture timing metrics for each of those steps so we can identify bottlenecks and potential regressions as we update our models. TorchServe can't, in general, know what those individual preprocessing/inference steps are so the metrics collection must happen within the model's handler.

2. Schema-free metrics

Defining an explicit schema in metrics.yaml introduces an additional step a developer must take in addition to instrumenting the code to capture metrics. But this schema is already known to TorchServe at runtime since the instrumented code must also provide this information. Moreover, it's error prone in case there's a mismatch between the schema declared in metrics.yaml and what's implemented in code.

Instead of having an explicit schema, I suggest either a) having TorchServe expose the schema it's currently using through the metrics endpoint, e.g. http://127.0.0.1:8082/metrics?schema; or b) writing out a metrics.yaml file periodically so users can see what metrics are available by inspecting that file. In either case, TorchServe does the work of producing the schema instead of a developer writing it by hand.

lxning · 2022-07-15T00:36:12Z

@sharvil Let me try to answer your questions.

Existing Torchserve allows users to define customized model specific metrics by custom-metrics-api. All metrics from backend (ie., model metrics) are written into model_metrics.log. I assume you are using it in your case now. The problem for this solution is it has no well defined metrics type so that it is not able to correctly populate to prometheus metrics framework. As a legacy, Torchserve will still support the metrics from custom-metrics-api and write them into model_metrics.log as the current style.
The proposed MetricsCaching.addMetric method is used for loading all model metrics defined in yaml file during backend initialization. In other words, it is not necessary for users to call it again. Users can directly access MetricsCaching to get the loaded the metrics defined in yaml file. Eventually all the model metrics defined in the yaml file will be populated into prometheus metrics framework by the frontend. Users can access them via 127.0.0.1:8082/metrics
Regarding "Schema-free metrics", you mentioned two requirements.

metrics schema endpoint: Prometheus API is able to provide the metrics info (eg. example) once the [RFC] is implemented.
static metrics (ie. metrics config yaml file) vs. dynamic metrics. All the metrics including the metrics from Torchserve frontend (ie. in existing ts_metrics.log) and the model metrics from backend (ie. in existing model_metrics.log) need to be cached and populated to prometheus by the frontend. Considering performance and simplicity, we prefer static metrics caching in frontend. That's one of the reason we introduces metrics yaml file.

sharvil · 2022-07-18T02:52:20Z

Thanks for the detailed response.

Existing Torchserve allows users to define customized model specific metrics by custom-metrics-api. All metrics from backend (ie., model metrics) are written into model_metrics.log. I assume you are using it in your case now. The problem for this solution is it has no well defined metrics type so that it is not able to correctly populate to prometheus metrics framework. As a legacy, Torchserve will still support the metrics from custom-metrics-api and write them into model_metrics.log as the current style.

I think we're saying the same thing here. The existing custom metrics API doesn't go far enough because it doesn't publish to Prometheus. We'd like the metrics API refactor to allow for custom metrics that get published to Prometheus.

The proposed MetricsCaching.addMetric method is used for loading all model metrics defined in yaml file during backend initialization. In other words, it is not necessary for users to call it again. Users can directly access MetricsCaching to get the loaded the metrics defined in yaml file. Eventually all the model metrics defined in the yaml file will be populated into prometheus metrics framework by the frontend. Users can access them via 127.0.0.1:8082/metrics

Yes, I understand the proposed API design. I'm suggesting an alternative design to support the following scenario:

each model handler can log model-specific metrics
those model-specific metrics are published to Prometheus (by the frontend, which collects metrics from each backend)
TorchServe already has a mechanism to dynamically load/unload models without restarting the server, so the metrics logging and collection needs to be flexible enough to also handle dynamic model loading/unloading (i.e. not just at backend initialization time)

So maybe we should a step back and answer two questions:

Is the scenario I described above one that the TorchServe team is interested in enabling?
If so, how does the proposed API support that scenario?

@lxning, what do you think?

duk0011 · 2022-08-11T23:53:19Z

If the model is configured with multiple workers, is there a way to know worker specific metrics/informations within the model logs? If not, can this be added.

metrics refactor RFC: pytorch#1492

lxning · 2022-10-28T22:47:44Z

@duk0011 worker is dynamically generated. For example, frontend will create a new worker if an existing worker dies. So in general, it is not very useful to monitor metrics on worker, especially in the case of elastic worker thread pooling in the future implementation. That's why Torchserve does not provider worker as a default dimension in the existing model metrics.

However, it is not a big deal to support worker as a metrics dimension. We can add worker id (ie. port number) in context so that user can fetch it in handler and emit it as a dimension value.

lxning · 2022-10-28T22:55:06Z

@sharvil it is doable to support dynamic metrics configuration, but will have performance impact on Model Server. So Torchserve will support static metrics in phase 1 and dynamic metric in later phase.

* Metrics helper classes implementation Dimension, Units and Metric * Refactor metrics helper classes 1) Move metrics helper classes from src/backends to src/utils 2) Update Metric class to store a vector of values instead of a single value * Fix metrics headers include guard to follow naming convention * Refactor metrics implementation to follow the API described in the metrics refactor RFC: #1492 * Revert changes to the following CMakeLists files since no change is required as part of the metrics implementation: cpp/src/backends/CMakeLists.txt cpp/test/CMakeLists.txt * Fix compiler warnings related to std::experimental::filesystem * Refactor metrics helper classes to simplify backend metrics implementation by emitting logs when the metrics API is called instead of storing them until the completion of an inference request to flush the metrics * Infer dimension names order from config file and use the same order for dimension values argument in the metrics API. Fix clang-tidy warnings. * Refactor backend metrics unit tests to use same regex as frontend to parse metric logs

* add workaround solution from nvidia * add comments * expand runtimeType * add runtimeType in model config * add unit test * revert test/buildspec_gpu.yml * update testng.xml * update json files * fmt * fmt * init cpp dir * init code structure * Init socket code and otf protocol * add log api * decouple backend and model worker; impl torchscript load model; build scripts [ci skip] * delete src/CMakeLists.txt * init model archive manifest loader * add manifest and unit test * integrate manifest into backend; add unit test * update otf message internal structure; add inference request message * update otfmessage function return [skip ci] * add torch base handler * support dynamic load shared lib * disable install libtorch * update utils/CMakeLists.txt * add dynamic lib loader unit test * [skip CI] update src/utils/CMakeLists.txt * install kineto in build.sh * [skip ci] add handler factory * [skip ci] update inference request message * vision handler inference impl. * [skip ci] update torchscript backend api * change model_path to model_dir [skip ci] * [skip ci] torchscripted handler load model pass postive test * [skip ci] fix dll test teardown * [skip ci] add mnist inference positive test * update torchscripted base handler postprocess * [skip ci] add model instance status in backend * [skip ci]add prediction test for base and dll handler * [skip ci] clean up * add batch inference test * [skip ci] add dll close * [skip ci] file_system clean up * [skip ci] add mnist scripted model pt file for testing * [skip ci] torch_scripted/torch_scripted_backend_test refactory * [skip ci] torch_scripted_backend_test refactory * [skip ci] extend LoadPredict api * [skip ci] add negative test in torch_scripted_backend_test * explicit set ABI=1 * support different build step for linux and mac * [skip ci] update folly installation * add sudo for folly dep installation * [skip ci] update libtorch cuda installation * [skip ci] update sudo mac * [skip ci] update cuda option flag * [skip ci] set cuda compiler * [skip ci] skip install kineto on linux * [skip ci] fix cude compile path * add documnetation * update gcc version description * add cpp log config file option * add parameters * update setup.py for package cpp * set cpp libpath env * add unit test * [skip ci] install lib * [skip ci] add cpp log config path for cpp backend start * CPP OTFProtocol implementation for inference request and response (#1817) * Add folly logging * Adding model response serializer * Slight refactor * Adding test for otf protocol * Address review comments * Adding some logging tests * Refactoring socket methods into its own class to enable mocking for testing * otf protocol implementation for inference request and response * rebase from #1814 * rebase from #1814 * refactor after PR#1814 rebase * add unit tests for inference request and response otf protocol * add cpp backend test for otf_protocol and handler * Update logging flag to read log config file * Address test review comments * Remove response end from LoadModelResponse data struct * Removing googletest since folly already has it * Adding errno to socket send and receive failures * address review comments * refactor LoadRequest and Response OTF protocol to remove use of new and minor improvements * address review comments Co-authored-by: Aaqib Ansari <maaquib@gmail.com> * update model archiver for cpp * Bug fix in cpp integration (#1887) * bug fixes in java - cpp integration * revert arg changes * add clang-tidy in build (#1896) * replace strcpy with strncpy (#1898) * add clang-tidy in build * replace strcpu with strncpy * [WIP] cpp backend with Java integ (#1890) * Fix build script * Fix socket issues * Fix socket name truncation for uds * Fix expected log lines format from frontend * Removing some debug logs * Address review comments * Remove incorrect log line * Fix inference issue * Update filesystem import * Fix path of default logging file * Make build.sh executable * add clang-tidy and clang-format for cpp backend lint (#1910) * add clang-tidy in build * replace strcpu with strncpy * fix warnings for torchscripte backends * add .clang-tidy and update CMakeLists * add clang-format * remove unused parameter in cpp basehandler (#1917) * add clang-tidy in build * replace strcpu with strncpy * fix warnings for torchscripte backends * add .clang-tidy and update CMakeLists * add clang-format * remove unused parameters in basehandler and update mnist handler * remove libmnist_handler.dylib * remove not necessary func softmax in mnist example handler * fix clang-tidy warnings (#1915) * CPP mnist_base postman test (#1907) * add mnist base cpp postman integration test * refactor based on #1917 * add response body validation * disable grpc inference api test for cpp backend model * fix typo * install clang-format on linux (#1926) * Add CI for cpp_backend branch (#1916) * Create ci-cpu-cpp.yml * Update ci-cpu-cpp.yml * Update ci-cpu-cpp.yml * Update ci-cpu-cpp.yml * Metrics helper classes implementation for C++ backend (#1874) * Metrics helper classes implementation Dimension, Units and Metric * Refactor metrics helper classes 1) Move metrics helper classes from src/backends to src/utils 2) Update Metric class to store a vector of values instead of a single value * Fix metrics headers include guard to follow naming convention * Refactor metrics implementation to follow the API described in the metrics refactor RFC: #1492 * Revert changes to the following CMakeLists files since no change is required as part of the metrics implementation: cpp/src/backends/CMakeLists.txt cpp/test/CMakeLists.txt * Fix compiler warnings related to std::experimental::filesystem * Refactor metrics helper classes to simplify backend metrics implementation by emitting logs when the metrics API is called instead of storing them until the completion of an inference request to flush the metrics * Infer dimension names order from config file and use the same order for dimension values argument in the metrics API. Fix clang-tidy warnings. * Refactor backend metrics unit tests to use same regex as frontend to parse metric logs * install cpp via install_from_src (#1883) * add clang-tidy in build * replace strcpu with strncpy * fix warnings for torchscripte backends * add .clang-tidy and update CMakeLists * add clang-format * remove unused parameters in basehandler and update mnist handler * remove libmnist_handler.dylib * remove not necessary func softmax in mnist example handler * feature install cpp from install_from_src * add --install-dependencies in setup.py * fix typo * update MANIFEST.in and readme * update readme * code cleanup * update readme * update logging path * fix backend worker started checking * update readme * Update README.md * YAML metrics configuration handling for C++ backend (#1941) * fix yaml_cpp installation in build script (#1996) * fix yaml_cpp installation * build request id strings for one batch * Metrics cache implementation and integration with C++ backend (#1975) * Metrics cache implementation for C++ backend * Metrics cache integration with C++ backend Co-authored-by: Naman Nandan <namannan@amazon.com> * Revert "Metrics cache implementation and integration with C++ backend (#1975)" (#2011) This reverts commit 3451bb7. * Metrics cache implementation and integration with C++ backend (#2012) * Metrics cache implementation for C++ backend * Metrics cache integration with C++ backend Co-authored-by: Naman Nandan <namannan@amazon.com> * Fix lint error * Fix lint error * Fix model-archiver after cpp merge * Adjust signature of workerLifeCycleMnist.startWorker in test * Fix unit tests after merging master into cpp_backend * Fix linting error * Install dependencies for cpp backend * Fix unit tests after cpp merge * Fix formatting * Move installation of cpp deps to ts_scripts/install_dependencies.py * Build cpp backend for regression and sanity tests * Fix formatting * Fix typo * Temp fix hanging after starting cpp worker * Add pytest for cpp backend * Roll back building of cpp abckend in ci regression and sanity tests; install deps in cpp ci * Fix formatting * Remove mnist_custom_cpp.mar file from postman test as we do not build cpp backend for general regression test * Remove cpp model archive in additional place * Remove cpp build from setup.py * Remove additional ref to build_cpp in setup.py * fix code link * Update README.md * Update libtorch versions + move installation of cpp backend to build.sh * Prepare cpp build workflow for merge into master * Update cuda version in cpp/build.sh * Remove reference to LDP * Fix WorkerLifeCycleTest * rm src/test/resources/config_test_cpp.properties * Remove debug prints * Skip cpp backend test if cpp backend is not available --------- Co-authored-by: lxning <lninga@amazon.com> Co-authored-by: Aaqib Ansari <maaquib@gmail.com> Co-authored-by: lxning <23464292+lxning@users.noreply.github.com> Co-authored-by: rohithkrn <rohith.nallamaddi@gmail.com> Co-authored-by: Naman Nandan <namankt55@gmail.com> Co-authored-by: Naman Nandan <namannan@amazon.com> Co-authored-by: Geeta Chauhan <4461127+chauhang@users.noreply.github.com>

lxning self-assigned this Mar 7, 2022

lxning added enhancement New feature or request benchmark labels Mar 7, 2022

lxning added this to the v0.6.0 milestone Mar 7, 2022

lxning added this to backlog in v0.6.0 lifecycle Mar 7, 2022

msaroufim mentioned this issue Mar 14, 2022

[RFC]: TorchServe Model Analyzer #1457

Closed

8 tasks

msaroufim changed the title ~~expand model metrics in promethsus endpoint~~ expand model metrics in prometheus endpoint Mar 14, 2022

lxning changed the title ~~expand model metrics in prometheus endpoint~~ Metrics Refactoring Mar 26, 2022

lxning changed the title ~~Metrics Refactoring~~ [RFC]: Metrics Refactoring Apr 16, 2022

lxning pinned this issue Apr 16, 2022

msaroufim mentioned this issue May 4, 2022

Add support for custom metrics in /metrics endpoint and Prometheus #1092

Closed

msaroufim unpinned this issue Jun 16, 2022

msaroufim pinned this issue Jun 20, 2022

msaroufim added metrics and removed metrics labels Jun 20, 2022

msaroufim mentioned this issue Jun 20, 2022

Confused about Cumulative Inference Duration vs. PredictionTime #1698

Open

joshuaan7 mentioned this issue Jul 11, 2022

[RFC]: Metrics Refactoring #1492 Draft PR #1727

Closed

1 task

This was referenced Jul 18, 2022

Health check for all API endpoints #1743

Open

Prometheus Plugin #722

Closed

msaroufim mentioned this issue Sep 14, 2022

Modularize ipex optimization in base_handler.py into ts/utils/ipex_optimization.py. #1664

Closed

rohithkrn unpinned this issue Sep 21, 2022

namannandan mentioned this issue Sep 23, 2022

Metrics helper classes implementation for C++ backend #1874

Merged

namannandan pushed a commit to namannandan/serve that referenced this issue Oct 12, 2022

Refactor metrics implementation to follow the API described in the

f3673cd

metrics refactor RFC: pytorch#1492

namannandan pushed a commit to namannandan/serve that referenced this issue Oct 18, 2022

Refactor metrics implementation to follow the API described in the

60fc539

metrics refactor RFC: pytorch#1492

namannandan pushed a commit to namannandan/serve that referenced this issue Oct 19, 2022

Refactor metrics implementation to follow the API described in the

aa65519

metrics refactor RFC: pytorch#1492

namannandan pushed a commit to namannandan/serve that referenced this issue Oct 21, 2022

Refactor metrics implementation to follow the API described in the

2ac9efc

metrics refactor RFC: pytorch#1492

namannandan mentioned this issue Nov 3, 2022

YAML metrics configuration handling for C++ backend #1941

Merged

1 task

maaquib mentioned this issue Nov 7, 2022

Caching Metrics implementation #1954

Merged

8 tasks

lxning closed this as completed in #1954 Nov 10, 2022

lxning mentioned this issue Nov 15, 2022

Update Torchserve to log telemetry metrics #1974

Merged

maaquib reopened this Nov 15, 2022

maaquib mentioned this issue Nov 15, 2022

Metric API - Integration with Prometheus & Grafana #1946

Closed

namannandan mentioned this issue Nov 15, 2022

Metrics cache implementation and integration with C++ backend #1975

Merged

8 tasks

namannandan mentioned this issue Nov 30, 2022

Metrics cache implementation and integration with C++ backend #2012

Merged

8 tasks

This was referenced Feb 15, 2023

Frontend metrics yaml parser #2134

Closed

TorchServe default metrics config yaml file #2133

Closed

Frontend Metric Interface and Class #2139

Closed

Metrics Caching #2140

Closed

migrate existing frontend metrics to the new cache solution #2141

Closed

This was referenced Mar 17, 2023

Frontend IMetrics, LogMetric and PrometheusMetric #2165

Merged

Frontend metrics configuration handling #2190

Merged

namannandan mentioned this issue Mar 30, 2023

Add metric cache and migrate existing metrics to cache implementation #2199

Merged

10 tasks

namannandan mentioned this issue Apr 27, 2023

Fix dimension names and documentation for metrics with prior prometheus support #2281

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Metrics Refactoring #1492

[RFC]: Metrics Refactoring #1492

lxning commented Mar 7, 2022 •

edited by namannandan

Tasks

msaroufim commented Mar 22, 2022

thimabru1010 commented Mar 31, 2022

jonhilgart22 commented May 5, 2022

sharvil commented May 10, 2022 •

edited

sharvil commented Jul 13, 2022

msaroufim commented Jul 13, 2022 •

edited

sharvil commented Jul 14, 2022

lxning commented Jul 15, 2022 •

edited

sharvil commented Jul 18, 2022

duk0011 commented Aug 11, 2022

lxning commented Oct 28, 2022

lxning commented Oct 28, 2022

[RFC]: Metrics Refactoring #1492

[RFC]: Metrics Refactoring #1492

Comments

lxning commented Mar 7, 2022 • edited by namannandan

Current TorchServe has two mechanisms to emit metrics.

No central place to store Metrics definition

Re-Design

Sub tasks on frontend side

Tasks

msaroufim commented Mar 22, 2022

thimabru1010 commented Mar 31, 2022

jonhilgart22 commented May 5, 2022

sharvil commented May 10, 2022 • edited

sharvil commented Jul 13, 2022

msaroufim commented Jul 13, 2022 • edited

sharvil commented Jul 14, 2022

1. Model-specific metrics

2. Schema-free metrics

lxning commented Jul 15, 2022 • edited

sharvil commented Jul 18, 2022

duk0011 commented Aug 11, 2022

lxning commented Oct 28, 2022

lxning commented Oct 28, 2022

lxning commented Mar 7, 2022 •

edited by namannandan

sharvil commented May 10, 2022 •

edited

msaroufim commented Jul 13, 2022 •

edited

lxning commented Jul 15, 2022 •

edited