Caching Metrics implementation #1954

maaquib · 2022-11-07T17:26:52Z

Description

This PR builds on #1727, fixing unit tests and making some code changes based on spec updates

> python ./ts_scripts/install_from_src.py
> torchserve --start --ncs --model-store ~/Downloads/model_store
> curl -X POST "http://localhost:8081/models?url=resnet-18.mar&model_name=resnet-18&initial_workers=1"
> curl http://127.0.0.1:8080/predictions/resnet-18 -T ./examples/image_classifier/kitten.jpg
> curl http://127.0.0.1:8080/predictions/resnet-18 -T ./examples/image_classifier/kitten.jpg
> torchserve --stop
> cat logs/model_metrics.log
2022-11-08T14:39:28,110 - HandlerTime.Milliseconds:134.14|#ModelName:resnet-18,Level:Model|#hostname:88665a4a5a1b.ant.amazon.com,requestID:8068b315-9df4-4b26-bff4-926b262a01af,timestamp:1667947168
2022-11-08T14:39:28,111 - PredictionTime.Milliseconds:134.43|#ModelName:resnet-18,Level:Model|#hostname:88665a4a5a1b.ant.amazon.com,requestID:8068b315-9df4-4b26-bff4-926b262a01af,timestamp:1667947168
2022-11-08T14:39:29,386 - HandlerTime.Milliseconds:105.97|#ModelName:resnet-18,Level:Model|#hostname:88665a4a5a1b.ant.amazon.com,requestID:8c350922-6f12-4681-9f82-0b60941a0b68,timestamp:1667947169
2022-11-08T14:39:29,388 - PredictionTime.Milliseconds:106.19|#ModelName:resnet-18,Level:Model|#hostname:88665a4a5a1b.ant.amazon.com,requestID:8c350922-6f12-4681-9f82-0b60941a0b68,timestamp:1667947169

Fixes #1492

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Sanity test
Benchmark

~/Development/serve/benchmarks> python benchmark-ab.py -r 5000
Starting AB benchmark suite...


Configured execution parameters are:
{'url': 'https://torchserve.pytorch.org/mar_files/resnet-18.mar', 'gpus': '', 'exec_env': 'local', 'batch_size': 1, 'batch_delay': 200, 'workers': 1, 'concurrency': 10, 'requests': 5000, 'input': '../examples/image_classifier/kitten.jpg', 'content_type': 'application/jpg', 'image': '', 'docker_runtime': '', 'backend_profiling': False, 'config_properties': 'config.properties', 'inference_model_url': 'predictions/benchmark', 'report_location': '/var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T', 'tmp_dir': '/var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T', 'result_file': '/var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/result.txt', 'metric_log': '/var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/logs/model_metrics.log', 'inference_url': 'http://0.0.0.0:8080', 'management_url': 'http://0.0.0.0:8081', 'config_properties_name': 'config.properties'}


Preparing local execution...
*Terminating any existing Torchserve instance ...
torchserve --stop
TorchServe is not currently running.
*Setting up model store...
*Starting local Torchserve instance...
torchserve --start --model-store /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/model_store --workflow-store /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/wf_store --ts-config /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/conf/config.properties > /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/logs/model_metrics.log
*Testing system health...
{
  "status": "Healthy"
}

*Registering model...
{
  "status": "Model \"benchmark\" Version: 1.0 registered with 1 initial workers"
}



Executing warm-up ...
ab -c 10  -n 500.0 -k -p /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/input -T  application/jpg http://0.0.0.0:8080/predictions/benchmark > /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/result.txt
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests


Executing inference performance tests ...
ab -c 10  -n 5000 -k -p /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/input -T  application/jpg http://0.0.0.0:8080/predictions/benchmark > /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/result.txt
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests
*Unregistering model ...
{
  "status": "Model \"benchmark\" unregistered"
}

*Terminating Torchserve instance...
torchserve --stop
TorchServe has stopped.
Apache Bench Execution completed.


Generating Reports...
Dropping 5073 warmup lines from log

Writing extracted PredictionTime metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/predict.txt

Writing extracted HandlerTime metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/handler_time.txt

Writing extracted QueueTime metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/waiting_time.txt

Writing extracted WorkerThreadTime metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/worker_thread.txt

Writing extracted CPUUtilization metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/cpu_percentage.txt

Writing extracted MemoryUtilization metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/memory_percentage.txt

Writing extracted GPUUtilization metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/gpu_percentage.txt

Writing extracted GPUMemoryUtilization metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/gpu_memory_percentage.txt

Writing extracted GPUMemoryUsed metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/gpu_memory_used.txt
*Generating CSV output...
Saving benchmark results to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T
*Preparing graphs...
*Preparing Profile graphs...
Working with sampling rate of 50

Test suite execution complete.

~/Development/serve/benchmarks> ll /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/
total 520
-rw-r--r--  1 mohaan  staff   627B Nov  9 15:48 ab_report.csv
drwxr-xr-x  3 mohaan  staff    96B Nov  9 15:38 conf
-rw-r--r--  1 mohaan  staff    42B Nov  9 15:48 cpu_percentage.txt
-rw-r--r--  1 mohaan  staff     0B Nov  9 15:48 gpu_memory_percentage.txt
-rw-r--r--  1 mohaan  staff     0B Nov  9 15:48 gpu_memory_used.txt
-rw-r--r--  1 mohaan  staff     0B Nov  9 15:48 gpu_percentage.txt
-rw-r--r--  1 mohaan  staff    32K Nov  9 15:48 handler_time.txt
-rw-r--r--  1 mohaan  staff   108K Nov  9 15:38 input
drwxr-xr-x  3 mohaan  staff    96B Nov  9 15:38 logs
-rw-r--r--  1 mohaan  staff    45B Nov  9 15:48 memory_percentage.txt
-rw-r--r--  1 mohaan  staff    32K Nov  9 15:48 predict.txt
-rw-r--r--  1 mohaan  staff    23K Nov  9 15:48 predict_latency.png
-rw-r--r--  1 mohaan  staff   1.4K Nov  9 15:48 result.txt
-rw-r--r--  1 mohaan  staff    21K Nov  9 15:48 waiting_time.txt
-rw-r--r--  1 mohaan  staff   9.8K Nov  9 15:48 worker_thread.txt

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

…d in yaml files

… str

…ra ssertions

…sting for passing file path through properties

codecov · 2022-11-08T20:49:47Z

Codecov Report

Merging #1954 (ff27a88) into master (b316608) will increase coverage by 8.64%.
The diff coverage is 91.90%.

@@            Coverage Diff             @@
##           master    #1954      +/-   ##
==========================================
+ Coverage   44.66%   53.31%   +8.64%     
==========================================
  Files          63       70       +7     
  Lines        2624     3157     +533     
  Branches       56       56              
==========================================
+ Hits         1172     1683     +511     
- Misses       1452     1474      +22

Impacted Files	Coverage Δ
ts/arg_parser.py	`25.80% <0.00%> (-3.23%)`	⬇️
ts/metrics/metrics_store.py	`92.98% <ø> (ø)`
ts/metrics/unit.py	`100.00% <ø> (ø)`
ts/service.py	`78.26% <68.75%> (-0.62%)`	⬇️
ts/model_service_worker.py	`65.89% <75.00%> (+0.29%)`	⬆️
ts/metrics/metric.py	`80.64% <77.77%> (-8.65%)`	⬇️
ts/tests/unit_tests/test_model_service_worker.py	`99.14% <83.33%> (+0.01%)`	⬆️
ts/metrics/caching_metric.py	`89.47% <89.47%> (ø)`
ts/metrics/metric_cache_abstract.py	`92.77% <92.77%> (ø)`
ts/metrics/metric_abstract.py	`92.85% <92.85%> (ø)`
... and 12 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

msaroufim · 2022-11-09T06:14:15Z

Was the prometheus export in scope for this change? If so can we add some screenshots otherwise lemme know and I can review the intended scope

maaquib · 2022-11-09T17:12:46Z

Was the prometheus export in scope for this change? If so can we add some screenshots otherwise lemme know and I can review the intended scope

@msaroufim The prometheus changes were not part of this scope

lxning

could you please also quickly test if the benchmark-ab.py can generate result based on this PR?

docs/metrics.md

ts/model_service_worker.py

ts/configs/metrics.yaml

mreso

LGTM in general but I would like to refactor this a bit to improve readability/maintainability

docs/metrics.md

ts/metrics/imetric.py

ts/metrics/metric.py

ts/metrics/metric_cache_yaml_impl.py

ts/model_service_worker.py

ts/tests/unit_tests/test_beckend_metric.py

An and others added 30 commits July 5, 2022 16:38

Created metrics cache class

89d8baf

added unit tests, cleaned up naming

6485b01

moved unit testing files

2a82a6b

fixing conditional logic in add_metric function, filled out name fiel…

1875ca5

…d in yaml files

Changed dimensions to be list of Dimension objects instead of list of…

1b0d91c

… str

moved to pytest for unit testing

d1a0755

abstracted metric cache class and added pytest unit tests with cleare…

7d3ac3f

…ra ssertions

Adding rough metrics flush method

2fdacef

Converting prints to loggers and emitting metrics to logs

ac89405

removed system metrics, creating custom handler

460e22a

reassigning the metric name and passing in yaml file as an argument

5758f43

Adding log lines and setting up MetricsCache obj integration

4d3c0d0

Adding more unit tests, fixing dimensions parsing

419ba96

added custom error class to act as wrapper for MetricsCache objects

d7ffd48

Added more unit tests for catching naming Metric strings

2efd5e5

Added more comments to code

2bd8c66

working in torchserve start cmd

01d3af9

getting rid of abs path method that is not in use

4abd315

migrating additional add_metric methods from store to cache

8340230

Creating custom handler to test migrated methods and beginnings of te…

b416bc7

…sting for passing file path through properties

editing custom handler

45d3fac

fixing custom handler

6ced1ad

adding flags to reset Metrics after being emitted, trying custom handler

994af49

revising custom handler and passing yaml file as arg

66e5162

refactoring metrics log var name

f299514

getting rid of unneeded log lines

5b69469

editing default metric log path

1e585a2

adding req ids to Metric objs parsed by yaml file

a918be0

editing custom handler

41cb423

Merge branch 'master' of https://github.com/joshuaan7/serve

e3a6fd3

Revert changes to emit metrics in service

4fbdb80

maaquib changed the title ~~[WIP] Caching Metrics implementation~~ Caching Metrics implementation Nov 8, 2022

maaquib added 2 commits November 8, 2022 11:52

Remove tests for emit metrics

7fdbceb

Add metrics config to snapshot config

e4b23e6

maaquib marked this pull request as ready for review November 8, 2022 20:02

Merge branch 'master' into metrics

eb92650

maaquib added 3 commits November 8, 2022 14:40

Backward compatibile

6c92664

Merge branch 'metrics' of github.com:maaquib/serve into metrics

959582c

Merge branch 'master' into metrics

ed046c1

maaquib requested review from msaroufim and rohithkrn November 8, 2022 22:43

Fix failing test

5ed3369

maaquib mentioned this pull request Nov 9, 2022

[RFC]: Metrics Refactoring #1492 Draft PR #1727

Closed

1 task

lxning reviewed Nov 9, 2022

View reviewed changes

maaquib added 2 commits November 9, 2022 15:09

File format sync with cpp

95ec336

Address review comments

373291b

msaroufim requested a review from mreso November 10, 2022 00:08

mreso approved these changes Nov 10, 2022

View reviewed changes

msaroufim and others added 3 commits November 9, 2022 21:50

Merge branch 'master' into metrics

cc877d7

Addressing review comments

142a49e

Merge branch 'metrics' of github.com:maaquib/serve into metrics

ff27a88

lxning approved these changes Nov 10, 2022

View reviewed changes

lxning merged commit a9d8d5d into pytorch:master Nov 10, 2022

This was referenced Nov 10, 2022

Move cache metrics initialization in constructor #1966

Open

Migrate metric_store's old add_metric api to the new cache API #1967

Open

maaquib deleted the metrics branch November 10, 2022 22:46

msaroufim mentioned this pull request Sep 29, 2023

Metric non-default unit is not set in Metric object in versions 0.6.1 onwards #2637

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching Metrics implementation #1954

Caching Metrics implementation #1954

maaquib commented Nov 7, 2022 •

edited

codecov bot commented Nov 8, 2022 •

edited

msaroufim commented Nov 9, 2022 •

edited

maaquib commented Nov 9, 2022

lxning left a comment

mreso left a comment

Caching Metrics implementation #1954

Caching Metrics implementation #1954

Conversation

maaquib commented Nov 7, 2022 • edited

Description

Type of change

Feature/Issue validation/testing

Checklist:

codecov bot commented Nov 8, 2022 • edited

Codecov Report

msaroufim commented Nov 9, 2022 • edited

maaquib commented Nov 9, 2022

lxning left a comment

Choose a reason for hiding this comment

mreso left a comment

Choose a reason for hiding this comment

maaquib commented Nov 7, 2022 •

edited

codecov bot commented Nov 8, 2022 •

edited

msaroufim commented Nov 9, 2022 •

edited