Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching Metrics implementation #1954

Merged
merged 74 commits into from
Nov 10, 2022
Merged

Caching Metrics implementation #1954

merged 74 commits into from
Nov 10, 2022

Conversation

maaquib
Copy link
Collaborator

@maaquib maaquib commented Nov 7, 2022

Description

This PR builds on #1727, fixing unit tests and making some code changes based on spec updates

> python ./ts_scripts/install_from_src.py
> torchserve --start --ncs --model-store ~/Downloads/model_store
> curl -X POST "http://localhost:8081/models?url=resnet-18.mar&model_name=resnet-18&initial_workers=1"
> curl http://127.0.0.1:8080/predictions/resnet-18 -T ./examples/image_classifier/kitten.jpg
> curl http://127.0.0.1:8080/predictions/resnet-18 -T ./examples/image_classifier/kitten.jpg
> torchserve --stop
> cat logs/model_metrics.log
2022-11-08T14:39:28,110 - HandlerTime.Milliseconds:134.14|#ModelName:resnet-18,Level:Model|#hostname:88665a4a5a1b.ant.amazon.com,requestID:8068b315-9df4-4b26-bff4-926b262a01af,timestamp:1667947168
2022-11-08T14:39:28,111 - PredictionTime.Milliseconds:134.43|#ModelName:resnet-18,Level:Model|#hostname:88665a4a5a1b.ant.amazon.com,requestID:8068b315-9df4-4b26-bff4-926b262a01af,timestamp:1667947168
2022-11-08T14:39:29,386 - HandlerTime.Milliseconds:105.97|#ModelName:resnet-18,Level:Model|#hostname:88665a4a5a1b.ant.amazon.com,requestID:8c350922-6f12-4681-9f82-0b60941a0b68,timestamp:1667947169
2022-11-08T14:39:29,388 - PredictionTime.Milliseconds:106.19|#ModelName:resnet-18,Level:Model|#hostname:88665a4a5a1b.ant.amazon.com,requestID:8c350922-6f12-4681-9f82-0b60941a0b68,timestamp:1667947169

Fixes #1492

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

~/Development/serve/benchmarks> python benchmark-ab.py -r 5000
Starting AB benchmark suite...


Configured execution parameters are:
{'url': 'https://torchserve.pytorch.org/mar_files/resnet-18.mar', 'gpus': '', 'exec_env': 'local', 'batch_size': 1, 'batch_delay': 200, 'workers': 1, 'concurrency': 10, 'requests': 5000, 'input': '../examples/image_classifier/kitten.jpg', 'content_type': 'application/jpg', 'image': '', 'docker_runtime': '', 'backend_profiling': False, 'config_properties': 'config.properties', 'inference_model_url': 'predictions/benchmark', 'report_location': '/var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T', 'tmp_dir': '/var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T', 'result_file': '/var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/result.txt', 'metric_log': '/var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/logs/model_metrics.log', 'inference_url': 'http://0.0.0.0:8080', 'management_url': 'http://0.0.0.0:8081', 'config_properties_name': 'config.properties'}


Preparing local execution...
*Terminating any existing Torchserve instance ...
torchserve --stop
TorchServe is not currently running.
*Setting up model store...
*Starting local Torchserve instance...
torchserve --start --model-store /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/model_store --workflow-store /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/wf_store --ts-config /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/conf/config.properties > /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/logs/model_metrics.log
*Testing system health...
{
  "status": "Healthy"
}

*Registering model...
{
  "status": "Model \"benchmark\" Version: 1.0 registered with 1 initial workers"
}



Executing warm-up ...
ab -c 10  -n 500.0 -k -p /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/input -T  application/jpg http://0.0.0.0:8080/predictions/benchmark > /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/result.txt
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests


Executing inference performance tests ...
ab -c 10  -n 5000 -k -p /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/input -T  application/jpg http://0.0.0.0:8080/predictions/benchmark > /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/result.txt
Completed 500 requests
Completed 1000 requests
Completed 1500 requests
Completed 2000 requests
Completed 2500 requests
Completed 3000 requests
Completed 3500 requests
Completed 4000 requests
Completed 4500 requests
Completed 5000 requests
Finished 5000 requests
*Unregistering model ...
{
  "status": "Model \"benchmark\" unregistered"
}

*Terminating Torchserve instance...
torchserve --stop
TorchServe has stopped.
Apache Bench Execution completed.


Generating Reports...
Dropping 5073 warmup lines from log

Writing extracted PredictionTime metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/predict.txt

Writing extracted HandlerTime metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/handler_time.txt

Writing extracted QueueTime metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/waiting_time.txt

Writing extracted WorkerThreadTime metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/worker_thread.txt

Writing extracted CPUUtilization metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/cpu_percentage.txt

Writing extracted MemoryUtilization metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/memory_percentage.txt

Writing extracted GPUUtilization metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/gpu_percentage.txt

Writing extracted GPUMemoryUtilization metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/gpu_memory_percentage.txt

Writing extracted GPUMemoryUsed metrics to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/gpu_memory_used.txt
*Generating CSV output...
Saving benchmark results to /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T
*Preparing graphs...
*Preparing Profile graphs...
Working with sampling rate of 50

Test suite execution complete.

~/Development/serve/benchmarks> ll /var/folders/54/y4twglls2fb6541bntzlv53h0000gs/T/benchmark/
total 520
-rw-r--r--  1 mohaan  staff   627B Nov  9 15:48 ab_report.csv
drwxr-xr-x  3 mohaan  staff    96B Nov  9 15:38 conf
-rw-r--r--  1 mohaan  staff    42B Nov  9 15:48 cpu_percentage.txt
-rw-r--r--  1 mohaan  staff     0B Nov  9 15:48 gpu_memory_percentage.txt
-rw-r--r--  1 mohaan  staff     0B Nov  9 15:48 gpu_memory_used.txt
-rw-r--r--  1 mohaan  staff     0B Nov  9 15:48 gpu_percentage.txt
-rw-r--r--  1 mohaan  staff    32K Nov  9 15:48 handler_time.txt
-rw-r--r--  1 mohaan  staff   108K Nov  9 15:38 input
drwxr-xr-x  3 mohaan  staff    96B Nov  9 15:38 logs
-rw-r--r--  1 mohaan  staff    45B Nov  9 15:48 memory_percentage.txt
-rw-r--r--  1 mohaan  staff    32K Nov  9 15:48 predict.txt
-rw-r--r--  1 mohaan  staff    23K Nov  9 15:48 predict_latency.png
-rw-r--r--  1 mohaan  staff   1.4K Nov  9 15:48 result.txt
-rw-r--r--  1 mohaan  staff    21K Nov  9 15:48 waiting_time.txt
-rw-r--r--  1 mohaan  staff   9.8K Nov  9 15:48 worker_thread.txt

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

An and others added 30 commits July 5, 2022 16:38
…sting for passing file path through properties
@maaquib maaquib changed the title [WIP] Caching Metrics implementation Caching Metrics implementation Nov 8, 2022
@maaquib maaquib marked this pull request as ready for review November 8, 2022 20:02
@codecov
Copy link

codecov bot commented Nov 8, 2022

Codecov Report

Merging #1954 (ff27a88) into master (b316608) will increase coverage by 8.64%.
The diff coverage is 91.90%.

@@            Coverage Diff             @@
##           master    #1954      +/-   ##
==========================================
+ Coverage   44.66%   53.31%   +8.64%     
==========================================
  Files          63       70       +7     
  Lines        2624     3157     +533     
  Branches       56       56              
==========================================
+ Hits         1172     1683     +511     
- Misses       1452     1474      +22     
Impacted Files Coverage Δ
ts/arg_parser.py 25.80% <0.00%> (-3.23%) ⬇️
ts/metrics/metrics_store.py 92.98% <ø> (ø)
ts/metrics/unit.py 100.00% <ø> (ø)
ts/service.py 78.26% <68.75%> (-0.62%) ⬇️
ts/model_service_worker.py 65.89% <75.00%> (+0.29%) ⬆️
ts/metrics/metric.py 80.64% <77.77%> (-8.65%) ⬇️
ts/tests/unit_tests/test_model_service_worker.py 99.14% <83.33%> (+0.01%) ⬆️
ts/metrics/caching_metric.py 89.47% <89.47%> (ø)
ts/metrics/metric_cache_abstract.py 92.77% <92.77%> (ø)
ts/metrics/metric_abstract.py 92.85% <92.85%> (ø)
... and 12 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@msaroufim
Copy link
Member

msaroufim commented Nov 9, 2022

Was the prometheus export in scope for this change? If so can we add some screenshots otherwise lemme know and I can review the intended scope

@maaquib
Copy link
Collaborator Author

maaquib commented Nov 9, 2022

Was the prometheus export in scope for this change? If so can we add some screenshots otherwise lemme know and I can review the intended scope

@msaroufim The prometheus changes were not part of this scope

Copy link
Collaborator

@lxning lxning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please also quickly test if the benchmark-ab.py can generate result based on this PR?

docs/metrics.md Outdated Show resolved Hide resolved
docs/metrics.md Show resolved Hide resolved
docs/metrics.md Outdated Show resolved Hide resolved
docs/metrics.md Outdated Show resolved Hide resolved
ts/model_service_worker.py Outdated Show resolved Hide resolved
ts/configs/metrics.yaml Outdated Show resolved Hide resolved
@msaroufim msaroufim requested a review from mreso November 10, 2022 00:08
Copy link
Collaborator

@mreso mreso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general but I would like to refactor this a bit to improve readability/maintainability

docs/metrics.md Show resolved Hide resolved
docs/metrics.md Show resolved Hide resolved
docs/metrics.md Outdated Show resolved Hide resolved
ts/metrics/imetric.py Outdated Show resolved Hide resolved
ts/metrics/metric.py Show resolved Hide resolved
ts/metrics/metric_cache_yaml_impl.py Outdated Show resolved Hide resolved
ts/metrics/metric_cache_yaml_impl.py Outdated Show resolved Hide resolved
ts/model_service_worker.py Show resolved Hide resolved
ts/model_service_worker.py Show resolved Hide resolved
ts/tests/unit_tests/test_beckend_metric.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC]: Metrics Refactoring
5 participants