managing nvml exception #1809

lromor · 2022-08-23T14:02:19Z

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #1722

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing

I tested the solution on the non supported device. I can now run serve even if monitoring for the device is not supported.

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

agunapal · 2022-08-23T18:40:23Z

@lromor Can you please attach the failing & passing logs(which shows the warning) to the PR

lromor · 2022-08-24T08:47:00Z

Sure, before:

$: ~/.local/bin/torchserve --start --foreground --model-store model-store/ 
Removing orphan pid file.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2022-08-24T08:44:21,415 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2022-08-24T08:44:21,458 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2022-08-24T08:44:21,516 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.6.0
TS Home: /home/research/serve
Current directory: /home/research/serve
Temp directory: /tmp
Number of GPUs: 1
Number of CPUs: 1
Max heap size: 15545 M
Python executable: /usr/bin/python3
Config file: logs/config/20220824084403675-shutdown.cfg
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/research/serve/model-store
Initial Models: N/A
Log dir: /home/research/serve/logs
Metrics dir: /home/research/serve/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
Workflow Store: /home/research/serve/model-store
Model config: N/A
2022-08-24T08:44:21,523 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Started restoring models from snapshot {
  "name": "20220824084403675-shutdown.cfg",
  "modelCount": 0,
  "created": 1661330643676,
  "models": {}
}
2022-08-24T08:44:21,530 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Validating snapshot 20220824084403675-shutdown.cfg
2022-08-24T08:44:21,531 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Snapshot 20220824084403675-shutdown.cfg validated successfully
2022-08-24T08:44:21,531 [WARN ] main org.pytorch.serve.snapshot.SnapshotManager - Model snapshot is empty. Starting TorchServe without initial models.
2022-08-24T08:44:21,533 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2022-08-24T08:44:21,584 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2022-08-24T08:44:21,585 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2022-08-24T08:44:21,586 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2022-08-24T08:44:21,586 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2022-08-24T08:44:21,587 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2022-08-24T08:44:22,191 [ERROR] Thread-1 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
  File "ts/metrics/metric_collector.py", line 27, in <module>
    system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
  File "/home/research/serve/ts/metrics/system_metrics.py", line 91, in collect_all
    value(num_of_gpu)
  File "/home/research/serve/ts/metrics/system_metrics.py", line 72, in gpu_utilization
    statuses = list_gpus.device_statuses()
  File "/home/research/.local/lib/python3.6/site-packages/nvgpu/list_gpus.py", line 67, in device_statuses
    return [device_status(device_index) for device_index in range(device_count)]
  File "/home/research/.local/lib/python3.6/site-packages/nvgpu/list_gpus.py", line 67, in <listcomp>
    return [device_status(device_index) for device_index in range(device_count)]
  File "/home/research/.local/lib/python3.6/site-packages/nvgpu/list_gpus.py", line 26, in device_status
    temperature = nv.nvmlDeviceGetTemperature(handle, nv.NVML_TEMPERATURE_GPU)
  File "/home/research/.local/lib/python3.6/site-packages/pynvml/nvml.py", line 1956, in nvmlDeviceGetTemperature
    _nvmlCheckReturn(ret)
  File "/home/research/.local/lib/python3.6/site-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported

After:

$ ~/.local/bin/torchserve --start --foreground --model-store model-store/ 
Removing orphan pid file.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2022-08-24T08:45:59,854 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2022-08-24T08:45:59,899 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2022-08-24T08:45:59,958 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.6.0
TS Home: /home/research/serve
Current directory: /home/research/serve
Temp directory: /tmp
Number of GPUs: 1
Number of CPUs: 1
Max heap size: 15545 M
Python executable: /usr/bin/python3
Config file: logs/config/20220824084541786-shutdown.cfg
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/research/serve/model-store
Initial Models: N/A
Log dir: /home/research/serve/logs
Metrics dir: /home/research/serve/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
Workflow Store: /home/research/serve/model-store
Model config: N/A
2022-08-24T08:45:59,965 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Started restoring models from snapshot {
  "name": "20220824084541786-shutdown.cfg",
  "modelCount": 0,
  "created": 1661330741787,
  "models": {}
}
2022-08-24T08:45:59,973 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Validating snapshot 20220824084541786-shutdown.cfg
2022-08-24T08:45:59,973 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Snapshot 20220824084541786-shutdown.cfg validated successfully
2022-08-24T08:45:59,973 [WARN ] main org.pytorch.serve.snapshot.SnapshotManager - Model snapshot is empty. Starting TorchServe without initial models.
2022-08-24T08:45:59,975 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2022-08-24T08:46:00,028 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2022-08-24T08:46:00,029 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2022-08-24T08:46:00,030 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2022-08-24T08:46:00,030 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2022-08-24T08:46:00,031 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2022-08-24T08:46:00,564 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - Parse metrics failed: gpu device monitoring not supported
2022-08-24T08:46:00,566 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:16.7|#Level:Host|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,566 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:460.01054763793945|#Level:Host|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,567 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:24.38713836669922|#Level:Host|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,567 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:5.0|#Level:Host|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,567 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.0|#Level:Host,device_id:0|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,567 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:0|#Level:Host,device_id:0|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,567 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:61089.7734375|#Level:Host|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,568 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2535.2578125|#Level:Host|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,568 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:5.0|#Level:Host|#hostname:training,timestamp:1661330760

agunapal

Thanks. Looks good to me

codecov · 2022-08-25T01:51:13Z

Codecov Report

Merging #1809 (ebbd4f1) into master (b8c23e6) will decrease coverage by 0.04%.
The diff coverage is 8.69%.

@@            Coverage Diff             @@
##           master    #1809      +/-   ##
==========================================
- Coverage   45.28%   45.23%   -0.05%     
==========================================
  Files          64       64              
  Lines        2597     2602       +5     
  Branches       60       60              
==========================================
+ Hits         1176     1177       +1     
- Misses       1421     1425       +4

Impacted Files	Coverage Δ
ts/metrics/system_metrics.py	`30.50% <8.69%> (-0.98%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

agunapal · 2022-08-25T20:22:06Z

@lromor Can you please rebase to the latest and fix the lint issues

fix lint issue

lromor · 2022-08-25T22:23:50Z

I'm having a few issues running pre-commit, even from a fresh ubuntu-latest via docker, it seems pre-commit performs changes that do not follow the coding standards you chose and manually running isort and black seems to reformat the whole codebase. Any ideas?

agunapal · 2022-08-26T17:34:01Z

@lromor I checked out your branch and ran this command pre-commit run --files ts/metrics/system_metrics.py

It worked. Then you need to commit this file. Let me know if it doesn't work. I can commit it.

lromor · 2022-08-26T18:19:50Z

@agunapal , I've run the same command, and the last commit shows you that it's now formatting parts of code I didn't touch.
I've run pre-commit from a docker ubuntu:latest installing only python3-pip, git and pip3 install pre-commit + your command.

agunapal · 2022-08-26T18:49:58Z

@lromor Thats expected behavior. The lint tool tries the correct the entire file

lromor · 2022-08-26T19:12:18Z

I see, but doesn't that imply that the original code wasn't properly formatted in the first place?

agunapal · 2022-08-26T21:01:28Z

@lromor Yes, that code was probably written before lint tool was integrated

agunapal approved these changes Aug 25, 2022

View reviewed changes

msaroufim self-requested a review August 25, 2022 01:19

msaroufim approved these changes Aug 25, 2022

View reviewed changes

agunapal force-pushed the fix-nvml-error-not-supported branch from 5aea057 to e7e03b7 Compare August 25, 2022 19:53

lromor force-pushed the fix-nvml-error-not-supported branch from e7e03b7 to be7ce4d Compare August 25, 2022 22:07

managing nvml exception

5d9e578

fix lint issue

lromor force-pushed the fix-nvml-error-not-supported branch from be7ce4d to 5d9e578 Compare August 25, 2022 22:11

applied pre-commit

d973a6d

Merge branch 'master' into fix-nvml-error-not-supported

ebbd4f1

agunapal merged commit 696442b into pytorch:master Aug 26, 2022

lxning mentioned this pull request Sep 29, 2022

fix pynvml import failure #1882

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

managing nvml exception #1809

managing nvml exception #1809

lromor commented Aug 23, 2022 •

edited

agunapal commented Aug 23, 2022

lromor commented Aug 24, 2022

agunapal left a comment

codecov bot commented Aug 25, 2022 •

edited

agunapal commented Aug 25, 2022

lromor commented Aug 25, 2022 •

edited

agunapal commented Aug 26, 2022

lromor commented Aug 26, 2022 •

edited

agunapal commented Aug 26, 2022

lromor commented Aug 26, 2022 •

edited

agunapal commented Aug 26, 2022

managing nvml exception #1809

managing nvml exception #1809

Conversation

lromor commented Aug 23, 2022 • edited

Description

Type of change

Feature/Issue validation/testing

Checklist:

agunapal commented Aug 23, 2022

lromor commented Aug 24, 2022

agunapal left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 25, 2022 • edited

Codecov Report

agunapal commented Aug 25, 2022

lromor commented Aug 25, 2022 • edited

agunapal commented Aug 26, 2022

lromor commented Aug 26, 2022 • edited

agunapal commented Aug 26, 2022

lromor commented Aug 26, 2022 • edited

agunapal commented Aug 26, 2022

lromor commented Aug 23, 2022 •

edited

codecov bot commented Aug 25, 2022 •

edited

lromor commented Aug 25, 2022 •

edited

lromor commented Aug 26, 2022 •

edited

lromor commented Aug 26, 2022 •

edited