Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

managing nvml exception #1809

Merged
merged 3 commits into from
Aug 26, 2022
Merged

Conversation

lromor
Copy link
Contributor

@lromor lromor commented Aug 23, 2022

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #1722

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing

I tested the solution on the non supported device. I can now run serve even if monitoring for the device is not supported.

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@agunapal
Copy link
Collaborator

@lromor Can you please attach the failing & passing logs(which shows the warning) to the PR

@lromor
Copy link
Contributor Author

lromor commented Aug 24, 2022

Sure, before:

$: ~/.local/bin/torchserve --start --foreground --model-store model-store/ 
Removing orphan pid file.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2022-08-24T08:44:21,415 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2022-08-24T08:44:21,458 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2022-08-24T08:44:21,516 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.6.0
TS Home: /home/research/serve
Current directory: /home/research/serve
Temp directory: /tmp
Number of GPUs: 1
Number of CPUs: 1
Max heap size: 15545 M
Python executable: /usr/bin/python3
Config file: logs/config/20220824084403675-shutdown.cfg
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/research/serve/model-store
Initial Models: N/A
Log dir: /home/research/serve/logs
Metrics dir: /home/research/serve/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
Workflow Store: /home/research/serve/model-store
Model config: N/A
2022-08-24T08:44:21,523 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Started restoring models from snapshot {
  "name": "20220824084403675-shutdown.cfg",
  "modelCount": 0,
  "created": 1661330643676,
  "models": {}
}
2022-08-24T08:44:21,530 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Validating snapshot 20220824084403675-shutdown.cfg
2022-08-24T08:44:21,531 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Snapshot 20220824084403675-shutdown.cfg validated successfully
2022-08-24T08:44:21,531 [WARN ] main org.pytorch.serve.snapshot.SnapshotManager - Model snapshot is empty. Starting TorchServe without initial models.
2022-08-24T08:44:21,533 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2022-08-24T08:44:21,584 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2022-08-24T08:44:21,585 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2022-08-24T08:44:21,586 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2022-08-24T08:44:21,586 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2022-08-24T08:44:21,587 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2022-08-24T08:44:22,191 [ERROR] Thread-1 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
  File "ts/metrics/metric_collector.py", line 27, in <module>
    system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
  File "/home/research/serve/ts/metrics/system_metrics.py", line 91, in collect_all
    value(num_of_gpu)
  File "/home/research/serve/ts/metrics/system_metrics.py", line 72, in gpu_utilization
    statuses = list_gpus.device_statuses()
  File "/home/research/.local/lib/python3.6/site-packages/nvgpu/list_gpus.py", line 67, in device_statuses
    return [device_status(device_index) for device_index in range(device_count)]
  File "/home/research/.local/lib/python3.6/site-packages/nvgpu/list_gpus.py", line 67, in <listcomp>
    return [device_status(device_index) for device_index in range(device_count)]
  File "/home/research/.local/lib/python3.6/site-packages/nvgpu/list_gpus.py", line 26, in device_status
    temperature = nv.nvmlDeviceGetTemperature(handle, nv.NVML_TEMPERATURE_GPU)
  File "/home/research/.local/lib/python3.6/site-packages/pynvml/nvml.py", line 1956, in nvmlDeviceGetTemperature
    _nvmlCheckReturn(ret)
  File "/home/research/.local/lib/python3.6/site-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported

After:

$ ~/.local/bin/torchserve --start --foreground --model-store model-store/ 
Removing orphan pid file.
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2022-08-24T08:45:59,854 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2022-08-24T08:45:59,899 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2022-08-24T08:45:59,958 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.6.0
TS Home: /home/research/serve
Current directory: /home/research/serve
Temp directory: /tmp
Number of GPUs: 1
Number of CPUs: 1
Max heap size: 15545 M
Python executable: /usr/bin/python3
Config file: logs/config/20220824084541786-shutdown.cfg
Inference address: http://127.0.0.1:8080
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/research/serve/model-store
Initial Models: N/A
Log dir: /home/research/serve/logs
Metrics dir: /home/research/serve/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
Workflow Store: /home/research/serve/model-store
Model config: N/A
2022-08-24T08:45:59,965 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Started restoring models from snapshot {
  "name": "20220824084541786-shutdown.cfg",
  "modelCount": 0,
  "created": 1661330741787,
  "models": {}
}
2022-08-24T08:45:59,973 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Validating snapshot 20220824084541786-shutdown.cfg
2022-08-24T08:45:59,973 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Snapshot 20220824084541786-shutdown.cfg validated successfully
2022-08-24T08:45:59,973 [WARN ] main org.pytorch.serve.snapshot.SnapshotManager - Model snapshot is empty. Starting TorchServe without initial models.
2022-08-24T08:45:59,975 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2022-08-24T08:46:00,028 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080
2022-08-24T08:46:00,029 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2022-08-24T08:46:00,030 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2022-08-24T08:46:00,030 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2022-08-24T08:46:00,031 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2022-08-24T08:46:00,564 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - Parse metrics failed: gpu device monitoring not supported
2022-08-24T08:46:00,566 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:16.7|#Level:Host|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,566 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:460.01054763793945|#Level:Host|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,567 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:24.38713836669922|#Level:Host|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,567 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:5.0|#Level:Host|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,567 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUtilization.Percent:0.0|#Level:Host,device_id:0|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,567 [INFO ] pool-3-thread-1 TS_METRICS - GPUMemoryUsed.Megabytes:0|#Level:Host,device_id:0|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,567 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:61089.7734375|#Level:Host|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,568 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:2535.2578125|#Level:Host|#hostname:training,timestamp:1661330760
2022-08-24T08:46:00,568 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:5.0|#Level:Host|#hostname:training,timestamp:1661330760

Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looks good to me

@msaroufim msaroufim self-requested a review August 25, 2022 01:19
@codecov
Copy link

codecov bot commented Aug 25, 2022

Codecov Report

Merging #1809 (ebbd4f1) into master (b8c23e6) will decrease coverage by 0.04%.
The diff coverage is 8.69%.

@@            Coverage Diff             @@
##           master    #1809      +/-   ##
==========================================
- Coverage   45.28%   45.23%   -0.05%     
==========================================
  Files          64       64              
  Lines        2597     2602       +5     
  Branches       60       60              
==========================================
+ Hits         1176     1177       +1     
- Misses       1421     1425       +4     
Impacted Files Coverage Δ
ts/metrics/system_metrics.py 30.50% <8.69%> (-0.98%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@agunapal agunapal force-pushed the fix-nvml-error-not-supported branch from 5aea057 to e7e03b7 Compare August 25, 2022 19:53
@agunapal
Copy link
Collaborator

@lromor Can you please rebase to the latest and fix the lint issues

@lromor lromor force-pushed the fix-nvml-error-not-supported branch from e7e03b7 to be7ce4d Compare August 25, 2022 22:07
fix lint issue
@lromor lromor force-pushed the fix-nvml-error-not-supported branch from be7ce4d to 5d9e578 Compare August 25, 2022 22:11
@lromor
Copy link
Contributor Author

lromor commented Aug 25, 2022

I'm having a few issues running pre-commit, even from a fresh ubuntu-latest via docker, it seems pre-commit performs changes that do not follow the coding standards you chose and manually running isort and black seems to reformat the whole codebase. Any ideas?

@agunapal
Copy link
Collaborator

@lromor I checked out your branch and ran this command pre-commit run --files ts/metrics/system_metrics.py

It worked. Then you need to commit this file. Let me know if it doesn't work. I can commit it.

@lromor
Copy link
Contributor Author

lromor commented Aug 26, 2022

@agunapal , I've run the same command, and the last commit shows you that it's now formatting parts of code I didn't touch.
I've run pre-commit from a docker ubuntu:latest installing only python3-pip, git and pip3 install pre-commit + your command.

@agunapal
Copy link
Collaborator

@lromor Thats expected behavior. The lint tool tries the correct the entire file

@lromor
Copy link
Contributor Author

lromor commented Aug 26, 2022

I see, but doesn't that imply that the original code wasn't properly formatted in the first place?

@agunapal
Copy link
Collaborator

@lromor Yes, that code was probably written before lint tool was integrated

@agunapal agunapal merged commit 696442b into pytorch:master Aug 26, 2022
@lxning lxning mentioned this pull request Sep 29, 2022
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NVML_ERROR_NOT_SUPPORTED exception
3 participants