-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte [<Ray component: tune } #45492
Comments
Experiencing the same issue in WSL2 instance after updating to the latest NVIDIA driver 555. Windows: 11
|
the temporary fix would be rolling back to Nvidia driver to 552.44 , worked for me . |
Indeed, did that right after posting, as a temporary workaround; not really a 'good' long-term solution moving forward though. |
Agreed, my workaround , in the first issue,was fairly easy but only for ray ; for example the wandb still suffered and didn't work as intended. |
Dask has same issue: dask/distributed#5768 |
The code that produces
Edit: add |
It seems to be a problem with the new driver version, but only on WSL2. I commented on the upstream pynvml issue gpuopenanalytics/pynvml#53. |
It seems this is a known problem and will be resolved in a new driver release (from gpuopenanalytics/pynvml#53)
|
Here is my output. Maybe it helps.
|
I'm having the same issue when using ray.data.read_csv. Can you please provide instruction on how to roll it back ti 552.44? Thanks in advance |
turns out just download the 552.44 from https://www.nvidia.com/download/driverResults.aspx/224484/en-us/ and install in windows. |
edit site-packages/ray/_private/accelerators/nvidia_gpu.py |
Just updated to 560.70. Issue seems to be resolved! |
Thanks for the update. Closing. Please open a new issue if the problem is not solved with information from the newer drivers. |
Can we please integrate this fix into Ray? Thanks
|
I thought that inly leads to problems later on. Maybe we should error out with a message about updating drivers? |
It would be helpful to include information about drivers to provide some context; Ray definitely should be more developer-friendly. My concern with the drivers is that we are dependent on NVIDIA, and having a wider range of supported encodings is a good idea, unless it leads to major problems. Having tested Ray with the current driver version, everything seems stable. Thanks |
ray sits near the bottom of a large stack of software. It tries to be as freindly as possible, but pulls in many pieces from third-party vendors: python packages, OS components, user code. Nt everything is under ray's control.
The problem was that for a particular version of the nvidia drivers on WSL were buggy. The immediate victim was an API call to determine the driver version, but simply changing the encoding would only allow the unsuspecting user to get to the next bug, which could have crashed the machine or led to wrong answers. I think failing fast is the best choice here, we don't know if allowing unconventional encodings would have led to major problems or not. Perhaps the error message could be improved, but at the end of the day ray is dependent on third-party software API calls to work properly. |
What happened + What you expected to happen
I've recently started experiencing the same exact issue[ https://github.com/wandb/wandb/issues/7683#issue-2308957793 ] that I suspect is related to a recent NVIDIA driver update. Yesterday, I updated my NVIDIA driver to version 555.85, and since then, I've been encountering errors in both Ray Tune and wand.
Initially, I encountered the error in Ray Tune, but after modifying the nvidia_gpu.py file in python3.11/site-packages/ray/_private/accelerators/ to use Latin-1 encoding instead of UTF-8, I was able to get my Ray Tune project working again. The modified code is as follows:
try:
pynvml.nvmlInit()
except pynvml.NVMLError:
return None # pynvml init failed
device_count = pynvml.nvmlDeviceGetCount()
cuda_device_type = None
if device_count > 0:
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
device_name = pynvml.nvmlDeviceGetName(handle)
if isinstance(device_name, bytes):
device_name = device_name.decode("latin1") # Changed from "utf-8" to "latin1"
cuda_device_type = (
NvidiaGPUAcceleratorManager._gpu_name_to_accelerator_type(device_name)
)
pynvml.nvmlShutdown()
return cuda_device_type
However, I'm still experiencing issues with W&B, where I'm receiving errors and my metrics are not being monitored as intended.
Versions / Dependencies
OS: Windows (WSL2)
Python version: 3.11.8
ray : 2.22.0
nvidia driver version ( installed on windows ) : 555.85
Reproduction script
File "/home/aricept094/MY_Scripts/Ray7_pl_augment_new.py", line 1629, in
analysis = tune.run(
^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/tune/tune.py", line 527, in run
_ray_auto_init(entrypoint=error_message_map["entrypoint"])
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/tune/tune.py", line 252, in _ray_auto_init
ray.init()
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/worker.py", line 1642, in init
_global_node = ray._private.node.Node(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/node.py", line 336, in init
self.start_ray_processes()
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/node.py", line 1396, in start_ray_processes
resource_spec = self.get_resource_spec()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/node.py", line 580, in get_resource_spec
).resolve(is_head=self.head, node_ip_address=self.node_ip_address)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/resource_spec.py", line 215, in resolve
accelerator_manager.get_current_node_accelerator_type()
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/accelerators/nvidia_gpu.py", line 71, in get_current_node_accelerator_type
device_name = device_name.decode("utf-8")
^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte
import re
import os
import logging
from typing import Optional, List, Tuple
from ray._private.accelerators.accelerator import AcceleratorManager
logger = logging.getLogger(name)
CUDA_VISIBLE_DEVICES_ENV_VAR = "CUDA_VISIBLE_DEVICES"
NOSET_CUDA_VISIBLE_DEVICES_ENV_VAR = "RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES"
NVIDIA_GPU_NAME_PATTERN = re.compile(r"\w+\s+([A-Z0-9]+)")
class NvidiaGPUAcceleratorManager(AcceleratorManager):
"""Nvidia GPU accelerators."""
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: