UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte [<Ray component: tune } #45492

Aricept094 · 2024-05-22T08:04:01Z

What happened + What you expected to happen

I've recently started experiencing the same exact issue[ https://github.com/wandb/wandb/issues/7683#issue-2308957793 ] that I suspect is related to a recent NVIDIA driver update. Yesterday, I updated my NVIDIA driver to version 555.85, and since then, I've been encountering errors in both Ray Tune and wand.

Initially, I encountered the error in Ray Tune, but after modifying the nvidia_gpu.py file in python3.11/site-packages/ray/_private/accelerators/ to use Latin-1 encoding instead of UTF-8, I was able to get my Ray Tune project working again. The modified code is as follows:

try:
pynvml.nvmlInit()
except pynvml.NVMLError:
return None # pynvml init failed
device_count = pynvml.nvmlDeviceGetCount()
cuda_device_type = None
if device_count > 0:
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
device_name = pynvml.nvmlDeviceGetName(handle)
if isinstance(device_name, bytes):
device_name = device_name.decode("latin1") # Changed from "utf-8" to "latin1"
cuda_device_type = (
NvidiaGPUAcceleratorManager._gpu_name_to_accelerator_type(device_name)
)
pynvml.nvmlShutdown()
return cuda_device_type

However, I'm still experiencing issues with W&B, where I'm receiving errors and my metrics are not being monitored as intended.

Versions / Dependencies

OS: Windows (WSL2)

Python version: 3.11.8

ray : 2.22.0

nvidia driver version ( installed on windows ) : 555.85

Reproduction script

File "/home/aricept094/MY_Scripts/Ray7_pl_augment_new.py", line 1629, in
analysis = tune.run(
^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/tune/tune.py", line 527, in run
_ray_auto_init(entrypoint=error_message_map["entrypoint"])
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/tune/tune.py", line 252, in _ray_auto_init
ray.init()
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/worker.py", line 1642, in init
_global_node = ray._private.node.Node(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/node.py", line 336, in init
self.start_ray_processes()
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/node.py", line 1396, in start_ray_processes
resource_spec = self.get_resource_spec()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/node.py", line 580, in get_resource_spec
).resolve(is_head=self.head, node_ip_address=self.node_ip_address)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/resource_spec.py", line 215, in resolve
accelerator_manager.get_current_node_accelerator_type()
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/accelerators/nvidia_gpu.py", line 71, in get_current_node_accelerator_type
device_name = device_name.decode("utf-8")
^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

import re
import os
import logging
from typing import Optional, List, Tuple

from ray._private.accelerators.accelerator import AcceleratorManager

logger = logging.getLogger(name)

CUDA_VISIBLE_DEVICES_ENV_VAR = "CUDA_VISIBLE_DEVICES"
NOSET_CUDA_VISIBLE_DEVICES_ENV_VAR = "RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES"

NVIDIA_GPU_NAME_PATTERN = re.compile(r"\w+\s+([A-Z0-9]+)")

class NvidiaGPUAcceleratorManager(AcceleratorManager):
"""Nvidia GPU accelerators."""

@staticmethod
def get_resource_name() -> str:
    return "GPU"

@staticmethod
def get_visible_accelerator_ids_env_var() -> str:
    return CUDA_VISIBLE_DEVICES_ENV_VAR

@staticmethod
def get_current_process_visible_accelerator_ids() -> Optional[List[str]]:
    cuda_visible_devices = os.environ.get(
        NvidiaGPUAcceleratorManager.get_visible_accelerator_ids_env_var(), None
    )
    if cuda_visible_devices is None:
        return None

    if cuda_visible_devices == "":
        return []

    if cuda_visible_devices == "NoDevFiles":
        return []

    return list(cuda_visible_devices.split(","))

@staticmethod
def get_current_node_num_accelerators() -> int:
    import ray._private.thirdparty.pynvml as pynvml

    try:
        pynvml.nvmlInit()
    except pynvml.NVMLError:
        return 0  # pynvml init failed
    device_count = pynvml.nvmlDeviceGetCount()
    pynvml.nvmlShutdown()
    return device_count

@staticmethod
def get_current_node_accelerator_type() -> Optional[str]:
    import ray._private.thirdparty.pynvml as pynvml

    try:
        pynvml.nvmlInit()
    except pynvml.NVMLError:
        return None  # pynvml init failed
    device_count = pynvml.nvmlDeviceGetCount()
    cuda_device_type = None
    if device_count > 0:
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        device_name = pynvml.nvmlDeviceGetName(handle)
        if isinstance(device_name, bytes):
            device_name = device_name.decode("utf-8")
        cuda_device_type = (
            NvidiaGPUAcceleratorManager._gpu_name_to_accelerator_type(device_name)
        )
    pynvml.nvmlShutdown()
    return cuda_device_type

@staticmethod
def _gpu_name_to_accelerator_type(name):
    if name is None:
        return None
    match = NVIDIA_GPU_NAME_PATTERN.match(name)
    return match.group(1) if match else None

@staticmethod
def validate_resource_request_quantity(
    quantity: float,
) -> Tuple[bool, Optional[str]]:
    return (True, None)

@staticmethod
def set_current_process_visible_accelerator_ids(
    visible_cuda_devices: List[str],
) -> None:
    if os.environ.get(NOSET_CUDA_VISIBLE_DEVICES_ENV_VAR):
        return

    os.environ[
        NvidiaGPUAcceleratorManager.get_visible_accelerator_ids_env_var()
    ] = ",".join([str(i) for i in visible_cuda_devices])

@staticmethod
def get_ec2_instance_num_accelerators(
    instance_type: str, instances: dict
) -> Optional[int]:
    if instance_type not in instances:
        return None

    gpus = instances[instance_type].get("GpuInfo", {}).get("Gpus")
    if gpus is not None:
        # TODO(ameer): currently we support one gpu type per node.
        assert len(gpus) == 1
        return gpus[0]["Count"]
    return None

@staticmethod
def get_ec2_instance_accelerator_type(
    instance_type: str, instances: dict
) -> Optional[str]:
    if instance_type not in instances:
        return None

    gpus = instances[instance_type].get("GpuInfo", {}).get("Gpus")
    if gpus is not None:
        # TODO(ameer): currently we support one gpu type per node.
        assert len(gpus) == 1
        return gpus[0]["Name"]
    return None

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

Yatagarasu50469 · 2024-05-27T07:02:37Z

Experiencing the same issue in WSL2 instance after updating to the latest NVIDIA driver 555.

Windows: 11
Ubuntu (WSL2): 22.04
Python: 3.10.12
Ray 2.20.0

import ray
ray.init()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 1642, in init
    _global_node = ray._private.node.Node(
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/node.py", line 336, in __init__
    self.start_ray_processes()
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/node.py", line 1396, in start_ray_processes
    resource_spec = self.get_resource_spec()
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/node.py", line 571, in get_resource_spec
    self._resource_spec = ResourceSpec(
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/resource_spec.py", line 215, in resolve
    accelerator_manager.get_current_node_accelerator_type()
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/accelerators/nvidia_gpu.py", line 71, in get_current_node_accelerator_type
    device_name = device_name.decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

Aricept094 · 2024-05-27T08:07:49Z

the temporary fix would be rolling back to Nvidia driver to 552.44 , worked for me .

Yatagarasu50469 · 2024-05-27T19:20:49Z

Indeed, did that right after posting, as a temporary workaround; not really a 'good' long-term solution moving forward though.

Aricept094 · 2024-05-28T05:45:14Z

Indeed, did that right after posting, as a temporary workaround; not really a 'good' long-term solution moving forward though.

Agreed, my workaround , in the first issue,was fairly easy but only for ray ; for example the wandb still suffered and didn't work as intended.

rynewang · 2024-05-28T21:34:38Z

Dask has same issue: dask/distributed#5768

mattip · 2024-05-29T04:51:36Z

The code that produces device_name comes from pynvml, see gpuopenanalytics/pynvml#53. Can one of the people with the error try this smaller reproducer and report the results? I think this only happens on machines with non-english user interfaces.

import pynvml

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
print( [x for x in pynvml.nvmlDeviceGetName(handle)])

Edit: add nvmlInit, improve output to show characters

mattip · 2024-05-29T15:56:45Z

It seems to be a problem with the new driver version, but only on WSL2. I commented on the upstream pynvml issue gpuopenanalytics/pynvml#53.

mattip · 2024-05-30T13:34:24Z

It seems this is a known problem and will be resolved in a new driver release (from gpuopenanalytics/pynvml#53)

This issue has been escalated to the NVML team and the fix has been merged into the upcoming r560 driver branch. I do not believe there are plans to re-release the short-loved r555 branch.

MKLepium · 2024-06-09T21:01:56Z

Here is my output. Maybe it helps.

>>> import pynvml
vmlDeviceGetHandleByIndex(0)
print( [x for x in pynvml.nvmlDeviceGetName(handle)])>>> 
>>> pynvml.nvmlInit()
>>> handle = pynvml.nvmlDeviceGetHandleByIndex(0)
>>> print( [x for x in pynvml.nvmlDeviceGetName(handle)])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mk/.local/lib/python3.10/site-packages/pynvml.py", line 1921, in wrapper
    return res.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

wxie2013 · 2024-06-14T20:40:52Z

the temporary fix would be rolling back to Nvidia driver to 552.44 , worked for me .

I'm having the same issue when using ray.data.read_csv. Can you please provide instruction on how to roll it back ti 552.44? Thanks in advance

wxie2013 · 2024-06-14T21:05:19Z

turns out just download the 552.44 from https://www.nvidia.com/download/driverResults.aspx/224484/en-us/ and install in windows.

tecworks-dev · 2024-06-24T19:41:08Z

edit site-packages/ray/_private/accelerators/nvidia_gpu.py
find
device_name = device_name.decode("utf-8")
change to
try:
device_name = device_name.decode('utf-16be')
except UnicodeDecodeError as e:
device_name = device_name.decode("utf-8")
seems to work

MKLepium · 2024-07-19T01:32:27Z

Just updated to 560.70. Issue seems to be resolved!

mattip · 2024-07-19T05:13:25Z

Thanks for the update. Closing. Please open a new issue if the problem is not solved with information from the newer drivers.

vladjohnson · 2024-08-05T18:34:03Z

Can we please integrate this fix into Ray? Thanks

device_name = device_name.decode("utf-8")

# Change to
try:
    device_name = device_name.decode('utf-16be')
except UnicodeDecodeError as e:
    device_name = device_name.decode("utf-8")

mattip · 2024-08-05T20:44:41Z

I thought that inly leads to problems later on. Maybe we should error out with a message about updating drivers?

vladjohnson · 2024-08-07T16:48:54Z

It would be helpful to include information about drivers to provide some context; Ray definitely should be more developer-friendly.

My concern with the drivers is that we are dependent on NVIDIA, and having a wider range of supported encodings is a good idea, unless it leads to major problems. Having tested Ray with the current driver version, everything seems stable.

Thanks

mattip · 2024-08-08T04:38:05Z

Ray definitely should be more developer-friendly.

ray sits near the bottom of a large stack of software. It tries to be as freindly as possible, but pulls in many pieces from third-party vendors: python packages, OS components, user code. Nt everything is under ray's control.

and having a wider range of supported encodings

The problem was that for a particular version of the nvidia drivers on WSL were buggy. The immediate victim was an API call to determine the driver version, but simply changing the encoding would only allow the unsuspecting user to get to the next bug, which could have crashed the machine or led to wrong answers. I think failing fast is the best choice here, we don't know if allowing unconventional encodings would have led to major problems or not. Perhaps the error message could be improved, but at the end of the day ray is dependent on third-party software API calls to work properly.

Aricept094 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 22, 2024

anyscalesam added windows core Issues that should be addressed in Ray Core labels May 24, 2024

jjyao added the QS Quantsight triage label label May 28, 2024

jjyao added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 3, 2024

T-Atlas mentioned this issue Jun 7, 2024

Issues about Ray under WSL THUDM/GLM-4#110

Closed

mattip closed this as completed Jul 19, 2024

Aricept094 mentioned this issue Jul 29, 2024

[CLI]: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte wandb/wandb#7683

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte [<Ray component: tune } #45492

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte [<Ray component: tune } #45492

Aricept094 commented May 22, 2024 •

edited

Loading

Yatagarasu50469 commented May 27, 2024

Aricept094 commented May 27, 2024

Yatagarasu50469 commented May 27, 2024

Aricept094 commented May 28, 2024

rynewang commented May 28, 2024

mattip commented May 29, 2024 •

edited

Loading

mattip commented May 29, 2024

mattip commented May 30, 2024

MKLepium commented Jun 9, 2024

wxie2013 commented Jun 14, 2024

wxie2013 commented Jun 14, 2024

tecworks-dev commented Jun 24, 2024

MKLepium commented Jul 19, 2024

mattip commented Jul 19, 2024

vladjohnson commented Aug 5, 2024

mattip commented Aug 5, 2024

vladjohnson commented Aug 7, 2024

mattip commented Aug 8, 2024

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte [<Ray component: tune } #45492

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte [<Ray component: tune } #45492

Comments

Aricept094 commented May 22, 2024 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Yatagarasu50469 commented May 27, 2024

Aricept094 commented May 27, 2024

Yatagarasu50469 commented May 27, 2024

Aricept094 commented May 28, 2024

rynewang commented May 28, 2024

mattip commented May 29, 2024 • edited Loading

mattip commented May 29, 2024

mattip commented May 30, 2024

MKLepium commented Jun 9, 2024

wxie2013 commented Jun 14, 2024

wxie2013 commented Jun 14, 2024

tecworks-dev commented Jun 24, 2024

MKLepium commented Jul 19, 2024

mattip commented Jul 19, 2024

vladjohnson commented Aug 5, 2024

mattip commented Aug 5, 2024

vladjohnson commented Aug 7, 2024

mattip commented Aug 8, 2024

Aricept094 commented May 22, 2024 •

edited

Loading

mattip commented May 29, 2024 •

edited

Loading