Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte [<Ray component: tune } #45492

Closed
Aricept094 opened this issue May 22, 2024 · 18 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks QS Quantsight triage label windows

Comments

@Aricept094
Copy link

Aricept094 commented May 22, 2024

What happened + What you expected to happen

I've recently started experiencing the same exact issue[ https://github.com/wandb/wandb/issues/7683#issue-2308957793 ] that I suspect is related to a recent NVIDIA driver update. Yesterday, I updated my NVIDIA driver to version 555.85, and since then, I've been encountering errors in both Ray Tune and wand.

Initially, I encountered the error in Ray Tune, but after modifying the nvidia_gpu.py file in python3.11/site-packages/ray/_private/accelerators/ to use Latin-1 encoding instead of UTF-8, I was able to get my Ray Tune project working again. The modified code is as follows:

try:
pynvml.nvmlInit()
except pynvml.NVMLError:
return None # pynvml init failed
device_count = pynvml.nvmlDeviceGetCount()
cuda_device_type = None
if device_count > 0:
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
device_name = pynvml.nvmlDeviceGetName(handle)
if isinstance(device_name, bytes):
device_name = device_name.decode("latin1") # Changed from "utf-8" to "latin1"
cuda_device_type = (
NvidiaGPUAcceleratorManager._gpu_name_to_accelerator_type(device_name)
)
pynvml.nvmlShutdown()
return cuda_device_type

However, I'm still experiencing issues with W&B, where I'm receiving errors and my metrics are not being monitored as intended.

Versions / Dependencies

OS: Windows (WSL2)

Python version: 3.11.8

ray : 2.22.0

nvidia driver version ( installed on windows ) : 555.85

Reproduction script

File "/home/aricept094/MY_Scripts/Ray7_pl_augment_new.py", line 1629, in
analysis = tune.run(
^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/tune/tune.py", line 527, in run
_ray_auto_init(entrypoint=error_message_map["entrypoint"])
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/tune/tune.py", line 252, in _ray_auto_init
ray.init()
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/worker.py", line 1642, in init
_global_node = ray._private.node.Node(
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/node.py", line 336, in init
self.start_ray_processes()
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/node.py", line 1396, in start_ray_processes
resource_spec = self.get_resource_spec()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/node.py", line 580, in get_resource_spec
).resolve(is_head=self.head, node_ip_address=self.node_ip_address)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/resource_spec.py", line 215, in resolve
accelerator_manager.get_current_node_accelerator_type()
File "/home/aricept094/miniconda3/envs/ray4/lib/python3.11/site-packages/ray/_private/accelerators/nvidia_gpu.py", line 71, in get_current_node_accelerator_type
device_name = device_name.decode("utf-8")
^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

import re
import os
import logging
from typing import Optional, List, Tuple

from ray._private.accelerators.accelerator import AcceleratorManager

logger = logging.getLogger(name)

CUDA_VISIBLE_DEVICES_ENV_VAR = "CUDA_VISIBLE_DEVICES"
NOSET_CUDA_VISIBLE_DEVICES_ENV_VAR = "RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES"

NVIDIA_GPU_NAME_PATTERN = re.compile(r"\w+\s+([A-Z0-9]+)")

class NvidiaGPUAcceleratorManager(AcceleratorManager):
"""Nvidia GPU accelerators."""

@staticmethod
def get_resource_name() -> str:
    return "GPU"

@staticmethod
def get_visible_accelerator_ids_env_var() -> str:
    return CUDA_VISIBLE_DEVICES_ENV_VAR

@staticmethod
def get_current_process_visible_accelerator_ids() -> Optional[List[str]]:
    cuda_visible_devices = os.environ.get(
        NvidiaGPUAcceleratorManager.get_visible_accelerator_ids_env_var(), None
    )
    if cuda_visible_devices is None:
        return None

    if cuda_visible_devices == "":
        return []

    if cuda_visible_devices == "NoDevFiles":
        return []

    return list(cuda_visible_devices.split(","))

@staticmethod
def get_current_node_num_accelerators() -> int:
    import ray._private.thirdparty.pynvml as pynvml

    try:
        pynvml.nvmlInit()
    except pynvml.NVMLError:
        return 0  # pynvml init failed
    device_count = pynvml.nvmlDeviceGetCount()
    pynvml.nvmlShutdown()
    return device_count

@staticmethod
def get_current_node_accelerator_type() -> Optional[str]:
    import ray._private.thirdparty.pynvml as pynvml

    try:
        pynvml.nvmlInit()
    except pynvml.NVMLError:
        return None  # pynvml init failed
    device_count = pynvml.nvmlDeviceGetCount()
    cuda_device_type = None
    if device_count > 0:
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        device_name = pynvml.nvmlDeviceGetName(handle)
        if isinstance(device_name, bytes):
            device_name = device_name.decode("utf-8")
        cuda_device_type = (
            NvidiaGPUAcceleratorManager._gpu_name_to_accelerator_type(device_name)
        )
    pynvml.nvmlShutdown()
    return cuda_device_type

@staticmethod
def _gpu_name_to_accelerator_type(name):
    if name is None:
        return None
    match = NVIDIA_GPU_NAME_PATTERN.match(name)
    return match.group(1) if match else None

@staticmethod
def validate_resource_request_quantity(
    quantity: float,
) -> Tuple[bool, Optional[str]]:
    return (True, None)

@staticmethod
def set_current_process_visible_accelerator_ids(
    visible_cuda_devices: List[str],
) -> None:
    if os.environ.get(NOSET_CUDA_VISIBLE_DEVICES_ENV_VAR):
        return

    os.environ[
        NvidiaGPUAcceleratorManager.get_visible_accelerator_ids_env_var()
    ] = ",".join([str(i) for i in visible_cuda_devices])

@staticmethod
def get_ec2_instance_num_accelerators(
    instance_type: str, instances: dict
) -> Optional[int]:
    if instance_type not in instances:
        return None

    gpus = instances[instance_type].get("GpuInfo", {}).get("Gpus")
    if gpus is not None:
        # TODO(ameer): currently we support one gpu type per node.
        assert len(gpus) == 1
        return gpus[0]["Count"]
    return None

@staticmethod
def get_ec2_instance_accelerator_type(
    instance_type: str, instances: dict
) -> Optional[str]:
    if instance_type not in instances:
        return None

    gpus = instances[instance_type].get("GpuInfo", {}).get("Gpus")
    if gpus is not None:
        # TODO(ameer): currently we support one gpu type per node.
        assert len(gpus) == 1
        return gpus[0]["Name"]
    return None

Issue Severity

High: It blocks me from completing my task.

@Aricept094 Aricept094 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 22, 2024
@anyscalesam anyscalesam added windows core Issues that should be addressed in Ray Core labels May 24, 2024
@Yatagarasu50469
Copy link

Experiencing the same issue in WSL2 instance after updating to the latest NVIDIA driver 555.

Windows: 11
Ubuntu (WSL2): 22.04
Python: 3.10.12
Ray 2.20.0

import ray
ray.init()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 1642, in init
    _global_node = ray._private.node.Node(
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/node.py", line 336, in __init__
    self.start_ray_processes()
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/node.py", line 1396, in start_ray_processes
    resource_spec = self.get_resource_spec()
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/node.py", line 571, in get_resource_spec
    self._resource_spec = ResourceSpec(
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/resource_spec.py", line 215, in resolve
    accelerator_manager.get_current_node_accelerator_type()
  File "/home/USER/.local/lib/python3.10/site-packages/ray/_private/accelerators/nvidia_gpu.py", line 71, in get_current_node_accelerator_type
    device_name = device_name.decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

@Aricept094
Copy link
Author

the temporary fix would be rolling back to Nvidia driver to 552.44 , worked for me .

@Yatagarasu50469
Copy link

Indeed, did that right after posting, as a temporary workaround; not really a 'good' long-term solution moving forward though.

@Aricept094
Copy link
Author

Indeed, did that right after posting, as a temporary workaround; not really a 'good' long-term solution moving forward though.

Agreed, my workaround , in the first issue,was fairly easy but only for ray ; for example the wandb still suffered and didn't work as intended.

@jjyao jjyao added the QS Quantsight triage label label May 28, 2024
@rynewang
Copy link
Contributor

Dask has same issue: dask/distributed#5768

@mattip
Copy link
Contributor

mattip commented May 29, 2024

The code that produces device_name comes from pynvml, see gpuopenanalytics/pynvml#53. Can one of the people with the error try this smaller reproducer and report the results? I think this only happens on machines with non-english user interfaces.

import pynvml

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
print( [x for x in pynvml.nvmlDeviceGetName(handle)])

Edit: add nvmlInit, improve output to show characters

@mattip
Copy link
Contributor

mattip commented May 29, 2024

It seems to be a problem with the new driver version, but only on WSL2. I commented on the upstream pynvml issue gpuopenanalytics/pynvml#53.

@mattip
Copy link
Contributor

mattip commented May 30, 2024

It seems this is a known problem and will be resolved in a new driver release (from gpuopenanalytics/pynvml#53)

This issue has been escalated to the NVML team and the fix has been merged into the upcoming r560 driver branch. I do not believe there are plans to re-release the short-loved r555 branch.

@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 3, 2024
@MKLepium
Copy link

MKLepium commented Jun 9, 2024

Here is my output. Maybe it helps.

>>> import pynvml
vmlDeviceGetHandleByIndex(0)
print( [x for x in pynvml.nvmlDeviceGetName(handle)])>>> 
>>> pynvml.nvmlInit()
>>> handle = pynvml.nvmlDeviceGetHandleByIndex(0)
>>> print( [x for x in pynvml.nvmlDeviceGetName(handle)])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mk/.local/lib/python3.10/site-packages/pynvml.py", line 1921, in wrapper
    return res.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

@wxie2013
Copy link

the temporary fix would be rolling back to Nvidia driver to 552.44 , worked for me .

I'm having the same issue when using ray.data.read_csv. Can you please provide instruction on how to roll it back ti 552.44? Thanks in advance

@wxie2013
Copy link

turns out just download the 552.44 from https://www.nvidia.com/download/driverResults.aspx/224484/en-us/ and install in windows.

@tecworks-dev
Copy link

edit site-packages/ray/_private/accelerators/nvidia_gpu.py
find
device_name = device_name.decode("utf-8")
change to
try:
device_name = device_name.decode('utf-16be')
except UnicodeDecodeError as e:
device_name = device_name.decode("utf-8")
seems to work

@MKLepium
Copy link

Just updated to 560.70. Issue seems to be resolved!

@mattip
Copy link
Contributor

mattip commented Jul 19, 2024

Thanks for the update. Closing. Please open a new issue if the problem is not solved with information from the newer drivers.

@vladjohnson
Copy link

Can we please integrate this fix into Ray? Thanks

device_name = device_name.decode("utf-8")

# Change to
try:
    device_name = device_name.decode('utf-16be')
except UnicodeDecodeError as e:
    device_name = device_name.decode("utf-8")

@mattip
Copy link
Contributor

mattip commented Aug 5, 2024

I thought that inly leads to problems later on. Maybe we should error out with a message about updating drivers?

@vladjohnson
Copy link

It would be helpful to include information about drivers to provide some context; Ray definitely should be more developer-friendly.

My concern with the drivers is that we are dependent on NVIDIA, and having a wider range of supported encodings is a good idea, unless it leads to major problems. Having tested Ray with the current driver version, everything seems stable.

Thanks

@mattip
Copy link
Contributor

mattip commented Aug 8, 2024

Ray definitely should be more developer-friendly.

ray sits near the bottom of a large stack of software. It tries to be as freindly as possible, but pulls in many pieces from third-party vendors: python packages, OS components, user code. Nt everything is under ray's control.

and having a wider range of supported encodings

The problem was that for a particular version of the nvidia drivers on WSL were buggy. The immediate victim was an API call to determine the driver version, but simply changing the encoding would only allow the unsuspecting user to get to the next bug, which could have crashed the machine or led to wrong answers. I think failing fast is the best choice here, we don't know if allowing unconventional encodings would have led to major problems or not. Perhaps the error message could be improved, but at the end of the day ray is dependent on third-party software API calls to work properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks QS Quantsight triage label windows
Projects
None yet
Development

No branches or pull requests

10 participants