Skip to content

[Core] Add Support for Furiosa AI NPU#63035

Merged
edoakes merged 22 commits into
ray-project:masterfrom
nadongjun:core/add-furiosa-accelerator-manager
May 27, 2026
Merged

[Core] Add Support for Furiosa AI NPU#63035
edoakes merged 22 commits into
ray-project:masterfrom
nadongjun:core/add-furiosa-accelerator-manager

Conversation

@nadongjun
Copy link
Copy Markdown
Contributor

Description

As a user of Furiosa AI's RNGD NPUs on Ray, I've found that management within Ray remains manual and error-prone, despite existing support in vLLM and Hugging Face.

This PR introduces first-class support for Furiosa AI NPUs (specifically the RNGD family: RNGD-S, RNGD, RNGD-Max, RNGD+) into Ray's accelerator management framework. This integration follows the established patterns used for other NPUs.

Currently, Furiosa RNGD is gaining traction in the LLM ecosystem with support in vLLM, Hugging Face Optimum (optimum-furiosa), and Kubernetes. However, Ray users running production inference workloads still face manual overhead:

  • Manual Resource Tagging: Users must pass --resources='{"FURIOSA": N}' to ray start as Ray lacks auto-detection.

  • Manual Device Isolation: Users have to manually manage FURIOSA_VISIBLE_DEVICES to prevent resource contention between actors.

  • Lack of SKU Awareness: There is no native way to target specific chip architectures (e.g., RNGD-Max vs. RNGD-S) using accelerator_type.

Usage Examples

import ray

# NPUs are auto-detected.
ray.init()

# Requesting a specific NPU with architecture pinning
@ray.remote(resources={"FURIOSA": 1}, accelerator_type="FURIOSA_RNGD")
class InferenceWorker:
    def __init__(self):
        # Ray automatically sets FURIOSA_VISIBLE_DEVICES.
        # furiosa-runtime / furiosa-llm will only see the assigned chip.
        from furiosa.runtime import session
        self.sess = session.create("model.dfg")

    def predict(self, x):
        return self.sess.run(x)

Related issues

Additional information

Product: Furiosa AI RNGD

SDK: furiosa-smi-py

Integrations: optimum-furiosa

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@nadongjun nadongjun requested a review from a team as a code owner April 30, 2026 09:01
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Furiosa AI NPUs by implementing the FuriosaAcceleratorManager for device detection and resource management. The changes include logic for architecture identification, environment variable configuration for visible devices, and comprehensive unit tests with mocks for the Furiosa SMI SDK. Review feedback identifies a bug in architecture detection that could return an incorrect string, suggests implementing get_current_node_accelerator_labels for improved dashboard visibility, and recommends updating typing imports.

Comment thread python/ray/_private/accelerators/furiosa.py
Comment thread python/ray/_private/accelerators/furiosa.py
Comment thread python/ray/_private/accelerators/furiosa.py
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Comment thread python/ray/_private/accelerators/furiosa.py
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Apr 30, 2026
nadongjun added 4 commits May 1, 2026 19:30
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@nadongjun
Copy link
Copy Markdown
Contributor Author

@rueian @ryanaoleary @edoakes Gentle ping. Any thoughts on this?

Copy link
Copy Markdown
Contributor

@elpis-furiosa elpis-furiosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on Ray support for Furiosa RNGD. I stumbled upon this PR while researching ways to support new accelerators. The code looks fine at a glance, but I'll test it on actual RNGD hardware just to make sure.

Also, it will be great if an example similar to those for other accelerators in the Using accelerators in Tasks and Actors section of the document. I can take a look at adding one as a follow-up PR.

Comment thread python/ray/_private/accelerators/furiosa.py Outdated
Comment thread python/ray/_private/accelerators/furiosa.py Outdated
Comment thread python/ray/tests/accelerators/test_furiosa.py
nadongjun and others added 2 commits May 9, 2026 09:08
Co-authored-by: Sukchul Cho <sukchul.cho@furiosa.ai>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Co-authored-by: Sukchul Cho <sukchul.cho@furiosa.ai>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Comment thread python/ray/tests/accelerators/mock_furiosa_smi_py.py
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit cae315e. Configure here.

Comment thread python/ray/tests/accelerators/test_furiosa.py
@nadongjun
Copy link
Copy Markdown
Contributor Author

Thank you for working on Ray support for Furiosa RNGD. I stumbled upon this PR while researching ways to support new accelerators. The code looks fine at a glance, but I'll test it on actual RNGD hardware just to make sure.

Also, it will be great if an example similar to those for other accelerators in the Using accelerators in Tasks and Actors section of the document. I can take a look at adding one as a follow-up PR.

@elpis-furiosa Thanks for the review. I've applied all the suggested changes.

Most of the accelerators Ray supports don't seem to have real hardware tests, so verifying RNGD on actual hardware would be a meaningful contribution for end users.

If you'd like, I'm happy to hand off not just the follow-up PR but ownership of this PR as well (feel free to open a fresh PR if that's cleaner). Long-term maintenance items like SDK changes and RNGD family naming would naturally fit better on the Furiosa AI side.

If you're interested, please leave a comment here or DM me on the Ray Slack (handle: Dongjun Na).

nadongjun added 2 commits May 9, 2026 09:37
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@elpis-furiosa
Copy link
Copy Markdown
Contributor

Thank you for working on Ray support for Furiosa RNGD. I stumbled upon this PR while researching ways to support new accelerators. The code looks fine at a glance, but I'll test it on actual RNGD hardware just to make sure.
Also, it will be great if an example similar to those for other accelerators in the Using accelerators in Tasks and Actors section of the document. I can take a look at adding one as a follow-up PR.

@elpis-furiosa Thanks for the review. I've applied all the suggested changes.

Most of the accelerators Ray supports don't seem to have real hardware tests, so verifying RNGD on actual hardware would be a meaningful contribution for end users.

If you'd like, I'm happy to hand off not just the follow-up PR but ownership of this PR as well (feel free to open a fresh PR if that's cleaner). Long-term maintenance items like SDK changes and RNGD family naming would naturally fit better on the Furiosa AI side.

If you're interested, please leave a comment here or DM me on the Ray Slack (handle: Dongjun Na).

Thanks for your offer, @nadongjun. Looking at the scope of this PR, you've already done most of the implementation work, so rather than taking over ownership, we'd prefer to focus on providing hardware testing support on actual RNGD devices. Any additional contributions from our side will follow as separate PRs.

@Yicheng-Lu-llll Yicheng-Lu-llll self-assigned this May 19, 2026
@Yicheng-Lu-llll
Copy link
Copy Markdown
Member

Thank you so much for the contribution, @nadongjun! I'll take a look by tmrw.

Copy link
Copy Markdown
Member

@Yicheng-Lu-llll Yicheng-Lu-llll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! left some nits

token = token.strip()
if token.startswith(_FURIOSA_DEVICE_PREFIX):
token = token[len(_FURIOSA_DEVICE_PREFIX) :]
# ``furiosa-llm`` allows ``npu:0:0-3`` to address a core range; we keep
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to confirm my understanding, this npu:0:0-3 format only shows up and gets used at ray node startup, right? So ray can safely treat it as a full device. And we don't really support core level scheduling.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the case. The :cores part only appears when users set FURIOSA_DEVICES themselves, so the same value can be passed straight through to furiosa-llm --devices.

Ray strips the suffix before scheduling and only operates at the device level, so core-level scheduling isn't supported.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in this case, for a detailed example:

  1. A user sets FURIOSA_DEVICES=npu:0:0-3 at node startup.
  2. Ray would only see npu:0.
  3. So, when we start a Ray worker on this node, Ray will set the env var to FURIOSA_DEVICES=npu:0 (via set_current_process_visible_accelerator_ids).
  4. furiosa sdk reads npu:0 instead of npu:0:0-3.

For reference, NVIDIA solves the same subdevice partitioning problem with MIG by giving each instance its own id and listing them flat: CUDA_VISIBLE_DEVICES=MIG-GPU-abc-1,MIG-GPU-abc-2 and ray treats each MIG instance as a whole device.

I'm thinking we either do the similar way, or just add a doc note saying please don't use things like npu:0:0-3.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. It would be better if FURIOSA_DEVICES can be partitioned, as in the case for CUDA_VISIBLE_DEVICES. To elaborate, Furiosa RNGD has 8PEs (Processing Elements) on one PCIe card, and they are dented as npu:{chip_id}:{pe_id}. Adjacent PEs can be fused to work together.

Can this issue be addressed in a different PR? I'll make a follow-up PR for this issue.

Comment thread doc/source/ray-core/scheduling/accelerators.rst Outdated
Comment thread python/ray/_private/accelerators/furiosa.py
Comment thread python/ray/_private/accelerators/furiosa.py Outdated
Comment thread python/ray/_private/accelerators/furiosa.py Outdated
nadongjun added 2 commits May 20, 2026 12:34
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Copy link
Copy Markdown
Member

@Yicheng-Lu-llll Yicheng-Lu-llll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! The only concern I have now is: #63035 (comment). Otherwise LGTM.

And @elpis-furiosa would you mind helping with testing this e2e?

Comment thread python/ray/tests/accelerators/test_furiosa.py Outdated
Comment thread python/ray/tests/accelerators/test_furiosa.py Outdated
Comment thread doc/source/ray-core/scheduling/accelerators.rst
Trouble only occurs if those tasks and actors
attempt to actually use accelerators that don't exist.

Using accelerators in Tasks and Actors
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also consider adding furiosa to Sections 2 & 3?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 7bbc9b7 adds the in-dev arch cases (RngdMax/RngdS/RngdPlus and rngd-max/rngd+) and the FURIOSA_RNGD constant. Kept the other SKUs out of accelerators.py since names may still change.
  • 338435e refactors test_get_current_process_visible_accelerator_ids to @pytest.mark.parametrize.

For the Tasks/Actors section example and the e2e testing on actual RNGD hardware, @elpis-furiosa offered to handle both earlier. Would it be OK to take care of them in a separate follow-up PR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Tasks/Actors section example and the e2e testing on actual RNGD hardware, @elpis-furiosa offered to handle both earlier. Would it be OK to take care of them in a separate follow-up PR?

I ran the following example on our hardware, which is similar to the examples of other accelerators:

(ray) ➜  ray git:(338435efad) ✗ cat ray_test.py
import os
import ray

ray.init(resources={"FURIOSA": 2})

@ray.remote(resources={"FURIOSA": 1})
class RNGDActor:
    def ping(self):
        print("RNGD IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["FURIOSA"]))
        print("FURIOSA_DEVICES: {}".format(os.environ["FURIOSA_DEVICES"]))

@ray.remote(resources={"FURIOSA": 1})
def rngd_task():
    print("RNGD IDs: {}".format(ray.get_runtime_context().get_accelerator_ids()["FURIOSA"]))
    print("FURIOSA_DEVICES: {}".format(os.environ["FURIOSA_DEVICES"]))

rngd_actor = RNGDActor.remote()
ray.get(rngd_actor.ping.remote())
# The actor uses the first RNGD so the task uses the second one.
ray.get(rngd_task.remote())
(ray) ➜  ray git:(338435efad) ✗ python3 ray_test.py
2026-05-21 15:52:42,642 INFO worker.py:2018 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
(RNGDActor pid=693909) RNGD IDs: ['0']
(RNGDActor pid=693909) FURIOSA_DEVICES: npu:0
(rngd_task pid=693883) RNGD IDs: ['1']
(rngd_task pid=693883) FURIOSA_DEVICES: npu:1

@nadongjun If you like, you can add this code as an example.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!, added the example with you as co-author. Marked Fractional Accelerators as unsupported for now

let me know if I should change that.

@nadongjun
Copy link
Copy Markdown
Contributor Author

@Yicheng-Lu-llll The npu:0:0-3 format I used in the docstring actually comes from the FuriosaRT-era multi-NPU env var introduction notes in furiosa-sdk 0.10.0 (docs), the PE fusion notation. I had assumed furiosa-llm would parse the same syntax and brought it over to the RNGD scenario without nailing down the source, which was a mistake. Sorry about that.

Looking at the RNGD docs again, the partitioning model is completely different:

  • "By implementing Single Root I/O Virtualization (SR-IOV), the system allows a single physical chip to be partitioned into 2, 4, or 8 independent NPU instances."

So instead of software-level PE fusion from the older version, RNGD splits a single chip into 2/4/8 VFs via hardware SR-IOV. Since the current furiosa-llm source isn't public, I can't directly verify the env var parsing behavior, so let me lay out the scenarios under some assumptions for clarity.

Assumption: furiosa_smi_py.list_devices() returns entries depending on whether SR-IOV is configured:

Without SR-IOV: one entry per physical chip

>>> list_devices()
[Device(npu0)]
# len == 1

With N VFs configured: one entry per VF

>>> list_devices()
[Device(npu0vf0), Device(npu0vf1)]    # 2 VFs
# len == 2

Scenario 1: SR-IOV not configured (one RNGD chip used as a whole)

# Admin: no SR-IOV partitioning
$ ray start --head
# FuriosaAcceleratorManager detects list_devices() length = 1 → registers "FURIOSA: 1"
import ray
ray.init()

@ray.remote(resources={"FURIOSA": 1})
class LLMServer:
    def __init__(self, model_path):
        from furiosa_llm import LLM
        self.llm = LLM(model_path)
        # Ray sets FURIOSA_DEVICES="npu:0" beforehand,
        # so furiosa-llm uses only that NPU (assumed)

    def generate(self, prompt: str) -> str:
        return self.llm.generate([prompt])[0]

server_a = LLMServer.remote("/models/test")
print(ray.get(server_a.generate.remote("Test")))

# Second actor: no free NPU -> stays pending
server_b = LLMServer.remote("/models/test")

-> One actor per node. The RNGD chip is owned entirely by a single workload.

Scenario 2: SR-IOV with 2 VFs

# Admin: one RNGD chip split into 2 VFs
$ ray start --head
# list_devices() length = 2 -> registers "FURIOSA: 2"
import ray
ray.init()

@ray.remote(resources={"FURIOSA": 1})
class LLMServer:
    def __init__(self, model_path):
        from furiosa_llm import LLM
        self.llm = LLM(model_path)   # Ray sets a different VF as the env var for each actor

    def generate(self, prompt: str) -> str:
        return self.llm.generate([prompt])[0]

# Two actors are scheduled concurrently on different VFs
server_a = LLMServer.remote("/models/test")    # FURIOSA_DEVICES="npu:0" (VF 0)
server_b = LLMServer.remote("/models/test")       # FURIOSA_DEVICES="npu:1" (VF 1)

# Parallel execution on isolated VFs
results = ray.get([
    server_a.generate.remote("Test"),
    server_b.generate.remote("Test"),
])

-> Two workloads share one RNGD chip. Since the partitioning is already done at boot time by SR-IOV, the worker-side env var doesn't need to carry core info, so the round-trip concern doesn't surface.


@elpis-furiosa, would you mind helping confirm the following three things?

  1. Enumeration unit of furiosa_smi_py.list_devices()
  • Without SR-IOV: is it one entry per physical chip?
  • With SR-IOV configured for N VFs: are N entries returned (one per VF)? Or just the PF, or both PF and VFs?
  1. Default behavior of furiosa_llm.LLM(devices=None): does it automatically honor the FURIOSA_DEVICES env var, or does the value need to be passed explicitly (e.g., devices=os.environ["FURIOSA_DEVICES"])?

  2. Legacy npu:N:cores notation: does RNGD's furiosa-llm still accept the older PE-fusion notation, or only SR-IOV VF indices (npu:N)?

If (1) matches the assumption, the current device-level model covers both scenarios as-is. If (2) matches, the user code examples stand without changes, otherwise we'd need one extra line. (3) mainly affects how I should tone down the docstring.

nadongjun and others added 3 commits May 21, 2026 10:57
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Co-authored-by: Sukchul Cho <sukchul.cho@furiosa.ai>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@elpis-furiosa
Copy link
Copy Markdown
Contributor

  1. Enumeration unit of furiosa_smi_py.list_devices()
  • Without SR-IOV: is it one entry per physical chip?
  • With SR-IOV configured for N VFs: are N entries returned (one per VF)? Or just the PF, or both PF and VFs?

furiosa_smi_py.list_devices() enumerates one entry per physical chip. Since VF is an experimental feature, we'll primarily target PF in Ray for now.

  1. Default behavior of furiosa_llm.LLM(devices=None): does it automatically honor the FURIOSA_DEVICES env var, or does the value need to be passed explicitly (e.g., devices=os.environ["FURIOSA_DEVICES"])?

furiosa_llm.LLM(devices=None) discovers and allocates all available devices via furiosa-smi — it does not automatically honor FURIOSA_DEVICES. To restrict allocation to specific devices, the list must be passed explicitly (e.g., devices=os.environ["FURIOSA_DEVICES"]).

  1. Legacy npu:N:cores notation: does RNGD's furiosa-llm still accept the older PE-fusion notation, or only SR-IOV VF indices (npu:N)?

furiosa-llm serve accepts both notations. For clarification, here's the relevant excerpt from furiosa-llm serve --help.

  --devices DEVICES     The devices to run the model. It can be a single device or a comma-separated list of devices. Each device can be either "npu:X" or "npu:X:Y", where X is a device index and Y is a NPU core range
                        notation (e.g. "npu:0" for whole npu 0, "npu:0:0" for core 0 of NPU 0, and "npu:0:0-3" for fused core 0-3 of npu 0). If not given, all available unoccupied devices will be used.

And @elpis-furiosa would you mind helping with testing this e2e?

@Yicheng-Lu-llll Certainly. Are there any unit / E2E testing guidelines for adding new accelerators to Ray?

@Yicheng-Lu-llll
Copy link
Copy Markdown
Member

Yicheng-Lu-llll commented May 21, 2026

@elpis-furiosa For the partition issue (npu:0:0-3), agreed, let's address it in a follow up PR. We don't have e2e testing guidelines since we don't have the hardware, so if you could help test it and share the results, and you're happy with the code, we should be good to go.

@nadongjun Could you update the doc description for points 2 and 3 you mentioned? Especially for point 2, it seems users need to set devices=os.environ["FURIOSA_DEVICES"] explicitly. Thanks! also you might needs to rebase you pr, ci failure is unrelate but need to rebase to fix.

@elpis-furiosa
Copy link
Copy Markdown
Contributor

elpis-furiosa commented May 22, 2026

I was able to verify that the following code works with RNGDs:

Code
import os
import ray
from ray.util.actor_pool import ActorPool
from furiosa_llm import LLM, SamplingParams

ray.init(resources={"FURIOSA": 2})


@ray.remote(resources={"FURIOSA": 1})
class FuriosaLLMActor:
    def __init__(self):
        print(
            "Initializing LLM with FURIOSA_DEVICES: {}".format(
                os.environ["FURIOSA_DEVICES"]
            )
        )
        self.llm = LLM(
            "furiosa-ai/Llama-3.1-8B-Instruct", devices=os.environ["FURIOSA_DEVICES"]
        )
        self.sampling_params = SamplingParams(temperature=0.5, max_tokens=1024)

    def chat(self, messages):
        outputs = self.llm.chat(messages, sampling_params=self.sampling_params)
        return [o.outputs[0].text for o in outputs]


actor_pool = ActorPool([FuriosaLLMActor.remote() for _ in range(2)])
print(
    list(
        actor_pool.map(
            lambda a, v: a.chat.remote(v),
            [
                [
                    {"role": "system", "content": "You are a helpful assistant"},
                    {"role": "user", "content": "Why is the sky blue?."},
                ],
                [
                    {"role": "system", "content": "You are a helpful assistant"},
                    {
                        "role": "user",
                        "content": "What is good for your health, water or coffee?",
                    },
                ],
            ],
        )
    )
)
Logs (RAY_DEDUP_LOGS=0)
/home/furiosa/elpis/ray/.venv/lib/python3.10/site-packages/furiosa/models/core/attention/attention.py:6: UserWarning: LTW Backend is provided temporarily for compatibility purposes only. This feature may be removed without notice in future versions. (Default behavior, set USE_WTL_BACKEND=1 to use WTL backend)
  from furiosa.models.core.attention.backends import LLMAttentionBackend
2026-05-22 18:23:29,532 INFO worker.py:2035 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
(pid=849518) /home/furiosa/elpis/ray/.venv/lib/python3.10/site-packages/furiosa/models/core/attention/attention.py:6: UserWarning: LTW Backend is provided temporarily for compatibility purposes only. This feature may be removed without notice in future versions. (Default behavior, set USE_WTL_BACKEND=1 to use WTL backend)
(pid=849518)   from furiosa.models.core.attention.backends import LLMAttentionBackend
(pid=849529) /home/furiosa/elpis/ray/.venv/lib/python3.10/site-packages/furiosa/models/core/attention/attention.py:6: UserWarning: LTW Backend is provided temporarily for compatibility purposes only. This feature may be removed without notice in future versions. (Default behavior, set USE_WTL_BACKEND=1 to use WTL backend)
(pid=849529)   from furiosa.models.core.attention.backends import LLMAttentionBackend
(FuriosaLLMActor pid=849518) Initializing LLM with FURIOSA_DEVICES: npu:0
(FuriosaLLMActor pid=849529) Initializing LLM with FURIOSA_DEVICES: npu:1
(FuriosaLLMActor pid=849518) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Fetching 13 files: 100%|██████████| 13/13 [00:00<00:00, 10571.14it/s]
Fetching 13 files: 100%|██████████| 13/13 [00:00<00:00, 16664.41it/s]
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:35.975243838+09:00  INFO furiosa_llm_common::artifact::types::next_gen: Loading artifact from path: /home/furiosa/.cache/huggingface/hub/models--furiosa-ai--Llama-3.1-8B-Instruct/snapshots/231d94fbc03cdd66aaeb2411697064a45f008ec7
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:36.154143104+09:00  INFO furiosa_llm_common::artifact::types::next_gen: Loading artifact from path: /home/furiosa/.cache/huggingface/hub/models--furiosa-ai--Llama-3.1-8B-Instruct/snapshots/231d94fbc03cdd66aaeb2411697064a45f008ec7
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:43.314344181+09:00  INFO furiosa_llm_common::artifact::types::commons: Loading artifact with schema version: 3.0
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:43.312929379+09:00  INFO furiosa_llm_common::artifact::types::commons: Loading artifact with schema version: 3.0
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:44.300213431+09:00  INFO furiosa::llm::engine: Loaded target artifact: SchemaVersion { major: 3, minor: 0 }
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:44.300281537+09:00  INFO furiosa::llm::engine: Parallelism Config: tp=8, pp=1, dp=1
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:44.313716413+09:00  INFO furiosa::llm::engine: Loaded target artifact: SchemaVersion { major: 3, minor: 0 }
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:44.313762504+09:00  INFO furiosa::llm::engine: Parallelism Config: tp=8, pp=1, dp=1
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:44.716205776+09:00  INFO furiosa_generator::next_gen::hf_compat_next_gen: Loading the target model ...
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:44.717235293+09:00  INFO device_runtime::context: Memory dump thread for Device([npu:0:0-3, npu:0:4-7]) started
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:44.708010467+09:00  INFO furiosa_generator::next_gen::hf_compat_next_gen: Loading the target model ...
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:44.708838951+09:00  INFO device_runtime::context: Memory dump thread for Device([npu:1:0-3, npu:1:4-7]) started
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:45.299625962+09:00  INFO furiosa_generator::next_gen::pipeline::resolve: PP device#0 allocation plan: Binary=283.8 MiB, Model weights=15.0 GiB, Reserved IO memory=2.0 GiB
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:45.296792233+09:00  INFO furiosa_generator::next_gen::pipeline::resolve: PP device#0 allocation plan: Binary=283.8 MiB, Model weights=15.0 GiB, Reserved IO memory=2.0 GiB
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:45.796373805+09:00  INFO furiosa_generator::next_gen::pipeline::resolve: Resolve 47 pipeline for 1 DP groups (DP=1, PP=1) in 1.08s
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:45.796431641+09:00  INFO furiosa_generator::backing_file: Total size of parameters loaded: 15.0 GiB in 0.4967606 s
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:45.806585898+09:00  INFO furiosa_generator::next_gen::pipeline::resolve: Resolve 47 pipeline for 1 DP groups (DP=1, PP=1) in 1.10s
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:45.806653519+09:00  INFO furiosa_generator::backing_file: Total size of parameters loaded: 15.0 GiB in 0.50978327 s
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:46.178928984+09:00  INFO furiosa_generator::next_gen::hf_compat_next_gen: Loading the target model took 1.462671207s
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:46.185103451+09:00  INFO furiosa_generator::next_gen::pipeline::resolve: PP device#0 KV cache=30.2 GiB
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:46.215650206+09:00  INFO furiosa_generator::next_gen::hf_compat_next_gen: Loading the target model took 1.507582617s
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:46.222273219+09:00  INFO furiosa_generator::next_gen::pipeline::resolve: PP device#0 KV cache=30.2 GiB
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:46.27466863+09:00  INFO furiosa_generator::next_gen::hf_compat_next_gen: Computed bucket limits: max_executable_len=131072
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:46.275410191+09:00  INFO furiosa_generator::structured_output::manager: Initializing structured output manager for backend: Auto
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:46.322329339+09:00  INFO furiosa_generator::next_gen::hf_compat_next_gen: Computed bucket limits: max_executable_len=131072
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:46.323092381+09:00  INFO furiosa_generator::structured_output::manager: Initializing structured output manager for backend: Auto
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:47.247807151+09:00  INFO furiosa_generator::structured_output::manager: XGrammar backend is initialized.
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:47.246899128+09:00  INFO furiosa_generator::structured_output::manager: XGrammar backend is initialized.
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:48.37119724+09:00  INFO furiosa_generator::structured_output::manager: LLGuidance backend is initialized.
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:48.371233334+09:00  INFO furiosa_generator::next_gen::generator: DP entry DpId(0) → device [npu1pe0-3, npu1pe4-7]
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:48.38912785+09:00  INFO furiosa_generator::next_gen::scheduler::request_management::task_selector: Initializing TaskSelector([[npu:1:0-3, npu:1:4-7]]) with config: TaskSelectorConfig { enable_jit_compilation: false }, 46 AOT wired pipelines
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:48.393971612+09:00  INFO furiosa_generator::next_gen::scheduler::memory_manager: Initialize KVCacheManager with config: KVCacheConfig(kv_cache_memory: {[npu:1:0-3, npu:1:4-7]: Buffer { addr: 0x80000000, size: 0x78cfb8c00, device: Npu([npu:1:0-3, npu:1:4-7], Dram) }}, KVCachePlan(global_attention_config: LayerConfig(attention_type: Global, unit_block_size: 2048, block_size: 2048, num_chips: 1), global_kv_cache_tensors: 64, aux_attention_config: None, aux_kv_cache_tensors: 0)), is_prefix_cache_enabled: true
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:48.394175638+09:00  INFO furiosa_generator::next_gen::scheduler::memory_manager: Configured KV cache blocks, global_num_blocks: 247421, aux_num_blocks: None
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:48.399114574+09:00  INFO furiosa_generator::structured_output::manager: LLGuidance backend is initialized.
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:48.399137112+09:00  INFO furiosa_generator::next_gen::generator: DP entry DpId(0) → device [npu0pe0-3, npu0pe4-7]
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:48.407217965+09:00  INFO furiosa_generator::next_gen::generator: Eager scheduler has started with: SchedulerConfig { scheduler_kind: None, npu_queue_limit: 1, max_processing_samples: 65536, spare_blocks_ratio: 0.0, estimation_time_limit_ms: None, prefix_cache_config: PrefixCacheConfig { enabled: true, lookahead_requests: 2 }, experimental_scheduling_loop_type: Eager, experimental_aggressive_batching: false, max_concurrency: None, max_num_batched_tokens: None, data_parallel_routing_policy: RoundRobin }
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:48.407258009+09:00  INFO furiosa_generator::next_gen::generator: max_kv_len=247420 (from KV cache blocks across 1 DP device(s))
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:48.416559489+09:00  INFO furiosa_generator::next_gen::scheduler::request_management::task_selector: Initializing TaskSelector([[npu:0:0-3, npu:0:4-7]]) with config: TaskSelectorConfig { enable_jit_compilation: false }, 46 AOT wired pipelines
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:48.420056039+09:00  INFO furiosa_generator::next_gen::scheduler::memory_manager: Initialize KVCacheManager with config: KVCacheConfig(kv_cache_memory: {[npu:0:0-3, npu:0:4-7]: Buffer { addr: 0x80000000, size: 0x78cfb8c00, device: Npu([npu:0:0-3, npu:0:4-7], Dram) }}, KVCachePlan(global_attention_config: LayerConfig(attention_type: Global, unit_block_size: 2048, block_size: 2048, num_chips: 1), global_kv_cache_tensors: 64, aux_attention_config: None, aux_kv_cache_tensors: 0)), is_prefix_cache_enabled: true
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:48.420132965+09:00  INFO furiosa_generator::next_gen::scheduler::memory_manager: Configured KV cache blocks, global_num_blocks: 247421, aux_num_blocks: None
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:48.429961633+09:00  INFO furiosa_generator::next_gen::generator: Eager scheduler has started with: SchedulerConfig { scheduler_kind: None, npu_queue_limit: 1, max_processing_samples: 65536, spare_blocks_ratio: 0.0, estimation_time_limit_ms: None, prefix_cache_config: PrefixCacheConfig { enabled: true, lookahead_requests: 2 }, experimental_scheduling_loop_type: Eager, experimental_aggressive_batching: false, max_concurrency: None, max_num_batched_tokens: None, data_parallel_routing_policy: RoundRobin }
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:48.430004409+09:00  INFO furiosa_generator::next_gen::generator: max_kv_len=247420 (from KV cache blocks across 1 DP device(s))
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:48.591132838+09:00  INFO furiosa_generator::next_gen::hf_compat_next_gen: Num samples received: 1
(FuriosaLLMActor pid=849518) 2026-05-22T18:23:48.591527234+09:00  INFO device_runtime::alloc::cpu: Support for huge page size of 2 MiB has been detected.
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:48.571864594+09:00  INFO furiosa_generator::next_gen::hf_compat_next_gen: Num samples received: 1
(FuriosaLLMActor pid=849529) 2026-05-22T18:23:48.572620078+09:00  INFO device_runtime::alloc::cpu: Support for huge page size of 2 MiB has been detected.
[["The sky appears blue because of a phenomenon called scattering. When sunlight enters Earth's atmosphere, it encounters tiny molecules of gases such as nitrogen (N2) and oxygen (O2). These molecules scatter the light in all directions, but they scatter shorter (blue) wavelengths more than longer (red) wavelengths.\n\nThis is known as Rayleigh scattering, named after the British physicist Lord Rayleigh, who first described the phenomenon in the late 19th century. The scattered blue light is then dispersed throughout the atmosphere, giving the sky its blue appearance.\n\nHere's a simplified explanation of the process:\n\n1. Sunlight enters the atmosphere as a mixture of all colors (white light).\n2. The shorter blue wavelengths are scattered more than the longer red wavelengths by the tiny molecules in the atmosphere.\n3. The scattered blue light is dispersed in all directions, reaching our eyes from all parts of the sky.\n4. Our eyes perceive the scattered blue light as the color of the sky.\n\nIt's worth noting that the exact shade of blue we see in the sky can vary depending on factors such as:\n\n* Time of day: During sunrise and sunset, the sky can take on hues of red, orange, and pink due to the scattering of light by atmospheric particles.\n* Atmospheric conditions: Pollution, dust, and water vapor in the atmosphere can affect the color of the sky.\n* Altitude: The sky can appear more intense blue at higher elevations due to the thinner atmosphere.\n\nOverall, the blue color of the sky is a result of the scattering of sunlight by the tiny molecules in the atmosphere, creating a breathtaking and ever-changing visual experience."], ["Both water and coffee have their own benefits and drawbacks when it comes to health. Here's a comparison:\n\n**Water:**\n\n1. **Hydration**: Water is essential for maintaining proper hydration and bodily functions.\n2. **Weight management**: Drinking water can help with weight loss and maintenance by suppressing appetite and increasing metabolism.\n3. **Flushes toxins**: Water helps to flush out toxins and waste products from the body.\n4. **Skin health**: Drinking enough water can improve skin health and reduce the appearance of wrinkles.\n5. **Exercise performance**: Proper hydration is essential for exercise performance and recovery.\n\n**Coffee:**\n\n1. **Cognitive function**: Caffeine in coffee can improve alertness, focus, and cognitive function.\n2. **Neuroprotection**: Moderate coffee consumption may have neuroprotective effects and reduce the risk of Parkinson's disease, Alzheimer's disease, and other neurodegenerative disorders.\n3. **Cardiovascular health**: Moderate coffee consumption may lower the risk of stroke, type 2 diabetes, and certain types of cancer.\n4. **Mood booster**: Caffeine can improve mood and reduce the risk of depression.\n5. **Antioxidants**: Coffee contains antioxidants, which can help protect cells from damage.\n\n**Key differences:**\n\n1. **Calorie content**: Water is calorie-free, while coffee contains calories, especially when added with sugar and cream.\n2. **Sleep**: Drinking coffee can interfere with sleep, while water is generally not a sleep disruptor.\n3. **Additives**: Coffee often contains added sugars, creamers, and syrups, which can greatly increase calorie intake.\n\n**The verdict:**\n\nWater is essential for maintaining proper hydration and overall health. Coffee, in moderation, can have cognitive and cardiovascular benefits, but it's essential to be mindful of calorie intake and potential sleep disruptions.\n\n**The ideal balance:**\n\n1. **Drink at least 8 cups of water per day**.\n2. **Consume coffee in moderation** (1-2 cups per day).\n3. **Avoid adding excessive sugar, cream, or syrup** to your coffee.\n4. **Monitor your sleep and adjust your coffee consumption accordingly**.\n\nRemember, a balanced lifestyle that includes a mix of water, coffee, and other healthy habits is key to maintaining overall health and well-being."]]

Copy link
Copy Markdown
Contributor

@elpis-furiosa elpis-furiosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work. Looks good to me!

@Yicheng-Lu-llll Is there anything else other than the result in this comment that you want me to verify?

@Yicheng-Lu-llll Yicheng-Lu-llll added the go add ONLY when ready to merge, run all tests label May 26, 2026
Copy link
Copy Markdown
Member

@Yicheng-Lu-llll Yicheng-Lu-llll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks for all the contribution!

cc @edoakes for merge.

@edoakes edoakes enabled auto-merge (squash) May 26, 2026 19:24
@github-actions github-actions Bot disabled auto-merge May 27, 2026 06:17
@nadongjun
Copy link
Copy Markdown
Contributor Author

@edoakes The premerge failure was unrelated to this PR (a CI infra issue already fixed on master). I've rebased to pick it up. Could you check and merge it when you get a chance?

@edoakes edoakes merged commit 5a7eb2d into ray-project:master May 27, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants