Skip to content

[misc][data.llm] Generalize the builder pattern in ray.data.llm#58484

Merged
richardliaw merged 6 commits intoray-project:masterfrom
jeffreyjeffreywang:generic_build_processor
Dec 10, 2025
Merged

[misc][data.llm] Generalize the builder pattern in ray.data.llm#58484
richardliaw merged 6 commits intoray-project:masterfrom
jeffreyjeffreywang:generic_build_processor

Conversation

@jeffreyjeffreywang
Copy link
Copy Markdown
Contributor

@jeffreyjeffreywang jeffreyjeffreywang commented Nov 9, 2025

Description

Briefly describe what this PR accomplishes and why it's needed.

As discussed in https://docs.google.com/document/d/1danbyJjd3Zl_Q-CSsS3PjxtG4K9dZkyn0A9t7i4Fyjg/edit?disco=AAABtNCDbfw, the current builder function build_llm_processor is overly specific to LLM inference workloads and not flexible enough to support additional processors, such as those for multimodal preprocessing. To address this, we’ve generalized it to build_processor to better accommodate a broader range of LLM-related workloads.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

N/A

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Original:

import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_processor

config = vLLMEngineProcessorConfig(
    model_source="meta-llama/Meta-Llama-3.1-8B-Instruct",
    concurrency=1,
    batch_size=64,
)

processor = build_llm_processor(
    config,
    preprocess=lambda row: dict(
        messages=[{"role": "user", "content": row["prompt"]}],
        sampling_params=dict(temperature=0.3, max_tokens=20),
    ),
    postprocess=lambda row: dict(resp=row["generated_text"]),
    preprocess_map_kwargs={"num_cpus": 0.5},
    postprocess_map_kwargs={"num_cpus": 0.25},
)

ds = ray.data.range(300)
ds = processor(ds)
for row in ds.take_all():
    print(row)

Updated:

import ray
from ray.data.llm import vLLMEngineProcessorConfig, build_processor

config = vLLMEngineProcessorConfig(
    model_source="meta-llama/Meta-Llama-3.1-8B-Instruct",
    concurrency=1,
    batch_size=64,
)

processor = build_processor( # This is the only difference. Arguments remain the same.
    config,
    preprocess=lambda row: dict(
        messages=[{"role": "user", "content": row["prompt"]}],
        sampling_params=dict(temperature=0.3, max_tokens=20),
    ),
    postprocess=lambda row: dict(resp=row["generated_text"]),
    preprocess_map_kwargs={"num_cpus": 0.5},
    postprocess_map_kwargs={"num_cpus": 0.25},
)

ds = ray.data.range(300)
ds = processor(ds)
for row in ds.take_all():
    print(row)

@jeffreyjeffreywang jeffreyjeffreywang requested review from a team as code owners November 9, 2025 22:30
@jeffreyjeffreywang
Copy link
Copy Markdown
Contributor Author

cc: @nrghosh @kouroshHakha

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors build_llm_processor to the more generic build_processor. This is a good change that makes the API more flexible for various processing workloads beyond just LLM inference. The renaming has been applied consistently and thoroughly across the entire codebase, including source files, documentation, examples, and tests. The changes are well-executed and I found no issues in my review.

@ray-gardener ray-gardener bot added data Ray Data-related issues llm community-contribution Contributed by the community labels Nov 10, 2025
@jeffreyjeffreywang
Copy link
Copy Markdown
Contributor Author

Will wait for #58298 to finalize before deciding if this PR is still necessary.

@nrghosh nrghosh added the go add ONLY when ready to merge, run all tests label Nov 20, 2025
@omatthew98 omatthew98 removed the data Ray Data-related issues label Dec 3, 2025
Copy link
Copy Markdown
Contributor

@nrghosh nrghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conflicts resolved, good for merge

@kouroshHakha kouroshHakha enabled auto-merge (squash) December 7, 2025 00:20
auto-merge was automatically disabled December 9, 2025 03:11

Head branch was pushed to by a user without write access

@jeffreyjeffreywang
Copy link
Copy Markdown
Contributor Author

Rebase to address failing doc tests.

Copy link
Copy Markdown
Contributor

@nrghosh nrghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kouroshHakha ready to merge 🚀

Copy link
Copy Markdown
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs some changes

"""
[DEPRECATED] Prefer build_processor. Build a LLM processor using the given config.
"""
deprecation_warning(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait this should be a decorator and we should also remove the PublicAPI annotation from build_llm_processor

Copy link
Copy Markdown
Contributor

@jeffreywang-anyscale jeffreywang-anyscale Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see similar usages of directly using the deprecation_warning helper: https://github.com/search?q=repo%3Aray-project%2Fray+%22deprecation_warning%28%22&type=code&p=2. I'm not able to find associated decorators though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove the publicAPI annotation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I guess I was talking about Deprecated class:

"""Decorator for documenting a deprecated class, method, or function.

Maybe we can use that instead?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, this is neater. adjusted in my latest revision.

Copy link
Copy Markdown
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rubber-stamping

Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
@jeffreywang-anyscale
Copy link
Copy Markdown
Contributor

Rebasing onto latest master

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@richardliaw richardliaw merged commit a8857ae into ray-project:master Dec 10, 2025
6 checks passed
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests llm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

7 participants