-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] Add batch predictor class #23808
Conversation
I don't really like the |
python/ray/ml/scorer.py
Outdated
batch_size: Split dataset into batches of this size for prediction. | ||
max_scoring_actors: If set, specify the maximum number of scoring actors. | ||
ray_remote_args: Additional resource requirements to request from | ||
ray (e.g., num_gpus=1 to request GPUs for the map tasks). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if num_gpus
should be a top-level arg, since GPU-based batch inference is pretty fundamental. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this should be easily accessible. I've added it to the top level API
scorer = BatchScorer(DummyPredictor, Checkpoint.from_dict({"factor": 2.0})) | ||
|
||
test_dataset = ray.data.from_items([1.0, 2.0, 3.0, 4.0]) | ||
assert scorer.score(test_dataset).to_pandas().to_numpy().squeeze().tolist() == [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If looking for a more concise conversion + comparison, content should be assertable without this chain, something like:
assert scorer.score(test_dataset).take() == [2.0, 4.0, 6.0, 8.0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leads to
E AssertionError: assert [{'value': 8.0}, {'value': 6.0}, {'value': 4.0}, {'value': 2.0}] == [2.0, 4.0, 6.0, 8.0]
how do I get them as a series and not dicts?
Also, the order problem seems to remain, any insights on that?
Regarding the naming, I don't think that "scoring" implies it's only 1 measure, but yeah, it's not ideal because we're not actually scoring a dataset, but perform inference on it. I wouldn't want to go with BatchPredictor or anything *Predictor, as we do have a top-level Predictor interface and this wrapper does not implement it. Maybe |
|
Actually, we can go with |
|
@@ -89,6 +89,9 @@ Predictors | |||
.. autoclass:: ray.ml.predictor.Predictor | |||
:members: | |||
|
|||
.. autoclass:: ray.ml.batch_predictor.BatchPredictor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Why are these changes needed?
What: This class adds a generic
BatchPredictor
class that offers an interface to run batch inference on Ray datasets. It takes a Predictor class and checkpoint as an input, and provides apredict(dataset)
method to run scalable scoring inference.Why: Currently users have to implement scorers themselves. This is mostly boilerplate and prone to errors, so we should provide a simple solution instead.
Note that this predictor also implements the
Predictor
interface.Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.