Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR][Predictor] Enable numpy based predictor #28917

Merged
merged 82 commits into from Nov 16, 2022

Conversation

jiaodong
Copy link
Member

@jiaodong jiaodong commented Sep 30, 2022

Why are these changes needed?

Add a numpy first path for DL predictors such as tensorflow and pytorch.

Notable changes:

  • We split preprocessor format and GPU separate stage format in BatchPredictor
    • Between dataset & preprocessor we choose the minimal conversion one if possible
    • Predictor base class now provides a preferred native batch format, and will be used as GPU separate stage format.
  • Changed enum to BatchFormat and use it across our codebase instead of raw string val
  • predictor.py now choose implementation to call based on input batch data type, same as preprocessor
  • both predictor.py and batch_predictor.py return same data type as input batch / bock format
  • Added _predict_numpy to TF & Torch predictor
  • Logic updates in BatchPredictor
  • Removed arrow batch format from existing tests
  • Notebook and test updates
  • Re-write test_predictor.py that removed mocks and test against all numpy + pandas combination of {data batch, preprocessor, predictor}

Batch prediction results

TL;DR -- Faster, Better memory footprint, no GRAM leak or OOM.

  • Per-batch inference time
    • Numpy: 0.88 secs / Pandas: 1.44 secs
  • Final memory cost after 10GB image data
    • Numpy: 0.04 GB / Pandas: 6.34 GB
  • Per-batch incremental GRAM footprint
    • Numpy: 0 / Pandas: +0.6GB
  • Final prediction output GRAM footprint
    • Numpy: 0 / Pandas: 3.03GB -> OOM

Setup:

Screen Shot 2022-10-03 at 10 01 38 PM

Image 1: Pandas narrow waist prediction, +0.6GB accumulated GPU memory usage each batch

Screen Shot 2022-10-03 at 10 01 51 PM

Image 2: Pandas narrow waist prediction, extra 3.03GB GPU memory required to dump final output from batchnorm that lead to OOM

Screen Shot 2022-10-03 at 9 59 50 PM

Image 3: Numpy narrow waist prediction, const memory usage that finishes.

Related issue number

Related #28346
Closes #28525, #28627, #29003

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jiaodong
Copy link
Member Author

jiaodong commented Oct 3, 2022

This PR is good for initial review, with pending fixes on one example notebook to be fixed.

=== edit ===
Notebook failure fixed with minor fix on output format

@jiaodong jiaodong marked this pull request as ready for review October 3, 2022 15:58
@jiaodong jiaodong requested a review from a team as a code owner October 3, 2022 15:58
@jiaodong jiaodong added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 3, 2022
@jiaodong
Copy link
Member Author

jiaodong commented Oct 3, 2022

some release test are flaky due to marginal e2e latency assertions, but this PR doesn't touch them (predictor only)

@jiaodong
Copy link
Member Author

jiaodong commented Oct 4, 2022

Failed release test air-benchmark-pytorch-training-e2e-gpu-1x1-20gb-9 is caused by ragged tensor PR that just got reverted. Serve test is irrelevant.

python/ray/train/_internal/dl_predictor.py Outdated Show resolved Hide resolved
python/ray/train/_internal/dl_predictor.py Outdated Show resolved Hide resolved
python/ray/train/_internal/dl_predictor.py Outdated Show resolved Hide resolved
python/ray/train/predictor.py Show resolved Hide resolved
python/ray/train/predictor.py Outdated Show resolved Hide resolved
python/ray/train/predictor.py Outdated Show resolved Hide resolved
python/ray/train/batch_predictor.py Outdated Show resolved Hide resolved
@ericl ericl removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 5, 2022
Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jiaodong, overall lgtm! Left some minor comments

python/ray/train/predictor.py Show resolved Hide resolved
python/ray/train/predictor.py Outdated Show resolved Hide resolved
python/ray/train/predictor.py Outdated Show resolved Hide resolved
python/ray/train/predictor.py Outdated Show resolved Hide resolved
python/ray/train/predictor.py Outdated Show resolved Hide resolved
@@ -183,7 +184,7 @@ def _fit(self, dataset: Dataset) -> "Preprocessor":
"""Sub-classes should override this instead of fit()."""
raise NotImplementedError()

def _determine_transform_to_use(self, data_format: str) -> str:
def determine_transform_to_use(self, data_format: BlockFormat) -> BatchFormat:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep this as private still? it should not be a public facing api to users

elif output_df.dtypes[col] == np.dtype(object) and all(
isinstance(v, np.ndarray) for v in output_df[col]
):
output_df.loc[:, col] = [v.tolist() for v in output_df[col]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if numpy arrays are not json serializable, is this also a problem if dict of ndarrays are returned?

this function is currently only called for pandas output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serve already knows how to handle it https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/serve/air_integrations.py?L129

This case is a very small edge case that we return a DataFrame that happens to have ndarray in it due to fallback casting. With your ongoing PR we should be able to remove this function and casting completely.

- Predictor implementation (pandas vs numpy)
"""
# Got to inline this rather than using @pytest.mark.parametrize to void
# unknown object owner error when running test with python cli.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following up on this- we use parametrize in test_batch_mapper.

does that not work here?

python/ray/train/tests/test_predictor.py Outdated Show resolved Hide resolved
else BatchFormat.PANDAS
)
# No preprocessor, just use the predictor format.
return self._predictor_cls._batch_format_to_use()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function should never be called in the first place if preprocessor is None. Don't think we need this if clause

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogkam Hmm seems like this is still called if preprocessor is None?

self._determine_preprocessor_batch_format(data)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but it doesn't need to be. But this is a minor point, so looks good to merge.

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm! Can we address the remaining comments before merging?

#28917 (comment)
#28917 (comment)

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only nits and suggestions for follow-ups, so I think we can merge!

"""
# We need schema to properly validate, so synchronously
# fetch it if necessary.
schema = self.schema(fetch_if_missing=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the pipeline peeking implemented above, this triggering execution should be fine, i.e. we shouldn't hit the double-execution issue. 👍

"""
from ray.data.extensions import TensorDtype

for col in output_df.columns:
# TensorArray requires special handling to numpy array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that we're leaving this relatively alone in this PR? Just double-checking, what was the decision?

@@ -222,13 +281,19 @@ def __call__(self, batch):
# Set the in-predictor preprocessing to a no-op when using a separate
# GPU stage. Otherwise, the preprocessing will be applied twice.
override_prep = BatchMapper(lambda x: x)
# preprocessor.transform will break for DatasetPipeline due to
# missing _dataset_format()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer true with your addition, we should try unifying these paths in a follow-up PR.

python/ray/train/batch_predictor.py Show resolved Hide resolved
python/ray/train/batch_predictor.py Outdated Show resolved Hide resolved
python/ray/train/batch_predictor.py Outdated Show resolved Hide resolved
else BatchFormat.PANDAS
)
# No preprocessor, just use the predictor format.
return self._predictor_cls._batch_format_to_use()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogkam Hmm seems like this is still called if preprocessor is None?

self._determine_preprocessor_batch_format(data)

python/ray/train/predictor.py Outdated Show resolved Hide resolved
raise NotImplementedError(
"None of `_predict_pandas` or `_predict_numpy` are "
f"implemented for input data batch format `{batch_format}`."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see that the check is here, nice. This happens upstream of any Predictor._batch_format_to_use() calls, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is a bit more downstream tho upon seeing data on predict, so i've added your suggestion above to surface the issue earlier

- Predictor implementation (pandas vs numpy)
"""
# Got to inline this rather than using @pytest.mark.parametrize to void
# unknown object owner error when running test with python cli.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this issue for the test_batch_mapper tests by ensuring that the fixtures use the ray_start_regular_shared fixture for its Datasets execution, otherwise you could have the fixtures creating the Datasets on separate Ray clusters than the eventual tests run. If you have this test use the ray_start_regular_shared fixture, and turn these test cases into fixtures depending on the ray_start_regular_shared fixture, it should work:

def ds_pandas_single_column_format(ray_start_regular_shared):

Can happen in a follow-up PR if you'd like.

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc changes

jiaodong and others added 3 commits November 15, 2022 12:43
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Signed-off-by: Jiao <sophchess@gmail.com>
@jiaodong
Copy link
Member Author

rllib get started is flaky that also fails on master.

@richardliaw richardliaw merged commit 326d84f into ray-project:master Nov 16, 2022
fishbone added a commit that referenced this pull request Nov 16, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add predict_numpy to DLPredictor types (tf, torch)
5 participants