Skip to content

Commit

Permalink
[data] Optimization to reduce ArrowBlock building time for blocks of …
Browse files Browse the repository at this point in the history
…size 1 ray-project#38833

Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Victor <vctr.y.m@example.com>
  • Loading branch information
stephanie-wang authored and Victor committed Oct 11, 2023
1 parent 0a5439e commit 3d43aac
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions python/ray/data/_internal/numpy_support.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,11 @@ def convert_udf_returns_to_numpy(udf_return_col: Any) -> Any:
return udf_return_col

if isinstance(udf_return_col, list):
if len(udf_return_col) == 1 and isinstance(udf_return_col[0], np.ndarray):
# Optimization to avoid conversion overhead from list to np.array.
udf_return_col = np.expand_dims(udf_return_col[0], axis=0)
return udf_return_col

# Try to convert list values into an numpy array via
# np.array(), so users don't need to manually cast.
# NOTE: we don't cast generic iterables, since types like
Expand Down

0 comments on commit 3d43aac

Please sign in to comment.