[data] Optimization to reduce ArrowBlock building time for blocks of …

…size 1 ray-project#38833 Many Data ops depend on converting numpy batches to Arrow blocks. A single np array -> pyarrow is normally zero-copy, but blocks with multiple rows will need a copy to make the column of np arrays into one contiguous ndarray. This PR avoids this step for blocks of a single row by using np.expand_dims to reshape the array instead of copying it. Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Victor <vctr.y.m@example.com>
vymao · Oct 11, 2023 · 3d43aac · 3d43aac
1 parent 0a5439e
commit 3d43aac
Showing 1 changed file with 5 additions and 0 deletions.
diff --git a/python/ray/data/_internal/numpy_support.py b/python/ray/data/_internal/numpy_support.py
@@ -51,6 +51,11 @@ def convert_udf_returns_to_numpy(udf_return_col: Any) -> Any:
         return udf_return_col
 
     if isinstance(udf_return_col, list):
+        if len(udf_return_col) == 1 and isinstance(udf_return_col[0], np.ndarray):
+            # Optimization to avoid conversion overhead from list to np.array.
+            udf_return_col = np.expand_dims(udf_return_col[0], axis=0)
+            return udf_return_col
+
         # Try to convert list values into an numpy array via
         # np.array(), so users don't need to manually cast.
         # NOTE: we don't cast generic iterables, since types like