[Data] Added First aggregation by laysfire · Pull Request #62765 · ray-project/ray

laysfire · 2026-04-20T10:02:23Z

Description

Added First aggregation allowing to returns the first value in a group.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: yifan.xie <xyfabcd@163.com>

gemini-code-assist

Code Review

This pull request introduces a First aggregation function to Ray Data, implementing the necessary first methods in Arrow and Pandas block accessors along with the First aggregation class and end-to-end tests. The reviewer identified several issues, including compatibility concerns with older PyArrow versions, potential exceptions in the Pandas implementation for empty columns, a logic error where empty blocks could incorrectly nullify aggregation results, and a lack of documentation for the new API.

gemini-code-assist · 2026-04-20T10:04:21Z

+    def first(self, *, ignore_nulls: bool, as_py: bool = True) -> Any:
+        import pyarrow.compute as pac
+        # If there are only null-values, the column's type is `NullType`.
+        # Pyarrow call will throw a 'no kernel matching input types (null)' Exception
+        try:
+            res = pac.first(self._column, skip_nulls=ignore_nulls)
+        except pyarrow.ArrowNotImplementedError:
+            return None
+        return res.as_py() if as_py else res


The pyarrow.compute.first kernel was introduced in PyArrow 13.0.0. Ray typically maintains compatibility with older versions of PyArrow, and using this kernel without a check will cause an AttributeError on older installations. Additionally, pac.first on an empty array returns a null scalar, which can lead to 'poisoning' the aggregation in AggregateFnV2 when ignore_nulls=False (where any null in the stream makes the result null). A fallback implementation using indexing is safer and more compatible.

Suggested change

def first(self, *, ignore_nulls: bool, as_py: bool = True) -> Any:

import pyarrow.compute as pac

# If there are only null-values, the column's type is `NullType`.

# Pyarrow call will throw a 'no kernel matching input types (null)' Exception

try:

res = pac.first(self._column, skip_nulls=ignore_nulls)

except pyarrow.ArrowNotImplementedError:

return None

return res.as_py() if as_py else res

def first(self, *, ignore_nulls: bool, as_py: bool = True) -> Any:

import pyarrow.compute as pac

if len(self._column) == 0:

return None

if hasattr(pac, "first"):

# If there are only null-values, the column's type is `NullType`.

# Pyarrow call will throw a 'no kernel matching input types (null)' Exception

try:

res = pac.first(self._column, skip_nulls=ignore_nulls)

except pyarrow.ArrowNotImplementedError:

return None

else:

# Fallback for older pyarrow versions

col = self.dropna() if ignore_nulls else self._column

if len(col) == 0:

return None

res = col[0]

return res.as_py() if as_py else res

Signed-off-by: yifan.xie <xyfabcd@163.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 2330e92. Configure here.}

cursor · 2026-04-21T04:58:41Z

+        # If there are only null-values, the column's type is `NullType`.
+        # Pyarrow call will throw a 'no kernel matching input types (null)' Exception
+        try:
+            res = pac.first(self._column, skip_nulls=ignore_nulls)


Arrow and Pandas first disagree when ignore_nulls=False

Medium Severity

The Arrow implementation of first with ignore_nulls=False uses pac.first(column, skip_nulls=False), which returns null if any null exists anywhere in the array. The Pandas implementation uses iloc[0], which returns the actual first element. For a column like [5, None, 7], Arrow returns None while Pandas returns 5. The test doesn't catch this because every group with nulls happens to have null as the first element.

Additional Locations (1)

python/ray/data/_internal/pandas_block.py#L188-L189

^{Reviewed by Cursor Bugbot for commit 2330e92. Configure here.}

Added aggregation

aaa68bf

Signed-off-by: yifan.xie <xyfabcd@163.com>

laysfire requested a review from a team as a code owner April 20, 2026 10:02

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

cursor Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread python/ray/data/_internal/pandas_block.py

Comment thread python/ray/data/aggregate.py Outdated

ray-gardener Bot added data Ray Data-related issues community-contribution Contributed by the community labels Apr 20, 2026

laysfire added 3 commits April 21, 2026 11:56

update

99c8612

Signed-off-by: yifan.xie <xyfabcd@163.com>

update

5a2095c

Signed-off-by: yifan.xie <xyfabcd@163.com>

lint

2330e92

Signed-off-by: yifan.xie <xyfabcd@163.com>

cursor Bot reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Added First aggregation#62765

[Data] Added First aggregation#62765
laysfire wants to merge 4 commits intoray-project:masterfrom
laysfire:add_first_aggregation

laysfire commented Apr 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

laysfire commented Apr 20, 2026

Description

Related issues

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 21, 2026

Choose a reason for hiding this comment

Arrow and Pandas first disagree when ignore_nulls=False

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Arrow and Pandas `first` disagree when `ignore_nulls=False`