Skip to content

[Data] pyarrow.lib.ArrowInvalid when aggregating over Python list of dicts #32056

@cadedaniel

Description

@cadedaniel

Ray 2.2 on MacOS, pyarrow==6.0.1, python 3.7

stacktrace:

  File "python/ray/_raylet.pyx", line 830, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 834, in ray._raylet.execute_task
  File "/Users/cade/miniconda3/envs/anyscale/lib/python3.7/site-packages/ray/data/grouped_dataset.py", line 58, in map
    parts = [BlockAccessor.for_block(p).combine(key, aggs) for p in partitions]
  File "/Users/cade/miniconda3/envs/anyscale/lib/python3.7/site-packages/ray/data/grouped_dataset.py", line 58, in <listcomp>
    parts = [BlockAccessor.for_block(p).combine(key, aggs) for p in partitions]
  File "/Users/cade/miniconda3/envs/anyscale/lib/python3.7/site-packages/ray/data/_internal/arrow_block.py", line 515, in combine
    return builder.build()
  File "/Users/cade/miniconda3/envs/anyscale/lib/python3.7/site-packages/ray/data/_internal/table_block.py", line 98, in build
    tables = [self._table_from_pydict(self._columns)]
  File "/Users/cade/miniconda3/envs/anyscale/lib/python3.7/site-packages/ray/data/_internal/arrow_block.py", line 123, in _table_from_pydict
    return pyarrow.Table.from_pydict(columns)
  File "pyarrow/table.pxi", line 1724, in pyarrow.lib.Table.from_pydict
  File "pyarrow/table.pxi", line 2368, in pyarrow.lib._from_pydict
  File "pyarrow/array.pxi", line 341, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 315, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert {'name': 'C', 'amount': 10, 'country': 'C1'} with type ArrowRow: did not recognize Python value type when inferring an Arrow data type

repro script:

#!/usr/bin/env python3

import ray
from ray.data.aggregate import AggregateFn

data = [
    {'name': 'A', 'amount': 100, 'country': 'C1'},
    {'name': 'B', 'amount': 200, 'country': 'C2'},
    {'name': 'C', 'amount': 10, 'country': 'C1'},
    {'name': 'D', 'amount': 500, 'country': 'C2'},
    {'name': 'E', 'amount': 400, 'country': 'C3'},
]

ds = ray.data.from_items(data)
ds = ds.groupby('country')

result = ds.aggregate(AggregateFn(
    init=lambda k: [],
    accumulate_row=lambda a, r: a + [r],
    merge=lambda a1, a2: a1['amount'] + a2['amount'],
    finalize=lambda a: a
))

full log https://gist.github.com/cadedaniel/1080563aae30309aef98505aef9fc6bc
pip freeze https://gist.github.com/cadedaniel/480a95d8d29da7795ebd19f092253b44

Metadata

Metadata

Assignees

Labels

bugSomething that is supposed to be working; but isn'tdataRay Data-related issuesstaleThe issue is stale. It will be closed within 7 days unless there are further conversation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions