Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support efficient conversion of Tensor and Complex Data in Python #2

Open
sjperkins opened this issue Mar 20, 2023 · 2 comments
Open
Assignees
Labels
enhancement New feature or request medium medium priority python Python related issues

Comments

@sjperkins
Copy link
Member

sjperkins commented Mar 20, 2023

Currently, conversion of Tensors and Complex data in Python is inefficient:

import casa_arrow as ca
casa_table = ca.table("~/data/WSRT_polar.MS_p0/")
arrow_table = casa_table.to_arrow()

print(arrow_table.column("DATA").to_numpy())

produces the following output

array([array([array([array([0., 1.], dtype=float32), array([0., 1.], dtype=float32),
                     array([0., 1.], dtype=float32), array([0., 1.], dtype=float32)],
                    dtype=object)                                                    ,
              array([array([0., 1.], dtype=float32), array([0., 1.], dtype=float32),
                     array([0., 1.], dtype=float32), array([0., 1.], dtype=float32)],
                    dtype=object)                                                    ,
           ...
           array([array([0., 1.], dtype=float32), array([0., 1.], dtype=float32),
                     array([0., 1.], dtype=float32), array([0., 1.], dtype=float32)],
                    dtype=object)                                                    ],
             dtype=object)                                                             ],
      dtype=object)

This is because the extension types are defined in C++, and the to_numpy() method on the default Python Extension Type wrapper isn't overridden. See daskms.experimental.arrow.extension_types.to_numpy for a possible implementation.

Two possible solutions exist

Provide wrappers with richer features within Apache Arrow

The Arrow maintainers are aware of this issue:

And the following exploratory PR's suggest initial solutions:

Provide wrappers at the casa-arrow level

Provide a table wrapper that creates numpy arrays directly from the arrow column buffers: e.g.

>>> AT.column("DATA").chunks[0].buffers()
 [None, None, None, None, <pyarrow.Buffer address=0x7fca84011000 size=2048 is_cpu=True is_mutable=True>]
>>> bufs=At.column("DATA").chunks[0].buffers()
>>> data = np.frombuffer(bufs[-1], np.complex64)
@sjperkins sjperkins added the bug Something isn't working label Mar 20, 2023
@sjperkins sjperkins self-assigned this Mar 20, 2023
@sjperkins sjperkins added python Python related issues medium medium priority enhancement New feature or request and removed bug Something isn't working labels Mar 20, 2023
@jorisvandenbossche
Copy link

Provide a table wrapper that creates numpy arrays directly from the arrow column buffers: e.g.

FYI, if it's a fixed size list array under the hood, you can do this conversion to numpy a bit easier than accessing the buffers:

In [22]: arr = pa.array([[1, 2], [3, 4], [5, 6]], pa.list_(pa.int64(), 2))

In [23]: arr.to_numpy(zero_copy_only=False)
Out[23]: array([array([1, 2]), array([3, 4]), array([5, 6])], dtype=object)

In [24]: arr.flatten().to_numpy().reshape(len(arr), -1)
Out[24]: 
array([[1, 2],
       [3, 4],
       [5, 6]])

@sjperkins
Copy link
Member Author

Provide a table wrapper that creates numpy arrays directly from the arrow column buffers: e.g.

FYI, if it's a fixed size list array under the hood, you can do this conversion to numpy a bit easier than accessing the buffers:

In [22]: arr = pa.array([[1, 2], [3, 4], [5, 6]], pa.list_(pa.int64(), 2))

In [23]: arr.to_numpy(zero_copy_only=False)
Out[23]: array([array([1, 2]), array([3, 4]), array([5, 6])], dtype=object)

In [24]: arr.flatten().to_numpy().reshape(len(arr), -1)
Out[24]: 
array([[1, 2],
       [3, 4],
       [5, 6]])

Ah, that's much better. Thanks @jorisvandenbossche.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request medium medium priority python Python related issues
Projects
None yet
Development

No branches or pull requests

2 participants