Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE-REQUEST] Support interchanging vaex dataframes with Arrow-backend columns #2134

Closed
honno opened this issue Jul 26, 2022 · 3 comments

Comments

@honno
Copy link
Contributor

honno commented Jul 26, 2022

Initialising an interchange protocol buffer (_VaexBuffer) only works for vaex columns with NumPy backends

def __init__(self, x: np.ndarray, allow_copy: bool = True) -> None:
"""
Handle only regular columns (= numpy arrays) for now.
"""

_VaexBuffer.__init__() is private API, but affects interchange with different libraries as this is called when using the public API of Column.get_buffers()

packages/vaex-core/vaex/dataframe_protocol.py:565: in get_buffers
    buffers["data"] = self._get_data_buffer()
packages/vaex-core/vaex/dataframe_protocol.py:603: in _get_data_buffer
    buffer = _VaexBuffer(self._col.values)

So obviously it'd be nice (if not practically essential?) if vaex supported interchanging Arrow-backend columns too. I just thought to raise this issue as a tracker, as I didn't quite see relevant conversation in #1509. cc @maartenbreddels

@maartenbreddels
Copy link
Member

Even if the buffer is stored as numpy array, it can still mean the underlying data is an arrow array.

I think it should be possible to do arrow->protocol->arrow without a memory copy. At least that's how we designed the spec AFAIKR. It could be that the implementation is missing some parts still.

@honno
Copy link
Contributor Author

honno commented Jul 26, 2022

Ah so you fixed the issue I was alluding to in #2122

-                buffer = _VaexBuffer(self._col.values)
+                buffer = _VaexBuffer(indices.to_numpy())

Before a test like the following would fail

def test_smoke_get_buffers(df_factory):
    x = np.arange(5)
    df = df_factory(x=x)
    df = df.categorize("x")
    interchange_df = df.__dataframe__()
    interchange_col = interchange_df.get_column_by_name("x")
    interchange_col.get_buffers()

for the pyarrow(+chunked) dataframe. So I think you're all good? I'll get to forcibly generate Arrow-backend examples for dataframe-interchange-tests.

@honno
Copy link
Contributor Author

honno commented Jul 26, 2022

Wrote a regression test #2135

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants