Draft for the Dataframe interchange protocol #1509

AlenkaF · 2021-08-12T11:47:37Z

This is the first draft of the Dataframe protocol implementation for Vaex.
It works for

numerical and boolean dtypes
categorical columns constructed with categorize()

It is erroring with Arrow Dictionary when trying to convert dtype arrow_type.to_pandas_dtype(), line 305 in _dtype_from_vaexdtype. Do you have any suggestions?

What is still on my todo list:

implement expression.codes see ✨ Expression.codes gives the numerical codes for categories/dict encoded #1503
materializing virtual columns
chunk handling
support of missing values

cc @maartenbreddels @JovanVeljanoski

maartenbreddels

Excellent work, I gave it a first pass, hope that helps.

packages/vaex-core/vaex/dataframe_protocol.py

tests/dataframe_protocol_test.py

maartenbreddels · 2021-08-13T15:58:46Z

packages/vaex-core/vaex/dataframe_protocol.py

+            if self._col.values[0] in self.labels:
+                for i in self._col.values:
+                    codes[np.where(codes==i)] = np.where(self.labels == i) # if values are same as labels
+            else: 
+                codes = self._col.values # values are already codes for the labels


I don't think is needed. Could you explain what you are trying to do here?

I also think this will not be needed when I will be able to use expressions.codes.

But for now, if I take this example:

df = vaex.from_arrays(year=[2012, 2015, 2019], weekday=[0, 4, 6]) df = df.categorize('year', min_value=2012, max_value=2019) df = df.categorize('weekday', labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])

and use .values function on year and weekday column I get two different types of output. In the first case it is a list of labels and in the other case I get the codes.

array([2012, 2015, 2019]) array([0, 4, 6])

To get the codes for the year column I calculate them as in line 402-404.

For now I can't seem to find another way.

packages/vaex-core/vaex/dataframe_protocol.py

AlenkaF · 2021-09-02T08:33:06Z

packages/vaex-core/vaex/dataframe_protocol.py

+        # If it is internal, kind must be categorical (23)
+        # If it is external (call from_dataframe), dtype must give type of the data
+        if self._col.df.is_category(self._col):
+            return (_DtypeKind.CATEGORICAL, 64, 'u', '=') # what should be the default??


This will have to be revised. See first two items in data-apis/dataframe-api#46 (comment)

AlenkaF · 2021-09-02T12:47:52Z

packages/vaex-core/vaex/dataframe_protocol.py

+
+    # Join the chunks into tuple for now
+    df_new = vaex.concat(dataframe)
+    df_new._buffers = _buffers


A nested list may be hard to read. Could be changed to a list of dictionaries with column names as keys if that would be better.

AlenkaF · 2021-09-07T10:25:27Z

@maartenbreddels @JovanVeljanoski don't know if it will stay green =) nevertheless I think this is ready for a review.

AlenkaF · 2021-09-07T11:04:22Z

packages/vaex-core/vaex/dataframe_protocol.py

+    Convert an int, uint, float or bool column to an arrow array
+    """
+    if col.offset != 0:
+        raise NotImplementedError("column.offset > 0 not handled yet")    


Is offset used in Vaex?

This allows plotly express to take in any dataframe that supports the dataframe protocol, see: https://data-apis.org/blog/dataframe_protocol_rfc/ https://data-apis.org/dataframe-protocol/latest/index.html Test includes an example with vaex, which should work with vaexio/vaex#1509 (not yet released)

maartenbreddels

Nice work! my biggest comment/change was the change of the describe_null, which can be 3 options (no null, np.ma or arrow). I hope you agree with that.
Would love to see string support!

PS: Please keep the line endings as line feeds.

packages/vaex-core/vaex/dataframe_protocol.py

maartenbreddels · 2021-09-16T12:05:16Z

packages/vaex-core/vaex/dataframe_protocol.py

+            if self.dtype[0] == _k.BOOL and isinstance(self._col.values, (pa.Array, pa.ChunkedArray)):
+                buffer = _VaexBuffer(np.array(self._col.tolist(), dtype=bool))
+            else:
+                buffer = _VaexBuffer(self._col.to_numpy())


This could make a copy. We can use .values/.evaluate() but then we'd have to support arrow and numpy arrays, which should be fine, maybe the following works:

Suggested change

buffer = _VaexBuffer(self._col.to_numpy())

buffer = _VaexBuffer(np.asarray(self._col.evaluate(), dtype=bool))

maartenbreddels · 2021-09-16T12:05:40Z

packages/vaex-core/vaex/dataframe_protocol.py

+            # If arrow array is boolean .to_numpy changes values for some reason
+            # For that reason data is transferred to numpy through .tolist
+            if self.dtype[0] == _k.BOOL and isinstance(self._col.values, (pa.Array, pa.ChunkedArray)):
+                buffer = _VaexBuffer(np.array(self._col.tolist(), dtype=bool))


Not sure why that happens, I'll see if a test fails because of that.

Maybe I should restate: the values changed in a strange way when transferring data with buffer_to_ndarray in the case of a bool arrow array if .to_numpy was used.

But I will try to use previous suggestion (.values/.evaluate()) for arrow and numpy and I will delete the comment if it works.

I get the same strange change of values for evaluate() also. It happens in the case of arrow arrays (int and bool) with missing values.

df = vaex.from_arrays( arrow_int_m = pa.array([0, 1, 2, None, 0], mask=np.array([0, 0, 0, 1, 1], dtype=bool)), arrow_float_m = pa.array([0.5, 1.5, 2.5, None, 0.5], mask=np.array([0, 0, 0, 1, 0], dtype=bool)), arrow_bool_m = pa.array([True, False, True, None, True], mask=np.array([0, 0, 1, 1, 0], dtype=bool)) ) col = df.__dataframe__().get_column_by_name('arrow_int_m') b, d = col.get_buffers()["data"] buffer_to_ndarray(b, d)

output:

array([ 0, 4607182418800017408, 4611686018427387904, -2251799813685248, -2251799813685248], dtype=int64)

I can commit with the error so you can have a look?

packages/vaex-core/vaex/dataframe_protocol.py

maartenbreddels · 2021-09-16T12:13:36Z

packages/vaex-core/vaex/dataframe_protocol.py

+            size = self.num_rows()
+            i = self._df.evaluate_iterator(self.get_column(0)._col, chunk_size=size // n_chunks)
+            iterator = []
+            for i1, i2, chunk in i:
+                iterator.append(_VaexColumn(self._df[i1:i2]))
+            return iterator


Maybe I misunderstood, but I think we can simply do

Suggested change

size = self.num_rows()

i = self._df.evaluate_iterator(self.get_column(0)._col, chunk_size=size // n_chunks)

iterator = []

for i1, i2, chunk in i:

iterator.append(_VaexColumn(self._df[i1:i2]))

return iterator

yield self

Similar to https://github.com/data-apis/dataframe-api/blob/27b8e1cb676bf10704d1dfc3dca0d0d806e2e802/protocol/pandas_implementation.py#L766

I added this part thinking one could read a dataframe (that is not chunked) in chunks specifying the n_chunks (item 12 from protocol-design-requirements).

But there are some errors left, sorry about that. It should be:

def get_chunks(self, n_chunks: Optional[int] = None) -> Iterable["_VaexDataFrame"]: """ Return an iterator yielding the chunks. TODO: details on ``n_chunks`` """ if n_chunks == None: size = self.num_rows() n_chunks = self.num_chunks() i = self._df.evaluate_iterator(self.get_column(0)._col, chunk_size=size // n_chunks) iterator = [] for i1, i2, chunk in i: iterator.append(_VaexDataFrame(self._df[i1:i2])) return iterator elif self.num_chunks() == 1: size = self.num_rows() i = self._df.evaluate_iterator(self.get_column(0)._col, chunk_size=size // n_chunks) iterator = [] for i1, i2, chunk in i: iterator.append(_VaexDataFrame(self._df[i1:i2])) return iterator else: raise ValueError("Dataframe is already chunked.")

packages/vaex-core/vaex/dataframe_protocol.py

maartenbreddels · 2021-09-27T10:06:31Z

Awesome to see the strings coming in, getting close!
I pushed some a change in how to consume the buffers for strings, which is more efficient (see vaex/arrow/convert.py for some more code examples), and I think we should also use that method to produce the buffers (so that my new mem copy test does not fail).

AlenkaF · 2021-10-06T06:20:07Z

@maartenbreddels I added the method suggested to produce the buffers for string dtpe. Also added the mask handling in the convert_string_column(). I think this is ready for another round of review. Thanks!

…l_ordinal for categorical dtypes

maartenbreddels · 2021-10-13T12:47:44Z

@AlenkaF many many things for your work, this is actually already released in vaex-core 4.6.0a3

rgommers · 2021-10-13T20:48:52Z

This is really great to see - thanks a lot @AlenkaF and @maartenbreddels!

* 🐛 handle offset for categories * Draft for the Dataframe interchange protocol * Adding test for virtual column plus typo. * Roundtrip test change plus some corrections in functions parameters * Apply suggestions from code review * Dtype for arrow dict plus use of arrow dict in convert_categorical_column * Add missing value handling * Added chunk handling and tests * Corrected usage of metadata for categories * Applying changes from general dataframe protocol * Delete copy error * Change sentinel value handling in convert_categorical_column * Add select_columns() and test * Update to _get_data_buffer() for Arrow Dictionary * Minor commenting changes * Correct typo error * Add _VaexBuffer test * Add tests and correction for _VaexColumn * Added tests for _VaexDataFrame * Added more tests and one correction for format_str * format to LF and black * support passing in allow_copy * correct descibe_null for arrow and numpy * correct _get_validity_buffer to match describe_null * correct describe_null, convert_categorical_column and test_categorical_ordinal for categorical dtypes * Apply suggestions from code review * correct get_chunks for _VaexDataFrame * Replace return with yield in get_chunks * Check for LF and run black with -l 220 * Black with line length 220 * Add string dtype support * Add Arrow Dict check to describe_categorical * avoid copying data for strings * small fix * also test sliced dataframe * test that we do not copy data * Apply string no-mem copy suggestions * fix and test get_chunks * use future ordinal encoding feature * make test work with dict encoded Co-authored-by: Maarten A. Breddels <maartenbreddels@gmail.com> Co-authored-by: Alenka Frim <alenkafrim@Alenkas-MacBook-Pro.local>

This allows plotly express to take in any dataframe that supports the dataframe protocol, see: https://data-apis.org/blog/dataframe_protocol_rfc/ https://data-apis.org/dataframe-protocol/latest/index.html Test includes an example with vaex, which should work with vaexio/vaex#1509 (not yet released)

AlenkaF mentioned this pull request Aug 12, 2021

🚀 Create a first draft PR in Vaex AlenkaF/vaex-df-api-implementation#6

Closed

AlenkaF force-pushed the dataframe-protocol branch 2 times, most recently from 8c52ceb to 48c34f6 Compare August 12, 2021 13:44

AlenkaF marked this pull request as draft August 13, 2021 11:50

maartenbreddels reviewed Aug 13, 2021

View reviewed changes

AlenkaF force-pushed the dataframe-protocol branch from 48c34f6 to 954c4bb Compare August 16, 2021 12:21

AlenkaF mentioned this pull request Aug 17, 2021

Missing values AlenkaF/vaex-df-api-implementation#8

Closed

AlenkaF force-pushed the dataframe-protocol branch 2 times, most recently from 44fc157 to f74a168 Compare August 31, 2021 08:17

AlenkaF commented Sep 2, 2021

View reviewed changes

AlenkaF force-pushed the dataframe-protocol branch 2 times, most recently from c06da90 to 14fa9a4 Compare September 7, 2021 06:59

AlenkaF marked this pull request as ready for review September 7, 2021 10:26

AlenkaF commented Sep 7, 2021

View reviewed changes

maartenbreddels mentioned this pull request Sep 16, 2021

support dataframe protocol (tested with Vaex) plotly/plotly.py#3387

Closed

maartenbreddels force-pushed the dataframe-protocol branch from 14fa9a4 to 32a401c Compare September 16, 2021 11:56

maartenbreddels requested changes Sep 16, 2021

View reviewed changes

AlenkaF force-pushed the dataframe-protocol branch 2 times, most recently from 00cab17 to ba8cef8 Compare September 17, 2021 12:40

AlenkaF force-pushed the dataframe-protocol branch from 1133384 to a81f4f5 Compare October 6, 2021 06:03

maartenbreddels force-pushed the dataframe-protocol branch from 106fe28 to 1fe648e Compare October 11, 2021 13:21

maartenbreddels and others added 3 commits October 12, 2021 10:33

🐛 handle offset for categories

b3b5477

Draft for the Dataframe interchange protocol

d91f1c4

Adding test for virtual column plus typo.

a86a1df

AlenkaF and others added 21 commits October 12, 2021 10:33

Added more tests and one correction for format_str

811a952

format to LF and black

6726d55

support passing in allow_copy

77650a2

correct descibe_null for arrow and numpy

64f12a4

correct _get_validity_buffer to match describe_null

bf7ebb0

correct describe_null, convert_categorical_column and test_categorica…

5126959

…l_ordinal for categorical dtypes

Apply suggestions from code review

e2ad32b

correct get_chunks for _VaexDataFrame

14d69cc

Replace return with yield in get_chunks

a074474

Check for LF and run black with -l 220

e53a390

Black with line length 220

a61ad1e

Add string dtype support

43980a8

Add Arrow Dict check to describe_categorical

f95692b

avoid copying data for strings

f5bf82c

small fix

1879472

also test sliced dataframe

7060266

test that we do not copy data

8e988f8

Apply string no-mem copy suggestions

bc7ed73

fix and test get_chunks

e14fba5

use future ordinal encoding feature

51b2c06

make test work with dict encoded

0df4b34

maartenbreddels force-pushed the dataframe-protocol branch from 1fe648e to 0df4b34 Compare October 12, 2021 08:35

maartenbreddels merged commit d5410f8 into vaexio:master Oct 13, 2021

thomasjpfan mentioned this pull request May 30, 2022

SLEP018 Pandas output for transformers with set_output scikit-learn/enhancement_proposals#68

Merged

honno mentioned this pull request Jul 26, 2022

[FEATURE-REQUEST] Support interchanging vaex dataframes with Arrow-backend columns #2134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft for the Dataframe interchange protocol #1509

Draft for the Dataframe interchange protocol #1509

AlenkaF commented Aug 12, 2021 •

edited

maartenbreddels left a comment

maartenbreddels Aug 13, 2021

AlenkaF Aug 16, 2021

AlenkaF Sep 2, 2021

AlenkaF Sep 2, 2021

AlenkaF commented Sep 7, 2021

AlenkaF Sep 7, 2021

maartenbreddels left a comment

maartenbreddels Sep 16, 2021

maartenbreddels Sep 16, 2021

AlenkaF Sep 17, 2021

AlenkaF Sep 17, 2021

maartenbreddels Sep 16, 2021

AlenkaF Sep 17, 2021

maartenbreddels commented Sep 27, 2021

AlenkaF commented Oct 6, 2021

maartenbreddels commented Oct 13, 2021

rgommers commented Oct 13, 2021

	buffer = _VaexBuffer(self._col.to_numpy())
	buffer = _VaexBuffer(np.asarray(self._col.evaluate(), dtype=bool))

Draft for the Dataframe interchange protocol #1509

Draft for the Dataframe interchange protocol #1509

Conversation

AlenkaF commented Aug 12, 2021 • edited

maartenbreddels left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlenkaF commented Sep 7, 2021

Choose a reason for hiding this comment

maartenbreddels left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maartenbreddels commented Sep 27, 2021

AlenkaF commented Oct 6, 2021

maartenbreddels commented Oct 13, 2021

rgommers commented Oct 13, 2021

AlenkaF commented Aug 12, 2021 •

edited