-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filtered arrow tables and conversion to arquero table: filter is lost #51
Comments
Thanks. You're right that Arquero does not currently account for filtered tables as it access the columns directly. Do you happen to know if by chance Arrow has a method akin to Arquero's |
There is also a related issue already filed for Arrow, but it looks like it’s been there for a while with no response: https://issues.apache.org/jira/browse/ARROW-9496 |
I tried to find such a reify() feature, at the table or column level, but to no avail. Generally speaking, the main interest of arrow is to avoid, or delay as much as possible the conversion to javascript objects. It is very interesting to use arrow filtering functions when you want to extract a small dataset from a large table. So scan() seems to me for the moment an interesting axis. In fact the ideal would be to be able to use all (or most important) arquero verbs in an equivalent way, whether they apply to an arquero table or an arrow table.This would imply some kind of internal compilation to the arrow syntax, and I imagine it would be quite a huge enterprise!. That's a bit I think what dplyr is trying to do (sorry to bring it back again with R ;))
The collect() is equivalent to your reify(), but before collect, this is standard dplyr syntax that works as usual, even if it applies here to an arrow table. |
Here is the current Arrow JS implementation of filtered scans: https://github.com/apache/arrow/blob/master/js/src/compute/dataframe.ts#L119 The initial filter call should be very fast indeed as it does no work: the predicate is only tested later upon scanning. I’m curious how a big a performance difference one sees from a filtered scan that extracts values into “collected”/“reified” arrays versus just performing the filtering in Arquero. We will probably need to experiment a bit. |
I'll try and arrange some benchs in an observable notebook.
Using dictionnary-encoded strings makes a huge difference (x10) in performance (and also in arrow file size, x3, and x6 if zipped). So from 17000 ms early this morning to 400 ms now, i am quite happy, arrow + arquero really rocks! |
Here is a more detailed (and reproducible) benchmark comparison: |
Fixed and released in v1.3.0. The |
Great, thanks! |
You wouldn't need to filter with arrow in advance if arquero could optimize predicates involving dictionary-encoded columns in the same way that arrow's Table.filter does. @jheer is anything like that planned/feasible? |
Thanks @TheNeuralBit, this is indeed something I am thinking about and should be feasible. That said, it's not currently at the top of my priority queue. In any case, clients could pass an Arrow FilteredDataFrame to Arquero, and so we should properly handle that (as added in PR #60). |
Thanks Jeff. You're right filtering in Arrow first is a fine workaround for now. I'm just thinking that if arquero gained that ability then maybe Arrow could/should get out of the business of a JS compute library and defer to arquero. The dictionary optimization is it's only real selling point right now. I filed #86 to track the work. I'd be interested in helping out with it if you could provide a few code pointers |
really happy to see Brian's engagment here, considering all he has achieved so far for Arrow and its js library! As i also tried to analyse here: the dictionary encoding in Arrow allows for spectacular performances, with filter, but also summarise operations (simple sum operations group by dictionary encoded variables). More generally, Arrow js implementation looks somewhat frozen, with from my point of view key pending issues (especially compression for a web usage, considering raw Arrow file sizes): No doubt for me that Arquero leveraging Arrow and your engagement could boost all that! |
Arquero does not seem to take into account a filter applied to an arrow table.
Here is my use case:
I have quite a big arrow table:
arrow_tb.count()
=> 6159296
I filter it with arrow functions, which is very fast, especially with dictionary-encoded columns:
=> 34951
I load it as an Arquero table and compute its number of rows:
aq.fromArrow(arrow_filtered_tb).numRows()
=> 6159296, which is not what i expect
Apparently, the arrow method column.toArray() exports the whole unfiltered column.
The arrow scan method looks more effective, i'd like to avoid here the conversion to objects before converting to arquero table :
=> 34951
The text was updated successfully, but these errors were encountered: