-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Port to Nulls.jl #155
Conversation
@@ -2,9 +2,9 @@ using Query | |||
using DataStreams | |||
using CSV | |||
|
|||
q = @from i in CSV.Source("data.csv") begin | |||
q = @from i in CSV.Source("data.csv", categorical=false) begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, did you change this because it doesn't work with CategoricalArrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I've been meaning to discuss this w/ you. The approach I took in CSV is that when a column is detected as categorical, the actual value returned from parsefield(io, CategoricalValue, ...)
is a WeakRefString
. This works because in the DataStreams world, you always pre-allocate the output vector and doing setindex!(::CategoricalArray)
or push!(::CategoricalArray
w/ a WeakRefString works and is more efficient. In the case of Query though, we're only ever iterating rows from the Source, so there's not necessarily the same pre-allocation that's guaranteed to happen. It's easier for now to just say categorical=false
, but definitely something we need to think about as CSV integrates w/ more frameworks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm not sure I see the big picture, but better discuss this elsewhere anyway.
Thanks for taking a stab at this, I very much appreciate it! I'm still under water with other projects (and will be for the next 1-2 weeks), so I only took a quick look (and didn't actually try to run anything). Two questions/issues:
|
A question I have is would we always expect the output composite type to have the form |
AFAIK calls to |
@davidanthoff, just let me know when you have some time to review and I'm happy to discuss here or we can have a higher-velocity chat on slack. No rush. |
@quinnj Alright, I had another look and thought a bit about it. Here are some (difficult) issues I see with this approach:
|
Instead of
DataValue
, usesUnion{T, Null}
for cases of missing data.Most tests pass for me locally with the following setup:
jq/nulls
branches: IterableTables, QueryOperatorswith this PR[merged]There seems to be an issue w/ current master Feather on 0.6 that I'm going to dig into.[fixed on master]As can be seen in the diff, it's actually a surprisingly small change. The only tricky part was handling "select" statement projections, which generated NamedTuple types like
@NT(a=1, c=lowercase(i.name))
and now generate them like@NT(a::typeof(1), c::Query.@infer(lowercase(i.name)))(1, lowercase(i.name))
; theQuery.@infer
macro doesn't execute anything, but relies, much like the select iterator itself, onBase._return_type
to infer the type of the projection to ensure a strongly typed NamedTuple.Would love for others to take a look and provide feedback; it's a bit tricky to get the right "state" for testing, but happy to answer any questions.
See corresponding PRs to IterableTables.jl, QueryOperators.jl