WIP: Port to Nulls.jl #155

quinnj · 2017-10-06T06:28:45Z

Instead of DataValue, uses Union{T, Null} for cases of missing data.

Most tests pass for me locally with the following setup:

jq/nulls branches: IterableTables, QueryOperators
Master: DataFrames, CSV, DataArrays, SQLite, Feather
Latest tagged releases: TypedTables, DataStreams, NullableArrays
NamedTuples ~~with this PR~~[merged]

~~There seems to be an issue w/ current master Feather on 0.6 that I'm going to dig into.~~[fixed on master]

As can be seen in the diff, it's actually a surprisingly small change. The only tricky part was handling "select" statement projections, which generated NamedTuple types like @NT(a=1, c=lowercase(i.name)) and now generate them like @NT(a::typeof(1), c::Query.@infer(lowercase(i.name)))(1, lowercase(i.name)); the Query.@infer macro doesn't execute anything, but relies, much like the select iterator itself, on Base._return_type to infer the type of the projection to ensure a strongly typed NamedTuple.

Would love for others to take a look and provide feedback; it's a bit tricky to get the right "state" for testing, but happy to answer any questions.

See corresponding PRs to IterableTables.jl, QueryOperators.jl

nalimilan · 2017-10-06T07:36:42Z

example/11-Datastream.jl

@@ -2,9 +2,9 @@ using Query
 using DataStreams
 using CSV

-q = @from i in CSV.Source("data.csv") begin
+q = @from i in CSV.Source("data.csv", categorical=false) begin


Out of curiosity, did you change this because it doesn't work with CategoricalArrays?

Ah, I've been meaning to discuss this w/ you. The approach I took in CSV is that when a column is detected as categorical, the actual value returned from parsefield(io, CategoricalValue, ...) is a WeakRefString. This works because in the DataStreams world, you always pre-allocate the output vector and doing setindex!(::CategoricalArray) or push!(::CategoricalArray w/ a WeakRefString works and is more efficient. In the case of Query though, we're only ever iterating rows from the Source, so there's not necessarily the same pre-allocation that's guaranteed to happen. It's easier for now to just say categorical=false, but definitely something we need to think about as CSV integrates w/ more frameworks.

OK, I'm not sure I see the big picture, but better discuss this elsewhere anyway.

davidanthoff · 2017-10-07T00:21:51Z

Thanks for taking a stab at this, I very much appreciate it!

I'm still under water with other projects (and will be for the next 1-2 weeks), so I only took a quick look (and didn't actually try to run anything). Two questions/issues:

This would have to work not just for NamedTuples, but for any composite type. I think this wouldn't work for queries like df |> @select(_.a=>_.b) |> Dict or say df |> @select(Foo(_.a, _.b)) |> Set (assuming Foo is something like struct Foo{T1,T2} a::T2 b::T1 end with line breaks), right? Or does it?
I think this moves a call to Base._return_type into the most inner loop, right? Can the compiler somehow eliminate that in the generated code, or will that actually run for every element? The latter would presumably be quite an issue for performance.

quinnj · 2017-10-08T08:46:46Z

This would have to work....

A question I have is would we always expect the output composite type to have the form T{T1, T2, ...}; a::T1; b::T2; end? If so, then I think my approach here would generalize to that (with some tweaking). And actually, now that I think about it some more, I guess that's the only case where we see issues w/ codegen trying to split unions (because otherwise the struct has a set type and the argument needs to be that type anyway).

nalimilan · 2017-10-08T14:25:48Z

I think this moves a call to Base._return_type into the most inner loop, right? Can the compiler somehow eliminate that in the generated code, or will that actually run for every element? The latter would presumably be quite an issue for performance.

AFAIK calls to return_type are resolved at compile time, so it doesn't matter whether it's called outside or inside a loop.

quinnj · 2017-10-10T05:38:13Z

@davidanthoff, just let me know when you have some time to review and I'm happy to discuss here or we can have a higher-velocity chat on slack. No rush.

davidanthoff · 2017-10-24T23:51:13Z

@quinnj Alright, I had another look and thought a bit about it. Here are some (difficult) issues I see with this approach:

As far as I can tell this approach will only work if the constructor call of the element that is being return actually occurs in the macro expression. But even a simple refactoring of say df |> @select(_.a=>_.b) |> Dict into f(x) = x.a=>x.b and df |> @select(f(_)) |> Dict would not work and couldn't be made to work, right?
Really this needs to work for any type, not just types of the form you quoted. The type I wrote down for example (on purpose) has the order of the type parameters be different from the order of the arguments in the constructor. It is not clear to me how that could be handled with this approach here.
In general, this needs to work with an outer constructor that looks like say function Foo(a::T,b::S) = Foo{T,T,S}(a,a,b). I don't see how that could be achieved with this approach?
I've generally tried to only do things in these macros where I have some hope that eventually julia base/language will provide similar features so that eventually the macros can be replaced with normal functions. This rewrite here looks like something that would never be accepted into base, though, right?
This is similar to the above point: I really hope that at one point we can make things like DataFrame({i.a,i.b} for i in df) and Dict(i.a=>i.b for i in df) work, but for that we can't rely on macro tricks like the one in this PR.
Right now the dependency on inference is only in the backend implementations. It would be entirely feasible to have a different backend that does not rely on inference at all. I've been thinking a lot about such a backend (possibly one that replaces the current default one) because the reliance on inference is really, really not good here. But with this change here the dependency on inference moves into the front-end syntax and would actually apply to all backends, i.e. one could no longer write a backend that doesn't rely on inference. This is a really major issue. The reliance on inference is really the biggest design problem in Query.jl right now, but currently it is relatively contained and at least theoretically feasible to work around it. With this PR we would close that door, i.e. it would no longer be possible to write a backend that doesn't use inference.
This point is kind of subtle: I tried really, really hard in the design of Query.jl to not change the semantics of any expression that appears in a query relative to what it would mean outside of a query. The only instance where I break that rule right now is in the .. operator, and I already regret it :) (plus I have a plan to undo that mistake!). But other than that, the various macros only ever add syntax, they never change the semantics of a given syntax. In my mind that kind of principle is really, really key for building a composable system. The strategy in this PR does away with this: essentially users would write constructor calls that in "normal" julia code would create a certain type, but in a query the same expression all of a sudden would create some other type.
This would make it a whole lot more difficult to write a backend that lives in the Queryable world. I don't think impossible, but certainly more tricky because one would get much more complicated ASTs with this PR.
I guess at the end of the day this looks like a complicated hack compared to the current design :) Certainly a very clever hack, and one I wasn't able to come up with, but still. The whole LINQ design is centered around this idea that you use these monadic designs for individual small building blocks, and then things just happen to compose well together and work smoothly. I've pretty much managed to carry that over to Query, but it does require a missing data story that is monad-like (like DataValue) to work. Just an example of this: Query.jl doesn't use anything from DataValues.jl anywhere right now. QueryOperators.jl has one query operator that depends on DataValue, the entire rest doesn't even know that things like missing values exist (and I have a plan to get rid of that dependency as well). But both just happen to work great if a source spits out DataValues, without any special casing. It works so nicely because both DataValues.jl and Query.jl are based around monads, and monads happen to compose nicely. I don't think any of this is an accident, the folks that designed LINQ had been researching how monads enable really composable designs for decades, and voila, LINQ and Query are pretty much an example that they were right. With this PR here we would essentially give up on that, now we would have a design where e.g. we have code in Query that is heavily dependent on a specific missing data value story.

quinnj added 2 commits October 5, 2017 09:43

Port to Nulls

eb65cae

A little cleanup

4559cf3

This was referenced Oct 6, 2017

WIP: Port to Nulls queryverse/QueryOperators.jl#2

Closed

WIP: Port to Nulls queryverse/IterableTables.jl#62

Closed

nalimilan mentioned this pull request Oct 6, 2017

Port dependencies to Missing and CategoricalArray JuliaData/DataFrames.jl#1232

Closed

nalimilan reviewed Oct 6, 2017

View reviewed changes

abalkin mentioned this pull request Oct 11, 2017

Pick proper abstractions to interoperate with kdb+ tables enlnt/Q.jl#55

Open

cjprybol mentioned this pull request Oct 12, 2017

Remove Query.jl from Query frameworks JuliaData/DataFrames.jl#1255

Closed

quinnj mentioned this pull request Dec 19, 2017

MySQL v0.2.0 is incompatible with DataFrames v0.11 and prevents a DataFrames upgrade to v0.11. JuliaDatabases/MySQL.jl#88

Closed

davidanthoff closed this Aug 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Port to Nulls.jl #155

WIP: Port to Nulls.jl #155

quinnj commented Oct 6, 2017 •

edited

Loading

nalimilan Oct 6, 2017

quinnj Oct 6, 2017

nalimilan Oct 6, 2017

davidanthoff commented Oct 7, 2017

quinnj commented Oct 8, 2017

nalimilan commented Oct 8, 2017

quinnj commented Oct 10, 2017

davidanthoff commented Oct 24, 2017

WIP: Port to Nulls.jl #155

WIP: Port to Nulls.jl #155

Conversation

quinnj commented Oct 6, 2017 • edited Loading

nalimilan Oct 6, 2017

Choose a reason for hiding this comment

quinnj Oct 6, 2017

Choose a reason for hiding this comment

nalimilan Oct 6, 2017

Choose a reason for hiding this comment

davidanthoff commented Oct 7, 2017

quinnj commented Oct 8, 2017

nalimilan commented Oct 8, 2017

quinnj commented Oct 10, 2017

davidanthoff commented Oct 24, 2017

quinnj commented Oct 6, 2017 •

edited

Loading