Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Port to Nulls.jl #155

Closed
wants to merge 2 commits into from
Closed

WIP: Port to Nulls.jl #155

wants to merge 2 commits into from

Conversation

quinnj
Copy link
Contributor

@quinnj quinnj commented Oct 6, 2017

Instead of DataValue, uses Union{T, Null} for cases of missing data.

Most tests pass for me locally with the following setup:

  • jq/nulls branches: IterableTables, QueryOperators
  • Master: DataFrames, CSV, DataArrays, SQLite, Feather
  • Latest tagged releases: TypedTables, DataStreams, NullableArrays
  • NamedTuples with this PR[merged]

There seems to be an issue w/ current master Feather on 0.6 that I'm going to dig into.[fixed on master]

As can be seen in the diff, it's actually a surprisingly small change. The only tricky part was handling "select" statement projections, which generated NamedTuple types like @NT(a=1, c=lowercase(i.name)) and now generate them like @NT(a::typeof(1), c::Query.@infer(lowercase(i.name)))(1, lowercase(i.name)); the Query.@infer macro doesn't execute anything, but relies, much like the select iterator itself, on Base._return_type to infer the type of the projection to ensure a strongly typed NamedTuple.

Would love for others to take a look and provide feedback; it's a bit tricky to get the right "state" for testing, but happy to answer any questions.

See corresponding PRs to IterableTables.jl, QueryOperators.jl

@@ -2,9 +2,9 @@ using Query
using DataStreams
using CSV

q = @from i in CSV.Source("data.csv") begin
q = @from i in CSV.Source("data.csv", categorical=false) begin

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, did you change this because it doesn't work with CategoricalArrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I've been meaning to discuss this w/ you. The approach I took in CSV is that when a column is detected as categorical, the actual value returned from parsefield(io, CategoricalValue, ...) is a WeakRefString. This works because in the DataStreams world, you always pre-allocate the output vector and doing setindex!(::CategoricalArray) or push!(::CategoricalArray w/ a WeakRefString works and is more efficient. In the case of Query though, we're only ever iterating rows from the Source, so there's not necessarily the same pre-allocation that's guaranteed to happen. It's easier for now to just say categorical=false, but definitely something we need to think about as CSV integrates w/ more frameworks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm not sure I see the big picture, but better discuss this elsewhere anyway.

@davidanthoff
Copy link
Member

Thanks for taking a stab at this, I very much appreciate it!

I'm still under water with other projects (and will be for the next 1-2 weeks), so I only took a quick look (and didn't actually try to run anything). Two questions/issues:

  1. This would have to work not just for NamedTuples, but for any composite type. I think this wouldn't work for queries like df |> @select(_.a=>_.b) |> Dict or say df |> @select(Foo(_.a, _.b)) |> Set (assuming Foo is something like struct Foo{T1,T2} a::T2 b::T1 end with line breaks), right? Or does it?
  2. I think this moves a call to Base._return_type into the most inner loop, right? Can the compiler somehow eliminate that in the generated code, or will that actually run for every element? The latter would presumably be quite an issue for performance.

@quinnj
Copy link
Contributor Author

quinnj commented Oct 8, 2017

  1. This would have to work....

A question I have is would we always expect the output composite type to have the form T{T1, T2, ...}; a::T1; b::T2; end? If so, then I think my approach here would generalize to that (with some tweaking). And actually, now that I think about it some more, I guess that's the only case where we see issues w/ codegen trying to split unions (because otherwise the struct has a set type and the argument needs to be that type anyway).

@nalimilan
Copy link

I think this moves a call to Base._return_type into the most inner loop, right? Can the compiler somehow eliminate that in the generated code, or will that actually run for every element? The latter would presumably be quite an issue for performance.

AFAIK calls to return_type are resolved at compile time, so it doesn't matter whether it's called outside or inside a loop.

@quinnj
Copy link
Contributor Author

quinnj commented Oct 10, 2017

@davidanthoff, just let me know when you have some time to review and I'm happy to discuss here or we can have a higher-velocity chat on slack. No rush.

@davidanthoff
Copy link
Member

@quinnj Alright, I had another look and thought a bit about it. Here are some (difficult) issues I see with this approach:

  1. As far as I can tell this approach will only work if the constructor call of the element that is being return actually occurs in the macro expression. But even a simple refactoring of say df |> @select(_.a=>_.b) |> Dict into f(x) = x.a=>x.b and df |> @select(f(_)) |> Dict would not work and couldn't be made to work, right?
  2. Really this needs to work for any type, not just types of the form you quoted. The type I wrote down for example (on purpose) has the order of the type parameters be different from the order of the arguments in the constructor. It is not clear to me how that could be handled with this approach here.
  3. In general, this needs to work with an outer constructor that looks like say function Foo(a::T,b::S) = Foo{T,T,S}(a,a,b). I don't see how that could be achieved with this approach?
  4. I've generally tried to only do things in these macros where I have some hope that eventually julia base/language will provide similar features so that eventually the macros can be replaced with normal functions. This rewrite here looks like something that would never be accepted into base, though, right?
  5. This is similar to the above point: I really hope that at one point we can make things like DataFrame({i.a,i.b} for i in df) and Dict(i.a=>i.b for i in df) work, but for that we can't rely on macro tricks like the one in this PR.
  6. Right now the dependency on inference is only in the backend implementations. It would be entirely feasible to have a different backend that does not rely on inference at all. I've been thinking a lot about such a backend (possibly one that replaces the current default one) because the reliance on inference is really, really not good here. But with this change here the dependency on inference moves into the front-end syntax and would actually apply to all backends, i.e. one could no longer write a backend that doesn't rely on inference. This is a really major issue. The reliance on inference is really the biggest design problem in Query.jl right now, but currently it is relatively contained and at least theoretically feasible to work around it. With this PR we would close that door, i.e. it would no longer be possible to write a backend that doesn't use inference.
  7. This point is kind of subtle: I tried really, really hard in the design of Query.jl to not change the semantics of any expression that appears in a query relative to what it would mean outside of a query. The only instance where I break that rule right now is in the .. operator, and I already regret it :) (plus I have a plan to undo that mistake!). But other than that, the various macros only ever add syntax, they never change the semantics of a given syntax. In my mind that kind of principle is really, really key for building a composable system. The strategy in this PR does away with this: essentially users would write constructor calls that in "normal" julia code would create a certain type, but in a query the same expression all of a sudden would create some other type.
  8. This would make it a whole lot more difficult to write a backend that lives in the Queryable world. I don't think impossible, but certainly more tricky because one would get much more complicated ASTs with this PR.
  9. I guess at the end of the day this looks like a complicated hack compared to the current design :) Certainly a very clever hack, and one I wasn't able to come up with, but still. The whole LINQ design is centered around this idea that you use these monadic designs for individual small building blocks, and then things just happen to compose well together and work smoothly. I've pretty much managed to carry that over to Query, but it does require a missing data story that is monad-like (like DataValue) to work. Just an example of this: Query.jl doesn't use anything from DataValues.jl anywhere right now. QueryOperators.jl has one query operator that depends on DataValue, the entire rest doesn't even know that things like missing values exist (and I have a plan to get rid of that dependency as well). But both just happen to work great if a source spits out DataValues, without any special casing. It works so nicely because both DataValues.jl and Query.jl are based around monads, and monads happen to compose nicely. I don't think any of this is an accident, the folks that designed LINQ had been researching how monads enable really composable designs for decades, and voila, LINQ and Query are pretty much an example that they were right. With this PR here we would essentially give up on that, now we would have a design where e.g. we have code in Query that is heavily dependent on a specific missing data value story.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants