New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: parse . in formulas #28
Conversation
Adds basic support for . in formulas like '~ .' to the highlevel interface, to indicate all otherwise unused variables. It *does not* support embedding . in arbitrary python strings like '~ np.log(.)'. I suppose that would be nice, in theory, but it would require rebuilding Python expression, instead of just using patsy's formula language. Let me know what you think. It definitely needs more tests, documentation and exploration of the various edge cases.
if factor not in cat_sniffers: | ||
cat_sniffers[factor] = CategoricalSniffer(NA_action, | ||
factor.origin) | ||
print NA_action, cat_sniffers[factor], value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cough
Awesome, thanks for the patch! You're probably the first person to actually read desc.py/build.py besides myself so that's exciting too :-). (Regarding whitespace, yeah, I've been meaning to clean those up at some point. If you use emacs then I recommend ethan-wspace.el to avoid making ugly patches.) Anyway: this is an interesting strategy for handling Right now, it definitely works differently than R formulas. I'm not sure if we want to follow R's implementation exactly, but they do handle two tricky cases that I think are hard to handle with the strategy currently in this PR:
In R, these both give a right-hand side (RHS) design matrix that has IIUC, with your current patch, both of these will give RHS designs that have both x1 and x2. In the first case, it's because the code handles the two design matrices separately, so it can't tell that x1 has been used on the LHS; in the second case, the formula evaluation machinery interprets the (And in the long run, we'll also probably want to support R's other semantics for I think example 2 means that we have to somehow pass the So maybe a better approach would be: add a 'context' argument to all the eval functions, which will contain information about whether we're on the LHS or RHS, for the RHS it will contain information on the contents of the LHS (both set up by the Also little note: we'll want to use
What do yo uthink? |
Actually, example 1 is equivalent to Example 2 should definitely be resolved, and for this I would agree that something like a "context" argument to keep track of what we've looked is cleaner than using the To make the design trade-offs a little clearer, let me share another example I had in mind with dmatrix("np.sqrt(x1) + .", {"x1": ..., "x2": ...}) # example 3 In the current patch, this formula is equivalent to I am inclined to switch to the R behavior, since this would let us get rid of Let me give that a try... |
OK, take another look at this patch. I'm a lot happier with the design here -- I appreciate your feedback and willingness to work with me! |
Another edge case to consider (from my most recent patch):
In R, both of these formulas yield the same result (the second one). I don't think we could get that result without doing another pass over the data. |
In this latest patch (which has reverted many of the changes I made along the way), the meaning of
I think this meaning can be reasoned about and covers the important use cases. It does make |
Sorry for not getting back to you on this earlier. I like this better, but I can't wrap my head around the thing where So my suggestion is, that the one piece of information about the formula that And then instead of having all the eval functions mutate the Two more points, that I'll just note and we can worry about the details of later, after we've sorted out the big picture stuff:
|
Not sure why Travis is not noticing this to test it... going to close/re-open and see if that helps. |
On further investigation it looks like I led you astray on the |
That sounds like a reasonable strategy to me. You've clearly put a lot of thought into designing Patsy in a careful way, so it makes sense to do it right. I will give it a shot when I have the chance. |
Closing this since I don't think I'll get around to doing this to your standards... hopefully this discussion was helpful for you! Patsy turned out to be pretty complex on the inside. |
Adds basic support for . in formulas like '~ .' to the highlevel
interface, to indicate all otherwise unused variables.
It does not support embedding . in arbitrary python strings like
'~ np.log(.)'. I suppose that would be nice, in theory, but it would
require rebuilding Python expressions, instead of just using patsy's
formula language.
Let me know what you think. It definitely needs more tests,
documentation and exploration of the various edge cases.
Apologies for all the extra lines in the pull request -- my editor
automatically trims trailing whitespace when saving a file.
CC: #10