Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: parse . in formulas #28

Closed
wants to merge 6 commits into from
Closed

WIP: parse . in formulas #28

wants to merge 6 commits into from

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Oct 20, 2013

Adds basic support for . in formulas like '~ .' to the highlevel
interface, to indicate all otherwise unused variables.

It does not support embedding . in arbitrary python strings like
'~ np.log(.)'. I suppose that would be nice, in theory, but it would
require rebuilding Python expressions, instead of just using patsy's
formula language.

Let me know what you think. It definitely needs more tests,
documentation and exploration of the various edge cases.

Apologies for all the extra lines in the pull request -- my editor
automatically trims trailing whitespace when saving a file.

CC: #10

Adds basic support for . in formulas like '~ .' to the highlevel
interface, to indicate all otherwise unused variables.

It *does not* support embedding . in arbitrary python strings like
'~ np.log(.)'. I suppose that would be nice, in theory, but it would
require rebuilding Python expression, instead of just using patsy's
formula language.

Let me know what you think. It definitely needs more tests,
documentation and exploration of the various edge cases.
if factor not in cat_sniffers:
cat_sniffers[factor] = CategoricalSniffer(NA_action,
factor.origin)
print NA_action, cat_sniffers[factor], value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cough

@njsmith
Copy link
Member

njsmith commented Oct 20, 2013

Awesome, thanks for the patch! You're probably the first person to actually read desc.py/build.py besides myself so that's exciting too :-). (Regarding whitespace, yeah, I've been meaning to clean those up at some point. If you use emacs then I recommend ethan-wspace.el to avoid making ugly patches.)

Anyway: this is an interesting strategy for handling . that I hadn't really considered. Having had some time to think about it now though, I'm sure how it can get the right semantics.

Right now, it definitely works differently than R formulas. I'm not sure if we want to follow R's implementation exactly, but they do handle two tricky cases that I think are hard to handle with the strategy currently in this PR:

dmatrices("x1 ~ .", {"x1": ..., "x2": ...})  # example 1
dmatrix("~ . - x1", {"x1": ..., "x2": ...})  # example 2

In R, these both give a right-hand side (RHS) design matrix that has x2 but not x1.

IIUC, with your current patch, both of these will give RHS designs that have both x1 and x2. In the first case, it's because the code handles the two design matrices separately, so it can't tell that x1 has been used on the LHS; in the second case, the formula evaluation machinery interprets the - x1 before the . gets expanded, so the - x1 evaluates away to a no-op. Does that sound right?

(And in the long run, we'll also probably want to support R's other semantics for ., with the update function where . means "insert here whatever terms you find in this other template formula". It's commonly used like update(model, . ~ . - x1), so it has similar issues to example 2, where you need to expand the . immediately before evaluating the rest of the formula parse tree.)

I think example 2 means that we have to somehow pass the . template information down into the formula evaluation machinery, so that it can be immediately substituted in when we evaluate .. And I think example 1 means that this substitution machinery needs to be able to somehow see the result of evaluating the LHS while it's in the middle of evaluating the RHS.

So maybe a better approach would be: add a 'context' argument to all the eval functions, which will contain information about whether we're on the LHS or RHS, for the RHS it will contain information on the contents of the LHS (both set up by the ~ evaluator and passed down recursively), and information on what . expands to, and then the dot eval function could just generate the right IntermediateExpr directly.

Also little note: we'll want to use EvalFactor for the auto-generated factors, so that - will work. Generating an EvalFactor for an arbitrary variable though needs something like

if is_valid_python_varname(name):
    code = name
else:
    code = "Q('%s')" % (name.encode("string_escape"),)

What do yo uthink?

@shoyer
Copy link
Member Author

shoyer commented Oct 20, 2013

Actually, example 1 is equivalent to "x1 ~ x2" in the current patch, thanks to the magic side effects of MarkedContainer. I'm generally not a fan of side effects (for exactly this reason!) but I think they are the only way to see what Python's built-in eval looks at.

Example 2 should definitely be resolved, and for this I would agree that something like a "context" argument to keep track of what we've looked is cleaner than using the MarkedContainer.

To make the design trade-offs a little clearer, let me share another example I had in mind with MarkedContainer:

dmatrix("np.sqrt(x1) + .", {"x1": ..., "x2": ...}) # example 3

In the current patch, this formula is equivalent to "np.sqrt(x1) + x2". It counts as "using" x1 even it's wrapped in a python expression. It occurs to me now that this is not necessarily the desired behavior. R expands an expression like this one to `"np.sqrt(x1) + x1 + x2".

I am inclined to switch to the R behavior, since this would let us get rid of MarkedContainer, which is honestly a bit of a hack, by using an explicit "context" argument instead. This would also let us avoid the second pass in design_matrix_builders, since we don't need to worry about evaluating all the python expressions before expanding ..

Let me give that a try...

@shoyer
Copy link
Member Author

shoyer commented Oct 21, 2013

OK, take another look at this patch. I'm a lot happier with the design here -- I appreciate your feedback and willingness to work with me!

@shoyer
Copy link
Member Author

shoyer commented Oct 21, 2013

Another edge case to consider (from my most recent patch):

In [6]: patsy.dmatrix("~ x:y + .", {'x': [0], 'y': [1], 'z': [2]})
Out[6]:
DesignMatrix with shape (1, 3)
  Intercept  x:y  z
          1    0  2
  Terms:
    'Intercept' (column 0)
    'x:y' (column 1)
    'z' (column 2)

In [7]: patsy.dmatrix("~ . + x:y", {'x': [0], 'y': [1], 'z': [2]})
Out[7]:
DesignMatrix with shape (1, 5)
  Intercept  y  x  z  x:y
          1  1  0  2    0
  Terms:
    'Intercept' (column 0)
    'y' (column 1)
    'x' (column 2)
    'z' (column 3)
    'x:y' (column 4)

In R, both of these formulas yield the same result (the second one). I don't think we could get that result without doing another pass over the data.

@shoyer
Copy link
Member Author

shoyer commented Oct 21, 2013

In this latest patch (which has reverted many of the changes I made along the way), the meaning of . is:

When the symbol . is encountered, it expands to the sum of all variables in the supplied data not already explicitly referenced by name in the formula.

I think this meaning can be reasoned about and covers the important use cases. It does make . slightly less useful for including on the LHS of formulas, but you can't put . on the LHS in an R formula at all. And it's actually more flexible than parsing "~ . + x:y" and "~ x:y + ." in the same way like R.

@njsmith
Copy link
Member

njsmith commented Oct 25, 2013

Sorry for not getting back to you on this earlier.

I like this better, but I can't wrap my head around the thing where .'s meaning depends on exactly what order things are evaluated in. Like adding parentheses can change what . evaluates to and all kinds of wacky stuff. Sure it's more flexible, but does anyone actually want . + x:y and x:y + . to be different?

So my suggestion is, that the one piece of information about the formula that . gets told is: what the final termlist on the LHS was. And all the variables that aren't mentioned there, are included by .. That seems a lot more predictable (and is also a lot closer to what R does). Does that seem reasonable to you?

And then instead of having all the eval functions mutate the Evaluator (mutable state sucks!), we'd just have _eval_tilde pass this information down when it evaluates the RHS, in a new third eval function argument.

Two more points, that I'll just note and we can worry about the details of later, after we've sorted out the big picture stuff:

  • R is clever enough to handle things like log(y) ~ . correctly (leaving out y from the RHS). I think we can and probably should do this too. (Requires adding another method to factors which returns a list of the data variables they reference; for EvalFactor we already calculate this, so it'd be easy to do, modulo weirdness around Q().)
  • Instead of hard-coding the behaviour where . expands to all-unmentioned-variables, from_formula should accept some sort of strategy object that implements the actual . expansion. That way it doesn't have to deal with data_iter_makers directly, and when we want to implement other behaviour like update then all we have to do is pass in a new strategy object.

@njsmith
Copy link
Member

njsmith commented Oct 25, 2013

Not sure why Travis is not noticing this to test it... going to close/re-open and see if that helps.

@njsmith njsmith closed this Oct 25, 2013
@njsmith njsmith reopened this Oct 25, 2013
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.11%) when pulling 0100c23 on shoyer:add-dot into 0a7c659 on pydata:master.

@njsmith
Copy link
Member

njsmith commented Oct 26, 2013

On further investigation it looks like I led you astray on the string_escape thing. It doesn't work in py3, and it doesn't do what we want anyway (e.g. it leaves literal quote marks unquoted). Looks like the way to go on that is just to use repr() instead.

@shoyer
Copy link
Member Author

shoyer commented Oct 28, 2013

That sounds like a reasonable strategy to me. You've clearly put a lot of thought into designing Patsy in a careful way, so it makes sense to do it right. I will give it a shot when I have the chance.

@shoyer
Copy link
Member Author

shoyer commented Dec 14, 2013

Closing this since I don't think I'll get around to doing this to your standards... hopefully this discussion was helpful for you! Patsy turned out to be pretty complex on the inside.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants