WIP: parse . in formulas #28

shoyer · 2013-10-20T01:53:35Z

Adds basic support for . in formulas like '~ .' to the highlevel
interface, to indicate all otherwise unused variables.

It does not support embedding . in arbitrary python strings like
'~ np.log(.)'. I suppose that would be nice, in theory, but it would
require rebuilding Python expressions, instead of just using patsy's
formula language.

Let me know what you think. It definitely needs more tests,
documentation and exploration of the various edge cases.

Apologies for all the extra lines in the pull request -- my editor
automatically trims trailing whitespace when saving a file.

CC: #10

Adds basic support for . in formulas like '~ .' to the highlevel interface, to indicate all otherwise unused variables. It *does not* support embedding . in arbitrary python strings like '~ np.log(.)'. I suppose that would be nice, in theory, but it would require rebuilding Python expression, instead of just using patsy's formula language. Let me know what you think. It definitely needs more tests, documentation and exploration of the various edge cases.

njsmith · 2013-10-20T10:44:22Z

patsy/build.py

                if factor not in cat_sniffers:
                    cat_sniffers[factor] = CategoricalSniffer(NA_action,
                                                              factor.origin)
+                print NA_action, cat_sniffers[factor], value


njsmith · 2013-10-20T16:50:56Z

Awesome, thanks for the patch! You're probably the first person to actually read desc.py/build.py besides myself so that's exciting too :-). (Regarding whitespace, yeah, I've been meaning to clean those up at some point. If you use emacs then I recommend ethan-wspace.el to avoid making ugly patches.)

Anyway: this is an interesting strategy for handling . that I hadn't really considered. Having had some time to think about it now though, I'm sure how it can get the right semantics.

Right now, it definitely works differently than R formulas. I'm not sure if we want to follow R's implementation exactly, but they do handle two tricky cases that I think are hard to handle with the strategy currently in this PR:

dmatrices("x1 ~ .", {"x1": ..., "x2": ...})  # example 1
dmatrix("~ . - x1", {"x1": ..., "x2": ...})  # example 2

In R, these both give a right-hand side (RHS) design matrix that has x2 but not x1.

IIUC, with your current patch, both of these will give RHS designs that have both x1 and x2. In the first case, it's because the code handles the two design matrices separately, so it can't tell that x1 has been used on the LHS; in the second case, the formula evaluation machinery interprets the - x1 before the . gets expanded, so the - x1 evaluates away to a no-op. Does that sound right?

(And in the long run, we'll also probably want to support R's other semantics for ., with the update function where . means "insert here whatever terms you find in this other template formula". It's commonly used like update(model, . ~ . - x1), so it has similar issues to example 2, where you need to expand the . immediately before evaluating the rest of the formula parse tree.)

I think example 2 means that we have to somehow pass the . template information down into the formula evaluation machinery, so that it can be immediately substituted in when we evaluate .. And I think example 1 means that this substitution machinery needs to be able to somehow see the result of evaluating the LHS while it's in the middle of evaluating the RHS.

So maybe a better approach would be: add a 'context' argument to all the eval functions, which will contain information about whether we're on the LHS or RHS, for the RHS it will contain information on the contents of the LHS (both set up by the ~ evaluator and passed down recursively), and information on what . expands to, and then the dot eval function could just generate the right IntermediateExpr directly.

Also little note: we'll want to use EvalFactor for the auto-generated factors, so that - will work. Generating an EvalFactor for an arbitrary variable though needs something like

if is_valid_python_varname(name):
    code = name
else:
    code = "Q('%s')" % (name.encode("string_escape"),)

What do yo uthink?

shoyer · 2013-10-20T21:06:05Z

Actually, example 1 is equivalent to "x1 ~ x2" in the current patch, thanks to the magic side effects of MarkedContainer. I'm generally not a fan of side effects (for exactly this reason!) but I think they are the only way to see what Python's built-in eval looks at.

Example 2 should definitely be resolved, and for this I would agree that something like a "context" argument to keep track of what we've looked is cleaner than using the MarkedContainer.

To make the design trade-offs a little clearer, let me share another example I had in mind with MarkedContainer:

dmatrix("np.sqrt(x1) + .", {"x1": ..., "x2": ...}) # example 3

In the current patch, this formula is equivalent to "np.sqrt(x1) + x2". It counts as "using" x1 even it's wrapped in a python expression. It occurs to me now that this is not necessarily the desired behavior. R expands an expression like this one to `"np.sqrt(x1) + x1 + x2".

I am inclined to switch to the R behavior, since this would let us get rid of MarkedContainer, which is honestly a bit of a hack, by using an explicit "context" argument instead. This would also let us avoid the second pass in design_matrix_builders, since we don't need to worry about evaluating all the python expressions before expanding ..

Let me give that a try...

shoyer · 2013-10-21T00:39:51Z

OK, take another look at this patch. I'm a lot happier with the design here -- I appreciate your feedback and willingness to work with me!

shoyer · 2013-10-21T02:44:07Z

Another edge case to consider (from my most recent patch):

In [6]: patsy.dmatrix("~ x:y + .", {'x': [0], 'y': [1], 'z': [2]})
Out[6]:
DesignMatrix with shape (1, 3)
  Intercept  x:y  z
          1    0  2
  Terms:
    'Intercept' (column 0)
    'x:y' (column 1)
    'z' (column 2)

In [7]: patsy.dmatrix("~ . + x:y", {'x': [0], 'y': [1], 'z': [2]})
Out[7]:
DesignMatrix with shape (1, 5)
  Intercept  y  x  z  x:y
          1  1  0  2    0
  Terms:
    'Intercept' (column 0)
    'y' (column 1)
    'x' (column 2)
    'z' (column 3)
    'x:y' (column 4)

In R, both of these formulas yield the same result (the second one). I don't think we could get that result without doing another pass over the data.

shoyer · 2013-10-21T07:20:36Z

In this latest patch (which has reverted many of the changes I made along the way), the meaning of . is:

When the symbol . is encountered, it expands to the sum of all variables in the supplied data not already explicitly referenced by name in the formula.

I think this meaning can be reasoned about and covers the important use cases. It does make . slightly less useful for including on the LHS of formulas, but you can't put . on the LHS in an R formula at all. And it's actually more flexible than parsing "~ . + x:y" and "~ x:y + ." in the same way like R.

njsmith · 2013-10-25T22:33:13Z

Sorry for not getting back to you on this earlier.

I like this better, but I can't wrap my head around the thing where .'s meaning depends on exactly what order things are evaluated in. Like adding parentheses can change what . evaluates to and all kinds of wacky stuff. Sure it's more flexible, but does anyone actually want . + x:y and x:y + . to be different?

So my suggestion is, that the one piece of information about the formula that . gets told is: what the final termlist on the LHS was. And all the variables that aren't mentioned there, are included by .. That seems a lot more predictable (and is also a lot closer to what R does). Does that seem reasonable to you?

And then instead of having all the eval functions mutate the Evaluator (mutable state sucks!), we'd just have _eval_tilde pass this information down when it evaluates the RHS, in a new third eval function argument.

Two more points, that I'll just note and we can worry about the details of later, after we've sorted out the big picture stuff:

R is clever enough to handle things like log(y) ~ . correctly (leaving out y from the RHS). I think we can and probably should do this too. (Requires adding another method to factors which returns a list of the data variables they reference; for EvalFactor we already calculate this, so it'd be easy to do, modulo weirdness around Q().)
Instead of hard-coding the behaviour where . expands to all-unmentioned-variables, from_formula should accept some sort of strategy object that implements the actual . expansion. That way it doesn't have to deal with data_iter_makers directly, and when we want to implement other behaviour like update then all we have to do is pass in a new strategy object.

njsmith · 2013-10-25T22:33:55Z

Not sure why Travis is not noticing this to test it... going to close/re-open and see if that helps.

coveralls · 2013-10-25T22:55:13Z

Coverage decreased (-0.11%) when pulling 0100c23 on shoyer:add-dot into 0a7c659 on pydata:master.

njsmith · 2013-10-26T11:26:55Z

On further investigation it looks like I led you astray on the string_escape thing. It doesn't work in py3, and it doesn't do what we want anyway (e.g. it leaves literal quote marks unquoted). Looks like the way to go on that is just to use repr() instead.

shoyer · 2013-10-28T06:05:30Z

That sounds like a reasonable strategy to me. You've clearly put a lot of thought into designing Patsy in a careful way, so it makes sense to do it right. I will give it a shot when I have the chance.

shoyer · 2013-12-14T19:52:37Z

Closing this since I don't think I'll get around to doing this to your standards... hopefully this discussion was helpful for you! Patsy turned out to be pretty complex on the inside.

shoyer added 2 commits October 19, 2013 18:33

Fixed categorical variables with . in formulas

e4085a4

njsmith reviewed Oct 20, 2013
View reviewed changes

A less hacky design for . in formulas

d0cd20c

Minor reworking of code for . in formulas

6c4610c

Build . with IntermediateExpr

5e99c30

. codes only generated if . used in a formula

0100c23

njsmith closed this Oct 25, 2013

njsmith reopened this Oct 25, 2013

shoyer mentioned this pull request Oct 30, 2013

Less awkward API for using simple stateful transforms? #31

Open

shoyer closed this Dec 14, 2013

njsmith mentioned this pull request Mar 3, 2015

Feature request: "target ~ ." syntax #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: parse . in formulas #28

WIP: parse . in formulas #28

shoyer commented Oct 20, 2013

njsmith Oct 20, 2013

njsmith commented Oct 20, 2013

shoyer commented Oct 20, 2013

shoyer commented Oct 21, 2013

shoyer commented Oct 21, 2013

shoyer commented Oct 21, 2013

njsmith commented Oct 25, 2013

njsmith commented Oct 25, 2013

coveralls commented Oct 25, 2013

njsmith commented Oct 26, 2013

shoyer commented Oct 28, 2013

shoyer commented Dec 14, 2013

WIP: parse . in formulas #28

WIP: parse . in formulas #28

Conversation

shoyer commented Oct 20, 2013

njsmith Oct 20, 2013

Choose a reason for hiding this comment

njsmith commented Oct 20, 2013

shoyer commented Oct 20, 2013

shoyer commented Oct 21, 2013

shoyer commented Oct 21, 2013

shoyer commented Oct 21, 2013

njsmith commented Oct 25, 2013

njsmith commented Oct 25, 2013

coveralls commented Oct 25, 2013

njsmith commented Oct 26, 2013

shoyer commented Oct 28, 2013

shoyer commented Dec 14, 2013