Capture vars not env #66

chrish42 · 2015-04-16T18:37:02Z

This is the current state of my work to capture only the required variables for patsy formulas instead of the whole environment.

Current status:

Add new methods to EvalEnvironment to create a environment that contains only a subset of variables.
Refactor EvalFactor (and update all users) to move the eval_env parameter out of the its initializer method and into the state.
Additions to doc/changes.rst (esp. to document the incompatible changes!)
Do a quick pass through the docs to catch anything that's now out-of-date.
Add "end-to-end" test or two -- something that tests the user-visible behaviour directly, like running a formula through dmatrix, then modifying the original environment, and then running some more data through the same builder (like we were doing predictions) and checking that it is unaffected by our changes to the env. (test_highlevel.py is where the other tests like this go.)
Create right subset EvalEnvironment and stash it into the state dict for EvalFactor.
~~Figure out what the right thing to do is when variables are shadowed between the EvalFactor's environment and the data.~~ (Can be done together with Trouble making categorical variable --- TypeError: 'ClassRegistry' object is not callable #13 instead. Not a blocker for this.)

When this is complete, it will fix bug #25.

…expression is seen in patsy term; add tests.

…es_needed and stored in state variable instead. Requires adding an 'eval_env' parameter to design_matrix_builders function.

…matrix_builders().

coveralls · 2015-04-16T18:45:31Z

Coverage decreased (-0.05%) to 97.8% when pulling 74c6dbf on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

coveralls · 2015-04-17T03:29:41Z

Coverage increased (+0.01%) to 97.86% when pulling 10f79f5 on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

njsmith · 2015-04-24T06:09:28Z

patsy/build.py

+    # more obscure backtrace later on.
+    if not isinstance(eval_env, six.integer_types + (EvalEnvironment,)):
+        raise TypeError("Parameter 'eval_env' must be either an integer or an instance" +
+                        "of patsy.EvalEnvironment.")


Nitpicks:

The "+" is unnecessary here -- the Python lexer (like that in many languages, e.g. C) will concatenate adjacent string tokens at compile time, e.g. "foo" "bar" is the same as "foobar" (and so is "foo" 'bar' for that matter). It's even slightly more efficient, b/c + is evaluated at runtime, while leaving it out gives a single string directly in the .pyc. Not that this efficiency matters at all :-). But as a style/consistency thing I leave the + out.

You're missing a space, this string says "instanceof" :-)

Yeah. I'm aware of this feature, but I vaguely remember being bitten by it (although maybe it was in C instead of Python), so I personally tend to prefer being explicit unless necessary. But given that this is the style of the rest of the code base, I'll go with it. Done.

njsmith · 2015-04-24T06:36:34Z

Hi Christian! As you can see I've mostly recovered from PyCon, and did a quick pass for general style on what you have so far :-). Looking forward to seeing the rest!

njsmith · 2015-04-24T07:02:05Z

patsy/build.py

@@ -614,7 +614,7 @@ def _make_term_column_builders(terms,
            term_to_column_builders[term] = column_builders
    return new_term_order, term_to_column_builders

-def design_matrix_builders(termlists, data_iter_maker, NA_action="drop"):
+def design_matrix_builders(termlists, data_iter_maker, eval_env=0, NA_action="drop"):


After thinking about this a little more... I think maybe we should actually make eval_env a required argument here. This is the lower-level machinery for those who are doing clever programmatic things, so it's safer I think to insist the user think about how they want to handle the evaluation environment.

Sounds good to me too. Done.

chrish42 · 2015-04-29T16:50:57Z

Hi Nathaniel! Thanks for the review. I will reply to them / apply them once I'm done with fixing the tests for using a subset of the environment.

I am currently in the process of fixing said tests. Instead of comparing state["eval_env"] to the captured EvalEnvironment, I now build an EvalEnvironment containing the correct subset, and compare it to state["eval_env"]. However, that cannot work, as the comparison for equality of EvalEnvironments only compares on identity of each namespace dict. How would you feel if I made EvalEnvironment.__eq__ check that each namespace dict has the same keys (by equality) and values (by identity) instead?

njsmith · 2015-04-29T17:11:54Z

Hi Christian,

Sounds like a good plan. Before I answer your actual question, can you back
up a step and elaborate on why you need equality comparisons in the first
place? The old code needed to define equality on EvalEnvironment because it
had to define equality on EvalFactor, and EvalFactor used to have an
EvalEnvironment. But now that it doesn't, I think EvalEnvironment.eq
should just be unused (and maybe even removed)?
On Apr 29, 2015 9:50 AM, "Christian Hudon" notifications@github.com wrote:

Hi Nathaniel! Thanks for the review. I will reply to them / apply them
once I'm done with fixing the tests for using a subset of the environment.

I am currently in the process of fixing said tests. Instead of comparing
state["eval_env"] to the captured EvalEnvironment, I now build an
EvalEnvironment containing the correct subset, and compare it to
state["eval_env"]. However, that cannot work, as the comparison for
equality of EvalEnvironments only compares on identity of each namespace
dict. How would you feel if I made EvalEnvironment.eq check that each
namespace dict has the same keys (by equality) and values (by identity)
instead?

—
Reply to this email directly or view it on GitHub
#66 (comment).

chrish42 · 2015-04-29T21:24:21Z

I need it for the tests. Here is one of the tests that I am in the process of fixing up (in eval.py):

def test_EvalFactor_memorize_passes_needed():
    from patsy.state import stateful_transform
    foo = stateful_transform(lambda: "FOO-OBJ")
    bar = stateful_transform(lambda: "BAR-OBJ")
    quux = stateful_transform(lambda: "QUUX-OBJ")
    e = EvalFactor("foo(x) + bar(foo(y)) + quux(z, w)")

    state = {}
    eval_env = EvalEnvironment.capture(0)
    passes = e.memorize_passes_needed(state, eval_env)
    expected_eval_env = EvalEnvironment([{'foo': foo,
                                                  'bar': bar,
                                                  'quux': quux}],
                                                  eval_env.flags)
    assert passes == 2
    assert state["eval_env"] == expected_eval_env
# More stuff cut.

The last line is why I need equality on EvalEnvironment. (Before, the code just did assert state["eval_env"] == eval_env. Maybe my previous comment now makes more sense in light of this code? I think equality on EvalEnvironment needs to be defined (likely) as I described it for these tests. Although, if equality is used only for the tests, we could also pull it out of the class and put it into a test-only helper function? Anyways, let me know what you think is best here.

njsmith · 2015-04-29T23:50:54Z

Ah, I see. My preference would still be to not define equality on EvalEnvironment at all (just because semantically it's extremely tricky -- e.g. two EvalEnvironment objects might act differently if they have different dict objects in them, because one thing you can do with an EvalEnvironment is execute code that modifies stuff in the environment -- and we don't really need it.) And for tests like this, just test that the EvalEnvironment does in fact have the correct behavior, e.g.:

# Variables in the current environment that were mentioned in the formula are accessible
assert state["eval_env"].eval("foo") is foo
# But other variables in the current environment that were not mentioned, are not available
assert_raises(NameError, state["eval_env"].eval, "state")

(This isn't as thorough as a full equality check, but we already have thorough tests for EvalEnvironment.subset and ast_names, right? So it isn't crucial to reproduce those tests here IMHO.)

chrish42 · 2015-04-30T00:57:45Z

Sounds reasonable to me. I'll move in that direction. I'll update the pull request when all the tests pass again.

…environment in its state.

chrish42 · 2015-04-30T21:24:09Z

Here's the updated pull request, using the subset of the environment in the state. Let me know what you think.

chrish42 · 2015-04-30T21:27:02Z

Also, what do you think is the right thing to do regarding shadowing of variables between the EvalEnvironment and the DataFrame or other passed to the formula?

…style of the rest of the code base. Also add missing space.

coveralls · 2015-05-01T01:02:57Z

Coverage increased (+0.01%) to 97.87% when pulling 5b39dbf on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

coveralls · 2015-05-01T01:10:27Z

Coverage increased (+0.01%) to 97.87% when pulling 5b39dbf on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

coveralls · 2015-05-01T01:20:26Z

Coverage increased (+0.01%) to 97.87% when pulling 5b39dbf on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

njsmith · 2015-05-04T22:37:41Z

Looking pretty good! Things I can think of besides the shadowing issue:

we need some additions to doc/changes.rst (esp. to document the incompatible changes!)
need to do a quick pass through the docs to catch anything that's now out-of-date.
I'd like to see an "end-to-end" test or two -- something that tests the user-visible behaviour directly, like running a formula through dmatrix, then modifying the original environment, and then running some more data through the same builder (like we were doing predictions) and checking that it is unaffected by our changes to the env. (test_highlevel.py is where the other tests like this go.)

njsmith · 2015-05-04T22:50:04Z

Okay, so having had a chance to look over the code that would be affected by shadowing, I think my thought is: some sort of systematic handling of shadowing would probably be a good idea, but it feels like something we can just as well handle as an independent change. In particular, shadowing checks should probably also notice when builtins are shadowed (this has been a persistent issue -- see #13). So I don't think it's a blocker for this PR.

coveralls · 2015-05-05T01:45:22Z

Coverage decreased (-0.01%) to 97.85% when pulling 8ecb89b on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

coveralls · 2015-05-05T01:45:22Z

Coverage increased (+0.01%) to 97.87% when pulling 8ecb89b on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

chrish42 · 2015-05-14T22:54:29Z

Working on the doc now. I assume you will register this release on zenodo before release time? (I'm also assuming the next release will be v0.4.0.)

njsmith · 2015-05-14T23:27:55Z

Actually, an annoying thing about Zenodo is that you actually cannot get the DOI until after the release is finalized, basically b/c it the DOI-generation operation takes the release hash as an input. So I've just been putting the DOIs into the dev docs after the release, and the dev docs are what show up by default on readthedocs anyway, and I guess it's good enough.

And yeah, we'll call it v0.4.0.

…asses_needed().

…nv argument now.

coveralls · 2015-05-15T00:43:15Z

Coverage decreased (-0.01%) to 97.85% when pulling 12c3ce9 on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

coveralls · 2015-05-15T00:43:16Z

Coverage increased (+0.01%) to 97.87% when pulling 12c3ce9 on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

chrish42 · 2015-05-15T00:44:33Z

Just updated the pull request with the results of my pass over the docs. Could you go over the diff for the doc update and see if things still make sense? I've deleted a lot more eval_env, etc. stuff than I added to the docs. I'm a bit worried that this means we missed an additional place where an eval_env argument should have been added.

I've also mentioned the incompatible change to EvalFactor.memorize_passes_needed() in the changelog, even though this is a pretty low-level method (which doesn't have a docstring too). Not too much detail for changes.rst? Anyways, feedback on my doc changes just as welcome as on the code ones.

I'll add the high-level tests another day, but other than that, this is starting to look pretty complete.

njsmith · 2015-05-18T05:12:09Z

The doc changes look good to me -- I think the docs just spend more time talking about the factor interface than the actual builder interface, so it makes sense that you deleted more than you added. Mentioning the EvalFactor.memorize_passes_needed change is fine -- probably no-one cares, but oh well, it won't hurt.

…anymore).

chrish42 · 2015-05-19T22:09:26Z

Added a high-level test. Let me know if you have other ideas for these kinds of tests. Otherwise, please let me know if you have any final comments, if there are still missing bits, etc. (Really looking forward to getting this merged.)

njsmith · 2015-05-20T00:19:50Z

Looks good to me!

One last thing: it looks like you have some merge conflicts which are greying out the merge button and preventing the tests from running. Could you merge or rebase from current master to resolve these?

chrish42 · 2015-05-20T01:57:19Z

Merge done.

coveralls · 2015-05-20T02:03:48Z

Coverage increased (+0.01%) to 97.87% when pulling 48d7e1e on chrish42:capture-vars-not-env into 845bded on pydata:master.

njsmith · 2015-05-20T02:09:00Z

Huzzah! We are good to go.

Capture vars not env

njsmith · 2015-05-20T02:11:03Z

Thanks for the great work, and for sticking with it despite my nitpickiness :-).

chrish42 · 2015-05-20T02:11:47Z

Woot! Great!

chrish42 · 2015-05-20T02:29:49Z

Haha. No problem! Happy to be starting to make contributions on the scientific Python software stack. More of your comments were very relevant, I thought. And until I come close to you in terms of code contributed to patsy, it's much more your code base than mine, so some nitpickiness is fully allowed. (I tried already to stick to your coding style as best as I could.) I also much prefer code bases with a unified style anyway, even when I don't agree 100% with it. So no problem again.

chrish42 added 4 commits April 15, 2015 12:01

Add function that yields all ast names from a Python expression.

2bba058

Raise an exception if lambda, dict/list/set comprehension, generator …

03ff98b

…expression is seen in patsy term; add tests.

Don't store eval_env in EvalFactor; have it passed from memorize_pass…

9cac009

…es_needed and stored in state variable instead. Requires adding an 'eval_env' parameter to design_matrix_builders function.

Add comment about why we are checking the type of eval_env in design_…

74c6dbf

…matrix_builders().

Bump test coverage back up.

10f79f5

njsmith reviewed Apr 24, 2015
View reviewed changes

njsmith mentioned this pull request Apr 25, 2015

BUG pickling formula information statsmodels/statsmodels#1263

Closed

Switch EvalFactor to use environment subset instead of full captured …

f707d5c

…environment in its state.

chrish42 added 3 commits April 30, 2015 17:41

First pass at incorporating feedback from code review.

bf628ce

Make eval_env argument in design_matrix_builders() mandatory.

793d7a9

Don't use '+' to concatenate two string literals, to fit in with the …

5b39dbf

…style of the rest of the code base. Also add missing space.

Second round of reviews.

8ecb89b

Update changes.rst file.

5b816a2

chrish42 added 2 commits May 14, 2015 20:25

Update rest of documentation for changes to EvalFactor and memorize_p…

55330c2

…asses_needed().

Also note that EvalFactor.memorize_passes_needed() requires an eval_e…

12c3ce9

…nv argument now.

Add high-level test: environment should not be saved in the builder (…

874a6ec

…anymore).

Merge new changes from upstream.

48d7e1e

njsmith added a commit that referenced this pull request May 20, 2015

Merge pull request #66 from chrish42/capture-vars-not-env

3819ae9

Capture vars not env

njsmith merged commit 3819ae9 into pydata:master May 20, 2015

chrish42 deleted the capture-vars-not-env branch May 20, 2015 02:11

josef-pkt mentioned this pull request Jun 24, 2015

REF: pickling formula information using patsy statsmodels/statsmodels#2469

Open

njsmith mentioned this pull request Apr 7, 2017

patsy pickles #104

Open

Capture vars not env #66

Capture vars not env #66

Conversation

chrish42 commented Apr 16, 2015

coveralls commented Apr 16, 2015

coveralls commented Apr 17, 2015

njsmith Apr 24, 2015

Choose a reason for hiding this comment

chrish42 Apr 30, 2015

Choose a reason for hiding this comment

njsmith commented Apr 24, 2015

njsmith Apr 24, 2015

Choose a reason for hiding this comment

chrish42 Apr 30, 2015

Choose a reason for hiding this comment

chrish42 commented Apr 29, 2015

njsmith commented Apr 29, 2015

chrish42 commented Apr 29, 2015

njsmith commented Apr 29, 2015

chrish42 commented Apr 30, 2015

chrish42 commented Apr 30, 2015

chrish42 commented Apr 30, 2015

coveralls commented May 1, 2015

coveralls commented May 1, 2015

coveralls commented May 1, 2015

njsmith commented May 4, 2015

njsmith commented May 4, 2015

coveralls commented May 5, 2015

coveralls commented May 5, 2015

chrish42 commented May 14, 2015

njsmith commented May 14, 2015

coveralls commented May 15, 2015

coveralls commented May 15, 2015

chrish42 commented May 15, 2015

njsmith commented May 18, 2015

chrish42 commented May 19, 2015

njsmith commented May 20, 2015

chrish42 commented May 20, 2015

coveralls commented May 20, 2015

njsmith commented May 20, 2015

njsmith commented May 20, 2015

chrish42 commented May 20, 2015

chrish42 commented May 20, 2015