Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture vars not env #66

Merged
merged 15 commits into from May 20, 2015
Merged

Capture vars not env #66

merged 15 commits into from May 20, 2015

Conversation

chrish42
Copy link
Contributor

This is the current state of my work to capture only the required variables for patsy formulas instead of the whole environment.

Current status:

  • Add new methods to EvalEnvironment to create a environment that contains only a subset of variables.
  • Refactor EvalFactor (and update all users) to move the eval_env parameter out of the its initializer method and into the state.
  • Additions to doc/changes.rst (esp. to document the incompatible changes!)
  • Do a quick pass through the docs to catch anything that's now out-of-date.
  • Add "end-to-end" test or two -- something that tests the user-visible behaviour directly, like running a formula through dmatrix, then modifying the original environment, and then running some more data through the same builder (like we were doing predictions) and checking that it is unaffected by our changes to the env. (test_highlevel.py is where the other tests like this go.)
  • Create right subset EvalEnvironment and stash it into the state dict for EvalFactor.
  • Figure out what the right thing to do is when variables are shadowed between the EvalFactor's environment and the data. (Can be done together with Trouble making categorical variable --- TypeError: 'ClassRegistry' object is not callable #13 instead. Not a blocker for this.)

When this is complete, it will fix bug #25.

…expression is seen in patsy term; add tests.
…es_needed and stored in state variable instead. Requires adding an 'eval_env' parameter to design_matrix_builders function.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.05%) to 97.8% when pulling 74c6dbf on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 97.86% when pulling 10f79f5 on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

# more obscure backtrace later on.
if not isinstance(eval_env, six.integer_types + (EvalEnvironment,)):
raise TypeError("Parameter 'eval_env' must be either an integer or an instance" +
"of patsy.EvalEnvironment.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicks:

  • The "+" is unnecessary here -- the Python lexer (like that in many languages, e.g. C) will concatenate adjacent string tokens at compile time, e.g. "foo" "bar" is the same as "foobar" (and so is "foo" 'bar' for that matter). It's even slightly more efficient, b/c + is evaluated at runtime, while leaving it out gives a single string directly in the .pyc. Not that this efficiency matters at all :-). But as a style/consistency thing I leave the + out.
  • You're missing a space, this string says "instanceof" :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'm aware of this feature, but I vaguely remember being bitten by it (although maybe it was in C instead of Python), so I personally tend to prefer being explicit unless necessary. But given that this is the style of the rest of the code base, I'll go with it. Done.

@njsmith
Copy link
Member

njsmith commented Apr 24, 2015

Hi Christian! As you can see I've mostly recovered from PyCon, and did a quick pass for general style on what you have so far :-). Looking forward to seeing the rest!

@@ -614,7 +614,7 @@ def _make_term_column_builders(terms,
term_to_column_builders[term] = column_builders
return new_term_order, term_to_column_builders

def design_matrix_builders(termlists, data_iter_maker, NA_action="drop"):
def design_matrix_builders(termlists, data_iter_maker, eval_env=0, NA_action="drop"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this a little more... I think maybe we should actually make eval_env a required argument here. This is the lower-level machinery for those who are doing clever programmatic things, so it's safer I think to insist the user think about how they want to handle the evaluation environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me too. Done.

@chrish42
Copy link
Contributor Author

Hi Nathaniel! Thanks for the review. I will reply to them / apply them once I'm done with fixing the tests for using a subset of the environment.

I am currently in the process of fixing said tests. Instead of comparing state["eval_env"] to the captured EvalEnvironment, I now build an EvalEnvironment containing the correct subset, and compare it to state["eval_env"]. However, that cannot work, as the comparison for equality of EvalEnvironments only compares on identity of each namespace dict. How would you feel if I made EvalEnvironment.__eq__ check that each namespace dict has the same keys (by equality) and values (by identity) instead?

@njsmith
Copy link
Member

njsmith commented Apr 29, 2015

Hi Christian,

Sounds like a good plan. Before I answer your actual question, can you back
up a step and elaborate on why you need equality comparisons in the first
place? The old code needed to define equality on EvalEnvironment because it
had to define equality on EvalFactor, and EvalFactor used to have an
EvalEnvironment. But now that it doesn't, I think EvalEnvironment.eq
should just be unused (and maybe even removed)?
On Apr 29, 2015 9:50 AM, "Christian Hudon" notifications@github.com wrote:

Hi Nathaniel! Thanks for the review. I will reply to them / apply them
once I'm done with fixing the tests for using a subset of the environment.

I am currently in the process of fixing said tests. Instead of comparing
state["eval_env"] to the captured EvalEnvironment, I now build an
EvalEnvironment containing the correct subset, and compare it to
state["eval_env"]. However, that cannot work, as the comparison for
equality of EvalEnvironments only compares on identity of each namespace
dict. How would you feel if I made EvalEnvironment.eq check that each
namespace dict has the same keys (by equality) and values (by identity)
instead?


Reply to this email directly or view it on GitHub
#66 (comment).

@chrish42
Copy link
Contributor Author

I need it for the tests. Here is one of the tests that I am in the process of fixing up (in eval.py):

def test_EvalFactor_memorize_passes_needed():
    from patsy.state import stateful_transform
    foo = stateful_transform(lambda: "FOO-OBJ")
    bar = stateful_transform(lambda: "BAR-OBJ")
    quux = stateful_transform(lambda: "QUUX-OBJ")
    e = EvalFactor("foo(x) + bar(foo(y)) + quux(z, w)")

    state = {}
    eval_env = EvalEnvironment.capture(0)
    passes = e.memorize_passes_needed(state, eval_env)
    expected_eval_env = EvalEnvironment([{'foo': foo,
                                                  'bar': bar,
                                                  'quux': quux}],
                                                  eval_env.flags)
    assert passes == 2
    assert state["eval_env"] == expected_eval_env
# More stuff cut.

The last line is why I need equality on EvalEnvironment. (Before, the code just did assert state["eval_env"] == eval_env. Maybe my previous comment now makes more sense in light of this code? I think equality on EvalEnvironment needs to be defined (likely) as I described it for these tests. Although, if equality is used only for the tests, we could also pull it out of the class and put it into a test-only helper function? Anyways, let me know what you think is best here.

@njsmith
Copy link
Member

njsmith commented Apr 29, 2015

Ah, I see. My preference would still be to not define equality on EvalEnvironment at all (just because semantically it's extremely tricky -- e.g. two EvalEnvironment objects might act differently if they have different dict objects in them, because one thing you can do with an EvalEnvironment is execute code that modifies stuff in the environment -- and we don't really need it.) And for tests like this, just test that the EvalEnvironment does in fact have the correct behavior, e.g.:

# Variables in the current environment that were mentioned in the formula are accessible
assert state["eval_env"].eval("foo") is foo
# But other variables in the current environment that were not mentioned, are not available
assert_raises(NameError, state["eval_env"].eval, "state")

(This isn't as thorough as a full equality check, but we already have thorough tests for EvalEnvironment.subset and ast_names, right? So it isn't crucial to reproduce those tests here IMHO.)

@chrish42
Copy link
Contributor Author

Sounds reasonable to me. I'll move in that direction. I'll update the pull request when all the tests pass again.

@chrish42
Copy link
Contributor Author

Here's the updated pull request, using the subset of the environment in the state. Let me know what you think.

@chrish42
Copy link
Contributor Author

Also, what do you think is the right thing to do regarding shadowing of variables between the EvalEnvironment and the DataFrame or other passed to the formula?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 97.87% when pulling 5b39dbf on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 97.87% when pulling 5b39dbf on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 97.87% when pulling 5b39dbf on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

@njsmith
Copy link
Member

njsmith commented May 4, 2015

Looking pretty good! Things I can think of besides the shadowing issue:

  • we need some additions to doc/changes.rst (esp. to document the incompatible changes!)
  • need to do a quick pass through the docs to catch anything that's now out-of-date.
  • I'd like to see an "end-to-end" test or two -- something that tests the user-visible behaviour directly, like running a formula through dmatrix, then modifying the original environment, and then running some more data through the same builder (like we were doing predictions) and checking that it is unaffected by our changes to the env. (test_highlevel.py is where the other tests like this go.)

@njsmith
Copy link
Member

njsmith commented May 4, 2015

Okay, so having had a chance to look over the code that would be affected by shadowing, I think my thought is: some sort of systematic handling of shadowing would probably be a good idea, but it feels like something we can just as well handle as an independent change. In particular, shadowing checks should probably also notice when builtins are shadowed (this has been a persistent issue -- see #13). So I don't think it's a blocker for this PR.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 97.85% when pulling 8ecb89b on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 97.87% when pulling 8ecb89b on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

@chrish42
Copy link
Contributor Author

Working on the doc now. I assume you will register this release on zenodo before release time? (I'm also assuming the next release will be v0.4.0.)

@njsmith
Copy link
Member

njsmith commented May 14, 2015

Actually, an annoying thing about Zenodo is that you actually cannot get the DOI until after the release is finalized, basically b/c it the DOI-generation operation takes the release hash as an input. So I've just been putting the DOIs into the dev docs after the release, and the dev docs are what show up by default on readthedocs anyway, and I guess it's good enough.

And yeah, we'll call it v0.4.0.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 97.85% when pulling 12c3ce9 on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 97.87% when pulling 12c3ce9 on chrish42:capture-vars-not-env into 3dd86d0 on pydata:master.

@chrish42
Copy link
Contributor Author

Just updated the pull request with the results of my pass over the docs. Could you go over the diff for the doc update and see if things still make sense? I've deleted a lot more eval_env, etc. stuff than I added to the docs. I'm a bit worried that this means we missed an additional place where an eval_env argument should have been added.

I've also mentioned the incompatible change to EvalFactor.memorize_passes_needed() in the changelog, even though this is a pretty low-level method (which doesn't have a docstring too). Not too much detail for changes.rst? Anyways, feedback on my doc changes just as welcome as on the code ones.

I'll add the high-level tests another day, but other than that, this is starting to look pretty complete.

@njsmith
Copy link
Member

njsmith commented May 18, 2015

The doc changes look good to me -- I think the docs just spend more time talking about the factor interface than the actual builder interface, so it makes sense that you deleted more than you added. Mentioning the EvalFactor.memorize_passes_needed change is fine -- probably no-one cares, but oh well, it won't hurt.

@chrish42
Copy link
Contributor Author

Added a high-level test. Let me know if you have other ideas for these kinds of tests. Otherwise, please let me know if you have any final comments, if there are still missing bits, etc. (Really looking forward to getting this merged.)

@njsmith
Copy link
Member

njsmith commented May 20, 2015

Looks good to me!

One last thing: it looks like you have some merge conflicts which are greying out the merge button and preventing the tests from running. Could you merge or rebase from current master to resolve these?

@chrish42
Copy link
Contributor Author

Merge done.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 97.87% when pulling 48d7e1e on chrish42:capture-vars-not-env into 845bded on pydata:master.

@njsmith
Copy link
Member

njsmith commented May 20, 2015

Huzzah! We are good to go.

njsmith added a commit that referenced this pull request May 20, 2015
@njsmith njsmith merged commit 3819ae9 into pydata:master May 20, 2015
@njsmith
Copy link
Member

njsmith commented May 20, 2015

Thanks for the great work, and for sticking with it despite my nitpickiness :-).

@chrish42
Copy link
Contributor Author

Woot! Great!

@chrish42 chrish42 deleted the capture-vars-not-env branch May 20, 2015 02:11
@chrish42
Copy link
Contributor Author

Haha. No problem! Happy to be starting to make contributions on the scientific Python software stack. More of your comments were very relevant, I thought. And until I come close to you in terms of code contributed to patsy, it's much more your code base than mine, so some nitpickiness is fully allowed. (I tried already to stick to your coding style as best as I could.) I also much prefer code bases with a unified style anyway, even when I don't agree 100% with it. So no problem again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants