Skip to content

Stateful transforms compatible with 'mgcv' R package#43

Merged
njsmith merged 19 commits intopydata:masterfrom
broessli:mgcv
Jul 4, 2014
Merged

Stateful transforms compatible with 'mgcv' R package#43
njsmith merged 19 commits intopydata:masterfrom
broessli:mgcv

Conversation

@broessli
Copy link
Contributor

Hi!
I've added the file mgcv.py containing the code for the following stateful transforms: cr/cs, cc and te.
Benoît.

broessli added 2 commits May 22, 2014 13:04
New 'cr/cs', 'cc' and 'te' stateful transforms matching what is found
in the 'mgcv' R package.
@broessli
Copy link
Contributor Author

PS: current stateful_transforms classes need obviously some more work (but not much i think, see TODOs) to really qualify as genuine stateful_transforms

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.43%) when pulling 2faf096 on broessli:mgcv into 6636eec on pydata:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-2.43%) when pulling 97997c0 on broessli:mgcv into 6636eec on pydata:master.

Stateful transforms 'cr'/'cs', 'cc' and 'te' contains now true
stateful_transform functionality.
@coveralls
Copy link

Coverage Status

Coverage decreased (-3.22%) when pulling 55b5e0b on broessli:mgcv into 6636eec on pydata:master.

@njsmith
Copy link
Member

njsmith commented May 23, 2014

Some comments on a quick pass:

  • Needs a copyright header at the top of the file (see the other files in patsy).
  • There are some PEP 8 violations -- notably inconsistent spacing (binary operators should always have spaces around them, except for :) and some extremely long lines (should be < 78 columns or so)
  • The docs are inadequate -- in particular even after reading the code I have no idea what te's arguments are or what they mean. And what does te do if you use it with other inputs (e.g., te(bs(...), bs(...))?
  • Most importantly, there are no tests. This is important in general (patsy has approximately 100% test coverage and we'd like to keep it that way :-)), and even more so for a change where compatibility with third-party software is important. We should really have tests that have some hardcoded output from mgcv, checking that we get the same results as they do, with and without explicit knot specification, inside and outside of the range of the original data, etc., because otherwise this will break at some point and no-one will notice. The check_stateful function in patsy/test_state.py is useful here -- it lets you specify a test case once, and then torture-tests a stateful transform to make sure that the memorize_chunk and memorize_finish functions are doing their job properly. See the test_* functions in patsy/test_state.py for examples.

Thanks!

@broessli
Copy link
Contributor Author

Thank you for your comments but please note that in fact the code committed so far was only meant to give you a quick idea of what it looks like, I should have mentioned that, sorry.
I was actually waiting at this stage -- but was not clear about it -- for a confirmation of your interest, after seeing this code basis, before putting the extra work of adding unit tests (existing tests used while designing the code were business related so I cannot include them) and making the code truly integrable into and compatible with patsy. This is what I'll do next, thanks again for your interest!

@njsmith
Copy link
Member

njsmith commented May 26, 2014

Ah, that makes sense -- let me know when this is ready to review, then, I
guess? (Github sends notifications on comments, but doesn't send
notifications on commits.)

On Fri, May 23, 2014 at 7:13 PM, broessli notifications@github.com wrote:

Thank you for your comments but please note that in fact the code
committed so far was only meant to give you a quick idea of what it looks
like, I should have mentioned that, sorry.
I was actually waiting at this stage -- but was not clear about it -- for
a confirmation of your interest, after seeing this code basis, before
putting the extra work of adding unit tests (existing tests used while
designing the code were business related so I cannot include them) and
making the code truly integrable into and compatible with patsy. This is
what I'll do next, thanks again for your interest!


Reply to this email directly or view it on GitHubhttps://github.com//pull/43#issuecomment-44043016
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

- Reworked all code: pep8 compliance, added proper docstrings
- Added special treatment for values outside knots range for 'cr/cs'
- Reworked 'te' api to have full control on constraints
application/derivation
- Added basic tests, to be completed with full scale R comparisons
@broessli
Copy link
Contributor Author

Yep, I'll let you know when this is ready.
(I've already reworked all the code, see last commit for details, and will add R test cases like what is done for 'bs')

@broessli
Copy link
Contributor Author

broessli commented Jun 3, 2014

Hi, the code is now ready to review, thanks!
ps: Travis build is broken but I am not sure about what I did wrong: when I check the logs all tests are running fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually own the copyright to this code so you probably shouldn't say I do :-). Please put whoever actually owns the copyright (I guess probably the company you work for?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that would be my company (GDF Suez)

@broessli
Copy link
Contributor Author

broessli commented Jun 3, 2014

About the warnings vs errors, this is to reproduce mgcv behavior precisely, see also R test cases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can we call this constraint or something instead of cons?
  • I don't like the quirky True/False meanings -- how about we allow either an array or the string "center"?
  • Could you explain somewhere what the constraints actually are? Both why you might want a constraint, and how the array (if given) is interpreted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same applies to the other classes that use cons= of course.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'll change that indeed, and add some doc about constraints. Interpretation of the constraints array is currently only described in _get_te_dmatrix, _get_crs_dmatrix and _absorb_constraints.

@njsmith
Copy link
Member

njsmith commented Jun 3, 2014

The test failures are because (a) latest pandas seems to have dropped python 2.6 support, oops, and (b) tests/check-API-refs.py is detecting that new functions have been added to the patsy.* namespace, but not to the docs.

The underlying issue here is that the docs are only partially automatic: they will pull out docstrings, but only when explicitly requested. So we need to add some entries to doc/API-reference.rst.

But this also means that we'll have to somewhat rearrange the docs you've already written. We don't want to explicitly add the CubicRegressionSpline class to the docs, because it's an internal, non-exported implementation detail. But this means that its docstring will not be in the docs, so referring people to it for critical information won't work. I suggest that we add a new file doc/splines.rst or similar, and put the detailed highlevel documentation about what constraints are and how these functions relate to each other there, and then the docstrings can refer to that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basis -> bases

@njsmith
Copy link
Member

njsmith commented Jun 28, 2014

That's all I noticed on this pass :-) A few details to sort out, but this is looking pretty good!

@broessli
Copy link
Contributor Author

Thanks a lot for your code review!

@njsmith
Copy link
Member

njsmith commented Jun 29, 2014

On Sun, Jun 29, 2014 at 8:17 PM, broessli notifications@github.com wrote:

Thanks a lot for your code review!

No problem :-). I think I've replied to all the new questions you raised,
let me know if I missed anything.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) when pulling 8e637cb on broessli:mgcv into 6636eec on pydata:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.19%) when pulling 32c34e9 on broessli:mgcv into 6636eec on pydata:master.

@broessli
Copy link
Contributor Author

broessli commented Jul 1, 2014

Hi, just to let you know I've addressed all the points raised during last review!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be written more idiomatically as

self._tmp.setdefault("count", 0)
self._tmp["count"] += tp.shape[0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I first wrote it this way but since I need np.zeros(..) for "sum" in setdefault() it would get evaluated everytime, that's why I changed with if/else. But I agree the idiomatic version is more readable, I'll change that!

@njsmith
Copy link
Member

njsmith commented Jul 1, 2014

Looks good, modulo the super-nitpicky comment about setdefault and I guess you missed the previous super-nitpicky comment about %r? Fix those and I'll merge :-)

@broessli
Copy link
Contributor Author

broessli commented Jul 1, 2014

Oh yes sorry I missed one '%r'

@broessli
Copy link
Contributor Author

broessli commented Jul 1, 2014

I've pushed the last corrections, waiting for travis...

@njsmith
Copy link
Member

njsmith commented Jul 1, 2014

Oh, yeah, I missed that about the zeros. In that case I'd be happy with

if "sum" not in ...:
   ...["sum"] = np.zeros(...)
...["sum"] += ...

as well, pick whichever you prefer :-)

@coveralls
Copy link

Coverage Status

Coverage increased (+0.19%) when pulling f7c583f on broessli:mgcv into 6636eec on pydata:master.

@broessli
Copy link
Contributor Author

broessli commented Jul 2, 2014

OK I'll keep the setdefault version :-)
Travis build seems OK!

njsmith added a commit that referenced this pull request Jul 4, 2014
Stateful transforms compatible with 'mgcv' R package
@njsmith njsmith merged commit 1710a5f into pydata:master Jul 4, 2014
@njsmith
Copy link
Member

njsmith commented Jul 4, 2014

Looks good to me!

Thanks so much, this was a pleasure to help you with :-)

@broessli
Copy link
Contributor Author

broessli commented Jul 4, 2014

I'd like to thank you very much for your help and taking the time that you
did for the detailed code reviews!
I've really enjoyed the whole process of improving the code to meet patsy's
standards. It's been a pleasure working with you on this as well :-)

Le samedi 5 juillet 2014, Nathaniel J. Smith notifications@github.com a
écrit :

Looks good to me!

Thanks so much, this was a pleasure to help you with :-)


Reply to this email directly or view it on GitHub
#43 (comment).

Benoît.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants