Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add pipe method #10253

Merged
merged 2 commits into from
Jun 6, 2015
Merged

ENH: Add pipe method #10253

merged 2 commits into from
Jun 6, 2015

Conversation

TomAugspurger
Copy link
Contributor

Closes #10129

In the dev meeting, we settled on the following:

  • .pipe will not include a check for __pipe_func__ on the function passed in.
  • To avoid messiness with lambdas when a function takes the DataFrame other than in the first position, users can pass in a (callable, data_keyword) argument to .pipe (thanks @mwaskom)
   import statsmodels.formula.api as sm

   bb = pd.read_csv('data/baseball.csv', index_col='id')

   (bb.query('h > 0')
      .assign(ln_h = lambda df: np.log(df.h))
      # sm.possion expects `formula, data`
      .pipe((sm.poisson, 'data'), 'hr ~ ln_h + year + g + C(lg)')
      .fit()
      .summary()
   )
## -- End pasted text --
Optimization terminated successfully.
         Current function value: 2.116284
         Iterations 24
Out[1]:
<class 'statsmodels.iolib.summary.Summary'>
"""
                          Poisson Regression Results
==============================================================================
Dep. Variable:                     hr   No. Observations:                   68
Model:                        Poisson   Df Residuals:                       63
Method:                           MLE   Df Model:                            4
Date:                Tue, 02 Jun 2015   Pseudo R-squ.:                  0.6878
Time:                        20:57:27   Log-Likelihood:                -143.91
converged:                       True   LL-Null:                       -460.91
                                        LLR p-value:                6.774e-136
===============================================================================
                  coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept   -1267.3636    457.867     -2.768      0.006     -2164.767  -369.960
C(lg)[T.NL]    -0.2057      0.101     -2.044      0.041        -0.403    -0.008
ln_h            0.9280      0.191      4.866      0.000         0.554     1.302
year            0.6301      0.228      2.762      0.006         0.183     1.077
g               0.0099      0.004      2.754      0.006         0.003     0.017
===============================================================================
"""

Thanks everyone for the input in the issue.


This is mostly ready. What's a good name for the argument to pipe? I have func right now, @shoyer you had target in the issue thread. I've been using target as the keyword expecting the data, i.e. where the DataFrame should be pipe to.

The tests are extremely minimal... but so is the implementation. Am I missing any obvious edge-cases?

We'll see how this goes. I don't think I push .pipe as a protocol at all in the documentation, though we can change that in the future. We should be forwards-compatible if we do ever go down the __pipe_func__ route.

)

Pandas encourages the second style. It flows with the rest of pandas
methods which return DataFrames or Series and are non-mutating by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use pure instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm being pedantic, but "pure" implies no-side effects. df.plot() is non-mutating, but not pure since it has the side-effect of drawing a plot. I didn't want some functional programming guru to call us out :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure...

@shoyer
Copy link
Member

shoyer commented Jun 3, 2015

A couple of other things that would be nice to highlight in the docs:

  1. Let's mention how this was inspired by popular pipe operator %>% from R's magrittr package, but the implementation here is explicit and Pythonic. I would encourage reading the source code -- we might even include it in the docs.
  2. We should encourage just using pipe instead of monkey patching. I would consider removing the mention of monkey patching at all -- this is a far better way to go.


>>> (df.pipe(h),
.pipe(g, arg1=1),
.pipe(f, arg2=2, arg3=3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comma's at the end of the lines are not correct here?

@TomAugspurger TomAugspurger changed the title ENH: this is a pipe ENH: Add pipe method Jun 3, 2015
@TomAugspurger
Copy link
Contributor Author

I would consider removing the mention of monkey patching at all

Any objections to removing this section from the docs? I forgot it even existed.

@TomAugspurger
Copy link
Contributor Author

Let's mention how this was inspired by popular pipe operator %>% from R's magrittr package, but the implementation here is explicit and Pythonic. I would encourage reading the source code -- we might even include it in the docs.

The pipe method is inspired by unix pipes and more recently dplyr_ and magrittr_, which
have introduced the popular (%>%) (read pipe) operator for R_.
The implementation of pipe here is quite clean and feels right at home in python.
We encourage you to view the source code (pd.DataFrame.pipe?? in IPyhton).


Ok, addressed all the comments I think (thanks). I removed the section on monkey patching.

The outstanding issue I see is @shoyer's comment about maybe checking whether we're about to clobber a kwarg when the tuple-style is used.

if target in kwargs:
    raise ValueError('%s is both the pipe target and a keyword argument' % target)

to catch a case of df.add(1).pipe((f, 'data'), x=x, y=y, data=df). Right now we (silently) replace kwargs['data'], The closest parallel I see is writing f(a=1, a=2), which Python catches.

The pipe method is inspired by unix pipes and more recently dplyr_ and magrittr_, which
have introduced the popular ``(%>%)`` (read pipe) operator for R_.
The implementation of ``pipe`` here is quite clean and feels right at home in python.
We encourage you to view the source code (``pd.DataFrame.pipe??`` in IPyhton).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPyhton -> IPython

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I should read what I write :/

@TomAugspurger
Copy link
Contributor Author

I put the kwarg clobbering (df.pipe((f, 'data'), x=1, data=2)) in a second commit: 968d0ae

EDIT: I also removed the monkey patching docs FYI, can reinstate them if anyone is attached to them.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2015

pls rebase on master and repush had an issue with some builds

@TomAugspurger
Copy link
Contributor Author

Travis is green. Are we good with checking whether the target is clobbered? And are we OK with raising a ValueError instead of SyntaxError like f(a=2, a=3) does?


1. `Tablewise Function Application`_: :meth:`~DataFrame.pipe`
2. `Row or Column-wise Function Application`_: :meth:`~DataFrame.apply`
3. Elementwise_ function application: :meth:`~DataFrame.applymap`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs backticks to pick up the references

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backticks on the "Elementwise_"? It works w/o the backticks since it's a single word. Are do you mean the method references?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, ok, then didn't know that.

@TomAugspurger
Copy link
Contributor Author

Ok, fixed that newline in the docstring and I just link people to the pipe section from internals.rst instead of the monkeypathcing.

@@ -10,6 +10,7 @@ We recommend that all users upgrade to this version.
Highlights include:

- Documentation on how to use ``numba`` with *pandas*, see :ref:`here <enhancingperf.numba>`
- A new ``pipe`` method, :ref:`here <basics.pipe>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor point, but I usually have these refer to the whatsnew itself, e.g. below (with the actual doc link from the whatsnew to the docs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the numba one is linking to the docs (could be wrong). Want me to change that to the section in whatsnew?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, didn't see it doesn't have a section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the numba one doesn't have another what's entry that's why its that way. Whereas you have a section in the whatsnew (which is good). So the link will skip to there. Then you already have the link back to the docs (at the end).

@jreback
Copy link
Contributor

jreback commented Jun 5, 2015

@TomAugspurger lgtm. merge when ready (see that minor point above though)


In the example above, the functions ``f``, ``g``, and ``h`` each expected the DataFrame as the first positional argument.
When the funciton you wish to apply takes its data anywhere other than the first argument, pass a tuple
of ``(funciton, keyword)`` indicating where the DataFrame should flow. For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling: function

@shoyer
Copy link
Member

shoyer commented Jun 5, 2015

looks pretty much good to go for me, too, after a few quick fixes

@TomAugspurger
Copy link
Contributor Author

Ok, I added a section header for pipe in whatsnew, linked to that from above. Fixed the (two) typos on function, and added the ... continuations in the docstring.

@TomAugspurger
Copy link
Contributor Author

We waiting on Travis? It's felt slow today :/

@@ -1,1597 +0,0 @@
.. currentmodule:: pandas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? I think you checked in after you buildt the docs....!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah, one sec.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2015

travis IS slow today (on master anyhow). not real sure why.

@TomAugspurger
Copy link
Contributor Author

OK, fixed the accidental deletion of api.rst

... .pipe(g, arg1=a)
... .pipe(f, arg2=b, arg3=c)
... )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe show an example of using the callable & data_keyword in the Notes? (can do later)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@TomAugspurger
Copy link
Contributor Author

@shoyer @jreback merge at your leisure. Or I'll be back online later tonight to push the button :)

... )

If you have a function that takes the data as (say) the second
argumnet, pass a tuple indicating which keyword expects the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling: argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm normally not the bad at spelling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the bad"? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. Long day.

TomAugspurger pushed a commit that referenced this pull request Jun 6, 2015
@TomAugspurger TomAugspurger merged commit 031e3bc into pandas-dev:master Jun 6, 2015
shoyer added a commit to shoyer/xarray that referenced this pull request Jun 10, 2015
The implementation here is directly copied from pandas:
pandas-dev/pandas#10253
@TomAugspurger TomAugspurger deleted the pipe branch August 18, 2015 12:44
@jreback
Copy link
Contributor

jreback commented Aug 19, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: Implement pipe protocol, a method for extensible method chaining
4 participants