Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: option to force slow code path (don't call apply function 1 too many times) in GroupBy.apply #2936

Closed
Aico opened this issue Feb 26, 2013 · 10 comments · Fixed by #24748
Closed
Labels
API Design Enhancement Groupby Performance Memory or execution speed performance
Milestone

Comments

@Aico
Copy link

Aico commented Feb 26, 2013

>>> def applym(df):
...     print df.irow(0)['a']
...     return DataFrame({'a':[1],'b':[1]})

>>> df = DataFrame({'a':[1,1,1,2,3],'b':['a','a','a','b','c']})

>>> df.groupby('b').apply(applym)
1
1
2
3
     a  b
b        
a 0  1  1
b 0  1  1
c 0  1  1
>>>

applym is called twice on the first group.

@stephenwlin
Copy link
Contributor

I believe think this is a duplicate of #2656 and not a bug...

The function is called twice for internal implementation reasons but the results are correct.

Possibly the docs should be updated to indicate that func passed to apply should not have side-effects.

@Aico
Copy link
Author

Aico commented Feb 26, 2013

I see. I agree with the suggestion on the other thread that there should be a way I can tell apply that there is no side effects so that it does not have to run the first group twice.

At the very least, it shouldn't run the first group twice if there is only one group.

@stephenwlin
Copy link
Contributor

hmm...well, what do you actually need guaranteed and why? just that the function is only called once per group? (or do you need a specific ordering, too?)

there's no way to add an option "i want to take the fast path always" because the fast path makes certain assumptions about memory aliasing and such and can cause segfault if they're incorrect. plus, the fast path implementation could change or an even faster path could be implemented, all of which are internal implementation details.

so it only makes sense to allow an option to force it to take the slow path, and the only use case for that would be if you depended on the order or number of calls (which would only matter if your function had side-effects, not vice-versa, unless your apply function is very expensive and you're worried about the CPU cycles...). is there a specific reason you need that?

@Aico
Copy link
Author

Aico commented Feb 27, 2013

The case I am came across, is expensive apply function with few groups. My apply takes 30 seconds to do so running the first group twice adds thirty seconds to the runtime. Though now that I know of this double run, I have changed my code to do the split-apply-concat step manually. I think I will try to use a global variable to just skip through the first run next time.

@stephenwlin
Copy link
Contributor

ok, if your apply is that expensive that makes sense then. (so maybe it should be an option to force the basic path)

@wesm
Copy link
Member

wesm commented Mar 19, 2013

Marked as enhancement for someday. Just have to get a groupby parameter to flow through

@jreback jreback modified the milestones: 0.15.0, Someday Apr 1, 2014
@jreback jreback modified the milestones: 0.15.0, 0.15.1 Jul 13, 2014
@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 9, 2014
@jreback jreback modified the milestones: 0.16.0, 0.17.0 Jan 26, 2015
@jrovegno
Copy link

Why, if I use a lambda expression, doesn't run twice the first iteration?

>>> df = pd.DataFrame({"a":["x", "y"], "b":[1,2]})
>>> identity = lambda row: print(tuple(row))
>>> df2 = df.apply(identity, axis=1)
('x', 1)
('y', 2)

@TomAugspurger
Copy link
Contributor

I'd like to consider adopting dask's behavior here, where the user provides a meta keyword to disable any inference.

An empty ``pd.DataFrame`` or ``pd.Series`` that matches the dtypes and
column names of the output. This metadata is necessary for many algorithms
in dask dataframe to work.  For ease of use, some alternative inputs are
also available. Instead of a ``DataFrame``, a ``dict`` of ``{name: dtype}``
or iterable of ``(name, dtype)`` can be provided. Instead of a series, a
tuple of ``(name, dtype)`` can be used. If not provided, dask will try to
infer the metadata. This may lead to unexpected results, so providing
``meta`` is recommended. For more information, see
``dask.dataframe.utils.make_meta``.

This has worked quite well for dask.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2018

i don’t think we need to do this at all
rather just compute the path once and use it

@erwanp
Copy link

erwanp commented Jun 5, 2018

Also using groupby.apply() on little number of groups, that sometimes can be just 1 depending on User inputs. In that particular case the code is 2x slower.

An option to force the path would be quite appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants