ENH: option to force slow code path (don't call apply function 1 too many times) in GroupBy.apply #2936

Aico · 2013-02-26T06:08:53Z

>>> def applym(df):
...     print df.irow(0)['a']
...     return DataFrame({'a':[1],'b':[1]})

>>> df = DataFrame({'a':[1,1,1,2,3],'b':['a','a','a','b','c']})

>>> df.groupby('b').apply(applym)
1
1
2
3
     a  b
b        
a 0  1  1
b 0  1  1
c 0  1  1
>>>

applym is called twice on the first group.

The text was updated successfully, but these errors were encountered:

stephenwlin · 2013-02-26T06:18:40Z

I believe think this is a duplicate of #2656 and not a bug...

The function is called twice for internal implementation reasons but the results are correct.

Possibly the docs should be updated to indicate that func passed to apply should not have side-effects.

Aico · 2013-02-26T07:24:16Z

I see. I agree with the suggestion on the other thread that there should be a way I can tell apply that there is no side effects so that it does not have to run the first group twice.

At the very least, it shouldn't run the first group twice if there is only one group.

stephenwlin · 2013-02-27T04:30:59Z

hmm...well, what do you actually need guaranteed and why? just that the function is only called once per group? (or do you need a specific ordering, too?)

there's no way to add an option "i want to take the fast path always" because the fast path makes certain assumptions about memory aliasing and such and can cause segfault if they're incorrect. plus, the fast path implementation could change or an even faster path could be implemented, all of which are internal implementation details.

so it only makes sense to allow an option to force it to take the slow path, and the only use case for that would be if you depended on the order or number of calls (which would only matter if your function had side-effects, not vice-versa, unless your apply function is very expensive and you're worried about the CPU cycles...). is there a specific reason you need that?

Aico · 2013-02-27T07:37:57Z

The case I am came across, is expensive apply function with few groups. My apply takes 30 seconds to do so running the first group twice adds thirty seconds to the runtime. Though now that I know of this double run, I have changed my code to do the split-apply-concat step manually. I think I will try to use a global variable to just skip through the first run next time.

stephenwlin · 2013-02-27T07:39:25Z

ok, if your apply is that expensive that makes sense then. (so maybe it should be an option to force the basic path)

wesm · 2013-03-19T23:07:38Z

Marked as enhancement for someday. Just have to get a groupby parameter to flow through

Related issues are pandas-dev#2656, pandas-dev#2936 and pandas-dev#6753.

jrovegno · 2018-02-28T20:27:57Z

Why, if I use a lambda expression, doesn't run twice the first iteration?

>>> df = pd.DataFrame({"a":["x", "y"], "b":[1,2]})
>>> identity = lambda row: print(tuple(row))
>>> df2 = df.apply(identity, axis=1)
('x', 1)
('y', 2)

TomAugspurger · 2018-03-09T22:03:54Z

I'd like to consider adopting dask's behavior here, where the user provides a meta keyword to disable any inference.

An empty ``pd.DataFrame`` or ``pd.Series`` that matches the dtypes and
column names of the output. This metadata is necessary for many algorithms
in dask dataframe to work.  For ease of use, some alternative inputs are
also available. Instead of a ``DataFrame``, a ``dict`` of ``{name: dtype}``
or iterable of ``(name, dtype)`` can be provided. Instead of a series, a
tuple of ``(name, dtype)`` can be used. If not provided, dask will try to
infer the metadata. This may lead to unexpected results, so providing
``meta`` is recommended. For more information, see
``dask.dataframe.utils.make_meta``.

This has worked quite well for dask.

jreback · 2018-03-09T22:11:52Z

i don’t think we need to do this at all
rather just compute the path once and use it

erwanp · 2018-06-05T13:16:29Z

Also using groupby.apply() on little number of groups, that sometimes can be just 1 depending on User inputs. In that particular case the code is 2x slower.

An option to force the path would be quite appreciated.

wesm mentioned this issue Apr 7, 2013

The apply method over a group seems to visit the first group twice #2656

Closed

This was referenced Apr 1, 2014

The apply function of a DataFrame is called twice on the first row #6753

Closed

DOC: documented that .apply(func) executes func twice on the first time #6756

Merged

jreback modified the milestones: 0.15.0, Someday Apr 1, 2014

jreback added Groupby labels Apr 1, 2014

jeffreystarr pushed a commit to jeffreystarr/pandas that referenced this issue Apr 28, 2014

DOC: documented that .apply(func) executes func twice on the first time

ec2a8a1

Related issues are pandas-dev#2656, pandas-dev#2936 and pandas-dev#6753.

jsw-fnal mentioned this issue Jul 13, 2014

groupby - apply applies to the first group twice #7739

Closed

jreback modified the milestones: 0.15.0, 0.15.1 Jul 13, 2014

jreback added the API Design label Jul 13, 2014

jreback modified the milestones: 0.15.1, 0.15.0 Sep 9, 2014

jreback modified the milestones: 0.16.0, 0.17.0 Jan 26, 2015

jreback mentioned this issue May 28, 2015

current implementation of DataFrame.apply where passed function has side effects is a real "gotcha" #10222

Closed

jreback mentioned this issue Jul 6, 2015

GroupBy.apply calling function twice #10519

Closed

jreback mentioned this issue Jan 27, 2016

BUG: GroupBy.apply iterates over first group twice #12155

Closed

jreback mentioned this issue Jul 15, 2017

An option to run once for pandas.DataFrame.apply #16946

Closed

jreback added Difficulty Intermediate labels Jul 15, 2017

jschendel mentioned this issue Mar 9, 2018

ENH: Add argument to GroupBy.apply to let user pre-select "slow path" & not run function twice on 1st group #20084

Closed

gfyoung mentioned this issue Jun 11, 2018

Additional calls to function when using GroupBy.apply with multiple index #21417

Closed

jorisvandenbossche modified the milestones: Next Major Release, 1.0 Jun 11, 2018

fjetter mentioned this issue Jan 13, 2019

ENH: Only apply first group once in fast GroupBy.apply #24748

Merged

3 tasks

jreback modified the milestones: 1.0, 0.25.0 Jan 26, 2019

WillAyd mentioned this issue Feb 26, 2019

Groupby.apply has issues with printing command #25450

Closed

WillAyd closed this as completed in #24748 Mar 26, 2019

kerwin6182828 mentioned this issue Jan 8, 2020

like old issue #2936: df.apply(foo) will run twice at the first row #30815

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: option to force slow code path (don't call apply function 1 too many times) in GroupBy.apply #2936

ENH: option to force slow code path (don't call apply function 1 too many times) in GroupBy.apply #2936

Aico commented Feb 26, 2013

stephenwlin commented Feb 26, 2013

Aico commented Feb 26, 2013

stephenwlin commented Feb 27, 2013

Aico commented Feb 27, 2013

stephenwlin commented Feb 27, 2013

wesm commented Mar 19, 2013

jrovegno commented Feb 28, 2018

TomAugspurger commented Mar 9, 2018

jreback commented Mar 9, 2018

erwanp commented Jun 5, 2018

ENH: option to force slow code path (don't call apply function 1 too many times) in GroupBy.apply #2936

ENH: option to force slow code path (don't call apply function 1 too many times) in GroupBy.apply #2936

Comments

Aico commented Feb 26, 2013

stephenwlin commented Feb 26, 2013

Aico commented Feb 26, 2013

stephenwlin commented Feb 27, 2013

Aico commented Feb 27, 2013

stephenwlin commented Feb 27, 2013

wesm commented Mar 19, 2013

jrovegno commented Feb 28, 2018

TomAugspurger commented Mar 9, 2018

jreback commented Mar 9, 2018

erwanp commented Jun 5, 2018