-
Notifications
You must be signed in to change notification settings - Fork 3
POC: pandas shared kernels for mean | groupby mean | rolling mean #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test script:
|
cc @max-sixty you might interested in the performance benchmarking of sliding mean vs |
Thanks for the CC @mroeschke . That's a great speedup! Do you know whether numba will inline the Here's the equivalent code in numbagg: https://github.com/numbagg/numbagg/blob/main/numbagg/moving.py#L82-L123. Assuming the function inline, I guess the performance should be similar — though I'm not sure whether the |
It looks like numba's gufuncs do parallelize, equivalent to In [14]: import numbagg
In [16]: df = pd.DataFrame(np.random.rand(10000, 10000))
In [21]: %timeit numbagg.move_mean(df.values, 2)
698 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [22]: %timeit df.rolling(2).mean()
5.37 s ± 208 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) |
Thanks for the numbagg comparison! I am not exactly sure how to check if numba in inlining |
Note to self. Using
(This is the version that was originally being called)
numba version 0.53.1 cc @stuartarchibald if you have any ideas if the function above could be refactored to not serialize the inner loops (PS thanks for being a great resource for answering numba questions) |
@mroeschke RE serialization of inner loops. Numba will by default serialize the nested parallelism. This is because the outer most loop contains the most work and distributing this often gives best performance. Consider the following: @njit(parallel=True)
def foo(x, y, z):
for i in prange(x):
for j in prange(y):
for k in prange(z):
#<work> The default is that the loop in Obviously there are use cases where you might want to either oversubscribe resources or run a Example: @njit(parallel=True)
def inner_foo(y, z)
for j in prange(y): # parallel loop
for k in prange(z): # serial loop
#<work>
@njit(parallel=True)
def foo(x, y, z):
for i in prange(x): # parallel loop
inner_foo(y, z) As an aside, you'll also note that there's numerous "other" parallel loops spotted, this is because Numba is analysing array expressions for parallel loops and also has a number of specialised parallel versions of routines e.g.
Hopefully the above answers this, but I think the question might need to be "What do I really want to run in parallel"?
No problem, thanks for trying all these things out. Practical use cases and feedback help drive Numba's design and development! So thank you!!! |
Closing as a POC |
General Flow:
Indexer
class determines the start and end bounds for the mean | groupby mean | rolling mean operationMean functions tested:
np.nanmean
parallelized over each start and end boundsPerformance Table:
df.shape = (10_000, 10_001)
np.nanmean
Pure NumbaPerformance Table (parallel columns=True / parallel start & end bounds=True):
df.shape = (1_000_000, 1001)
np.nanmean
Pure NumbaPerformance Table (parallel columns=True / parallel start & end bounds=False):
df.shape = (1_000_000, 1001)
np.nanmean
Pure NumbaPerformance Table (parallel columns=False / parallel start & end bounds =True):
df.shape = (1_000_000, 1001)
np.nanmean
Pure NumbaPerformance Table (parallel columns=False / parallel start & end bounds=False):
df.shape = (1_000_000, 1001)
np.nanmean
Pure Numba