Compute cumulative values in a groupby #540

jmehault · 2020-01-02T15:46:43Z

Hi all,

First of all, thank you for this really cool lib. I discovered it few days ago, and I am starting to understand its philosophy.

I face a problem, I guess someone would give me some insight.
I am trying to compute the cumulative sum of a variable based on a group.
The straight solution
df['cum_sum'] = df.groupby("class", agg={'x': 'cumsum'})
does not work because cumsum() is not supported in vaex.
The workaround I use is to move to pandas, compute what I want and come back to vaex. Here is an example of my current solution :

import pandas as pd
df = pd.DataFrame({"class":np.repeat(["a", "b", "c"], 5), "x":np.arange(15)})
gr = df.groupby("class")
df["cum_x_by_class"] = gr.x.cumsum()

Do you think a solution with the current state of vaex is possible ?
Thanks a lot

The text was updated successfully, but these errors were encountered:

maartenbreddels · 2020-01-06T19:13:59Z

Hi Jérémie,

thanks!
I think we can do it, but it requires an in-memory copy of the data, and it also requires respecting the data order, so the performance could be quite bad. It may also require 2 passes over the data, and another one of the final result.
Together with @JovanVeljanoski we are thinking about new aggregates, and it's great you've come up with this example, we'll take it into consideration.

Would it help by doing an approximation? You could do a binby, to do a 2d histogram (with sum aggregation), and do a sum in the final direction?
sth like (didn't test it yet...):

# assume class is a categorical/ordinal
X = df.binby([df['class'], df['x']], agg={'x': 'sum'}, limits=[None, [0, 100]], shape=[None, 1000])
c = np.cumsum(X, axis=1)

cheers,

Maarten

jmehault · 2020-01-07T09:37:37Z

Thanks a lot for your answer Maarten,
I see I have a lot of work to know every existing vaex's functions.
In vaex-core 1.4.0 and on master, binby() does not support limits and shape parameters. Do I missed something ?
Without those attributes, the final result 'c' is:
<xarray.DataArray (statistic: 1, class_: 3, x: 15)>
array([[[ 0., 1., 3., 6., 10., 10., 10., 10., 10., 10.,
10., 10., 10., 10., 10.],
[ 0., 1., 3., 6., 10., 15., 21., 28., 36., 45.,
45., 45., 45., 45., 45.],
[ 0., 1., 3., 6., 10., 15., 21., 28., 36., 45.,
55., 66., 78., 91., 105.]]])
Coordinates:

statistic (statistic) <U1 'x'
class_ (class_) <U1 'a' 'b' 'c'
x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
This might help, but still with a lot of handling to get the expected result. I am going to script the required transformations and see if it is accpetable in term of time and simplicity.
I am using the result as a physical measurement and it is required to be precise.

Best regards,
Jérémie

hyperfra · 2020-06-29T22:43:28Z

Hi

any news on having cumsum implemented in vaex? I think this is really relevant when
you consider time in your analysis.

best

Francesco

meta-ks · 2023-10-31T07:32:59Z

Do we have cumsum/cumsum support in groups now in this wonderful library ?

BustaRams mentioned this issue May 17, 2020

Equivalent of cumsum in vaex #743

Open

yohplala mentioned this issue Nov 16, 2021

[FEATURE-REQUEST] How to share results from calculations on previous rows for calculations on next rows in apply? #1706

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute cumulative values in a groupby #540

Compute cumulative values in a groupby #540

jmehault commented Jan 2, 2020

maartenbreddels commented Jan 6, 2020

jmehault commented Jan 7, 2020

hyperfra commented Jun 29, 2020

meta-ks commented Oct 31, 2023

Compute cumulative values in a groupby #540

Compute cumulative values in a groupby #540

Comments

jmehault commented Jan 2, 2020

maartenbreddels commented Jan 6, 2020

jmehault commented Jan 7, 2020

hyperfra commented Jun 29, 2020

meta-ks commented Oct 31, 2023