Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute cumulative values in a groupby #540

Open
jmehault opened this issue Jan 2, 2020 · 4 comments
Open

Compute cumulative values in a groupby #540

jmehault opened this issue Jan 2, 2020 · 4 comments

Comments

@jmehault
Copy link

jmehault commented Jan 2, 2020

Hi all,

First of all, thank you for this really cool lib. I discovered it few days ago, and I am starting to understand its philosophy.

I face a problem, I guess someone would give me some insight.
I am trying to compute the cumulative sum of a variable based on a group.
The straight solution
df['cum_sum'] = df.groupby("class", agg={'x': 'cumsum'})
does not work because cumsum() is not supported in vaex.
The workaround I use is to move to pandas, compute what I want and come back to vaex. Here is an example of my current solution :

import pandas as pd
df = pd.DataFrame({"class":np.repeat(["a", "b", "c"], 5), "x":np.arange(15)})
gr = df.groupby("class")
df["cum_x_by_class"] = gr.x.cumsum()

Do you think a solution with the current state of vaex is possible ?
Thanks a lot

@maartenbreddels
Copy link
Member

Hi Jérémie,

thanks!
I think we can do it, but it requires an in-memory copy of the data, and it also requires respecting the data order, so the performance could be quite bad. It may also require 2 passes over the data, and another one of the final result.
Together with @JovanVeljanoski we are thinking about new aggregates, and it's great you've come up with this example, we'll take it into consideration.

Would it help by doing an approximation? You could do a binby, to do a 2d histogram (with sum aggregation), and do a sum in the final direction?
sth like (didn't test it yet...):

# assume class is a categorical/ordinal
X = df.binby([df['class'], df['x']], agg={'x': 'sum'}, limits=[None, [0, 100]], shape=[None, 1000])
c = np.cumsum(X, axis=1)

cheers,

Maarten

@jmehault
Copy link
Author

jmehault commented Jan 7, 2020

Thanks a lot for your answer Maarten,
I see I have a lot of work to know every existing vaex's functions.
In vaex-core 1.4.0 and on master, binby() does not support limits and shape parameters. Do I missed something ?
Without those attributes, the final result 'c' is:
<xarray.DataArray (statistic: 1, class_: 3, x: 15)>
array([[[ 0., 1., 3., 6., 10., 10., 10., 10., 10., 10.,
10., 10., 10., 10., 10.],
[ 0., 1., 3., 6., 10., 15., 21., 28., 36., 45.,
45., 45., 45., 45., 45.],
[ 0., 1., 3., 6., 10., 15., 21., 28., 36., 45.,
55., 66., 78., 91., 105.]]])
Coordinates:

  • statistic (statistic) <U1 'x'
  • class_ (class_) <U1 'a' 'b' 'c'
  • x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
    This might help, but still with a lot of handling to get the expected result. I am going to script the required transformations and see if it is accpetable in term of time and simplicity.
    I am using the result as a physical measurement and it is required to be precise.

Best regards,
Jérémie

@hyperfra
Copy link

Hi

any news on having cumsum implemented in vaex? I think this is really relevant when
you consider time in your analysis.

best

Francesco

@meta-ks
Copy link

meta-ks commented Oct 31, 2023

Do we have cumsum/cumsum support in groups now in this wonderful library ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants