Advice on optimization? #1338
Replies: 2 comments 8 replies
-
This could be that your numeric values could be strings. This happened to me in data in tidy/tall format as well and the solution was to do a groupby and pyarrow datatype cast on all of my components with a syntax something like this....
Similarly you can use other Vaex aggregate functions It would be beneficial to check and see if they are of integer type or stringfloat or stringscalar or something! |
Beta Was this translation helpful? Give feedback.
-
Hi, thanks for reaching out to us. Could you try exporting to hdf5 files, and try the same? cheers, Maarten |
Beta Was this translation helpful? Give feedback.
-
Hi there!
First off, thanks for creating Vaex!
I was wondering I could tap into your expertise, as I'm experiencing performance/stability issues.
Please excuse me for talking mainly theoretically - this question stems from my work for a SaaS company, and I won't be able to share the data I'm working on. If I have to, I might generate dummy data.
Anyways, my company saves parts of its data as huge JSONs.
While researching these JSONs, I quickly realized Pandas won't be able to handle the huge amounts of data and turned to Vaex.
The flattened JSON includes (in some instances) thousands of columns (~22K) and currently ~2M rows. The rows increase every week, but we'll ignore that for now.
I saved the relevant data in 210 parquet files, which Vaex should open and concatenate.
After struggling with opening the files for a bit I read a few discussions and realized Vaex has trouble with a great number of columns.
The solution I've found was to melt the tables into 3 rows (asset id, component id, count) and concatenate the files on the index instead, to sum the counts of each component.
And it worked. Beautifully.
But then, I wanted to calculate the distributions - min, max, mean, std, and percentiles [10,20,25,30,40,50...].
I used a selection -
df[df['component] == 'component_name']
, and tested the calculations. And that's where I need your help.The calculation, although made on ~2M rows every time, take a LONG time. I even tried using the delay parameter to only pass through the data one, but it actually slowed the calculation down! When everything was delayed, it took ~40 mins to calculate the stats for a single component. Doing the same calculations without the delay cut it down to ~11 mins.
What surprised me about the calculations though, except for the amount of time, was that while the min value is always zero for each component, the percentile_approx function actually returned negative values for the percentiles under 0.4
Eventually I gave up and settled on doing value_counts and doing the statistics with pandas, but it turns out that value_counts take ~40 mins as well. and I cant seem to be able to loop through the different components because after the first calculation is done the kernel just freezes/restarts!
Right now, the solution I have left is to use pandas to read each parquet individually, only 1 column at a time for performance reasons, and create a "manual" value counts. Please help me 😳
Beta Was this translation helpful? Give feedback.
All reactions