Advice on optimization? #1338

amits-element · 2021-05-03T10:35:30Z

amits-element
May 3, 2021

Hi there!
First off, thanks for creating Vaex!

I was wondering I could tap into your expertise, as I'm experiencing performance/stability issues.
Please excuse me for talking mainly theoretically - this question stems from my work for a SaaS company, and I won't be able to share the data I'm working on. If I have to, I might generate dummy data.

Anyways, my company saves parts of its data as huge JSONs.
While researching these JSONs, I quickly realized Pandas won't be able to handle the huge amounts of data and turned to Vaex.
The flattened JSON includes (in some instances) thousands of columns (~22K) and currently ~2M rows. The rows increase every week, but we'll ignore that for now.

I saved the relevant data in 210 parquet files, which Vaex should open and concatenate.
After struggling with opening the files for a bit I read a few discussions and realized Vaex has trouble with a great number of columns.
The solution I've found was to melt the tables into 3 rows (asset id, component id, count) and concatenate the files on the index instead, to sum the counts of each component.

And it worked. Beautifully.
But then, I wanted to calculate the distributions - min, max, mean, std, and percentiles [10,20,25,30,40,50...].

I used a selection - df[df['component] == 'component_name'] , and tested the calculations. And that's where I need your help.
The calculation, although made on ~2M rows every time, take a LONG time. I even tried using the delay parameter to only pass through the data one, but it actually slowed the calculation down! When everything was delayed, it took ~40 mins to calculate the stats for a single component. Doing the same calculations without the delay cut it down to ~11 mins.

What surprised me about the calculations though, except for the amount of time, was that while the min value is always zero for each component, the percentile_approx function actually returned negative values for the percentiles under 0.4

Eventually I gave up and settled on doing value_counts and doing the statistics with pandas, but it turns out that value_counts take ~40 mins as well. and I cant seem to be able to loop through the different components because after the first calculation is done the kernel just freezes/restarts!

Right now, the solution I have left is to use pandas to read each parquet individually, only 1 column at a time for performance reasons, and create a "manual" value counts. Please help me 😳

jayceslesar · 2021-05-03T19:35:46Z

jayceslesar
May 3, 2021

This could be that your numeric values could be strings. This happened to me in data in tidy/tall format as well and the solution was to do a groupby and pyarrow datatype cast on all of my components with a syntax something like this....
pa.int8() is just a pyarrow datatype to convert from string to integer to sum the values. Greatly reduced my overall calculation time.
cols_regex is just some regex that looks like 'column1|column2|column3' , same operation to where you have df[df['component] == 'component_name']

new_df = df[df['component_name'].str.contains(cols_regex)]
new_df['count'] = new_df['count'].astype(pa.int8()) *this may not be needed if this is already numeric but I needed some cast
group_sums = new_df.groupby(['component_name'], agg=[vaex.agg.sum(new_df.count)])
component_names = group_sums['component_name'].values
component_counts = group_sums['counts'].values

component_counts and component_names have the same indices.

Similarly you can use other Vaex aggregate functions

It would be beneficial to check and see if they are of integer type or stringfloat or stringscalar or something!

1 reply

amits-element May 4, 2021
Author

Hey, thanks for your reply!
I checked, and the dtype is int64, so it's not that unfortunately...
I'll give your script a go and see how it goes. When using pandas to iterate over the files, it takes 30 secs to calculate the value_counts of a single component!
I really don't get it, I feel like I'm missing something central about this library's usage...

I'd appreciate any other ideas though, keep 'em coming!

maartenbreddels · 2021-05-04T10:39:24Z

maartenbreddels
May 4, 2021
Maintainer

Hi,

thanks for reaching out to us. Could you try exporting to hdf5 files, and try the same?

cheers,

Maarten

7 replies

amits-element May 11, 2021
Author

Ok, so I converted all of the parquets to HDF5s with Vaex, which was quite fast. For some reason though, Pandas won't read the hdf5s created this way, it raised the following error -
NoSuchNodeError: group ``/table`` does not have a child named X (can't recall what X was)

Anyways:

Opening the files (210 files) took more than I expected - ~36.2 s
Filtering rows was fast - 6.44 s
Calculating value counts was slow - 5:45 mins
Also, for some reason, for most of these ~6 mins only one core was active, sometimes 2 (out of 4)

The time breakdown of the value counts from Jupyter Notebook was as follows:
CPU times: user 1min 27s, sys: 1min 41s, total: 3min 9s
Wall time: 5min 45s

So definitely a HUGE improvement over parquet (~8X), but still slower than reading a specific column in each file in pandas by ~10X

JovanVeljanoski May 11, 2021
Maintainer

Are you using a single hdf5 file or multiple? If multiple, can you export the data as a single hdf5 file?

I know there is some overhead when concatenating 100s of files, so maybe this is what you are hitting?

amits-element May 11, 2021
Author

Multiple (210 files), since a single file will be larger than RAM.
Is the concatenation only performed while calculating? That might explain it.

I'll give the single file a go and update soon

JovanVeljanoski May 11, 2021
Maintainer

The single file will be indeed larger than ram.. but that's ok! that's why vaex was built..

Vaex will be able to export such a file, no problem :)

amits-element May 26, 2021
Author

Hey, sorry for the extreme delay 😅
An update -
*** Script ran on Amazon SageMaker with 4 vCPUs. 16 GB RAM and 1 GPU ***

Opening the single file took 1.17 s (~97% improvement from separate files)
Filtering the rows took 5.45 s (~15% improvements from separate files)
Calculating value counts took 10 mins, almost 2X as slow. Also, for some reason, the only value appearing was 0, when in the separate files I had various values.

Once again, for some reason, for almost the entirety of the time when calculating the value counts only 1-2 out of 4 cores worked. Occasionally 3, and I don't think I ever saw a 100% CPU usage

That's the code I used to convert the files to a single file, I have no idea why it would remove/reset all the non-zero values...

paths = glob(f'{local_path}/{key}_[0-9]*.hdf5')
df = vaex.open(paths)
df.export_hdf5(f'{local_path}/{key}_unified.hdf5')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice on optimization? #1338

{{title}}

Replies: 2 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Advice on optimization? #1338

amits-element May 3, 2021

Replies: 2 comments · 8 replies

jayceslesar May 3, 2021

amits-element May 4, 2021 Author

maartenbreddels May 4, 2021 Maintainer

amits-element May 11, 2021 Author

JovanVeljanoski May 11, 2021 Maintainer

amits-element May 11, 2021 Author

JovanVeljanoski May 11, 2021 Maintainer

amits-element May 26, 2021 Author

amits-element
May 3, 2021

Replies: 2 comments 8 replies

jayceslesar
May 3, 2021

amits-element May 4, 2021
Author

maartenbreddels
May 4, 2021
Maintainer

amits-element May 11, 2021
Author

JovanVeljanoski May 11, 2021
Maintainer

amits-element May 11, 2021
Author

JovanVeljanoski May 11, 2021
Maintainer

amits-element May 26, 2021
Author