Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long load times on saved dataset #4565

Closed
Ecskrabacz10 opened this issue Jul 5, 2023 · 21 comments · Fixed by #4595
Closed

Long load times on saved dataset #4565

Ecskrabacz10 opened this issue Jul 5, 2023 · 21 comments · Fixed by #4595

Comments

@Ecskrabacz10
Copy link

Ecskrabacz10 commented Jul 5, 2023

Hello! This is not necessarily a bug report, but a question on how I can improve on saving and loading a dataset.

I have currently been working with a filetype from a simulation not currently supported by yt. I have been able to get a dataset, get a time series, and save a dataset. For some context, I have made my own dataset with ~4.2e6 particles and the following fields:

[("all","particle_index")]
[("all","particle_mass")]
[("all","particle_type")]
[("all","particle_position_x")]
[("all","particle_position_y")]
[("all","particle_position_z")]
[("all","particle_velocity_x")]
[("all","particle_velocity_y")]
[("all","particle_velocity_z")]

Each of the particle types has the same fields, i.e. [("stable matter","particle_position_x")]. I have changed values in the dataset (such as current_time, current_redshift, hubble_constant, etc) as well. Currently, I have been saving these datasets to mitigate a memory issue by the command ad.save_as_dataset(path, fields = ds.field_list) , where ds is the dataset and ad = ds.all_data() . Naturally, this saves both the .h5 file and the .ewah file.

Now is where I run into an issue. I am able to load the saved file and perform the same command to define ad with no issue. It takes only around one second to load the particle index. However, the software takes around 2.5 minutes to load some of the saved data through, say, pos_x = ad[("all","particle_position_x")] . This would normally be a non-issue, but when trying to load every field in the field_list, this could easily take a little under half an hour for each of the particle types.

Could anyone give me any advice as to how I can make the loading process quicker? Am I possibly saving my dataset incorrectly?

@chrishavlin
Copy link
Contributor

Hi there! apologies for the slow response.

I don't think you're doing anything incorrectly, I think you're hitting some performance limits... but I think I might have a fix incoming to speed things up. Just to double check -- how are you initially loading in your data to get your initial yt dataset? using yt.load_particles?

@chrishavlin
Copy link
Contributor

@Ecskrabacz10 I just opened a PR that should speed things up for you when you're loading back in via all_data(). Would you be able to check out that PR and see if it works for you? You'd need to install from source off of my branch in that PR (let me know if you need a hand doing that). Also, I think it will only really speed things for all_data(), might not help as much for other data selections...

@Ecskrabacz10
Copy link
Author

Ecskrabacz10 commented Jul 18, 2023

Apologies in advance for the long comment.

Yes absolutely! I would love some help for installing the PR from source off of your branch.

I am unsure if this will help, but I was looking at the lengths of my saved datasets, and it seems that my saved datasets have about 32 times the number of particles from the original dataset. I believe this may be the main cause of the long load times.

Explanation of dataset

For some clarification about the dataset itself, it includes four types of species: all, decaying matter, stable matter, and nbody (which I believe is the combination of lists that include all of the previously stated species). Before saving my dataset, I change the unit registry, current_redshift, current_time, hubble_constant, omega_matter, omega_lambda, and omega_radiation to match my cosmology.

Before I save my dataset, I have also checked to make sure that all of the lengths of the particle lists should be the same: all has 2*(128^3) particles, decaying matter has 128^3 particles, stable matter has 128^3 particles, and nbody has 4*(128^3) particles.

Issue

The main issue now comes from the saved datasets. As I've stated previously, the length of the dataset grew 64 times the original size. Instead of having 2*(128^3) particles, the all species has now has 512^3 particles YIKES.

Currently, I save my dataset through the following procedure,

f = h5py.File(filename)
data_list = []

# For loop to populate the data_list with the decaying matter, stable matter, and all species
bbox = bbox

ds = yt.load_particles(data_list, #Other conditions such as length_unit and bbox#)

ds.current_redshift = current_redshift
# Other changes to modify the cosmology and unit_registry

ad = ds.all_data()
ad.save_as_dataset(path_to_save, fields = ds.field_list)

My main guess would be when I perform ad.save_as_dataset(), the ad already includes the fields meant to be saved. So, having the portion that states fields = ds.field_list may be redundant, but if that were the case then it would only be copied once instead of 32 times.

Another issue could arise from my cpu structure. I have currently been running on one node with 16 cores, but I do not have parallelism enabled. Once again, I'm not sure how this could cause an issue as I do not have parallelism enabled. I have also tested this with yt.is_root(), and it still resulted in the same issue. I also ran my code using mpirun -n 1, and the issue still happened.

Do you or anyone else know how this process creates a dataset that is 32 times larger than the original?

@neutrinoceros
Copy link
Member

I would love some help for installing the PR from source off of your branch.

Here's how. Note that it'll take longer than regular installation (about a couple minutes), because your computer will run extra compilation steps.

python -m pip install git+https://github.com/chrishavlin/yt@ytdata_check_for_all_data

@chrishavlin
Copy link
Contributor

thanks for the extra info, @Ecskrabacz10 !

And after reading through, I don't actually think my PR will fully solve your problem -- that PR simply adds code to avoid yt's selection routines when using all_data (and simply returns all the data immediately). So since you've somehow ended up with more particles that you started with, I suspect that my PR will still load those in.

One thing you could do that would help isolate the issue: when you call save_as_dataset, the file that is saved is simply another hdf file. So you could try opening that file with h5py to inspect the data in order to check whether the duplication is happening on initial save_as_dataset or on re-load.

@matthewturk
Copy link
Member

You could also post the output of h5ls -r on the file here for a quick look.

@Ecskrabacz10
Copy link
Author

Ecskrabacz10 commented Jul 18, 2023

Thank you both for the extra insight! Here's a snapshot of what the h5ls -r output of the saved files looks like.

image

The length of each component looks correct. So, it seems that the duplication does not happen during the initial save_as_dataset, but instead on re-load @chrishavlin

Edit: Here is the output that I get when I load with ds = yt.load(filename)
image

The 1.678e+07 particles seems to be the correct amount, but when we check the shape of each field, they are 32 times the normal. This 32 seems to be reflected when initializing ad = ds.all_data().

@chrishavlin
Copy link
Contributor

Interesting. It is suspicious that the counts are off by a factor of 32 and the dataset index gets initialized with 32 chunks... maybe each chunk is referencing the same index range and everything is getting loaded 32 times... Let me look a bit more at how all that works to see if I can reproduce this behavior with a smaller dataset.

@Ecskrabacz10
Copy link
Author

I was curious and wanted to check if this problem was due to the number of CPUs in the node that I have been using. I tested two different nodes, one with 16 CPUs and one with 20. After saving and loading the same dataset from the two different nodes, both saved datasets encountered the same factor of 32 problem.

@matthewturk
Copy link
Member

This really is very confusing. One thing that might help -- do any ewah files get generated? If so, can you send them, and also send us the ewah files from the original pre-save-dataset dataset?

@chrishavlin
Copy link
Contributor

OK, was able to reproduce this on main with a simple example:

import yt
import numpy as np

n_particles = int(1e6)
ppx, ppy, ppz = np.random.random(size=[3, n_particles])
ppm = np.arange(0, n_particles)
data = {
    "particle_position_x": ppx,
    "particle_position_y": ppy,
    "particle_position_z": ppz,
    "particle_mass": ppm,
}


ds = yt.load_particles(data)
ad = ds.all_data()

fn= ad.save_as_dataset('/var/tmp/test_save', fields=ds.field_list)
ds1 = yt.load(fn)
ad1 = ds1.all_data()

n_particles_out = ad1[('all', 'particle_mass')].shape[0]
print(n_particles_out, n_particles_out == n_particles, n_particles_out / n_particles)

prints out

yt : [INFO     ] 2023-07-21 17:52:04,155 Parameters: current_time              = 0.0
yt : [INFO     ] 2023-07-21 17:52:04,155 Parameters: domain_dimensions         = [1 1 1]
yt : [INFO     ] 2023-07-21 17:52:04,156 Parameters: domain_left_edge          = [0. 0. 0.]
yt : [INFO     ] 2023-07-21 17:52:04,157 Parameters: domain_right_edge         = [1. 1. 1.]
yt : [INFO     ] 2023-07-21 17:52:04,157 Parameters: cosmological_simulation   = 0
yt : [INFO     ] 2023-07-21 17:52:04,158 Allocating for 1e+06 particles
yt : [INFO     ] 2023-07-21 17:52:04,557 Saving field data to yt dataset: /var/tmp/test_save.h5.
yt : [INFO     ] 2023-07-21 17:52:04,924 Parameters: current_time              = 0.0 code_time
yt : [INFO     ] 2023-07-21 17:52:04,924 Parameters: domain_dimensions         = [1 1 1]
yt : [INFO     ] 2023-07-21 17:52:04,925 Parameters: domain_left_edge          = [0. 0. 0.] code_length
yt : [INFO     ] 2023-07-21 17:52:04,926 Parameters: domain_right_edge         = [1. 1. 1.] code_length
yt : [INFO     ] 2023-07-21 17:52:04,926 Parameters: cosmological_simulation   = 0
yt : [INFO     ] 2023-07-21 17:52:04,943 Allocating for 3e+06 particles
Initializing coarse index : 100%|█████████████████████████████████████████████| 4/4 [00:00<00:00, 2488.83it/s]
Initializing refined index: 100%|█████████████████████████████████████████████| 4/4 [00:00<00:00, 2351.07it/s]

4000000 False 4.0

So I ended up with 4 times more particles. Furthermore, cause I gave the particle_mass field unique values, we can see that the data is indeed duplicated:

print(np.unique(ad1[('all','particle_mass')]).shape)

prints (1000000,) (the proper size).

This only happens with there are enough particles to trigger the chunking -- smaller initial n_particles don't build an .ewah file.

So, not sure what the problem is yet, but we now have a simpler toy problem to debug.

@chrishavlin
Copy link
Contributor

I'm gonna go ahead and label this a bug at this point ...

@chrishavlin
Copy link
Contributor

Oh -- might be that stream dataset always uses 1 chunk, but that info is not being passed out to the saved dataset so that when it gets loaded back in the particle index re-builds with multiple chunks??

@Ecskrabacz10
Copy link
Author

.ewah files get generated when I load the saved dataset, but there are no .ewah files for previous steps. I just tried uploading it, and it seems like GitHub does not support .ewah files for comments/posts. Is there anywhere I could post my sample .ewah file?

I also found it very odd that the particle index re-builds in multiple chunks. I thought the number of chunks might be related to the number of fields present in the dataset. However, I don't believe that it is as each of my species has nine fields associated with them.

@chrishavlin
Copy link
Contributor

chrishavlin commented Jul 24, 2023

The number of chunks is related to the length of the arrays when re-loading. And it only happens on re-load because when loading back in from a dataset that was created with save_as_dataset, yt uses a slightly different dataset type internally to handle it. e.g., the default chunksize is 64^3, and so when reloading it should process arrays with chunks of size 64^3. But instead it seems to be cycling through chunks and loading the entire array each time.

I just found the spot in the code where this is happening, so hoping can have a fix in today.

@chrishavlin
Copy link
Contributor

(and don't worry about uploading the .ewah files )

@chrishavlin
Copy link
Contributor

chrishavlin commented Jul 24, 2023

@Ecskrabacz10 would you be able to test out my fix in #4595 with your data?

You can install from my PR branch with

pip install git+https://github.com/chrishavlin/yt@ytdata_chunking

@Ecskrabacz10
Copy link
Author

Just installed the fix and it seems like it worked! It still loads the particles in 32 chunks; however, it has the same number of particles as the original dataset.

This may just be because it's a new fix but I wanted to bring it up nonetheless:
When importing ytree with the fix, it results in the following error

ImportError: cannot import name 'validate_index_order' from 'yt.data_objects.static_output' (/anaconda3/envs/testing/lib/python3.11/site-packages/yt/data_objects/static_output.py)

Once again, not too big of an issue, but I wanted to bring it up just in case.

@chrishavlin
Copy link
Contributor

great!

I think the second error is already fixed by ytree (ytree-project/ytree#162), so you'll want to update ytree to use the fix here. Doesn't look like that bug fix in ytree has made it to a release yet, so you'd need to install ytree from source.

@chrishavlin
Copy link
Contributor

Oh and how's the load time? much faster I hope??

@Ecskrabacz10
Copy link
Author

Ah okay that makes sense, thank you so much! And yes it is MUCH faster now, about 5 seconds to load each field instead of the 2-3 minutes it was taking before.

@neutrinoceros neutrinoceros modified the milestones: 4.2.2, 4.3.0 Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants