Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Particle indices have to be re-generated every time for some datasets #3487

Closed
chummels opened this issue Aug 28, 2021 · 4 comments
Closed

Comments

@chummels
Copy link
Member

Bug report

Bug summary

As discussed in the yt slack, some datasets must re-generate particle indices every time they are loaded, instead of using the existing .ewah files to skip this step. This negates the whole purpose of generating these indices (ewah files) and can take hours to load in a dataset depending on the nature of the dataset.

The issue arises for datasets where yt updates the refined index order to be more efficient, it generates an ewah file specifically named based on the new coarse and refined indices (e.g., halo_59.hdf5.index6_4.ewah). But when yt loads this dataset the next time, it searches for an ewah file based on the default coarse and refined indices (e.g., halo_59.hdf5.index7_5.ewah), so it fails to see the ewah file and thinks it needs to generate it again.

Code for reproduction

import yt
ds = yt.load_sample('TNGHalo')
ds.index
ds = yt.load_sample('TNGHalo')
ds.index

Actual outcome

This code needs to generate the coarse index and refined index both times it loads the dataset. If you re-run this script with a different dataset, like FIRE_M12i_ref11, it only needs to generate the coarse index and refined index once, and the second time, it just loads the particle index data Loading particle index.

@chummels
Copy link
Member Author

This issue was discussed in #3198 , and a solution was proposed:

I've now limited the heuristic, but I'm somewhat slightly concerned that what it now does is check for the old index_order in the filename, rather than the new, so it's entirely possible that if it does any modifications to the index_order, it will always always generate new ewah files.

One possible way around this would be to have the filename just have index_order1 in it, and if it's auto-generated, have it call it "auto" or something.

@chummels
Copy link
Member Author

Alternatively, why do we need to list the indices in the ewah filename at all? Once an ewah file has been generated, does it matter at all what the coarse and refined indices are? Perhaps I'm being naive, but it seems like once it's generated, it'll just work for loading in the data. But perhaps one will see changes in efficiency depending on the future functions applied to that dataset, so maybe it does matter?

One solution would be to just use regular expressions to see if there is any ewah file (with the same filename stem) in the same directory as the dataset, and if so, try to use that. This would resolve the issue, I think, and be backwards compatible as well.

@matthewturk
Copy link
Member

matthewturk commented Aug 28, 2021 via email

@neutrinoceros
Copy link
Member

closed via #4198

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants