ENH: Allow to store multiple bitmap indices in the ewah-sidecar #4307

Xarthisius · 2023-01-18T17:51:14Z

PR Summary

My attempt at fixing #3327. The main "upgrade" is that we can now store indices of the same order for different domain sizes, which we currently don't allow. Each cached index is identified by: left_edge, right_edge, hash of data file, periodicity, index_order1, index_order2 and nfiles.
Drowback is that h5py is a strict requirement now.

PR Checklist

New features are documented, with docstrings and narrative docs
Adds a test for any bugs fixed. Adds tests for new features.

neutrinoceros · 2023-01-18T17:55:46Z

We need to coordinate this and #2711
@matthewturk

Xarthisius · 2023-01-18T17:57:30Z

We need to coordinate this and #2711 @matthewturk

I don't think it's necessary. I'm not touching any ewah proper.

neutrinoceros · 2023-01-18T18:02:13Z

I have very little expertise here but I don't fancy adding h5py as a hard requirement. However if it comes to that, this will require a couple edits to reflect it in pyproject.toml
Note that the Python 3.11 tests can't possibly run at the moment without building h5py from source, which isn't trivial.

neutrinoceros · 2023-01-18T18:07:17Z

pyproject.toml

@@ -43,6 +43,7 @@ keywords = [
 requires-python = ">=3.8"
 dependencies = [
    "cmyt>=1.1.2",
+    "h5py>=3.1.0,<4.0.0",


I think the <4.0.0 part was probably me trying to sooth my anger for some reason I don't even remember. In hindsight I view this as an anti pattern and I'd rather we remove it if we're going to promote h5py as a hard requirement. You'll also want to add a the corresponding constraint to the minimal target.

neutrinoceros · 2023-01-18T18:08:27Z

I don't think it's necessary. I'm not touching any ewah proper.

My bad. Viewing on my phone, I got ahead of myself.

jzuhone · 2023-01-18T18:33:11Z

FWIW I'm in favor of making h5py a hard requirement--long ago it used to be, and over ~90% of our functionality uses it in one way or another.

Surely h5py wheels have been built by now for Python 3.11?

neutrinoceros · 2023-01-18T18:40:24Z

Surely h5py wheels have been built by now for Python 3.11?

Believe it or not, but not yet. I've been watching the repo closely for weeks, because it's the one reason why half a dozen projects I maintain can't be upgraded fully yet (yt included).

neutrinoceros

Genuine, possibly naive question: what's the benefit of using h5py here ? I think I understand the maintenance costs much better, so right now it's a little hard for me to judge if it is worth it.

neutrinoceros · 2023-01-18T21:52:54Z

yt/geometry/particle_geometry_handler.py

-            fname, _ = _current_fname(
-                self.regions.index_order1, self.regions.index_order2
-            )
+            fname = getattr(ds, "index_filename", None) or f"{ds.parameter_filename}.ewah"


I would expect this simpler form

Suggested change

fname = getattr(ds, "index_filename", None) or f"{ds.parameter_filename}.ewah"

fname = getattr(ds, "index_filename", f"{ds.parameter_filename}.ewah")

but it's not 100% equivalent: it behaves differently in case ds.index_filename is defined but not truthy. Does this matter here ?

Yeah, I don't think that or is pretty either, but I exactly wanted to avoid someone/something passing None or ''.

matthewturk · 2023-01-18T22:18:36Z

I can speak to a bit of the HDF5 advantage. I see three -- the first is that it lets us consolidate into a single file what would otherwise have been spread across multiple. The second is that we can have, in a forward-compatible way, metadata for the datasets. And finally, it lets us ensure that we aren't doing anything platform-specific with the EWAH arrays, which we have guarded against but which is a possibility, given that the word-alignment may change on different machines.

jzuhone · 2023-01-18T22:39:09Z

plus HDF5 makes it trivial for a user to inspect via other means such as h5dump etc

neutrinoceros · 2023-01-18T22:42:13Z

Maybe I need to get into the details more, but all of that sounds achievable without it (to me). I get that it makes the code easier to write (?), but so far I'm not convinced it makes it easier to maintain, mostly because it puts a lot of maintenance cost on the whole code base.

Just to make my case clear too, here are the costs I anticipate

added complexity for testing dev versions of our dependencies (again, building h5py from source isn't as easy as numpy or matplotlib)
new opportunities for blocking bugs when, e.g., numpy breaks h5py (this sort of conflict already happens quite often with our dependencies)
we'll have to wait for months after the fact to even test a new Python version properly, which to me would be a regression. Last year we were able to test Python 3.11 as early as beta 4 and I discovered a critical compilation error in yt much before it actually reached any user. This is stuff I care about.
finally, and this is a more personal point, but as a user I tend to avoid h5py as much as I can, and I'm glad I don't need it for any of the stuff I do with yt.

I don't mean to block this, I just want to avoid that the decision be made lightly.

jzuhone · 2023-01-19T02:00:24Z

@neutrinoceros,

Maybe I need to get into the details more, but all of that sounds achievable without it (to me).

In theory yes, but it's nowhere near as easy, and if the file was an unformatted binary instead, the increased complexity will make it more likely that one will make mistakes in trying to read either the data itself or the metadata. The whole point of having a format like HDF5 is that it makes it much easier to read without making mistakes.

I get that it makes the code easier to write (?)

it definitely makes it not only easier to write, but also to interpret.

but so far I'm not convinced it makes it easier to maintain, mostly because it puts a lot of maintenance cost on the whole code base.
Just to make my case clear too, here are the costs I anticipate
added complexity for testing dev versions of our dependencies (again, building h5py from source isn't as easy as numpy or matplotlib)
new opportunities for blocking bugs when, e.g., numpy breaks h5py (this sort of conflict already happens quite often with our dependencies)
we'll have to wait for months after the fact to even test a new Python version properly, which to me would be a regression. Last year we were able to test Python 3.11 as early as beta 4 and I discovered a critical compilation error in yt much before it actually reached any user. This is stuff I care about.

HDF5 is ubiquitous in yt--not quite everywhere, of course, but most of the frontends use it in some way. Based on a quick scan, these are the ones that use it (at least in part):

Enzo
FLASH
Gadget
Arepo
GDF
Athena++
Chimera
Cholla
Chombo
Eagle
Enzo-E
GAMER
GIZMO
Moab
OWLS
Swift
ytdata

And we use it for file exporting in several places (yt arrays, fixed resolution buffers, data objects, etc).

finally, and this is a more personal point, but as a user I tend to avoid h5py as much as I can, and I'm glad I don't need it for any of the stuff I do with yt.

I think it's safe to say that as far as users go, those who can avoid HDF5 are very much in the minority. I don't think your experience is typical at all. Admittedly anecdotal, but several years back the developer of a major astrophysics hydrodynamics code who had avoided HDF5 for years finally gave in and re-wrote their entire I/O using it because it was just too convenient not to use, and doing so saved so much hassle for their users.

As I noted previously, we had it as a hard dependency in the beginning. I don't remember exactly when we dropped it as such (@matthewturk or @Xarthisius, do you remember?). I recall thinking at the time that dropping it as a hard dependency was unnecessary.

I do not recall since I've started using yt (which has been since 2010) that there have been this many problems with h5py--it makes me suspect that this is a recent phenomenon with Python 3.11 and probably will not be a long-term issue. I admit being baffled why they don't have a wheel for it--Anaconda has a binary that works just fine.

I take your points that there are some challenges, but I definitely think that they are fixable (either by us or h5py) and the benefits significantly outweigh the costs. Most of the frontends that need EWAH files use HDF5 anyway already.

jzuhone · 2023-01-19T02:07:30Z

latest status of h5py and Python 3.11: h5py/h5py#2146 (comment)

jzuhone · 2023-01-19T02:08:56Z

we could even wait on this until h5py has their act together on wheels

neutrinoceros · 2023-01-20T16:08:57Z

Thanks John for your detailed response.

I think it's safe to say that as far as users go, those who can avoid HDF5 are very much in the minority. I don't think your experience is typical at all.

granted !

I do not recall since I've started using yt (which has been since 2010) that there have been this many problems with h5py--it makes me suspect that this is a recent phenomenon with Python 3.11 and probably will not be a long-term issue.

I went back in their release history and it seems that this year is indeed an outlier. In recent years (excluding the current one) wheels for new Python versions were published within a month after the final release. Assuming this better represents what can be typically expected, the situation wouldn't be as bad as I previously described, but we'd still loose the ability to test pre-releases (probably).

I definitely think that they are fixable (either by us or h5py) and the benefits significantly outweigh the costs.

well I honestly don't want to spend time studying the benefits too much so I'll trust you on that one.

Two more remarks I thought of in between comments:

I wondered how much startup overhead this would add to import yt, and found that it's about 3% currently, so it's definitely acceptable

This means we probably don't need to keep yt.utilities.on_demand_imports._h5py_imports, but we need to be careful about importing order between h5py and netCDF4 see

yt/yt/utilities/on_demand_imports.py

Lines 85 to 94 in 4eda8b2

    
           class netCDF4_imports(OnDemand): 
        
               def __init__(self): 
        
                   # this ensures the import ordering between netcdf4 and h5py. If h5py is 
        
                   # imported first, can get file lock errors on some systems (including travis-ci) 
        
                   # so we need to do this before initializing h5py_imports()! 
        
                   # similar to this issue https://github.com/pydata/xarray/issues/2560 
        
                   try: 
        
                       import netCDF4  # noqa F401 
        
                   except ImportError: 
        
                       pass

matthewturk · 2023-01-20T16:21:06Z

I'll chip in a bit more but I would suggest that we not change the h5py on-demand-import just yet

neutrinoceros · 2023-01-20T16:23:19Z

why not ?

matthewturk · 2023-01-20T16:24:07Z

Well, because hdf5 wouldn't strictly be a hard dependency outside of particle codes. For instance, idefix wouldn't need it!

neutrinoceros · 2023-01-20T16:29:30Z

No no, it is a strict requirement here just as much as matplotlib: not having it installed will just break import yt. Besides, as I previously mentioned, the startup overhead it contributes is small enough that there's no real benefit to trying to delay it. Even if we did, I don't think that it's feasible in a .pyx module.

matthewturk · 2023-01-20T16:32:28Z

I mean, if we use the on-demand, that shouldn't be the case. Right?

neutrinoceros · 2023-01-20T16:34:15Z

Well if I believed it was feasible in Cython, I would have suggested we do that instead of annoying you all :)

matthewturk · 2023-01-20T16:37:45Z

My reading is that we are doing this inside a def inside Cython, which will utilize Python bindings where available.

neutrinoceros · 2023-01-20T16:40:17Z

Worth a shot. If it works, would that make everybody happy ?

matthewturk · 2023-01-20T17:19:52Z

So I actually think I want to step back and say, we should probably just have h5py be a hard dependency. In particular, I think it would be extremely helpful to the majority of cases if conda install yt always got h5py.

neutrinoceros · 2023-01-20T18:26:14Z

I understand that extra dependencies are not as nice in conda land, but we could have h5py be declared as a hard dependency in conda builds and still keep it as optional for PyPI wheels. I think this would get us the best of both worlds, how about that ?

Xarthisius · 2023-01-20T18:35:44Z

After 20+ comments about h5py, I wouldn't be offended if someone chimed in on whether what I did with ewah sidecar is the way to go, does it work for use cases I haven't tried and/or are we generally happy with that approach regardless of the technology used to store it...

matthewturk

I really like this change.

yt/geometry/particle_oct_container.pyx

jzuhone · 2023-01-20T22:09:29Z

I'll test this either tonight or tomorrow

jzuhone · 2023-01-21T03:04:15Z

@Xarthisius I checked this against a couple of recent use cases (including adding a bounding box) and it works very well!

neutrinoceros · 2023-01-22T16:42:07Z

@Xarthisius is this still "in progress" ? Looks like we are good on reviews but I'm hesitant to push the button while "WIP" is in the title.

Xarthisius requested a review from jzuhone January 18, 2023 17:51

Xarthisius changed the title ~~Initial stab at ewah-sidecar that stores multiple indices.~~ [WIP] Initial stab at ewah-sidecar that stores multiple indices. Jan 18, 2023

Xarthisius added enhancement Making something better index: particle labels Jan 18, 2023

Xarthisius force-pushed the h5_ewah branch from 1cf4171 to a8d7128 Compare January 18, 2023 17:52

neutrinoceros reviewed Jan 18, 2023

View reviewed changes

Xarthisius force-pushed the h5_ewah branch 2 times, most recently from 5772967 to 5192a9e Compare January 19, 2023 00:35

Xarthisius added 2 commits January 18, 2023 19:08

Initial stab at ewah-sidecar that stores multiple indices.

eed1295

Add h5py as hard dependency

2902353

Xarthisius force-pushed the h5_ewah branch from 760c8e9 to 8cd9929 Compare January 19, 2023 01:08

Remove a test that is no longer valid

e28b487

Xarthisius force-pushed the h5_ewah branch from 8cd9929 to e28b487 Compare January 19, 2023 01:09

Xarthisius force-pushed the h5_ewah branch from 61ca13d to bc05b4d Compare January 20, 2023 19:56

Revert adding h5py as a hard dep

7e6cdda

Xarthisius force-pushed the h5_ewah branch from bc05b4d to 7e6cdda Compare January 20, 2023 20:02

matthewturk previously approved these changes Jan 20, 2023

View reviewed changes

yt/geometry/particle_oct_container.pyx Outdated Show resolved Hide resolved

0-pad nfile datasets

89d635a

Xarthisius dismissed matthewturk’s stale review via 89d635a January 20, 2023 22:35

jzuhone approved these changes Jan 21, 2023

View reviewed changes

Xarthisius changed the title ~~[WIP] Initial stab at ewah-sidecar that stores multiple indices.~~ ENH: Allow to store multiple bitmap indices in the ewah-sidecar Jan 22, 2023

neutrinoceros merged commit e7878a1 into yt-project:main Jan 24, 2023

This was referenced Mar 11, 2023

Continued problems with stored EWAH files #4371

Closed

BUG: Fix index order if loading from EWAH file #4372

Closed

neutrinoceros mentioned this pull request Jun 5, 2023

OverflowError in ParticleBitmap #4471

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Allow to store multiple bitmap indices in the ewah-sidecar #4307

ENH: Allow to store multiple bitmap indices in the ewah-sidecar #4307

Xarthisius commented Jan 18, 2023 •

edited

neutrinoceros commented Jan 18, 2023

Xarthisius commented Jan 18, 2023

neutrinoceros commented Jan 18, 2023

neutrinoceros Jan 18, 2023

neutrinoceros commented Jan 18, 2023

jzuhone commented Jan 18, 2023

neutrinoceros commented Jan 18, 2023

neutrinoceros left a comment

neutrinoceros Jan 18, 2023

Xarthisius Jan 18, 2023

matthewturk commented Jan 18, 2023

jzuhone commented Jan 18, 2023

neutrinoceros commented Jan 18, 2023 •

edited

jzuhone commented Jan 19, 2023 •

edited

jzuhone commented Jan 19, 2023

jzuhone commented Jan 19, 2023

neutrinoceros commented Jan 20, 2023

matthewturk commented Jan 20, 2023

neutrinoceros commented Jan 20, 2023

matthewturk commented Jan 20, 2023

neutrinoceros commented Jan 20, 2023

matthewturk commented Jan 20, 2023

neutrinoceros commented Jan 20, 2023

matthewturk commented Jan 20, 2023

neutrinoceros commented Jan 20, 2023 •

edited

matthewturk commented Jan 20, 2023

neutrinoceros commented Jan 20, 2023

Xarthisius commented Jan 20, 2023

matthewturk left a comment

jzuhone commented Jan 20, 2023

jzuhone commented Jan 21, 2023

neutrinoceros commented Jan 22, 2023

	fname = getattr(ds, "index_filename", None) or f"{ds.parameter_filename}.ewah"
	fname = getattr(ds, "index_filename", f"{ds.parameter_filename}.ewah")

ENH: Allow to store multiple bitmap indices in the ewah-sidecar #4307

ENH: Allow to store multiple bitmap indices in the ewah-sidecar #4307

Conversation

Xarthisius commented Jan 18, 2023 • edited

PR Summary

PR Checklist

neutrinoceros commented Jan 18, 2023

Xarthisius commented Jan 18, 2023

neutrinoceros commented Jan 18, 2023

neutrinoceros Jan 18, 2023

Choose a reason for hiding this comment

neutrinoceros commented Jan 18, 2023

jzuhone commented Jan 18, 2023

neutrinoceros commented Jan 18, 2023

neutrinoceros left a comment

Choose a reason for hiding this comment

neutrinoceros Jan 18, 2023

Choose a reason for hiding this comment

Xarthisius Jan 18, 2023

Choose a reason for hiding this comment

matthewturk commented Jan 18, 2023

jzuhone commented Jan 18, 2023

neutrinoceros commented Jan 18, 2023 • edited

jzuhone commented Jan 19, 2023 • edited

jzuhone commented Jan 19, 2023

jzuhone commented Jan 19, 2023

neutrinoceros commented Jan 20, 2023

matthewturk commented Jan 20, 2023

neutrinoceros commented Jan 20, 2023

matthewturk commented Jan 20, 2023

neutrinoceros commented Jan 20, 2023

matthewturk commented Jan 20, 2023

neutrinoceros commented Jan 20, 2023

matthewturk commented Jan 20, 2023

neutrinoceros commented Jan 20, 2023 • edited

matthewturk commented Jan 20, 2023

neutrinoceros commented Jan 20, 2023

Xarthisius commented Jan 20, 2023

matthewturk left a comment

Choose a reason for hiding this comment

jzuhone commented Jan 20, 2023

jzuhone commented Jan 21, 2023

neutrinoceros commented Jan 22, 2023

Xarthisius commented Jan 18, 2023 •

edited

neutrinoceros commented Jan 18, 2023 •

edited

jzuhone commented Jan 19, 2023 •

edited

neutrinoceros commented Jan 20, 2023 •

edited