Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow to store multiple bitmap indices in the ewah-sidecar #4307

Merged
merged 5 commits into from
Jan 24, 2023

Conversation

Xarthisius
Copy link
Member

@Xarthisius Xarthisius commented Jan 18, 2023

PR Summary

My attempt at fixing #3327. The main "upgrade" is that we can now store indices of the same order for different domain sizes, which we currently don't allow. Each cached index is identified by: left_edge, right_edge, hash of data file, periodicity, index_order1, index_order2 and nfiles.
Drowback is that h5py is a strict requirement now.

PR Checklist

  • New features are documented, with docstrings and narrative docs
  • Adds a test for any bugs fixed. Adds tests for new features.

@Xarthisius Xarthisius changed the title Initial stab at ewah-sidecar that stores multiple indices. [WIP] Initial stab at ewah-sidecar that stores multiple indices. Jan 18, 2023
@Xarthisius Xarthisius added enhancement Making something better index: particle labels Jan 18, 2023
@neutrinoceros
Copy link
Member

We need to coordinate this and #2711
@matthewturk

@Xarthisius
Copy link
Member Author

We need to coordinate this and #2711 @matthewturk

I don't think it's necessary. I'm not touching any ewah proper.

@neutrinoceros
Copy link
Member

I have very little expertise here but I don't fancy adding h5py as a hard requirement. However if it comes to that, this will require a couple edits to reflect it in pyproject.toml
Note that the Python 3.11 tests can't possibly run at the moment without building h5py from source, which isn't trivial.

pyproject.toml Outdated
@@ -43,6 +43,7 @@ keywords = [
requires-python = ">=3.8"
dependencies = [
"cmyt>=1.1.2",
"h5py>=3.1.0,<4.0.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the <4.0.0 part was probably me trying to sooth my anger for some reason I don't even remember. In hindsight I view this as an anti pattern and I'd rather we remove it if we're going to promote h5py as a hard requirement. You'll also want to add a the corresponding constraint to the minimal target.

@neutrinoceros
Copy link
Member

I don't think it's necessary. I'm not touching any ewah proper.

My bad. Viewing on my phone, I got ahead of myself.

@jzuhone
Copy link
Contributor

jzuhone commented Jan 18, 2023

FWIW I'm in favor of making h5py a hard requirement--long ago it used to be, and over ~90% of our functionality uses it in one way or another.

Surely h5py wheels have been built by now for Python 3.11?

@neutrinoceros
Copy link
Member

Surely h5py wheels have been built by now for Python 3.11?

Believe it or not, but not yet. I've been watching the repo closely for weeks, because it's the one reason why half a dozen projects I maintain can't be upgraded fully yet (yt included).

Copy link
Member

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Genuine, possibly naive question: what's the benefit of using h5py here ? I think I understand the maintenance costs much better, so right now it's a little hard for me to judge if it is worth it.

fname, _ = _current_fname(
self.regions.index_order1, self.regions.index_order2
)
fname = getattr(ds, "index_filename", None) or f"{ds.parameter_filename}.ewah"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this simpler form

Suggested change
fname = getattr(ds, "index_filename", None) or f"{ds.parameter_filename}.ewah"
fname = getattr(ds, "index_filename", f"{ds.parameter_filename}.ewah")

but it's not 100% equivalent: it behaves differently in case ds.index_filename is defined but not truthy. Does this matter here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think that or is pretty either, but I exactly wanted to avoid someone/something passing None or ''.

@matthewturk
Copy link
Member

I can speak to a bit of the HDF5 advantage. I see three -- the first is that it lets us consolidate into a single file what would otherwise have been spread across multiple. The second is that we can have, in a forward-compatible way, metadata for the datasets. And finally, it lets us ensure that we aren't doing anything platform-specific with the EWAH arrays, which we have guarded against but which is a possibility, given that the word-alignment may change on different machines.

@jzuhone
Copy link
Contributor

jzuhone commented Jan 18, 2023

plus HDF5 makes it trivial for a user to inspect via other means such as h5dump etc

@neutrinoceros
Copy link
Member

neutrinoceros commented Jan 18, 2023

Maybe I need to get into the details more, but all of that sounds achievable without it (to me). I get that it makes the code easier to write (?), but so far I'm not convinced it makes it easier to maintain, mostly because it puts a lot of maintenance cost on the whole code base.

Just to make my case clear too, here are the costs I anticipate

  • added complexity for testing dev versions of our dependencies (again, building h5py from source isn't as easy as numpy or matplotlib)
  • new opportunities for blocking bugs when, e.g., numpy breaks h5py (this sort of conflict already happens quite often with our dependencies)
  • we'll have to wait for months after the fact to even test a new Python version properly, which to me would be a regression. Last year we were able to test Python 3.11 as early as beta 4 and I discovered a critical compilation error in yt much before it actually reached any user. This is stuff I care about.
  • finally, and this is a more personal point, but as a user I tend to avoid h5py as much as I can, and I'm glad I don't need it for any of the stuff I do with yt.

I don't mean to block this, I just want to avoid that the decision be made lightly.

@Xarthisius Xarthisius force-pushed the h5_ewah branch 2 times, most recently from 5772967 to 5192a9e Compare January 19, 2023 00:35
@jzuhone
Copy link
Contributor

jzuhone commented Jan 19, 2023

@neutrinoceros,

Maybe I need to get into the details more, but all of that sounds achievable without it (to me).

In theory yes, but it's nowhere near as easy, and if the file was an unformatted binary instead, the increased complexity will make it more likely that one will make mistakes in trying to read either the data itself or the metadata. The whole point of having a format like HDF5 is that it makes it much easier to read without making mistakes.

I get that it makes the code easier to write (?)

it definitely makes it not only easier to write, but also to interpret.

but so far I'm not convinced it makes it easier to maintain, mostly because it puts a lot of maintenance cost on the whole code base.
Just to make my case clear too, here are the costs I anticipate
added complexity for testing dev versions of our dependencies (again, building h5py from source isn't as easy as numpy or matplotlib)
new opportunities for blocking bugs when, e.g., numpy breaks h5py (this sort of conflict already happens quite often with our dependencies)
we'll have to wait for months after the fact to even test a new Python version properly, which to me would be a regression. Last year we were able to test Python 3.11 as early as beta 4 and I discovered a critical compilation error in yt much before it actually reached any user. This is stuff I care about.

HDF5 is ubiquitous in yt--not quite everywhere, of course, but most of the frontends use it in some way. Based on a quick scan, these are the ones that use it (at least in part):

  • Enzo
  • FLASH
  • Gadget
  • Arepo
  • GDF
  • Athena++
  • Chimera
  • Cholla
  • Chombo
  • Eagle
  • Enzo-E
  • GAMER
  • GIZMO
  • Moab
  • OWLS
  • Swift
  • ytdata

And we use it for file exporting in several places (yt arrays, fixed resolution buffers, data objects, etc).

finally, and this is a more personal point, but as a user I tend to avoid h5py as much as I can, and I'm glad I don't need it for any of the stuff I do with yt.

I think it's safe to say that as far as users go, those who can avoid HDF5 are very much in the minority. I don't think your experience is typical at all. Admittedly anecdotal, but several years back the developer of a major astrophysics hydrodynamics code who had avoided HDF5 for years finally gave in and re-wrote their entire I/O using it because it was just too convenient not to use, and doing so saved so much hassle for their users.

As I noted previously, we had it as a hard dependency in the beginning. I don't remember exactly when we dropped it as such (@matthewturk or @Xarthisius, do you remember?). I recall thinking at the time that dropping it as a hard dependency was unnecessary.

I do not recall since I've started using yt (which has been since 2010) that there have been this many problems with h5py--it makes me suspect that this is a recent phenomenon with Python 3.11 and probably will not be a long-term issue. I admit being baffled why they don't have a wheel for it--Anaconda has a binary that works just fine.

I take your points that there are some challenges, but I definitely think that they are fixable (either by us or h5py) and the benefits significantly outweigh the costs. Most of the frontends that need EWAH files use HDF5 anyway already.

@jzuhone
Copy link
Contributor

jzuhone commented Jan 19, 2023

latest status of h5py and Python 3.11: h5py/h5py#2146 (comment)

@jzuhone
Copy link
Contributor

jzuhone commented Jan 19, 2023

we could even wait on this until h5py has their act together on wheels

@neutrinoceros
Copy link
Member

Thanks John for your detailed response.

I think it's safe to say that as far as users go, those who can avoid HDF5 are very much in the minority. I don't think your experience is typical at all.

granted !

I do not recall since I've started using yt (which has been since 2010) that there have been this many problems with h5py--it makes me suspect that this is a recent phenomenon with Python 3.11 and probably will not be a long-term issue.

I went back in their release history and it seems that this year is indeed an outlier. In recent years (excluding the current one) wheels for new Python versions were published within a month after the final release. Assuming this better represents what can be typically expected, the situation wouldn't be as bad as I previously described, but we'd still loose the ability to test pre-releases (probably).

I definitely think that they are fixable (either by us or h5py) and the benefits significantly outweigh the costs.

well I honestly don't want to spend time studying the benefits too much so I'll trust you on that one.

Two more remarks I thought of in between comments:

  • I wondered how much startup overhead this would add to import yt, and found that it's about 3% currently, so it's definitely acceptable
  • This means we probably don't need to keep yt.utilities.on_demand_imports._h5py_imports, but we need to be careful about importing order between h5py and netCDF4 see
    class netCDF4_imports(OnDemand):
    def __init__(self):
    # this ensures the import ordering between netcdf4 and h5py. If h5py is
    # imported first, can get file lock errors on some systems (including travis-ci)
    # so we need to do this before initializing h5py_imports()!
    # similar to this issue https://github.com/pydata/xarray/issues/2560
    try:
    import netCDF4 # noqa F401
    except ImportError:
    pass

@matthewturk
Copy link
Member

I'll chip in a bit more but I would suggest that we not change the h5py on-demand-import just yet

@neutrinoceros
Copy link
Member

why not ?

@matthewturk
Copy link
Member

Well, because hdf5 wouldn't strictly be a hard dependency outside of particle codes. For instance, idefix wouldn't need it!

@neutrinoceros
Copy link
Member

No no, it is a strict requirement here just as much as matplotlib: not having it installed will just break import yt. Besides, as I previously mentioned, the startup overhead it contributes is small enough that there's no real benefit to trying to delay it. Even if we did, I don't think that it's feasible in a .pyx module.

@matthewturk
Copy link
Member

I mean, if we use the on-demand, that shouldn't be the case. Right?

@neutrinoceros
Copy link
Member

Well if I believed it was feasible in Cython, I would have suggested we do that instead of annoying you all :)

@matthewturk
Copy link
Member

My reading is that we are doing this inside a def inside Cython, which will utilize Python bindings where available.

@neutrinoceros
Copy link
Member

neutrinoceros commented Jan 20, 2023

Worth a shot. If it works, would that make everybody happy ?

@matthewturk
Copy link
Member

So I actually think I want to step back and say, we should probably just have h5py be a hard dependency. In particular, I think it would be extremely helpful to the majority of cases if conda install yt always got h5py.

@neutrinoceros
Copy link
Member

I understand that extra dependencies are not as nice in conda land, but we could have h5py be declared as a hard dependency in conda builds and still keep it as optional for PyPI wheels. I think this would get us the best of both worlds, how about that ?

@Xarthisius
Copy link
Member Author

After 20+ comments about h5py, I wouldn't be offended if someone chimed in on whether what I did with ewah sidecar is the way to go, does it work for use cases I haven't tried and/or are we generally happy with that approach regardless of the technology used to store it...

matthewturk
matthewturk previously approved these changes Jan 20, 2023
Copy link
Member

@matthewturk matthewturk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this change.

yt/geometry/particle_oct_container.pyx Outdated Show resolved Hide resolved
@jzuhone
Copy link
Contributor

jzuhone commented Jan 20, 2023

I'll test this either tonight or tomorrow

@jzuhone
Copy link
Contributor

jzuhone commented Jan 21, 2023

@Xarthisius I checked this against a couple of recent use cases (including adding a bounding box) and it works very well!

@neutrinoceros
Copy link
Member

@Xarthisius is this still "in progress" ? Looks like we are good on reviews but I'm hesitant to push the button while "WIP" is in the title.

@Xarthisius Xarthisius changed the title [WIP] Initial stab at ewah-sidecar that stores multiple indices. ENH: Allow to store multiple bitmap indices in the ewah-sidecar Jan 22, 2023
@neutrinoceros neutrinoceros merged commit e7878a1 into yt-project:main Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Making something better index: particle
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants