-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow to store multiple bitmap indices in the ewah-sidecar #4307
Conversation
We need to coordinate this and #2711 |
I don't think it's necessary. I'm not touching any ewah proper. |
I have very little expertise here but I don't fancy adding h5py as a hard requirement. However if it comes to that, this will require a couple edits to reflect it in |
pyproject.toml
Outdated
@@ -43,6 +43,7 @@ keywords = [ | |||
requires-python = ">=3.8" | |||
dependencies = [ | |||
"cmyt>=1.1.2", | |||
"h5py>=3.1.0,<4.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the <4.0.0
part was probably me trying to sooth my anger for some reason I don't even remember. In hindsight I view this as an anti pattern and I'd rather we remove it if we're going to promote h5py as a hard requirement. You'll also want to add a the corresponding constraint to the minimal
target.
My bad. Viewing on my phone, I got ahead of myself. |
FWIW I'm in favor of making h5py a hard requirement--long ago it used to be, and over ~90% of our functionality uses it in one way or another. Surely h5py wheels have been built by now for Python 3.11? |
Believe it or not, but not yet. I've been watching the repo closely for weeks, because it's the one reason why half a dozen projects I maintain can't be upgraded fully yet (yt included). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Genuine, possibly naive question: what's the benefit of using h5py here ? I think I understand the maintenance costs much better, so right now it's a little hard for me to judge if it is worth it.
fname, _ = _current_fname( | ||
self.regions.index_order1, self.regions.index_order2 | ||
) | ||
fname = getattr(ds, "index_filename", None) or f"{ds.parameter_filename}.ewah" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this simpler form
fname = getattr(ds, "index_filename", None) or f"{ds.parameter_filename}.ewah" | |
fname = getattr(ds, "index_filename", f"{ds.parameter_filename}.ewah") |
but it's not 100% equivalent: it behaves differently in case ds.index_filename
is defined but not truthy. Does this matter here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think that or
is pretty either, but I exactly wanted to avoid someone/something passing None
or ''
.
I can speak to a bit of the HDF5 advantage. I see three -- the first is that it lets us consolidate into a single file what would otherwise have been spread across multiple. The second is that we can have, in a forward-compatible way, metadata for the datasets. And finally, it lets us ensure that we aren't doing anything platform-specific with the EWAH arrays, which we have guarded against but which is a possibility, given that the word-alignment may change on different machines. |
plus HDF5 makes it trivial for a user to inspect via other means such as |
Maybe I need to get into the details more, but all of that sounds achievable without it (to me). I get that it makes the code easier to write (?), but so far I'm not convinced it makes it easier to maintain, mostly because it puts a lot of maintenance cost on the whole code base. Just to make my case clear too, here are the costs I anticipate
I don't mean to block this, I just want to avoid that the decision be made lightly. |
5772967
to
5192a9e
Compare
In theory yes, but it's nowhere near as easy, and if the file was an unformatted binary instead, the increased complexity will make it more likely that one will make mistakes in trying to read either the data itself or the metadata. The whole point of having a format like HDF5 is that it makes it much easier to read without making mistakes.
it definitely makes it not only easier to write, but also to interpret.
HDF5 is ubiquitous in yt--not quite everywhere, of course, but most of the frontends use it in some way. Based on a quick scan, these are the ones that use it (at least in part):
And we use it for file exporting in several places (yt arrays, fixed resolution buffers, data objects, etc).
I think it's safe to say that as far as users go, those who can avoid HDF5 are very much in the minority. I don't think your experience is typical at all. Admittedly anecdotal, but several years back the developer of a major astrophysics hydrodynamics code who had avoided HDF5 for years finally gave in and re-wrote their entire I/O using it because it was just too convenient not to use, and doing so saved so much hassle for their users. As I noted previously, we had it as a hard dependency in the beginning. I don't remember exactly when we dropped it as such (@matthewturk or @Xarthisius, do you remember?). I recall thinking at the time that dropping it as a hard dependency was unnecessary. I do not recall since I've started using yt (which has been since 2010) that there have been this many problems with I take your points that there are some challenges, but I definitely think that they are fixable (either by us or |
latest status of |
we could even wait on this until h5py has their act together on wheels |
Thanks John for your detailed response.
granted !
I went back in their release history and it seems that this year is indeed an outlier. In recent years (excluding the current one) wheels for new Python versions were published within a month after the final release. Assuming this better represents what can be typically expected, the situation wouldn't be as bad as I previously described, but we'd still loose the ability to test pre-releases (probably).
well I honestly don't want to spend time studying the benefits too much so I'll trust you on that one. Two more remarks I thought of in between comments:
|
I'll chip in a bit more but I would suggest that we not change the h5py on-demand-import just yet |
why not ? |
Well, because hdf5 wouldn't strictly be a hard dependency outside of particle codes. For instance, idefix wouldn't need it! |
No no, it is a strict requirement here just as much as matplotlib: not having it installed will just break |
I mean, if we use the on-demand, that shouldn't be the case. Right? |
Well if I believed it was feasible in Cython, I would have suggested we do that instead of annoying you all :) |
My reading is that we are doing this inside a |
Worth a shot. If it works, would that make everybody happy ? |
So I actually think I want to step back and say, we should probably just have h5py be a hard dependency. In particular, I think it would be extremely helpful to the majority of cases if |
I understand that extra dependencies are not as nice in conda land, but we could have h5py be declared as a hard dependency in conda builds and still keep it as optional for PyPI wheels. I think this would get us the best of both worlds, how about that ? |
After 20+ comments about h5py, I wouldn't be offended if someone chimed in on whether what I did with ewah sidecar is the way to go, does it work for use cases I haven't tried and/or are we generally happy with that approach regardless of the technology used to store it... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this change.
I'll test this either tonight or tomorrow |
@Xarthisius I checked this against a couple of recent use cases (including adding a bounding box) and it works very well! |
@Xarthisius is this still "in progress" ? Looks like we are good on reviews but I'm hesitant to push the button while "WIP" is in the title. |
PR Summary
My attempt at fixing #3327. The main "upgrade" is that we can now store indices of the same order for different domain sizes, which we currently don't allow. Each cached index is identified by:
left_edge
,right_edge
, hash of data file,periodicity
,index_order1
,index_order2
andnfiles
.Drowback is that
h5py
is a strict requirement now.PR Checklist