-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow stream frontends to accept callables #3421
Conversation
@chummels I wanted to ping you on this one, as I think it would make it much easier to prototype some of the frontends you've been looking at, specifically as it might allow you to dynamically repartition grids. |
@@ -177,7 +180,9 @@ def process_data(data, grid_dims=None): | |||
# At this point, we have arrays for all our fields | |||
new_data = {} | |||
for field in data: | |||
n_shape = len(data[field].shape) | |||
n_shape = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't incredibly happy with this. Right now, if it's callable, it assumes it's 3D. I'm wondering if maybe we could have a decorator that supplied some metadata about the field that would either be returned or would be assigned to the function object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could provide a @shape(...)
decorator that would add a .shape
attribute to callable field functions so this line would work transparently without change. It would work something like this
@shape(1, 2, 3)
def my_callable_field(...):
....
my_callable_field.shape == (1, 2, 3)
that said, if we only really car about dimensionality, then len(data[field].shape)
, then we should just call data[field].ndim
and have a corresponding, simpler decorator instead.
@ndim(3)
def my_callable_field(...):
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea. I personally think an ndim
would be minimum ... I think a more expansive one could come later.
@@ -18,6 +18,8 @@ def __init__(self, ds): | |||
def _read_data_set(self, grid, field): | |||
# This is where we implement processor-locking | |||
tr = self.fields[grid.id][field] | |||
if callable(tr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, and the change below, is not the most elegant. I'm not sure I have a better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's typing madness, but I'd say that however inelegant, it's at least very compact and doesn't leak out of scope, so I think it's a reasonable approach overall
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Provided we add tests & doc, this would be a really good addition
cd7c806
to
35dd1be
Compare
I've rebased this for fixing the tests. |
@chrishavlin and I have been talking and there are a handful of changes we want to make, in addition to fixing any broken tests and adding docs and examples.
Probably more, but that was the main thing at the moment. |
Ok, with a little more experimentation, I've come up with this as a workable thing: import yt
from yt.utilities.decompose import get_psize, decompose_array
import numpy as np
import h5py
with h5py.File("example.h5", "w") as f:
f1, f2, f3 = np.mgrid[0.0:1.0:256j, 0.0:1.0:256j, 0.0:1.0:256j]
f.create_dataset("/density", data = f1)
f.create_dataset("/temperature", data = f2)
f.create_dataset("/dinosaurs", data = f3)
class WeirdFunGenerator:
def __init__(self, filename):
self.filename = filename
self._handle = h5py.File(filename, "r")
def read_data(self, grid, field_name):
ftype, fname = field_name
si = grid.get_global_startindex()
ei = si + grid.ActiveDimensions
return self._handle[fname][si[0]:ei[0], si[1]:ei[1], si[2]:ei[2]]
wfg = WeirdFunGenerator("example.h5")
grid_data = []
psize = get_psize(np.array((256, 256, 256)), 16)
left_edges, right_edges, shapes, slices = decompose_array((256, 256, 256), psize, np.array([[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]]))
for le, re, s in zip(left_edges, right_edges, shapes):
grid_data.append({'left_edge': le, 'right_edge': re, 'dimensions': s, 'density': wfg.read_data, 'temperature': wfg.read_data, 'dinosaurs': wfg.read_data, 'level': 1})
ds = yt.load_amr_grids(grid_data, [256, 256, 256])
ds.r[:,0.5,:].plot([("gas", "density")]) which does what we want! I think if we get docs and a wrapper function that handles all the decomp, we should be good to go. The test-case example we could use is a very simple mmap-ing of an on-disk gigantic array. |
With a little bit more work, I made this: https://gist.github.com/84911dce25e6aac8d6ece9a3d6dbe668 which I think might be getting close to what I'd be happy with. |
I think that I'd like to get two loaders into For the latter, the example that I was thinking of was going to be just a set of shaped grids that return their xyz coordinates, to demonstrate it. This would also work for the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's some feedback
@@ -177,7 +180,9 @@ def process_data(data, grid_dims=None): | |||
# At this point, we have arrays for all our fields | |||
new_data = {} | |||
for field in data: | |||
n_shape = len(data[field].shape) | |||
n_shape = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could provide a @shape(...)
decorator that would add a .shape
attribute to callable field functions so this line would work transparently without change. It would work something like this
@shape(1, 2, 3)
def my_callable_field(...):
....
my_callable_field.shape == (1, 2, 3)
that said, if we only really car about dimensionality, then len(data[field].shape)
, then we should just call data[field].ndim
and have a corresponding, simpler decorator instead.
@ndim(3)
def my_callable_field(...):
...
@@ -18,6 +18,8 @@ def __init__(self, ds): | |||
def _read_data_set(self, grid, field): | |||
# This is where we implement processor-locking | |||
tr = self.fields[grid.id][field] | |||
if callable(tr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's typing madness, but I'd say that however inelegant, it's at least very compact and doesn't leak out of scope, so I think it's a reasonable approach overall
I have two more things to do:
I'd appreciate some feedback on the existing function I added for HDF5 files. |
I suspect the docs build is related to my stubbing of non-existent files etc. |
I made a notebook that I think I might start with for showing off the way you can load data from functional forms. I also intend to add on an example that just loads from individual files, too -- maybe replicating the enzo data loading. I got a little confused about the best way to show overlapping wavemodes without doing any whole-domain convolution and I'm not sure I like this all that much ... any suggestions for a better way to demonstrate? https://gist.github.com/82e13dfab8a2f0d956cce791cddb9611 (n.b., this has no narration, and I think I messed things up with the |
I've spent some time looking at the slicing stuff and I think I want to hold off for the time being. I'll check that one off and we can consider doing it at a later time. I'll get the tests added for the HDF5 file. |
Taken your last message I'm going to assume it's safe to remove this from the milestone |
Sorry, no, what I meant was: the slices don't need to go in. I think I am ready for this to go in. I want to add tests for the existing functionality, but that's all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick look at the current state since the hdf5 loader was added since last I looked. In addition to the in-line questions/comments, a couple more:
- the
load_amr_grids
docstring entry forgrid_data
should be updated to reflect that you can supply callables - do you intend this to work fully with
load_uniform_grid
as well? It should work as it is now whennproc=1
, but it'll break whennproc>1
. I think the domain decomposition whennproc > 1
could be rewritten to use thedomain_dimensions
argument along with how you handled thenchunks
inload_hdf5_file
? but I'm not sure that's needed for this PR.
yt/loaders.py
Outdated
dataset_arguments: Optional[dict] = None, | ||
): | ||
""" | ||
Create a yt dataset given the path to an hdf5 file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should mention this works only with grid data and not e.g., particles stored in an hdf file (unless I'm wrong about that)
root_node: Optional[str] = "/", | ||
fields: Optional[List[str]] = None, | ||
bbox: np.ndarray = None, | ||
nchunks: int = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does nchunks=0 have meaning beyond being a flag to trigger the auto-chunking? I'm inclined to change this to
nchunks: int = 0, | |
nchunks: Optional[int] = None, |
and change the nchunks
logic below to match. But I don't feel strongly about it.
I've updated most of the suggestions, but I am running out of gas on this for the moment. I think the biggest item of what is left from the suggestions is to throw an error if @chrishavlin @neutrinoceros take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new tests look good to me!
In addition to adding the nproc
error check, I think it'd be good to update the load_hdf5 docstring to mention that it's for loading grids (for now at least? unless I'm confused...).
No further comments for me at this point, thank you for adding tests ! |
I think I've addressed them all! I also added unit stuff like @chrishavlin suggested. |
I am very confused by the type checking error. It hardly seems related to anything we're doing here. What seems to be happening is that we're running mypy, targeting Python 3.7, but using numpy 1.23, which uses some Python 3.8+ syntax, which raises an error and prevent type checking to be triggered. This seems like something that should happen every time the job is run, or not at all, but I've never seen it happening before. |
Before we do anything else, I suggest upgrading mypy, hoping that the issue (that I'm assuming is internal to mypy) was resolved already. I'm working on this in #4139 |
PR Summary
At present, all the stream frontends can only accept actual, instantiated arrays. So for instance, this means that any calls to
load_uniform_grid
require that the entire array already be present in memory.This isn't a hard-and-fast requirement of the Stream frontend -- and in fact, is opposite the original purpose of the Stream frontend! -- but it has evolved to be the only way to load data into it. This PR starts the process of reversing this, and enabling functions to be supplied, instead.
Ideally, while this is somewhat useful at present, in the long run this will allow us to expose a much simpler interface for prototyping frontends, and supplying items to the Stream frontend without actually writing a fully-fledged frontend and without requiring that the data be loaded in advance.
It is still at a draft state, and is somewhat brittle.
PR Checklist