Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow stream frontends to accept callables #3421

Merged
merged 16 commits into from
Sep 23, 2022

Conversation

matthewturk
Copy link
Member

PR Summary

At present, all the stream frontends can only accept actual, instantiated arrays. So for instance, this means that any calls to load_uniform_grid require that the entire array already be present in memory.

This isn't a hard-and-fast requirement of the Stream frontend -- and in fact, is opposite the original purpose of the Stream frontend! -- but it has evolved to be the only way to load data into it. This PR starts the process of reversing this, and enabling functions to be supplied, instead.

Ideally, while this is somewhat useful at present, in the long run this will allow us to expose a much simpler interface for prototyping frontends, and supplying items to the Stream frontend without actually writing a fully-fledged frontend and without requiring that the data be loaded in advance.

It is still at a draft state, and is somewhat brittle.

PR Checklist

  • New features are documented, with docstrings and narrative docs
  • Adds a test for any bugs fixed. Adds tests for new features.

@matthewturk matthewturk added code frontends Things related to specific frontends enhancement Making something better labels Jul 8, 2021
@matthewturk
Copy link
Member Author

@chummels I wanted to ping you on this one, as I think it would make it much easier to prototype some of the frontends you've been looking at, specifically as it might allow you to dynamically repartition grids.

@@ -177,7 +180,9 @@ def process_data(data, grid_dims=None):
# At this point, we have arrays for all our fields
new_data = {}
for field in data:
n_shape = len(data[field].shape)
n_shape = 3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't incredibly happy with this. Right now, if it's callable, it assumes it's 3D. I'm wondering if maybe we could have a decorator that supplied some metadata about the field that would either be returned or would be assigned to the function object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could provide a @shape(...) decorator that would add a .shape attribute to callable field functions so this line would work transparently without change. It would work something like this

@shape(1, 2, 3)
def my_callable_field(...):
    ....

my_callable_field.shape  == (1, 2, 3)

that said, if we only really car about dimensionality, then len(data[field].shape), then we should just call data[field].ndim and have a corresponding, simpler decorator instead.

@ndim(3)
def my_callable_field(...):
    ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea. I personally think an ndim would be minimum ... I think a more expansive one could come later.

@@ -18,6 +18,8 @@ def __init__(self, ds):
def _read_data_set(self, grid, field):
# This is where we implement processor-locking
tr = self.fields[grid.id][field]
if callable(tr):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, and the change below, is not the most elegant. I'm not sure I have a better solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's typing madness, but I'd say that however inelegant, it's at least very compact and doesn't leak out of scope, so I think it's a reasonable approach overall

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

cphyc
cphyc previously approved these changes Jul 27, 2021
Copy link
Member

@cphyc cphyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Provided we add tests & doc, this would be a really good addition

@matthewturk matthewturk changed the title [WIP] Allow stream frontends to accept callables [WIP] ENH: Allow stream frontends to accept callables Oct 9, 2021
@matthewturk
Copy link
Member Author

I've rebased this for fixing the tests.

@matthewturk matthewturk changed the title [WIP] ENH: Allow stream frontends to accept callables ENH: Allow stream frontends to accept callables Aug 2, 2022
@matthewturk matthewturk added this to the 4.1.0 milestone Aug 2, 2022
@matthewturk
Copy link
Member Author

matthewturk commented Aug 2, 2022

@chrishavlin and I have been talking and there are a handful of changes we want to make, in addition to fixing any broken tests and adding docs and examples.

  1. Make the callables accept a set of kwargs and args that are well-defined.
  2. Set up "virtual" grids for them, so that you can specify directly the base size and have it decompose using the existing machinery.

Probably more, but that was the main thing at the moment.

@matthewturk
Copy link
Member Author

Ok, with a little more experimentation, I've come up with this as a workable thing:

import yt
from yt.utilities.decompose import get_psize, decompose_array
import numpy as np
import h5py

with h5py.File("example.h5", "w") as f:
    f1, f2, f3 = np.mgrid[0.0:1.0:256j, 0.0:1.0:256j, 0.0:1.0:256j]
    f.create_dataset("/density", data = f1)
    f.create_dataset("/temperature", data = f2)
    f.create_dataset("/dinosaurs", data = f3)

class WeirdFunGenerator:
    def __init__(self, filename):
        self.filename = filename
        self._handle = h5py.File(filename, "r")
        
    def read_data(self, grid, field_name):
        ftype, fname = field_name
        si = grid.get_global_startindex()
        ei = si + grid.ActiveDimensions
        return self._handle[fname][si[0]:ei[0], si[1]:ei[1], si[2]:ei[2]]

wfg = WeirdFunGenerator("example.h5")

grid_data = []
psize = get_psize(np.array((256, 256, 256)), 16)
left_edges, right_edges, shapes, slices = decompose_array((256, 256, 256), psize, np.array([[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]]))
for le, re, s in zip(left_edges, right_edges, shapes):
    grid_data.append({'left_edge': le, 'right_edge': re, 'dimensions': s, 'density': wfg.read_data, 'temperature': wfg.read_data, 'dinosaurs': wfg.read_data, 'level': 1})
ds = yt.load_amr_grids(grid_data, [256, 256, 256])

ds.r[:,0.5,:].plot([("gas", "density")])

which does what we want!

image

I think if we get docs and a wrapper function that handles all the decomp, we should be good to go. The test-case example we could use is a very simple mmap-ing of an on-disk gigantic array.

@matthewturk
Copy link
Member Author

With a little bit more work, I made this:

https://gist.github.com/84911dce25e6aac8d6ece9a3d6dbe668

which I think might be getting close to what I'd be happy with.

@matthewturk
Copy link
Member Author

I think that I'd like to get two loaders into loaders.py before this is ready -- one that does the hdf5 thing and one that accepts a dict of slice-receiving set of objects.

For the latter, the example that I was thinking of was going to be just a set of shaped grids that return their xyz coordinates, to demonstrate it. This would also work for the tests.

Copy link
Member

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's some feedback

doc/source/examining/loading_data.rst Outdated Show resolved Hide resolved
@@ -177,7 +180,9 @@ def process_data(data, grid_dims=None):
# At this point, we have arrays for all our fields
new_data = {}
for field in data:
n_shape = len(data[field].shape)
n_shape = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could provide a @shape(...) decorator that would add a .shape attribute to callable field functions so this line would work transparently without change. It would work something like this

@shape(1, 2, 3)
def my_callable_field(...):
    ....

my_callable_field.shape  == (1, 2, 3)

that said, if we only really car about dimensionality, then len(data[field].shape), then we should just call data[field].ndim and have a corresponding, simpler decorator instead.

@ndim(3)
def my_callable_field(...):
    ...

@@ -18,6 +18,8 @@ def __init__(self, ds):
def _read_data_set(self, grid, field):
# This is where we implement processor-locking
tr = self.fields[grid.id][field]
if callable(tr):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's typing madness, but I'd say that however inelegant, it's at least very compact and doesn't leak out of scope, so I think it's a reasonable approach overall

@matthewturk
Copy link
Member Author

matthewturk commented Aug 4, 2022

I have two more things to do:

  • add an even simpler slicer loader, that we can feed mmap data into even if it's not hdf5
  • really shore up some documentation and examples
  • add some tests for this, but likely not hdf5 loader tests (yet)
  • put in stubs for ndim, but probably not implement in this situation yet
  • Add docs for the hdf5 loader

I'd appreciate some feedback on the existing function I added for HDF5 files.

@matthewturk
Copy link
Member Author

I suspect the docs build is related to my stubbing of non-existent files etc.

@matthewturk
Copy link
Member Author

matthewturk commented Aug 4, 2022

I made a notebook that I think I might start with for showing off the way you can load data from functional forms. I also intend to add on an example that just loads from individual files, too -- maybe replicating the enzo data loading. I got a little confused about the best way to show overlapping wavemodes without doing any whole-domain convolution and I'm not sure I like this all that much ... any suggestions for a better way to demonstrate?

https://gist.github.com/82e13dfab8a2f0d956cce791cddb9611

(n.b., this has no narration, and I think I messed things up with the 2**i bit)

@matthewturk
Copy link
Member Author

I've spent some time looking at the slicing stuff and I think I want to hold off for the time being. I'll check that one off and we can consider doing it at a later time. I'll get the tests added for the HDF5 file.

@neutrinoceros
Copy link
Member

Taken your last message I'm going to assume it's safe to remove this from the milestone

@neutrinoceros neutrinoceros removed this from the 4.1.0 milestone Aug 16, 2022
@matthewturk
Copy link
Member Author

Sorry, no, what I meant was: the slices don't need to go in. I think I am ready for this to go in. I want to add tests for the existing functionality, but that's all.

@matthewturk matthewturk added this to the 4.1.0 milestone Aug 16, 2022
Copy link
Contributor

@chrishavlin chrishavlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look at the current state since the hdf5 loader was added since last I looked. In addition to the in-line questions/comments, a couple more:

  • the load_amr_grids docstring entry for grid_data should be updated to reflect that you can supply callables
  • do you intend this to work fully with load_uniform_grid as well? It should work as it is now when nproc=1, but it'll break when nproc>1. I think the domain decomposition when nproc > 1 could be rewritten to use the domain_dimensions argument along with how you handled the nchunks in load_hdf5_file? but I'm not sure that's needed for this PR.

doc/source/examining/Loading_Data_via_Functions.ipynb Outdated Show resolved Hide resolved
yt/frontends/stream/definitions.py Show resolved Hide resolved
yt/loaders.py Outdated
dataset_arguments: Optional[dict] = None,
):
"""
Create a yt dataset given the path to an hdf5 file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mention this works only with grid data and not e.g., particles stored in an hdf file (unless I'm wrong about that)

yt/loaders.py Show resolved Hide resolved
root_node: Optional[str] = "/",
fields: Optional[List[str]] = None,
bbox: np.ndarray = None,
nchunks: int = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does nchunks=0 have meaning beyond being a flag to trigger the auto-chunking? I'm inclined to change this to

Suggested change
nchunks: int = 0,
nchunks: Optional[int] = None,

and change the nchunks logic below to match. But I don't feel strongly about it.

@matthewturk
Copy link
Member Author

I've updated most of the suggestions, but I am running out of gas on this for the moment. I think the biggest item of what is left from the suggestions is to throw an error if nproc > 1 and there are any callables.

@chrishavlin @neutrinoceros take a look?

chrishavlin
chrishavlin previously approved these changes Sep 21, 2022
Copy link
Contributor

@chrishavlin chrishavlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new tests look good to me!

In addition to adding the nproc error check, I think it'd be good to update the load_hdf5 docstring to mention that it's for loading grids (for now at least? unless I'm confused...).

@neutrinoceros
Copy link
Member

No further comments for me at this point, thank you for adding tests !

@matthewturk
Copy link
Member Author

I think I've addressed them all! I also added unit stuff like @chrishavlin suggested.

@neutrinoceros
Copy link
Member

I am very confused by the type checking error. It hardly seems related to anything we're doing here. What seems to be happening is that we're running mypy, targeting Python 3.7, but using numpy 1.23, which uses some Python 3.8+ syntax, which raises an error and prevent type checking to be triggered. This seems like something that should happen every time the job is run, or not at all, but I've never seen it happening before.

@neutrinoceros
Copy link
Member

Before we do anything else, I suggest upgrading mypy, hoping that the issue (that I'm assuming is internal to mypy) was resolved already. I'm working on this in #4139

@jzuhone jzuhone merged commit e7bb936 into yt-project:main Sep 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code frontends Things related to specific frontends enhancement Making something better
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants