-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp main script functionality and associated examples #155
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calc.region had been a dict of {region.name: region} items. But there is no need for this; we were never using the key, and we can always just grab the name from the object itself. Doing so made the main script overhaul easier
Delete find_obj.py
+1 for both of these
we can come back to [...] bigger picture stuff later. Sound good?
Agreed; we could easily get caught in a lengthy discussion here, but for now let's hold off.
Still working through the PR; I need to head out now, but will return to this later today and tomorrow.
aospy/automate.py
Outdated
elif name == 'default': | ||
return getattr(parent, defaults_attr_name).values() | ||
else: | ||
return getattr(parent, attr_name)[name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily opposed, just want to confirm: does this mean that we need to switch from using lists to dictionaries when we specify the models
attribute of a Proj
object or the runs
attribute of a Model
object?
E.g. go from this:
example_model = Model(
name='example_model',
grid_file_paths=('/path/to/files'),
runs=[runs.control, runs.conv_off],
default_runs=[]
)
to this:
example_model = Model(
name='example_model',
grid_file_paths=('/path/to/files'),
runs=dict(control=runs.control, conv_off=runs.conv_off),
default_runs=dict()
)
I'm assuming this was a decision made to eliminate the need for the find_obj.py
module? (I'm all for that decision, by the way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually already taken care of via the utils.io.dict_name_keys
function. So the list syntax still works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On that point, the main script and example object library from these commits should work; try running the main script (from inside the example directory so that you don't have to install the object library).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see this gets addressed automatically, should have noticed that sorry -- should we add some checking to utils.io.dict_name_keys
to make sure that there are no duplicate names in the list of provided core objects (at least warn, maybe raise an exception)?
For example, right now if you had two Runs with the same name attribute in the list provided, the first one would get overwritten by the second without the user knowing.
On that point, the main script and example object library from these commits should work; try running the main script (from inside the example directory so that you don't have to install the object library).
Indeed, I can confirm this works!
Thanks a lot. Sorry again for the large scope and putting us into a bit of a time crunch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spencerahill I've finished making a first pass here -- overall I think it looks pretty good! Thanks for taking this on.
A few high-level questions came up (related to #3), but we might want to put those off until later.
aospy/automate.py
Outdated
elif name == 'default': | ||
return getattr(parent, defaults_attr_name).values() | ||
else: | ||
return getattr(parent, attr_name)[name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see this gets addressed automatically, should have noticed that sorry -- should we add some checking to utils.io.dict_name_keys
to make sure that there are no duplicate names in the list of provided core objects (at least warn, maybe raise an exception)?
For example, right now if you had two Runs with the same name attribute in the list provided, the first one would get overwritten by the second without the user knowing.
On that point, the main script and example object library from these commits should work; try running the main script (from inside the example directory so that you don't have to install the object library).
Indeed, I can confirm this works!
examples/example_obj_lib.py
Outdated
|
||
|
||
projects = {example_proj.name: example_proj} | ||
variables = {var.name: var for var in [precip_largescale, precip_convective, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat big picture, but I'll just note here that this requirement (creating a dictionary mapping Var name attributes to Var objects) breaks my current object library, because I have some Var objects that have the same name attribute, but are different Var objects (to accommodate cases where a variable is native in one model, but needs to be computed from other variables for output from another model, see #3 (comment)).
E.g. these two variables for OLR:
olr = Var(
name='olr',
alt_names=('rlut',),
units=units.W_m2,
domain='atmos',
description='All-sky outgoing longwave radiation at TOA.',
def_time=True,
def_vert=False,
def_lat=True,
def_lon=True
)
olr_imr = Var(
name='olr',
domain='atmos',
description=('Outgoing longwave radiation'),
variables=(swdn_sfc, vert_int_tdtsw_rad_imr, netrad_toa_imr),
def_time=True,
def_vert=False,
def_lat=True,
def_lon=True,
func=calcs.idealized_moist_rad.energy.olr,
units=units.W_m2
)
This will take some effort/thinking to address and I don't think it's necessarily a blocker here. There's nothing that precludes me (or anyone else) from rolling their own solutions to a main script or object library (so until we have a better solution for #3 I may need to just continue using find_obj.py
and my old main script to get things done).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a more general note, this practice of requiring the user to create module level dictionaries to set up their object libraries feels a little non-ideal; again though, without more thought I don't have a good alternative solution. For small object libraries it feels OK, but for larger ones (that get spread out across multiple submodules, like what we use in practice), these would need to be placed (somewhat hidden) in an aospy_user/__init__.py
file.
For variables in particular, if one has a lot defined in a module (e.g. in aospy_user/variables/__init__.py
) is there a straightforward pythonic way to create this dictionary without much typing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this requirement (creating a dictionary mapping Var name attributes to Var objects) breaks my current object library, because I have some Var objects that have the same name attribute, but are different Var objects
this practice of requiring the user to create module level dictionaries to set up their object libraries feels a little non-ideal
These are both good points and I definitely want to avoid breaking your workflow. Will have a separate comment and/or Issue on these...I have some ideas but need to clarify them.
aospy/automate.py
Outdated
specs = self._specs_in.copy() | ||
[specs.pop(core) for core in self._CORE_SPEC_NAMES] | ||
# TODO: don't just pull from first project. | ||
regions = _get_mult_objs_by_names(specs['regions'], self._projects[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this prevent a user from using a region that was not explicitly listed as a Region in a particular Proj object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's a good point...I need to think about this
A related issue (although I don't see an immediate use-case) is if the same name was used by Region objects in different projects but with different region definitions. In that case this logic would incorrectly use the region definition for the first project in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same name was used by Region objects in different projects but with different region definitions
This would remain a problem even if e.g. we formed the union of regions defined across all projects. In that case the problem reduces to the same issue you raised above re: conflicting names of Var objects.
aospy/automate.py
Outdated
def create_calcs(self): | ||
return _create_calcs(self._combine_core_aux_specs()) | ||
|
||
def _print_specs(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll need to update this function to use the self._specs_in
dictionary, rather than the (now removed) projects
, models
, runs
etc. attributes in order for it to run (though I'm guessing you're aware of this, since you've commented it out below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes definitely
aospy/automate.py
Outdated
prompt_verify=False, verbose=True): | ||
"""Generate and execute all specified computations.""" | ||
calc_suite = CalcSuite(calc_suite_specs, obj_lib) | ||
# calc_suite._print_specs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncomment this once _print_specs()
is updated
aospy/automate.py
Outdated
return _permute_aux_specs(specs) | ||
|
||
def _combine_core_aux_specs(self): | ||
return _combine_core_aux_specs(self._permute_core_specs(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of confusing to have two functions of the same name that are defined differently. Is there a good reason to define the module-level version of _combine_core_aux_specs
? Could you simply just implement what it does within this class-level function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe at the module level create a function called _merge_dicts
and then here you can write a one-liner using itertools.product
? Would that work?
E.g. at the module level:
def _merge_dicts(a, b):
merged = a.copy()
merged.update(b)
return merged
and at the instance-level (here):
def _combine_core_aux_specs(self):
return [_merge_dicts(a, b) for a, b in itertools.product(
self._permute_core_specs(), self._permute_aux_specs())]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I got carried away here with the module-level functions. The original motivation was for ease of testing and potential re-use of functionality elsewhere, but this logic seems pretty specific to CalcSuite.
aospy/automate.py
Outdated
defaults_attr_name) for name in names] | ||
|
||
|
||
def _permute_aux_specs(specs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could write this function pretty compactly as a list-comprehension:
def _permute_aux_specs(specs):
return [dict(zip(specs.keys(), perm)) for perm in itertools.product(*specs.values())]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I can confirm this works!
Great!
For example, right now if you had two Runs with the same name attribute in the list provided, the first one would get overwritten by the second without the user knowing.
Good catch; will address.
I think you could write this function pretty compactly as a list-comprehension
Yes, will do and the other list comprehensions you suggest
aospy/automate.py
Outdated
return all_specs | ||
|
||
|
||
def _create_calcs(specs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could also be written more compactly as a list-comprehension
aospy/automate.py
Outdated
return pool.map(lambda calc: calc.compute(), calcs) | ||
out = [] | ||
for calc in calcs: | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could there be a way to encapsulate this try-except logic such that it could also be used above when we parallelize the calculations? That way it would ensure the same warning behavior through each pathway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, fixing this makes that logic much cleaner
@spencerkclark thanks a lot for the review. Your concerns largely coincide with the parts I was most uncomfortable about, so at least we're on the same page about what needs improving. In terms of long-term solutions, I opened #156 to discuss the aspects re: how to register objects with minimal input by the users. The other big issue you note that I still need to think about is what the appropriate "parent" is of Vars and Regions. Will follow up with my thinking on that. |
@spencerkclark I think I have managed to relax the prior onerous requirement of defining {name: obj} dicts for the projects and variables. CalcSuite now simply searches the object. However, now I'm realizing this still doesn't work with either of our existing setups, because we are importing a Perhaps a good solution is a try/except case, wherein if 'variables' is defined it is used, and if not then this new method of finding all the I added some initial tests; will add more. C.f. our convo offline, you're welcome to add more, but more important for me is your feedback on the overall approach. |
Edit: typo in the commit message; should be 'variables' Just pushed this, let me know if it works for you. |
Thanks for the updates; yes, I think this is a very good compromise. I will have a closer look at your changes tomorrow afternoon. |
Great, thanks! I'll try to do some more testing in the meantime |
Just added some more tests. This is my first time using pytest. I immediately like it a lot, although I'm not certain I'm using it correctly. I feel like maybe I'm overdoing it with all of the @spencerkclark since I know you've been using pytest recently, any comments on these aspects would be appreciated. |
I have encountered another problem. When I execute aospy_main the first time, it works fine. But the second time, I'm getting an error related to the tar output:
If I delete the tar output and re-submit, it works fine. @spencerkclark does this happen for you? It's possibly a MacOS thing, given the OSError and that we've not had this problem in the past with the tar output (I'm running from my Macbook Pro). |
Thanks for adding some more tests! I'm still getting the hang of pytest as well, but I also think it's pretty awesome. I think your use of fixtures is actually what's causing your trouble with @pytest.mark.parametrize(('type_', 'obj_lib', 'expected'),
[(Var, examples, {condensation_rain, convection_rain, precip, ps, sphum}),
(Proj, examples, {example_proj})])
def test_get_all_objs_of_type(type_, obj_lib, expected):
actual = _get_all_objs_of_type(type_, obj_lib)
assert expected == actual and I think things should work. I'm taking a look at the rest of your changes now, and will see if I can fill in some gaps in the test coverage after. |
aospy/automate.py
Outdated
# Drop the "core" specifications, which are handled separately. | ||
specs = self._specs_in.copy() | ||
[specs.pop(core) for core in self._CORE_SPEC_NAMES] | ||
specs['regions'] = self._get_regions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will always use every region defined by the user in their object library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also does this work if there is no region specified? (In other words if someone doesn't want to do a spatial reduction on result?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, both very good points, and same with _get_variables
below.
aospy/automate.py
Outdated
specs = self._specs_in.copy() | ||
[specs.pop(core) for core in self._CORE_SPEC_NAMES] | ||
specs['regions'] = self._get_regions() | ||
specs['variables'] = self._get_variables() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Similar to above) I think this will always use every variable defined by the user in their object library. Is that correct?
aospy/automate.py
Outdated
|
||
def _get_variables(self): | ||
if hasattr(self._obj_lib, 'variables'): | ||
return self._obj_lib.variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone ahead and tried this in my object library; this line does get called, but it causes a problem because this function needs to return an iterable to be used in line 53.
The obvious change (which I tried) would be to return the result of _get_all_objs_of_type(Var, self._obj_lib.variables)
, but this results in automate
attempting to create calculations involving all the variables defined in my object library (ignoring the one I specified in the specifications), prompting my comments below. Is there a simple way to address this?
I've actually had something similar happen (prior to this PR), and not for tar output but for standard netCDF output. It only seems to occur for regional calculations. Here's a minimal working example: In [1]: import xarray as xr
In [2]: ds = xr.Dataset()
In [3]: ds['a'] = 1
In [4]: ds['b'] = 2
In [5]: ds.to_netcdf('test.nc')
In [6]: ds.to_netcdf('test.nc') For some reason I can call
(This is on a GFDL workstation by the way, so not Mac OS specific) |
I'm not sure if this issue in xarray is related: pydata/xarray#1215 |
aospy/automate.py
Outdated
def _get_regions(self): | ||
return [_get_all_objs_of_type(Region, self._obj_lib)] | ||
|
||
def _get_variables(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would rewriting this function to look like this be acceptable? This would basically revert the behavior back to using the python variable name for each Var (not its name attribute).
def _get_variables(self):
objs = getattr(self._obj_lib, 'variables', self._obj_lib).__dict__
variables = {
name: var for name, var in objs.items() if isinstance(var, Var)}
names = self._specs_in['variables']
return set([var for name, var in variables.items() if name in names])
This is pretty confusing though, since the name attribute is used for every other core object type (though that's also the way things are handled in the main script prior to this PR).
To be honest I think for my particular situation / use-case, writing a similar, simpler, kind of main script to handle permutation that asked the user to specify lists of actual aospy core objects (rather than strings) would be the way to go. It's up to you what you want to do here; I'm not suggesting we switch to that in the current PR. I'm more considering in the medium-term writing that type of main script on my own, and potentially thinking about adding it to the aospy core later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, we should stop dinking around with this string silliness and use the actual objects. I have started in on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spencerahill cool! Please let me know if you need any help.
@spencerkclark big thanks for this batch of comments. Sorry wasn't able to get to them tonight, will tomorrow for sure. Once again you've caught some important things. |
(Github not letting respond to this in-line for some reason)
Darn. I kind of assumed that was the case but was hoping there was some trickery to avoid it. Oh well. Thanks for the commit fixing it. |
Among other things, need to (re-) implement converting from 'default' date_ranges value to each Run's actual default range
@spencerkclark I've taken a stab at switching from names to objects and have managed to get back to roughly the same spot I was before otherwise. I also added a "library" kwarg to the Note that the FYI I'll be away for the next hour or so but this is my sole task otherwise tonight. Will try my best to get all outstanding problems addressed before tomorrow. |
Yes, will do and will ping you |
@spencerkclark I'm still working through these, I'll ping you when I'm done, don't worry about it in the meantime (was going to push that same fix shortly) |
- pretty print summary of requested calcs - getattr rather than try/except - user doesn't have to add extra list around 'default' for date_ranges or around output_time_regional_reductions or regions NOTE: I'm getting weird failures on test_permute_aux_specs; sometimes it fails, sometimes it doesn't, depending on the context (i.e. whether I call pytest on the whole package or just that test module). When it fails, basically the order of the two elements has been swapped somehow.
…/spencerahill/aospy into docs-update-install-instructions
@spencerkclark ok I think I've addressed everything at this point, less some further tests of automate to bring the coverage up. However see my note in the dd25efb commit re: the weird failing test. On Travis it's only failing for 2.7, but it's not simply a Python version issue; it happens for me on 3.5 sometimes, but not always. |
I want to say in this test we are implicitly assuming that the order is consistent across the sets that go into producing the dictionaries (which is not guaranteed). Using sets is nice to automatically trim duplicates, but I wonder if we should just switch to lists for test simplicity, which always have the same order? |
aospy/utils/io.py
Outdated
raise AttributeError(e) | ||
return objs | ||
|
||
|
||
def to_dup_list(x, n, single_to_list=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized we no longer use this (untested) function, so I think we can remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, will delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I recently learned of the vulture package that finds unused code in a package, could be useful to use this periodically on aospy
Yes this is probably the right move. I was motivated by a vague desire to think more carefully about what data structures we are using in each place, but I guess in this case a list's order preservation is more useful than set's prevention of duplicates (there is no built-in ordered set) Will do this shortly |
We actually do need sets/unordered containers after all, because the _get_all_objs_of_type list comprehension pulls items from __dict__, which depending on the Python version may or may not be ordered. So since we can't rely on an order there, we can't rely on an order in anything downstream of it or tests thereof.
aospy/test/test_automate.py
Outdated
return True | ||
assert len(actual) == len(expected) | ||
for act in actual: | ||
assert act in expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But of course! Much cleaner solution here. Sorry for leading you down the wrong path.
Wow this ended up being a headache! @spencerkclark automate is only at 78% coverage, but if it's alright with you I'd like to let it slide. Partly because I see a clear path forward for an improved long-term solution (#158 , with more details to come). Partly because I'm burnt out on this module! Self-inflicted 😄 |
…/spencerahill/aospy into docs-update-install-instructions
@spencerkclark good catch on the docstring. Just updated the what's new. Since all CI passed on the previous non-doc commits, we're green in my view. Any final concerns? |
@spencerahill thanks again! Bear with me, let me see if I can write a simple test for |
…/spencerahill/aospy into docs-update-install-instructions
@spencerahill let me know if you're good with the test I added; if that checks out, and the AppVeyor build succeeds, I'm good to merge! |
@spencerkclark done! Once again, huge thanks for catching my myriad mistakes on this and all your other effort on it. Ironic that we went for a "quick fix", since it ended up being such an ordeal! Just how it goes sometimes I guess 😄 |
Closes #151 and #152.
@spencerkclark I would really appreciate a review on this within the next few days if at all possible. (Sorry for the time crunch and that I accidentally did this on the same branch as my last PR #153, so the diffs from that are also showing up below) . Thanks in advance!
This is a major overhaul of the functionality relating to the main script, and also includes one minor change to the
Calc
API.Calc.region
had been a dict of{region.name: region}
items. But there is no need for this; we were never using the key, and we can always just grab the name from the object itself. Doing so made the main script overhaul easierprojs
to be iterated over in the same multi-calc-generation call.main.py
toaospy_main.py
and themain()
function tosubmit_mult_calcs()
submit_mult_calcs
and thus the main script more descriptivefind_obj.py
and most of logic frommain.py
; move what remained ofmain.py
to new moduleautomate.py
.I am not completely satisfied with my new approach in automate.py, which involves separating the specifications for Calc into "core" and "aux" categories. This whole business would be trivial (just call
itertools.product
on everything), were it not for our support of'default'
and'all'
for some of the specs.Basically I think this is well suited for a more OOP approach: a
CalcSpec
abstract base class, with one implementation of it for each spec that gets passed to Calc. Each implementation would specify how it is to be handled, e.g. where to find its 'default' or 'all' values or what data type to be expected (e.g.dtype_out_time
is a tuple that gets permuted over within a Calc, not across Calcs). This would make it easier to, for example, support 'default' and 'all' values for every spec.BUT we are a bit time crunched in terms of the pending blog post, and I've already ended up down the rabbit hole as it is. So, barring fundamental objections to the current approach, I'm looking for critiques there, and we can come back to this bigger picture stuff later. Sound good?
This bigger picture stuff notwithstanding, still some outstanding items:
Once this is locked in I will address #150, which I've already started on.