Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use coordinates directly from xarray.DataArrays, not from parent Model #14

Open
spencerahill opened this issue Oct 14, 2015 · 10 comments
Open

Comments

@spencerahill
Copy link
Owner

Currently, each Model object is instantiated with paths to various netCDF files which are used to supply basic information about its grid: latitude, longitude, etc. Some physical calculations require this grid information in addition to physical variables themselves. However, the values of these grid properties are not necessarily unique to a Model: simulations run in the same model may be output to different grids (horizontal or vertical), or changes in software may make, e.g. the latitude values off on the 5th digit, causing xray to throw an error (the last example comes from @spencerkclark ).

The netCDF files from which data is loaded all contain the coordinate data locally already, and so do the xray.Dataset and/or DataArray objects derived from those files. So there is no reason to always pull from the Model's grid attributes -- they should be pulled from the local variable itself.

As a starting point, Calc should just grab the needed coordinate/grid values from the 1st variable that it loads in that has those values. Pressure values and vertical pressure thicknesses will likely need to be handled as their own unique case, which is what's done already anyways.

@spencerahill
Copy link
Owner Author

spencerahill commented Oct 23, 2015

Now that we have migrated aospy to being largely xarray-based, this has been mostly resolved, in that the functions are now passed xarray.DataArrays by default, which have the grid information embedded. So it's up to the user to write functions that extract these embedded coordinate arrays as needed.

Some functions and variables remain, however, that aren't defined in their own netCDF files but reasonably could be needed by themselves. Currently these are still pulled from the model. This aspect still has room for improvement/refactoring.

@spencerahill
Copy link
Owner Author

One potential solution, although it could be heavy-handed, is to copy all of the grid fields into each DataArray when they are loaded from disk. So, for example, a t_surf DataArray, which is defined only in lat, lon, and time, would also get pfull, phalf, ak, bk, zsurf, land_mask, etc. added as coordinates. That way the function can rely on these being available from the DataArrays that get passed in. These are all sufficiently small relative to the typical actual data arrays of interest that the added computational expense isn't an issue.

However, at least the last two aren't coordinates but data arrays themselves, meaning that we would need to pass in Datasets, not DataArrays, with them included among the variables. So still need to think this one through.

@spencerkclark
Copy link
Collaborator

This might actually work better than you think; multi-dimensional coordinates are supported within xray (so zsurf or land_mask could in fact be coordinates). That being said, there are some subtle differences between the behavior of coordinates within Datasets and DataArrays that could be important (long explanation below). Outside of the case of pressure though, I can't think of any other edge cases where these differences could pose problems (so the differences might not be an issue if we just handle pressure separately).


Multi-dimensional coordinates are supported within xray; the catch is that to include them in a DataArray as coordinates, they must be related to the existing dimensions (that is the dimensions that the variable described by the DataArray lies upon). What this means is that the likes of pfull and phalf would not be able to be added as coordinates to DataArrays that were inherently 2D in space (like t_surf), but zsurf and land_mask would be. Conversely, a variable like zsurf would not be able to be added as a coordinate to a DataArray describing bk as its variable. Take as an example a typical atmos.static.nc file.

This is what it looks like raw upon being read in as a Dataset:

In [3]: ds
Out[3]:
<xray.Dataset>
Dimensions:    (lat: 90, latb: 91, lon: 144, lonb: 145, phalf: 25)
Coordinates:
  * lat        (lat) float64 -89.49 -87.98 -85.96 -83.93 -81.91 -79.89 ...
  * latb       (latb) float64 -90.0 -88.99 -86.97 -84.94 -82.92 -80.9 -78.88 ...
  * lon        (lon) float64 1.25 3.75 6.25 8.75 11.25 13.75 16.25 18.75 ...
  * lonb       (lonb) float64 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 ...
  * phalf      (phalf) float64 1.0 9.034 34.75 75.06 127.9 191.1 262.1 339.1 ...
Data variables:
    bk         (phalf) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.04357 0.1102 0.1922 ...
    pk         (phalf) float64 100.0 903.4 3.475e+03 7.506e+03 1.279e+04 ...
    zsurf      (lat, lon) float64 2.848e+03 2.848e+03 2.848e+03 2.848e+03 ...
    land_mask  (lat, lon) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ...
Attributes:
    filename: atmos.static.nc
    title: am2clim_reyoi_extratropics_sp
    grid_type: regular
    grid_tile: N/A

I can now set land_mask as a 2D coordinate in a new Dataset:

In [4]: ds_land_mask = ds.set_coords('land_mask')
In [5]: ds_land_mask
Out[5]:
<xray.Dataset>
Dimensions:    (lat: 90, latb: 91, lon: 144, lonb: 145, phalf: 25)
Coordinates:
  * lat        (lat) float64 -89.49 -87.98 -85.96 -83.93 -81.91 -79.89 ...
  * latb       (latb) float64 -90.0 -88.99 -86.97 -84.94 -82.92 -80.9 -78.88 ...
  * lon        (lon) float64 1.25 3.75 6.25 8.75 11.25 13.75 16.25 18.75 ...
  * lonb       (lonb) float64 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 ...
  * phalf      (phalf) float64 1.0 9.034 34.75 75.06 127.9 191.1 262.1 339.1 ...
    land_mask  (lat, lon) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ...
Data variables:
    bk         (phalf) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.04357 0.1102 0.1922 ...
    pk         (phalf) float64 100.0 903.4 3.475e+03 7.506e+03 1.279e+04 ...
    zsurf      (lat, lon) float64 2.848e+03 2.848e+03 2.848e+03 2.848e+03 ...
Attributes:
    filename: atmos.static.nc
    title: am2clim_reyoi_extratropics_sp
    grid_type: regular
    grid_tile: N/A

If I look to pick off the zsurf DataArray from this Dataset, it will now contain land_mask as a coordinate (but latb, lonb, and phalf are gone):

In [6]: ds_land_mask.zsurf
Out[6]:
<xray.DataArray 'zsurf' (lat: 90, lon: 144)>
array([[  2.84800220e+03,   2.84800220e+03,   2.84800220e+03, ...,
          2.84800220e+03,   2.84800220e+03,   2.84800220e+03],
       [  2.76080835e+03,   2.76813501e+03,   2.77564966e+03, ...,
          2.73906250e+03,   2.74635181e+03,   2.75357593e+03],
       [  2.57908862e+03,   2.60248169e+03,   2.62594678e+03, ...,
          2.50599146e+03,   2.53115771e+03,   2.55542334e+03],
       ...,
       [  1.98320568e-01,   1.47022083e-01,   1.06589988e-01, ...,
          4.26998675e-01,   3.37892354e-01,   2.61694580e-01],
       [  6.06214553e-02,   5.01112714e-02,   4.08714749e-02, ...,
          9.91002619e-02,   8.52438807e-02,   7.23643824e-02],
       [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00]])
Coordinates:
  * lon        (lon) float64 1.25 3.75 6.25 8.75 11.25 13.75 16.25 18.75 ...
    land_mask  (lat, lon) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ...
  * lat        (lat) float64 -89.49 -87.98 -85.96 -83.93 -81.91 -79.89 ...
Attributes:
    long_name: surface height
    units: m
    cell_methods: time: point

If I pick off bk it will not contain land_mask, because it is not a function of the phalf coordinate:

In [7]: ds_land_mask.bk
Out[7]:
<xray.DataArray 'bk' (phalf: 25)>
array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.0435679 ,  0.1102275 ,  0.1922249 ,  0.28176561,
        0.36949971,  0.45323479,  0.53162527,  0.60387331,  0.6695556 ,
        0.72851759,  0.78080171,  0.82659918,  0.86621481,  0.90004063,
        0.92853642,  0.952214  ,  0.97162521,  0.98735231,  1.        ])
Coordinates:
  * phalf    (phalf) float64 1.0 9.034 34.75 75.06 127.9 191.1 262.1 339.1 ...
Attributes:
    long_name: vertical coordinate sigma value
    units: none
    cell_methods: time: point

I think the question we may need to ask is: are there frequently scenarios when a 2D variable needs access to the level dimensions (when NOT already within the context of a 3D variable which already does)? The only case I can think of is in computing pressure on sigma levels -- if this is the only case perhaps we should just continue to handle pressure as a special case.

On the flip-side, is there ever a case where a variable defined only in the vertical needs access to 2D grid information? The only variables that I can think of that are only defined in the vertical are pk and bk; in general we consider these to hold at every 2D gridpoint so for instance taking a regional or area average of them alone doesn't really make sense.

For that reason I see no harm in adding all of these "grid" variables (including surface area as a function of latitude and longitude) as coordinates to the Datasets1 we read in, and then simply letting xray determine which ones to pass on as coordinates when we isolate them as DataArrays in passing them to our calculation functions.

Under this system, any variable that is defined in lat-lon would automatically get the spatial grid attributes (zsurf, land_mask, surface_area etc.) appended to it as coordinates, and any variable that is defined in the vertical would automatically get the vertical grid attributes (bk, pk etc.). A 3D variable would get all of them. Would that suffice for most use cases (pressure excluded)?

You could accomplish this within _create_input_data_obj() by something like:

def _create_input_data_obj(self, var, start_date=False,
                               end_date=False, n=0, set_dt=False):
        """Create xray.DataArray for the Var from files on disk."""
        paths = self._get_input_data_paths(var, start_date, end_date, n)
        ds_chunks = []
        for file_ in paths:
            test = xray.open_dataset(file_, decode_cf=False,
                                     drop_variables=['time_bounds', 'nv',
                                                     'average_T1',
                                                     'average_T2'])
            if start_date.year < 1678:
                for v in ['time']:
                    test[v].attrs['units'] = ('days since 1900-01-01 '
                                              '00:00:00')
                test['time'].attrs['calendar'] = 'noleap'
            test = xray.decode_cf(test)
            ds_chunks.append(test)
        ds = xray.concat(ds_chunks, dim='time')

##### Add multi-dimensional coordinate information to Dataset here. #####

        for name in var.names:
            try:

##### This stays the same; let xray self-select relevant multi-dimensional grid attributes. #####

                arr = ds[name]
            except KeyError:
                pass
            else:
                break
        else:
            raise KeyError('Variable not found: {}'.format(var))

1Note that you can add all the coordinates you want to a Dataset -- as opposed to a DataArray, for a Dataset there is no stipulation that they must be related to a variable (take latb or lonb as an example). Things only get strict with coordinates when you move to a DataArray (but as shown above, this doesn't preclude us from having additional coordinates within DataArrays as long as they are functions of the variable coordinates).

@spencerahill
Copy link
Owner Author

This looks really solid overall. Regarding the special case of pressure, what about (in the case of data on model coordinates), pre-computing the full 4D pressure array, and then tacking it on as a coordinate defined in (time, pfull, lat, lon)? In that case it would persist into the DataArrays, right?

(Could we even potentially re-assign the variable's coordinates to include it? This sounds more tricky. But at least for the functions I use (in particular finite differencing ones), there would be upside.)

At the same time, this basically doubles the amount of data in memory, since the pressure array is the same shape as the variable array. This is non-negligible for high frequency output.

@spencerkclark
Copy link
Collaborator

Including pressure as a coordinate (regardless of whether sigma or interpolated) would help simplify things a bunch. It's a bit confusing to me how currently 'p' and 'dp' are treated as strings and not aospy variables within calc.py; among other things this solution would remove this unique treatment.

In contrast to horizontal coordinates where we have a choice, we must allow different vertical coordinates from Run to Run from the same Model. This is especially true for pfull, since it is time dependent on sigma levels, but it is also true if the results from a particular Run were interpolated to different pressure levels than another. For reasons I mention in the pull request, I think we should hold firm that the horizontal coordinates between a model and run are always aligned (let me know if you disagree!), but it seems like we must relax this constraint for the vertical coordinate.

As such, to remove the possibility of using the wrong pressure levels, I think in handling pressure we should allow pk and bk to be taken from the Model object, but level or pfull should be taken solely from the Run objects.

The memory issue is an important consideration; however, if we eventually move to passing Datasets as arguments to functions rather than DataArrays then we could pass just one copy of pressure at a time. That said, if a calculation doesn't use pressure (like if it's just averaging a certain quantity that is defined in 4D), then it would be a bit of a waste.

@spencerkclark
Copy link
Collaborator

One other thing to note: beyond memory at runtime, without some logic to strip off secondary grid-attributes before saving output, these extra coordinates will be written out to .nc files. In the case of the 1D and 2D attributes this isn't a big deal, but like you mention, if we add sigma pressure to the mix, then it becomes non-negligiable. I don't think it would be too hard to add such logic, but I just want to bring it up here so we don't forget.

@spencerahill
Copy link
Owner Author

I agree, the 'p' and 'dp' strings are unnecessary. We can just flag them as special using var.name in ('p', 'dp').

We should allow for different horizontal grids within a single model: in the GFDL ana4mips repo, at least one of the products (MERRA, I believe), outputs vertically defined data on a coarser lat-lon grid and non-vertically defined data on a more refined lat-lon grid. But these still in my mind should fall under the umbrella of a single Model and Run. So, in my mind, bk and pk are really the only ones that could reasonably be defined at the Model level.

But even for those two, at least for the full GCMs (let me know if the idealized models differ), each run has an atmos.static.nc file produced at the top of its postprocessing directory that alters this. And, at least conceptually, I see no reason why the same model couldn't be run at different vertical resolutions. So I think we should ultimately support all grid items being defined at the Run level, i.e. including the grid_file logic to the Run class.

At the same time, we should retain the "chain-of-responsibility" framework that already exists for some attributes (via the get_parent_attr function), wherein an attribute, e.g. bk, is searched for first in the Run, and if not there then in the parent Model, and then Proj. I could foresee users having whole Proj's whose data are sufficiently homogeneous that they could do almost all of their specifications at the Proj level.

Any function (at least of mine) that requires knowledge of the pressure values explicitly passes it in as an argument, either p or dp.

Good point re: writing extra coordinates to disk. For now let's not worry about it, since the current user base (=you and me) has little storage limitation, and unless lat-lon ts data is being saved to disk, the reductions applied generally make the output relatively small.

@spencerkclark
Copy link
Collaborator

I understand completely. If we go this route (which I'm fully on board with) what is the purpose of a Model object? Might we be able to phase it out?

Once you allow grid attributes to be defined at the run level, you could accomplish this "grid inheritance" feature of the Model object by just having an optional argument within a Run constructor along the lines of "use grid from 'pre-existing Run name'" (if you didn't have a Run to inherit from, you'd provide the paths to all the grid files, like you do currently in the Model object). You could have a similar argument for inheriting the data_in_dur, data_in_start_date and other time-related arguments from an existing Run.

What else does the Model object do besides aggregate grid information (and group Runs to some extent)?

@spencerahill
Copy link
Owner Author

I would like to retain the Model class. Even if most of the grid specifications can be made at the Run level, the Model level is still useful.

  • If a user comes to aospy with dozens of runs from the same model and knows they all have the exact same grid specifications, they can point to the pertinent file in the single Model object instead of each run.
  • The default_runs attribute of the Model is convenient for automatically iterating over the Runs you want to perform calculations on (although I think I broke this functionality at some point while refactoring aospy-obj-lib).
  • Similarly, if more than one model has a run with the same name, e.g. 'control', the user should be able to specify the model names, and just the single run name 'control', and be able to execute the calculations on each (again, this might not work in the aospy-obj-lib's current state).

I like a lot the idea of inheriting grid information from another object. Maybe it doesn't even require a separate argument; if the method gets passed a Proj/Model/Run for that argument, then use that object's value for that argument.

@spencerkclark
Copy link
Collaborator

That makes sense -- thanks for bringing those aspects up! That's an interesting thought about passing objects directly as arguments. Ultimately (in continuing with what's already done aospy) I think we should strive to code things in a way such that the user has to input as little repeat information as possible (since often in experiments the grids and time periods are the same across Runs); if you think that's the most intuitive way to do that for non-default Model grids and time periods, I'm all for it.

spencerahill pushed a commit that referenced this issue Nov 3, 2015
@spencerahill spencerahill added this to the v0.1 milestone Jul 2, 2016
@spencerahill spencerahill changed the title Use coordinates directly from xray.DataArrays, not from parent Model Use coordinates directly from xarray.DataArrays, not from parent Model Oct 12, 2016
@spencerahill spencerahill removed this from the v0.1 milestone Jan 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants