Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add average module with average_shape #109

Merged
merged 19 commits into from
Feb 19, 2021

Conversation

aulemahal
Copy link
Collaborator

@aulemahal aulemahal commented Jan 12, 2021

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
  • Tests for the changes have been added (for bug fixes / features)
  • Documentation has been added / updated (for bug fixes / features)
  • HISTORY.rst has been updated (with summary of main changes)
  • bumpversion minor has been called on this branch
  • I have added my relevant user information to AUTHORS.md
  • What kind of change does this PR introduce?:
    Adds clisops/core/average.py with a first method average_shape that uses the power of xESMF to perform averages over polygons. As discussed in the slack channel, I might have had a different idea of what the average submodule was supposed to be, this is why I only implemented the method I needed and only the computational pare (not the clisops.ops.average part).

My understanding at first was that most methods in average would be no more than a call to the related subset_* followed by a mean over the appropriate dimensions. The use of xESMF for polygon averaging adds complexity but also precision as it considers partial overlaps, grid cell sizes and holes, while subset_shape does not (it is a simple "contains" test).
On the other hand, it restricts the inputs:

  1. Input as datasets is still fragile because of Regridder fails when ds_in doesn't have (lat,lon) as last two coordinates pangeo-data/xESMF#53
  2. Inputs must follow the CF conventions in the naming and attributes of the coordinates.

Because of that, we could also choose to

a. not implement a xESMF-backed shape-averaging method, but this would break our relatively clean "clisops->finch" pipeline.
b. implement average_shape_simple and/or average_shape_exact (or other name alternatives). The "simple" version would be something like subet_shape(ds, shape).mean(['lat', 'lon']).
c. implement a switch in the function itself, either with a keyword or a try..except. I do not like this alternative, because I feel like that is the job of the clisops.ops.average version.

This is a draft PR. I plan to add documentation, fix all requirements and maybe implement a create_weight_masks helper, that returns the weight mask for each polygon in a given GeoDataFrame, giving more control on the average to the user, similar to subset.create_mask.
If my understanding for the base average method (subset + mean on dims) is correct, I could also add them here.

  • Does this PR introduce a breaking change?:
    This will add a new dependency : xESMF (and thus ESMF), which is not light and makes pip installs difficult.

@agstephens
Copy link
Collaborator

Hi @aulemahal, thanks for putting the PR together. I think that our requirements for average are very limited in scope compared to the larger requirements that you are looking at.

I think that our (C3S) requirements could be summed up by implementing something like:

In: clisops.core.average:

from roocs_utils.xarray_utils.xarray_utils import get_coord_type, known_coord_types

def average_over_dims(ds, dims=None, ignore_unfound_dims=False):
    if not dims: return ds
    if not set(dims).issubset(set(known_coord_types)):
         raise Exception(f'Unknown dimension requested for averaging, must be within: {known_coord_types}.')

    found_dims = []

    # Work out real coordinate types for each dimension 
    for coord in ds.coords:
          coord_type = get_coord_type(coord)
          if coord_type:
               found_dims.append(coord.name)

    if ignore_unfound_dims == False and set(found_dims) != set(dims):
        raise Exception(f'Requested dimensions were not found in input dataset: {set(dims) - set(found_dims)}.')

    da = ...cast dataset to DataArray...
    ds_averaged_over_dims = da.mean(dim=found_dims, skip_na=???, keep_attrs=???)

    return ...

What do think? If it is really that easy for us create the functionality we need for C3S work then the real issue here is making sure that we structure the code so that the various layers work together properly.

clisops/core/average.py Outdated Show resolved Hide resolved
@huard
Copy link
Contributor

huard commented Jan 13, 2021

Hi Ag,

Let's say a use case is to compute the tropical average (averaged over both lon and lat, between lat=[-23, 23]).
One option would be to do

sub = subset_bbox(lat_bnds=[-23, 23])
out = average_over_dims(sub, dims=["lat", "lon"]

I think what we are proposing (not in this PR but in general), is to have a average_bbox function where we could do:
out = average_bbox(lat_bnds=[-23, 23], lon_bnds=[0, 360])

In other words, I think trying to pack everything in one average function is going to be difficult as users require additional features. We suggest there should be a number of average_* functions similar to subset_*. I think your very simple use case can essentially be met with average_bbox that would look very similar to subset_bbox but with a mean operation on each dimension that is not None.

@agstephens
Copy link
Collaborator

Hi Ag,

Let's say a use case is to compute the tropical average (averaged over both lon and lat, between lat=[-23, 23]).
One option would be to do

sub = subset_bbox(lat_bnds=[-23, 23])
out = average_over_dims(sub, dims=["lat", "lon"]

I think what we are proposing (not in this PR but in general), is to have a average_bbox function where we could do:
out = average_bbox(lat_bnds=[-23, 23], lon_bnds=[0, 360])

In other words, I think trying to pack everything in one average function is going to be difficult as users require additional features. We suggest there should be a number of average_* functions similar to subset_*. I think your very simple use case can essentially be met with average_bbox that would look very similar to subset_bbox but with a mean operation on each dimension that is not None.

@huard, @aulemahal: I agree that your general approach of matching average_* functions to similar subset_* functions keeps our core module structure nice and clear. We also have the clisops.ops.<module> route which our WPS uses as its entry point. We could put multiple functions in that module to meet different use cases, and call them what we like.

@agstephens
Copy link
Collaborator

@huard and @aulemahal: Thinking in more detail, there is one part of the function signature to average_bbox() that I am struggling with. I think we want to the user to call it with the standard arguments:

    da: Union[xarray.DataArray, xarray.Dataset],
    lon_bnds: Union[np.array, Tuple[Optional[float], Optional[float]]] = None,
    lat_bnds: Union[np.array, Tuple[Optional[float], Optional[float]]] = None,
    start_date: Optional[str] = None,
    end_date: Optional[str] = None,
    first_level: Optional[Union[float, int]] = None,
    last_level: Optional[Union[float, int]] = None

But we also want to allow the user to say:

  • average over all longitudes
  • average over all latitudes
  • average over all levels
  • average over all times

Do you have a proposal for a function interface that allows all these options? I am struggling to come up with something elegant that copes with the two different approaches.

@huard
Copy link
Contributor

huard commented Jan 14, 2021

Thought about the same thing yesterday.
One option could be slice(None).

@agstephens
Copy link
Collaborator

@huard: I think the problem is that we have a single argument for latitude, a single argument for longitude, and then a pair of arguments for each of level and time. That makes the slice approach is still quite inelegant.

@huard
Copy link
Contributor

huard commented Jan 14, 2021

@agstephens Agreed.

I must confess that I'm a bit confused by the fact that core functions call each other. I think a better design would be for core to have low-level functionality (subset_bbox operating only on lon, lat, not time and level, which are left to subset_time and subset_level). It is the job of ops to combine there various low-level interfaces.

@aulemahal
Copy link
Collaborator Author

I think I share the perspective of @huard : We could have the computational side of things in core, where functions are as simple as possible and the more user-friendly all-encompassing methods in ops?

To be completely coherent with that idea, functions in core.subset could be modified to remove time- and level-subsetting options from the spatial-subsetting functions. Thus, in core.average we could have only the cases where da.mean(dims) is not sufficient. All the simpler cases could be handled in ops.average, which would basically be a stronger ds.cf.mean(dims), AFAIU.

Thus I am suggesting here to have things in core that are not wrapped in any of the ops, in a perspective of using clisops as a dependency elsewhere, like in finch.

@huard
Copy link
Contributor

huard commented Jan 14, 2021

This PR is consistent with this vision. So if @agstephens agrees, I suggest we go with this and defer the refactoring of core processes to another PR.

@agstephens
Copy link
Collaborator

@aulemahal and @huard: Yes, I agree. Let's go ahead with this approach.

@aulemahal
Copy link
Collaborator Author

Cool! Now this leaves another issue I raised above : the use of xESMF is powerful for "exact" shape-averaging, but it has 1 more restriction then the previous "contains" (as in subset.create_mask) approach : it will not work with 2D lat / lon coordinates unless "lat/lon_bounds" are provided.
AFAIU, the 2D lat/lon case is out of scope for now in clisops?
Moreover, there is a plan to implement automatic computation of those bounds in cf_xarray, so this will eventually be fixed downstream.

If, however, this is a problem, we could write a average_shape_simple method that uses subset.create_mask. It would also be faster (I think).

@agstephens
Copy link
Collaborator

@aulemahal: From the perspective of what we are required to do for the C3S service:

  • we only need a simple average_over_dims(ds, dims=['time', 'latitude']) function
  • it does not have to cope with 2D coordinates
  • it only needs to support averaging over the four coordinates:
    • time
    • level
    • latitude
    • longitude

So, I think that we can implement that quickly, and it leaves space for Ouranos to add in a range of more sophisticated averaging functions within clisops/core/average.py.

I propose that we will propagate the functionality through to our WPS, with:

  • clisops.core.average::average_over_dims(), is called by
  • clisops.ops.average::average_over_dims(), is called by
  • daops.ops.average::average_over_dims(), is called by
  • rook.processes.wps_average::Average() - is exposed as a WPS process

Tagging @ellesmith88

@aulemahal aulemahal marked this pull request as ready for review January 22, 2021 20:38
@aulemahal
Copy link
Collaborator Author

Hi! So this last push:

  • adds the xESMF >= 0.5.0 requirement where needed. Pypi stops at 0.3.1, which is quite unfortunate. I create an "extra" pip extra requirements and made average_shape optional : error if xESMF is too old/doesn't exist and skip tests. xESMf has to be installed through conda or from source, it has been made clear in the docs.
  • adds a new notebook core_average with only one section for now.
  • fixes a small error in create_mask : when the given geodataframe had a non-integer index the function would crash. It nows resets the index if needed.

@aulemahal
Copy link
Collaborator Author

Update doc and new method:
clisops.core.subset.create_weight_masks to use the power of xESMF in creating weight masks instead of integer masks. For elegant output it required use of xESMF 0.5.2, so I bumped that everywhere I had added it.

I think this is ready for review!

clisops/core/subset.py Outdated Show resolved Hide resolved
Co-authored-by: Trevor James Smith <10819524+Zeitsperre@users.noreply.github.com>
@agstephens
Copy link
Collaborator

I had a chat with @aulemahal. We will await the release of xesmf:0.5.2 on conda-forge before merging this PR.

@aulemahal
Copy link
Collaborator Author

Added a comment on the non-installability of xESMF on windows + fixed a conflict in history.

And woups I just realized that all patch version are noted in the HISTORY, should I not have run bumpversion?

docs/installation.rst Outdated Show resolved Hide resolved
@aulemahal
Copy link
Collaborator Author

Hi @agstephens , is there a (tentative) schedule for the release of clisops 0.6 (or 0.5.2) ?
My goal is to use the work of this PR in finch, somewhere in the next month or two. Thanks!

@agstephens
Copy link
Collaborator

Hi @agstephens , is there a (tentative) schedule for the release of clisops 0.6 (or 0.5.2) ?
My goal is to use the work of this PR in finch, somewhere in the next month or two. Thanks!

@aulemahal: @ellesmith88 will be in touch to discuss a new release :-)

@ellesmith88 ellesmith88 mentioned this pull request Feb 11, 2021
4 tasks
@ellesmith88
Copy link
Collaborator

ellesmith88 commented Feb 11, 2021

Hi @aulemahal We plan to release v0.6.0 when the averaging work is all merged in, so within the next few weeks hopefully. We've been working on average_over_dims as mentioned above and would like that to go in as well.

I've opened our average as a separate PR to be reviewed (#125) - once we're happy that the 2 average PRs line up I'll go about merging and releasing.

I couldn't do the PR to your branch as its on a fork so it's to master but we can merge them so that you are able to resolve any conflicts how you would like. Let me know what you think of this plan.

@aulemahal
Copy link
Collaborator Author

Sounds good to me! Thanks.

@aulemahal
Copy link
Collaborator Author

aulemahal commented Feb 18, 2021

All done!

EDIT: With comments from @tlogan2000, I just realized this is missing a relatively important option to skip, or not, missing values in the sum.

@aulemahal aulemahal marked this pull request as draft February 18, 2021 21:32
@aulemahal
Copy link
Collaborator Author

I was wrong, xESMF's SpatialAverager is broken with the usual NaN-skipping solution (passing a mask). I'll leave this improvement for a future release. pangeo-data/xESMF#77 must be fixed before.

Even if the functionality of average_shape is not perfect, most improvements are to be made in xESMF and I think this is good to merge.

@aulemahal aulemahal marked this pull request as ready for review February 18, 2021 23:35
@agstephens agstephens merged commit ff0508a into roocs:master Feb 19, 2021
@agstephens
Copy link
Collaborator

This all looks good. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants