Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read grid mapping and bounds as coords #2844

Merged
merged 40 commits into from
Feb 17, 2021

Conversation

DWesl
Copy link
Contributor

@DWesl DWesl commented Mar 22, 2019

I prefer having these as coordinates rather than data variables.

This does not cooperate with slicing/pulling out individual variables.
grid_mapping should only be associated with variables that have
horizontal dimensions or coordinates.
bounds should stay associated despite having more dimensions.

I have not implemented similar functionality for the iris conversions.

An alternate approach to dealing with bounds (not used here) is to use a
pandas.IntervalIndex
http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IntervalIndex.html#pandas.IntervalIndex
and use where the coordinate is within its cell to determine on which
side the intervals are closed (x_dim == x_dim_bnds[:, 0] corresponds
to "left", x_dim == x_dim_bnds[:, 1] corresponds to "right", and
anything else is "neither"). This would stay through slicing and
might already be used for .groupby_bins(), but would not generalize
to boundaries of multidimensional coordinates unless someone
implements a multidimensional generalization of pd.IntervalIndex

  • Closes #xxxx
  • Tests added
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

This does not cooperate with slicing/pulling out individual variables.
`grid_mapping` should only be associated with variables that have
horizontal dimensions or coordinates.
`bounds` should stay associated despite having more dimensions.

An alternate approach to dealing with bounds is to use a
`pandas.IntervalIndex`
http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IntervalIndex.html#pandas.IntervalIndex
and use where the coordinate is within its cell to determine on which
side the intervals are closed (`x_dim == x_dim_bnds[:, 0]` corresponds
to "left", `x_dim == x_dim_bnds[:, 1]` corresponds to "right", and
anything else is "neither").  This would stay through slicing and
might already be used for `.groupby_bins()`, but would not generalize
to boundaries of multidimensional coordinates unless someone
implements a multidimensional generalization of `pd.IntervalIndex`.

I do not yet know where to put tests for this.  This should probably
also be mentioned in the documentation.
@DWesl
Copy link
Contributor Author

DWesl commented Mar 26, 2019

Related to #1475 and #2288 , but this is just keeping the metadata consistent where already present, not extending the data model to include bounds, cells, or projections. I should add a test to ensure saving still works if the bounds are lost when pulling out variables.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sympathetic to this change. My main concerns are mentioned inline below: the way this is currently written encoding/attrs are used inconsistently with other CF metadata.

Alternatively, we might only put these fields in encoding when reading from disk, and only use encoding when choosing how to write data. The downside is that these attributes would not be as visible in xarray's data model.

xarray/conventions.py Outdated Show resolved Hide resolved
xarray/conventions.py Outdated Show resolved Hide resolved
@DWesl
Copy link
Contributor Author

DWesl commented Mar 30, 2019

I can shift this to use encoding only, but I'm having trouble figuring out where that code would go.

Would the preferred path be to create VariableCoder classes for each and add them to encode_cf_variable, then add tests to xarray.tests.test_coding?

@shoyer
Copy link
Member

shoyer commented Mar 30, 2019

The current VariableCoder interface only works for coders that work at the level of individual variables. But coordinates only make sense on a dataset, so we'll need a different abstraction for that, e.g., a DatasetCoder?

For now, it's probably fine to keep this logic where you have it currently.

Discussion on GH2844 indicates that encoding is the preferred location
for how things are stored on disk.
Copy link
Contributor Author

@DWesl DWesl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shifted everything to use encoding rather than encoding and attrs, and added enough new code that my new tests pass locally.

xarray/conventions.py Outdated Show resolved Hide resolved
@andreas-h
Copy link
Contributor

I'd like this one to be merged very much. Is there anything holding this back?

Also, it might be nice to update the documentation with info on this.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a minor code suggestion, otherwise this looks good to me 👍 .

Please also add a brief note to the release notes in whats-new.rst

@@ -235,6 +235,11 @@ def encode_cf_variable(var, needs_copy=True, name=None):
var = maybe_default_fill_value(var)
var = maybe_encode_bools(var)
var = ensure_dtype_not_object(var, name=name)

if var.encoding.get('grid_mapping', None) is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to supply a default value of None with .get(), that is already the default:

Suggested change
if var.encoding.get('grid_mapping', None) is not None:
if var.encoding.get('grid_mapping') is not None:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually: are there any cases where someone would explicitly have grid_mapping=None in encoding? If not, then let's just check:

Suggested change
if var.encoding.get('grid_mapping', None) is not None:
if 'grid_mapping' in var.encoding:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not. That detail should go in documentation somewhere. Where would you suggest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code change is done. Docs change is not. What page should that go on? IO/NetCDF?

xarray/conventions.py Outdated Show resolved Hide resolved
@dcherian
Copy link
Contributor

I'm a little confused on why these are in encoding and not in attrs.

@DWesl
Copy link
Contributor Author

DWesl commented May 31, 2019

This is briefly mentioned above, in
#2844 (comment)
The rationale was that everywhere else xarray uses CF attributes for something, the original values of those attributes are recorded in var.encoding, not var.attrs, and consistency across a code base is a good thing. Since I'm doing this primarily to get grid_mapping and bounds variables out of ds.data_vars, I don't have strong opinions on the subject.

If you feel strongly to the contrary, there's an idea at the top of this thread for getting bounds information encoded in terms xarray already uses in some cases (Dataset.groupby_bins()), and the diffs for this PR should help you figure out what needs changing to support this.

For grid_mapping there's
http://xarray.pydata.org/en/latest/weather-climate.html#cf-compliant-coordinate-variables
which is enough for my uses.

@DWesl
Copy link
Contributor Author

DWesl commented May 31, 2019

Switched to use in rather than is not None.

Re: grid_mapping in .encoding not .attrs
MetPy assumes grid_mapping will be in .attrs. Since the xarray documentation mentions this capability, should I be making concurrent changes to MetPy to allow this to continue?

If so, would it be sufficient to change their .attrs references to .encoding and mentioning in both sets of documentation that the user should call ds.metpy.parse_cf() immediately after loading to ensure the information is available for MetPy to use? I don't entirely understand the accessor API.

@dcherian
Copy link
Contributor

It isn't just MetPy though. I'm sure there's existing code relying on adding grid_mapping and bounds to attrs in order to write CF-compliant files. So there's a (potentially big) backward compatibility issue. This becomes worse if in the future we keep interpreting more CF attributes and moving them to encoding :/.

Since I'm doing this primarily to get grid_mapping and bounds variables out of ds.data_vars.

I'm +1 on this but I wonder whether saving them in attrs and using that information when encoding coordinates would be the more pragmatic choice.

We could define encoding as containing a specified set of CF attributes that control on-disk representation such as units, scale_factor, contiguous etc. and leaving everything else in attrs. A full list of attributes that belong in encoding could be in the docs so that downstream packages can fully depend on this behaviour.

Currently I see coordinates is interpreted and moved to encoding. In the above proposal, this would be left in attrs but its value would still be interpreted if decode_coords=True.

What do you think?

@DWesl
Copy link
Contributor Author

DWesl commented Jun 1, 2019 via email

@DWesl
Copy link
Contributor Author

DWesl commented Feb 14, 2020

I just noticed pandas.PeriodIndex would be an alternative to pandas.IntervalIndex for time data if which side the interval is closed on is largely irrelevant for such data.

Is there an interest in using these for 1D coordinates with bounds? I think ds.groupby_bins() already returns an IntervalIndex.

@pep8speaks
Copy link

pep8speaks commented Feb 14, 2020

Hello @DWesl! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-02-11 17:47:29 UTC

@DWesl
Copy link
Contributor Author

DWesl commented Feb 18, 2020

The test failures seem to all be due to recent changes in cftime/CFTimeIndex, which I haven't touched.

Is sticking the grid_mapping and bounds attributes in encoding good, or should I put them back in attrs?

@dcherian
Copy link
Contributor

My preference is for attrs but someone else should weigh in. cc @pydata/xarray

Maybe @dopplershift , as MetPy maintainer has thoughts on this matter?

@dopplershift
Copy link
Contributor

As a downstream user, I just want to be told what to do (assuming encoding is part of the public API for xarray). I'd love not to have to modify our code, but that's not essential necessarily.

So to clarify: is this about whether they should be in one spot or the other? Or is it about having grid_mapping and bounds in both?

@DWesl
Copy link
Contributor Author

DWesl commented Mar 10, 2020

I think the choice is between attrs and encoding, not both.

If it helps lean your decision one way or the other, attrs tends to stay associated with Datasets through more operations than encoding, so parse_cf() would have to be called fairly soon after opening if the information ends up in encoding, while putting it in attrs gives users a bit more time for that.

@dopplershift
Copy link
Contributor

Thanks for the info. Based on that, I lean towards attrs.

I think a better rationale, though, would be to formalize the role of encoding in xarray.

@shoyer
Copy link
Member

shoyer commented Mar 12, 2020 via email

DWesl and others added 4 commits January 16, 2021 21:12
There's more than three attributes used to assign variables as coordinates.

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Use two backticks for monospace font.  Single backticks trop you into the default role.

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
I'm not just testing ``grid_mapping`` and ``bounds`` here, so I should
have the name reflect that.
Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DWesl just pushed a small docs commit. I think this is ready to merge. Thank you for your patience.

@DWesl
Copy link
Contributor Author

DWesl commented Jan 17, 2021

Looks good to me. I was wondering where those docstrings were.

@jthielen
Copy link
Contributor

I unfortunately have not been following along with the recent developments in this PR, so these may have already been previously covered. Sorry if that is the case!

However, earlier on in the development of this PR, there were some substantial concerns about backwards compatibility (e.g., for libraries like MetPy that currently rely on grid_mapping and the like being in attrs) and the improved propagation of encoding through operations (so that moving these to encoding doesn't mean they are unnecessarily lost). What is the current status with regards to these?

@dcherian
Copy link
Contributor

there were some substantial concerns about backwards compatibility

:( yes this is currently backward incompatible (so the next release will bump major version). There's reluctance to add yet another decode_* kwarg to open_dataset, and also reluctance to issue a warning saying that the behaviour of decode_coords will change in a future version (since this would be very common).

One option is to change decode_coords from bool to Union[bool, str] and allow decode_coords to be either "all" (this PR) or "coordinates", or True ( == "coordinates", backwards compatible). I can bring this up at the next dev meeting.

…_and_bounds_as_coords

* upstream/master: (51 commits)
  Ensure maximum accuracy when encoding and decoding cftime.datetime values (pydata#4758)
  Fix `bounds_error=True` ignored with 1D interpolation (pydata#4855)
  add a drop_conflicts strategy for merging attrs (pydata#4827)
  update pre-commit hooks (mypy) (pydata#4883)
  ensure warnings cannot become errors in assert_ (pydata#4864)
  update pre-commit hooks (pydata#4874)
  small fixes for the docstrings of swap_dims and integrate (pydata#4867)
  Modify _encode_datetime_with_cftime for compatibility with cftime > 1.4.0 (pydata#4871)
  vélin (pydata#4872)
  don't skip the doctests CI (pydata#4869)
  fix da.pad example for numpy 1.20 (pydata#4865)
  temporarily pin dask (pydata#4873)
  Add units if "unit" is in the attrs. (pydata#4850)
  speed up the repr for big MultiIndex objects (pydata#4846)
  dim -> coord in DataArray.integrate (pydata#3993)
  WIP: backend interface, now it uses subclassing  (pydata#4836)
  weighted: small improvements (pydata#4818)
  Update related-projects.rst (pydata#4844)
  iris update doc url (pydata#4845)
  Faster unstacking (pydata#4746)
  ...
@dcherian
Copy link
Contributor

One option is to change decode_coords from bool to Union[bool, str] and allow decode_coords to be either "all" (this PR) or "coordinates", or True ( == "coordinates", backwards compatible). I can bring this up at the next dev meeting.

Updated to implement this proposal after getting OK at the previous meeting.

@DWesl can you take a look at the most recent changes please?

@DWesl
Copy link
Contributor Author

DWesl commented Feb 13, 2021

I think this looks good.

@dcherian
Copy link
Contributor

Great I'll merge before the next release if no one else has comments. Thanks @DWesl

Copy link
Member

@andersy005 andersy005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

Does anyone know why the xr.open_dataset(....) call is echoed in the warning message. Is this intentional? Cc @dcherian @DWesl

In [4]: ds = xr.open_dataset('/home/abanihi/Downloads/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc', decode_c
   ...: oords="all")
<ipython-input-4-ccf90ed4a433>:1: UserWarning: Variable(s) referenced in cell_measures not in variables: ['areacella']
  ds = xr.open_dataset('/home/abanihi/Downloads/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc', decode_coords="all")

In [5]: ds
Out[5]: 
<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 145, lon: 192, time: 60)
Coordinates:
  * time       (time) datetime64[ns] 2010-01-16T12:00:00 ... 2014-12-16T12:00:00
    time_bnds  (time, bnds) datetime64[ns] ...
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
    lon_bnds   (lon, bnds) float64 ...
  * lat        (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
    lat_bnds   (lat, bnds) float64 ...
Dimensions without coordinates: bnds
Data variables:
    pr         (time, lat, lon) float32 ...

@DWesl
Copy link
Contributor Author

DWesl commented Feb 14, 2021

Does anyone know why the xr.open_dataset(....) call is echoed in the warning message. Is this intentional? Cc @dcherian @DWesl

It seems you've already figured this out, but for anyone else with this question, the repeat of the call on that file is part of the warning that the file does not have all the variables the attributes refer to. You can fix this by recreating the file with the listed variables added (areacella, or by deleting the attribute from the variables (cell_measures). You can also ignore the warning using the machinery in the warnings module.

@dcherian dcherian merged commit 12b4480 into pydata:master Feb 17, 2021
dopplershift added a commit to dopplershift/MetPy that referenced this pull request Oct 25, 2024
This support was added in pydata/xarray#2844 and handles converting the
grid_mapping variable to a coordinate in xarray itself, which was
incompatible with some assumptions in parse_cf(). Add some handling for
the case where the grid_mapping. Also add decode_coords='all' to our
full test suite and adjust a few tests as necessary.
dopplershift added a commit to dopplershift/MetPy that referenced this pull request Oct 28, 2024
This support was added in pydata/xarray#2844 and handles converting the
grid_mapping variable to a coordinate in xarray itself, which was
incompatible with some assumptions in parse_cf(). Add some handling for
the case where the grid_mapping. Also add decode_coords='all' to our
full test suite and adjust a few tests as necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants