Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opt out of auto creating index variables #8711

Merged
merged 11 commits into from
Mar 26, 2024

Conversation

TomNicholas
Copy link
Contributor

@TomNicholas TomNicholas commented Feb 5, 2024

Tries fixing #8704 by cherry-picking from #8124 as @benbovy suggested in #8704 (comment)

@TomNicholas TomNicholas added this to In progress in Explicit Indexes via automation Feb 5, 2024
@TomNicholas
Copy link
Contributor Author

This currently doesn't work - when I run test_coordinates.py I get some failures:

FAILED xarray/tests/test_coordinates.py::TestCoordinates::test_init_default_index - AssertionError: {'x': <class 'xarray.core.variable.Variable'>}
FAILED xarray/tests/test_coordinates.py::TestCoordinates::test_getitem - AssertionError: {'x': <class 'xarray.core.variable.Variable'>}
FAILED xarray/tests/test_coordinates.py::TestCoordinates::test_assign - AssertionError: {'x': <class 'xarray.core.variable.Variable'>, 'y': <class 'xarray.core.variable.IndexVariable'>}
FAILED xarray/tests/test_coordinates.py::TestCoordinates::test_align - ValueError: cannot reindex or align along dimension 'x' because of conflicting dimension sizes: {2, 3}

The first couple of errors seem to be because the 1D variable doesn't have a corresponding index created by default.

I noticed that the local variable default_indexes is being defined within Coordinates.__init__ as an empty dict, which is never changed before being used to update indexes. This seems like an indication of some missing logic to populate default_indexes?

https://github.com/pydata/xarray/pull/8711/files#diff-c73f63e529ab4065cca1aa23d93c91a3b060f314e2e498b708b90fef87141a88R286

* ``Coordinates.__init__`` create default indexes

... for any input dimension coordinate, if ``indexes=None``.

Also, if another ``Coordinates`` object is passed, extract its indexes
and raise if ``indexes`` is not None (no align/merge supported here).

* add docstring examples

* fix doctests

* fix tests

* update what's new
@benbovy
Copy link
Member

benbovy commented Feb 6, 2024

Sorry the two commits picked from #8124 where done before #8107 and messed up things a bit. I also picked and added the last merge commit from #8124, which should hopefully make things better (coordinate tests are all running fine locally at least).

after unintentionally reverted a valid previous change.
@benbovy
Copy link
Member

benbovy commented Feb 6, 2024

About the auto_convert option added to as_variable() here: this is a temporary workaround before we make a deeper and better refactoring of non-index vs. index coordinate variables. That is not super nice since as_variable() is public API, but maybe that's fine if it is clearly stated in the docstrings that auto_convert is for internal use only (and we likely be removed in the future)?

if auto_convert:
if name is not None and name in obj.dims and obj.ndim == 1:
# automatically convert the Variable into an Index
emit_user_level_warning(
Copy link
Member

@benbovy benbovy Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should remove this warning since we're not exactly sure yet what the future behavior will be?

Or at least let's make sure this warning won't be emitted in user code unless explicitly calling as_variable(..., auto_convert=True), i.e., in xarray internals all new variables should be converted into index variables explicitly (if needed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I tried to ensure that this warning won't be emitted in user code simply by replacing the warning with an exception and running all the tests to see where it raised. The only places it raised were:

Is that sufficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that may be enough for now.

@TomNicholas
Copy link
Contributor Author

Thanks @benbovy !

That is not super nice since as_variable() is public API, but maybe that's fine if it is clearly stated in the docstrings that auto_convert is for internal use only (and we likely be removed in the future)?

I mean I don't think that's a huge deal, but if you definitely wanted to hide it from users couldn't you make a private _as_variable(..., auto_convert) and use that for internal calls and have the public as_variable call the private one?

@dcherian
Copy link
Contributor

dcherian commented Feb 6, 2024

Why is as_variable public API?

@benbovy
Copy link
Member

benbovy commented Feb 6, 2024

why is as_variable public API?

More context in #1303

@benbovy
Copy link
Member

benbovy commented Feb 7, 2024

I mean I don't think that's a huge deal

Me neither (I wanted to see what others think).

@TomNicholas have you tried running your notebook mentioned in #8704 (comment) with this PR? Does it fix the errors raised?

@TomNicholas
Copy link
Contributor Author

@TomNicholas have you tried running your notebook mentioned in #8704 (comment) with this PR? Does it fix the errors raised?

Yes I did, and I think it does! I still can't do a full xr.concat on Dataset objects because of a different bug (see #8714), but with this PR gets to the same point as v2023.08.0 does at least. 😀

@TomNicholas
Copy link
Contributor Author

@benbovy I would really like to get this merged as I'm using it in a new package (see zarr-developers/VirtualiZarr#42), but I'm not really sure what the steps to make this PR ready for merging are. Does this need any new tests?

@dcherian dcherian requested a review from benbovy March 19, 2024 17:04
Copy link
Member

@benbovy benbovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomNicholas I think that this PR is almost good to go in.

We just need to ensure that unnecessary warnings won't get emitted all over the place (see #8711 (comment)).

Also a test for the skipped creation of index variables would be welcome (see my comment below).

Comment on lines +301 to 302
var = as_variable(data, name=name, auto_convert=False)
if var.dims == (name,) and indexes is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This place is (I think) the only place where a newly created "dimension coordinate" might not always be an IndexVariable (i.e., when an empty dictionary or any other value than None is passed as the indexes argument). It would be nice to add a test for this specific case.

The other refactored as_variable places in this PR always create an IndexVariable for a dimension coordinate so I think it is covered by the existing tests (invariant checks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read this comment many times over, and I don't understand why this test isn't already testing the case you're asking about (i.e. passing indexes={} to the Coordinates constructor, hence preventing an IndexVariable being automatically created).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to add a check in that specific test to make sure that the created "x" variable is not coerced to an IndexVariable (before this PR it was still the case even though no xr.indexes.PandasIndex was created from it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I added a check both explicitly and for if that warning is raised.

@dcherian
Copy link
Contributor

Can we catch these warnings if they are expected?


xarray/tests/test_variable.py::TestVariable::test_as_variable
  /home/runner/work/xarray/xarray/xarray/tests/test_variable.py:1219: FutureWarning: variable 'x' with name matching its dimension will not be automatically converted into an `IndexVariable` object in the future.
    actual = as_variable(data, name="x")

xarray/tests/test_variable.py::TestVariable::test_as_variable
  /home/runner/work/xarray/xarray/xarray/tests/test_variable.py:1237: FutureWarning: variable 'time' with name matching its dimension will not be automatically converted into an `IndexVariable` object in the future.
    assert as_variable(dt, "time").dtype.kind == "M"

xarray/tests/test_variable.py::TestVariable::test_as_variable
  /home/runner/work/xarray/xarray/xarray/tests/test_variable.py:1239: FutureWarning: variable 'time' with name matching its dimension will not be automatically converted into an `IndexVariable` object in the future.
    assert as_variable(td, "time").dtype.kind == "m"

@TomNicholas
Copy link
Contributor Author

This should be done now!

Can we catch these warnings if they are expected?

Done

Copy link
Member

@benbovy benbovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks @TomNicholas!

@benbovy benbovy merged commit aaf3b7e into pydata:main Mar 26, 2024
29 checks passed
Explicit Indexes automation moved this from In progress to Done Mar 26, 2024
@TomNicholas TomNicholas deleted the opt-out-auto-create-index-variables branch March 26, 2024 13:55
dcherian added a commit to dcherian/xarray that referenced this pull request Apr 2, 2024
* main: (26 commits)
  [pre-commit.ci] pre-commit autoupdate (pydata#8900)
  Bump the actions group with 1 update (pydata#8896)
  New empty whatsnew entry (pydata#8899)
  Update reference to 'Weighted quantile estimators' (pydata#8898)
  2024.03.0: Add whats-new (pydata#8891)
  Add typing to test_groupby.py (pydata#8890)
  Avoid in-place multiplication of a large value to an array with small integer dtype (pydata#8867)
  Check for aligned chunks when writing to existing variables (pydata#8459)
  Add dt.date to plottable types (pydata#8873)
  Optimize writes to existing Zarr stores. (pydata#8875)
  Allow multidimensional variable with same name as dim when constructing dataset via coords (pydata#8886)
  Don't allow overwriting indexes with region writes (pydata#8877)
  Migrate datatree.py module into xarray.core. (pydata#8789)
  warn and return bytes undecoded in case of UnicodeDecodeError in h5netcdf-backend (pydata#8874)
  groupby: Dispatch quantile to flox. (pydata#8720)
  Opt out of auto creating index variables (pydata#8711)
  Update docs on view / copies (pydata#8744)
  Handle .oindex and .vindex for the PandasMultiIndexingAdapter and PandasIndexingAdapter (pydata#8869)
  numpy 2.0 copy-keyword and trapz vs trapezoid (pydata#8865)
  upstream-dev CI: Fix interp and cumtrapz (pydata#8861)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

Currently no way to create a Coordinates object without indexes for 1D variables
3 participants