Adds Dataset.query() method, analogous to pandas DataFrame.query() #4984

alimanfoo · 2021-03-02T11:08:42Z

This PR adds a Dataset.query() method which enables making a selection from a dataset based on values in one or more data variables, where the selection is given as a query expression to be evaluated against the data variables in the dataset. See also discussion.

Tests added
Passes pre-commit run --all-files
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

alimanfoo · 2021-03-02T11:10:20Z

Hi folks, thought I'd put up a proof of concept PR here for further discussion. Any advice/suggestions about if/how to take this forward would be very welcome.

max-sixty

Thanks @alimanfoo, this looks like a great start. And forgive taking a few days to respond?

Does the pd.eval work with more than two dimensions? Great if so! This would be a very high impact per line of code :)

Please could we add some tests for that?

alimanfoo · 2021-03-12T17:21:29Z

Hi @max-sixty, no problem. Re this...

Does the pd.eval work with more than two dimensions?

...not quite sure what you mean, could you elaborate?

ci/requirements/environment.yml

alimanfoo · 2021-03-12T18:16:15Z

Just to mention I've added tests to verify this works with variables backed by dask arrays. Also added explicit tests of different eval engine and query parser options. And added a docstring.

max-sixty · 2021-03-13T01:27:32Z

Hi @max-sixty, no problem. Re this...

Does the pd.eval work with more than two dimensions?

...not quite sure what you mean, could you elaborate?

For sure — forgive me if I wasn't clear.

Currently the test runs over an array of two dimensions — x & y. Would pd.query work if there were also a z dimension?

alimanfoo · 2021-03-14T22:44:49Z

Currently the test runs over an array of two dimensions — x & y. Would pd.query work if there were also a z dimension?

No worries, yes any number of dimensions can be queried. I've added tests showing three dimensions can be queried.

As an aside, in writing these tests I came upon a probable upstream bug in pandas, reported as pandas-dev/pandas#40436. I don't think this affects this PR though, and has low impact as only the "python" query parser is affected, and most people will use the default "pandas" query parser.

xarray/core/dataset.py

max-sixty · 2021-03-15T04:13:46Z

Great re the dimensions!

I reviewed the tests more fully, they look great.

It looks like we need a requires_numexpr decorator on the tests — would you be OK to add that?

Could we add a simple method to DataArray which converts to a Dataset, calls the functions, and converts back too? (there are lots of examples already of this, let me know any issues)

And we should add the methods to api.rst, and a whatsnew entry if possible.

Does anyone have any other thoughts? I think the API is very reasonable. I could imagine a more sophisticated API that could take a single query, rather than a dict of them by dimension — currently it's da.query(x="x>3", y="y>4"), but that would require more work and decisions, even if it were preferable.

Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>

alimanfoo · 2021-03-16T11:24:42Z

Hi @max-sixty,

It looks like we need a requires_numexpr decorator on the tests — would you be OK to add that?

Sure, done.

Could we add a simple method to DataArray which converts to a Dataset, calls the functions, and converts back too? (there are lots of examples already of this, let me know any issues)

Done.

And we should add the methods to api.rst, and a whatsnew entry if possible.

Done.

Let me know if there's anything else. Looking forward to using this 😄

max-sixty · 2021-03-16T14:30:40Z

Excellent!

Could we add a very small test for the DataArray? Given the coverage on Dataset, it should mostly just test that the method works.

Any thoughts from others before we merge?

alimanfoo · 2021-03-16T14:40:45Z

Could we add a very small test for the DataArray? Given the coverage on Dataset, it should mostly just test that the method works.

No problem, some DataArray tests are there.

Any thoughts from others before we merge?

Good to go from my side.

xarray/tests/test_dataarray.py

dcherian

The code LGTM. I didn't check the tests closely but they seem v. thorough. Thanks @alimanfoo!

In a future PR, it would be good to add some docs comparing this to using .where: https://xarray.pydata.org/en/latest/user-guide/indexing.html#masking-with-where

max-sixty · 2021-03-16T17:28:11Z

Great, merging!

Seconded re the docs!

alimanfoo · 2021-03-16T18:28:09Z

Yay, first xarray PR 🥳

…indow * upstream/master: Fix regression in decoding large standard calendar times (pydata#5050) Fix sticky sidebar responsiveness on small screens (pydata#5039) Flexible indexes refactoring notes (pydata#4979) add a install xarray step to the upstream-dev CI (pydata#5044) Adds Dataset.query() method, analogous to pandas DataFrame.query() (pydata#4984) run tests on python 3.9 (pydata#5040) Add date attribute to datetime accessor (pydata#4994) 📚 New theme & rearrangement of the docs (pydata#4835) upgrade ci-trigger to the most recent version (pydata#5037) GH5005 fix documentation on open_rasterio (pydata#5021) GHA for automatically canceling previous CI runs (pydata#5025) Implement GroupBy.__getitem__ (pydata#3691) conventions: decode unsigned integers to signed if _Unsigned=false (pydata#4966) Added support for numpy.bool_ (pydata#4986) Add additional str accessor methods for DataArray (pydata#4622)

…-tasks * upstream/master: Fix regression in decoding large standard calendar times (pydata#5050) Fix sticky sidebar responsiveness on small screens (pydata#5039) Flexible indexes refactoring notes (pydata#4979) add a install xarray step to the upstream-dev CI (pydata#5044) Adds Dataset.query() method, analogous to pandas DataFrame.query() (pydata#4984) run tests on python 3.9 (pydata#5040) Add date attribute to datetime accessor (pydata#4994) 📚 New theme & rearrangement of the docs (pydata#4835) upgrade ci-trigger to the most recent version (pydata#5037) GH5005 fix documentation on open_rasterio (pydata#5021) GHA for automatically canceling previous CI runs (pydata#5025) Implement GroupBy.__getitem__ (pydata#3691) conventions: decode unsigned integers to signed if _Unsigned=false (pydata#4966) Added support for numpy.bool_ (pydata#4986) Add additional str accessor methods for DataArray (pydata#4622) add polyval to polyfit see also (pydata#5020) mention map_blocks in the docstring of apply_ufunc (pydata#5011) Switch backend API to v2 (pydata#4989) WIP: add new backend api documentation (pydata#4810) pin netCDF4=1.5.3 in min-all-deps (pydata#4982)

initial work on Dataset.query

8b542f8

max-sixty reviewed Mar 11, 2021

View reviewed changes

dataset query: test backends, engines, parsers; add docstring

a41e805

alimanfoo force-pushed the dataset-query-20210302 branch from c11e98a to a41e805 Compare March 12, 2021 18:14

alimanfoo commented Mar 12, 2021

View reviewed changes

ci/requirements/environment.yml Show resolved Hide resolved

alimanfoo added 2 commits March 12, 2021 18:17

add error test

907f226

unfortunate typo

6212437

alimanfoo added 2 commits March 13, 2021 23:40

test three dims

a5e5932

refine tests

3723946

max-sixty reviewed Mar 15, 2021

View reviewed changes

xarray/core/dataset.py Show resolved Hide resolved

alimanfoo and others added 7 commits March 16, 2021 09:28

fix error message

2d4a74d

Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>

add requires decorators

0ba3db3

revert change, should be func name

9dddc13

improve Dataset.query tests

c3f322c

add DataArray.query

0eec982

add query to API docs

cfe03d3

add query to whats new

d786041

alimanfoo added 2 commits March 16, 2021 11:26

Merge branch 'master' into dataset-query-20210302

00f3be1

fix black, mypy

48de755

keewis reviewed Mar 16, 2021

View reviewed changes

xarray/tests/test_dataarray.py Outdated Show resolved Hide resolved

refine test parameterisation and requirements

8fcb02e

dcherian approved these changes Mar 16, 2021

View reviewed changes

max-sixty merged commit 37fe544 into pydata:master Mar 16, 2021

alimanfoo deleted the dataset-query-20210302 branch March 16, 2021 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds Dataset.query() method, analogous to pandas DataFrame.query() #4984

Adds Dataset.query() method, analogous to pandas DataFrame.query() #4984

alimanfoo commented Mar 2, 2021 •

edited

Loading

alimanfoo commented Mar 2, 2021

max-sixty left a comment •

edited

Loading

alimanfoo commented Mar 12, 2021

alimanfoo commented Mar 12, 2021

max-sixty commented Mar 13, 2021

alimanfoo commented Mar 14, 2021

max-sixty commented Mar 15, 2021

alimanfoo commented Mar 16, 2021

max-sixty commented Mar 16, 2021

alimanfoo commented Mar 16, 2021

dcherian left a comment

max-sixty commented Mar 16, 2021

alimanfoo commented Mar 16, 2021

Adds Dataset.query() method, analogous to pandas DataFrame.query() #4984

Adds Dataset.query() method, analogous to pandas DataFrame.query() #4984

Conversation

alimanfoo commented Mar 2, 2021 • edited Loading

alimanfoo commented Mar 2, 2021

max-sixty left a comment • edited Loading

Choose a reason for hiding this comment

alimanfoo commented Mar 12, 2021

alimanfoo commented Mar 12, 2021

max-sixty commented Mar 13, 2021

alimanfoo commented Mar 14, 2021

max-sixty commented Mar 15, 2021

alimanfoo commented Mar 16, 2021

max-sixty commented Mar 16, 2021

alimanfoo commented Mar 16, 2021

dcherian left a comment

Choose a reason for hiding this comment

max-sixty commented Mar 16, 2021

alimanfoo commented Mar 16, 2021

alimanfoo commented Mar 2, 2021 •

edited

Loading

max-sixty left a comment •

edited

Loading