Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Dataset.query() method, analogous to pandas DataFrame.query() #4984

Merged
merged 16 commits into from
Mar 16, 2021

Conversation

alimanfoo
Copy link
Contributor

@alimanfoo alimanfoo commented Mar 2, 2021

This PR adds a Dataset.query() method which enables making a selection from a dataset based on values in one or more data variables, where the selection is given as a query expression to be evaluated against the data variables in the dataset. See also discussion.

  • Tests added
  • Passes pre-commit run --all-files
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@alimanfoo
Copy link
Contributor Author

Hi folks, thought I'd put up a proof of concept PR here for further discussion. Any advice/suggestions about if/how to take this forward would be very welcome.

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alimanfoo, this looks like a great start. And forgive taking a few days to respond?

Does the pd.eval work with more than two dimensions? Great if so! This would be a very high impact per line of code :)

Please could we add some tests for that?

@alimanfoo
Copy link
Contributor Author

Hi @max-sixty, no problem. Re this...

Does the pd.eval work with more than two dimensions?

...not quite sure what you mean, could you elaborate?

@alimanfoo
Copy link
Contributor Author

Just to mention I've added tests to verify this works with variables backed by dask arrays. Also added explicit tests of different eval engine and query parser options. And added a docstring.

@max-sixty
Copy link
Collaborator

Hi @max-sixty, no problem. Re this...

Does the pd.eval work with more than two dimensions?

...not quite sure what you mean, could you elaborate?

For sure — forgive me if I wasn't clear.

Currently the test runs over an array of two dimensions — x & y. Would pd.query work if there were also a z dimension?

@alimanfoo
Copy link
Contributor Author

Currently the test runs over an array of two dimensions — x & y. Would pd.query work if there were also a z dimension?

No worries, yes any number of dimensions can be queried. I've added tests showing three dimensions can be queried.

As an aside, in writing these tests I came upon a probable upstream bug in pandas, reported as pandas-dev/pandas#40436. I don't think this affects this PR though, and has low impact as only the "python" query parser is affected, and most people will use the default "pandas" query parser.

@max-sixty
Copy link
Collaborator

Great re the dimensions!

I reviewed the tests more fully, they look great.

It looks like we need a requires_numexpr decorator on the tests — would you be OK to add that?

Could we add a simple method to DataArray which converts to a Dataset, calls the functions, and converts back too? (there are lots of examples already of this, let me know any issues)

And we should add the methods to api.rst, and a whatsnew entry if possible.


Does anyone have any other thoughts? I think the API is very reasonable. I could imagine a more sophisticated API that could take a single query, rather than a dict of them by dimension — currently it's da.query(x="x>3", y="y>4"), but that would require more work and decisions, even if it were preferable.

@alimanfoo
Copy link
Contributor Author

Hi @max-sixty,

It looks like we need a requires_numexpr decorator on the tests — would you be OK to add that?

Sure, done.

Could we add a simple method to DataArray which converts to a Dataset, calls the functions, and converts back too? (there are lots of examples already of this, let me know any issues)

Done.

And we should add the methods to api.rst, and a whatsnew entry if possible.

Done.

Let me know if there's anything else. Looking forward to using this 😄

@max-sixty
Copy link
Collaborator

Excellent!

Could we add a very small test for the DataArray? Given the coverage on Dataset, it should mostly just test that the method works.

Any thoughts from others before we merge?

@alimanfoo
Copy link
Contributor Author

Could we add a very small test for the DataArray? Given the coverage on Dataset, it should mostly just test that the method works.

No problem, some DataArray tests are there.

Any thoughts from others before we merge?

Good to go from my side.

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code LGTM. I didn't check the tests closely but they seem v. thorough. Thanks @alimanfoo!

In a future PR, it would be good to add some docs comparing this to using .where: https://xarray.pydata.org/en/latest/user-guide/indexing.html#masking-with-where

@max-sixty
Copy link
Collaborator

Great, merging!

Seconded re the docs!

@max-sixty max-sixty merged commit 37fe544 into pydata:master Mar 16, 2021
@alimanfoo alimanfoo deleted the dataset-query-20210302 branch March 16, 2021 18:26
@alimanfoo
Copy link
Contributor Author

Yay, first xarray PR 🥳

dcherian added a commit to dcherian/xarray that referenced this pull request Mar 18, 2021
…indow

* upstream/master:
  Fix regression in decoding large standard calendar times (pydata#5050)
  Fix sticky sidebar responsiveness on small screens (pydata#5039)
  Flexible indexes refactoring notes (pydata#4979)
  add a install xarray step to the upstream-dev CI (pydata#5044)
  Adds Dataset.query() method, analogous to pandas DataFrame.query() (pydata#4984)
  run tests on python 3.9 (pydata#5040)
  Add date attribute to datetime accessor (pydata#4994)
  📚 New theme & rearrangement of the docs (pydata#4835)
  upgrade ci-trigger to the most recent version (pydata#5037)
  GH5005 fix documentation on open_rasterio (pydata#5021)
  GHA for automatically canceling previous CI runs (pydata#5025)
  Implement GroupBy.__getitem__ (pydata#3691)
  conventions: decode unsigned integers to signed if _Unsigned=false (pydata#4966)
  Added support for numpy.bool_ (pydata#4986)
  Add additional str accessor methods for DataArray (pydata#4622)
dcherian added a commit to dcherian/xarray that referenced this pull request Mar 23, 2021
…-tasks

* upstream/master:
  Fix regression in decoding large standard calendar times (pydata#5050)
  Fix sticky sidebar responsiveness on small screens (pydata#5039)
  Flexible indexes refactoring notes (pydata#4979)
  add a install xarray step to the upstream-dev CI (pydata#5044)
  Adds Dataset.query() method, analogous to pandas DataFrame.query() (pydata#4984)
  run tests on python 3.9 (pydata#5040)
  Add date attribute to datetime accessor (pydata#4994)
  📚 New theme & rearrangement of the docs (pydata#4835)
  upgrade ci-trigger to the most recent version (pydata#5037)
  GH5005 fix documentation on open_rasterio (pydata#5021)
  GHA for automatically canceling previous CI runs (pydata#5025)
  Implement GroupBy.__getitem__ (pydata#3691)
  conventions: decode unsigned integers to signed if _Unsigned=false (pydata#4966)
  Added support for numpy.bool_ (pydata#4986)
  Add additional str accessor methods for DataArray (pydata#4622)
  add polyval to polyfit see also (pydata#5020)
  mention map_blocks in the docstring of apply_ufunc (pydata#5011)
  Switch backend API to v2 (pydata#4989)
  WIP: add new backend api documentation (pydata#4810)
  pin netCDF4=1.5.3 in min-all-deps (pydata#4982)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants