Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sparse option to reindex and unstack #3542

Merged
merged 11 commits into from
Nov 19, 2019
Merged

Conversation

fujiisoup
Copy link
Member

Added sparse option to reindex and unstack.
I just added a minimal set of codes necessary to unstack and reindex.

There is still a lot of space to complete the sparse support as discussed in #3245.

@pep8speaks
Copy link

pep8speaks commented Nov 16, 2019

Hello @fujiisoup! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-18 09:02:36 UTC

@@ -993,6 +993,32 @@ def chunk(self, chunks=None, name=None, lock=False):

return type(self)(self.dims, data, self._attrs, self._encoding, fastpath=True)

def _as_sparse(self, sparse_format=_default, fill_value=dtypes.NA):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this is a private method.
Probably we can expose it to the public and add the same method to DataArray and Dataset as well in the future.

data = as_sparse(self.data.astype(dtype), fill_value=fill_value)
return self._replace(data=data)

def _to_dense(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also private, as is _as_sparse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make these public in DataArray and Dataset. See discussion here: #3245 (comment)

Can be left for a future PR though :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @dcherian.
I would like to expose them to public, but what is the best name of these functions?
#3245

actual = data.reindex(dim3=dim3, sparse=True)
expected = data.reindex(dim3=dim3, sparse=False)
for k, v in data.data_vars.items():
np.testing.assert_equal(actual[k].data.todense(), expected[k].data)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, assert_equal cannot be used as we need to explicitly densify the array for the comparison.

@@ -1862,6 +1863,17 @@ def test_getitem_with_mask_nd_indexer(self):
)


@requires_sparse
class TestVariableWithSparse:
# TODO inherit VariableSubclassobjects to cover more tests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


actual = ds["var"].unstack("index", sparse=True)
expected = ds["var"].unstack("index")
assert actual.variable._to_dense().equals(expected.variable)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we test whether actual.variable is actually sparse?

actual = data.reindex(dim3=dim3, sparse=True, fill_value=-10)
expected = data.reindex(dim3=dim3, sparse=False, fill_value=-10)
for k, v in data.data_vars.items():
np.testing.assert_equal(actual[k].data.todense(), expected[k].data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we generally use assert_array_equal for numpy arrays (but I can't immediately recall the difference...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these actually end up doing the exact same checks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values property does not work for the sparse-backed Variable, resulting in the failure of assert_array_equal.
I'll add TODO comment for this.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fujiisoup it's great to have you back!

xarray/core/variable.py Outdated Show resolved Hide resolved
actual = data.reindex(dim3=dim3, sparse=True, fill_value=-10)
expected = data.reindex(dim3=dim3, sparse=False, fill_value=-10)
for k, v in data.data_vars.items():
np.testing.assert_equal(actual[k].data.todense(), expected[k].data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these actually end up doing the exact same checks.


if sparse_format is _default:
sparse_format = "coo"
as_sparse = getattr(sparse, "as_{}".format(sparse_format.lower()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of not hard-coding supported sparse formats, but I wonder if we could be a little more careful here if AttributeError is raised. We should probably catch and re-raise Attribute error with a more informative message if this fails.

Otherwise, I expect we might see bug reports from confused users, e.g., when sparse_format='csr' raises a confusing message.

@@ -251,6 +253,9 @@ def count(data, axis=None):

def where(condition, x, y):
"""Three argument where() with better dtype promotion rules."""
# sparse support
if isinstance(x, sparse_array_type) or isinstance(y, sparse_array_type):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little surprised this is necessary. Does sparse not support __array_function__ for np.where?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yes. sparse looks not working with np.result_type and astype(copy=False).
I'll add a TODO here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have the latest version of sparse installed?

when I test this on my machine, it works:

In [13]: import sparse

In [14]: import numpy as np

In [15]: import xarray

In [16]: x = sparse.COO(np.arange(3))

In [17]: xarray.core.duck_array_ops.where(x > 1, x, x)
Out[17]: <COO: shape=(3,), dtype=int64, nnz=2, fill_value=0>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. You are right.
I was running with sparse 0.7.0. With 0.8.0, it is running.

"""
import sparse

# TODO what to do if dask-backended?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully sparse will raise an error if you try to convert a dask array into a sparse array! If not, we should do that ourselves.

Long term, the best solution would be to convert a dask array from dense chunks to sparse chunks.

xarray/core/dataset.py Outdated Show resolved Hide resolved
@shoyer shoyer merged commit 220adbc into pydata:master Nov 19, 2019
@shoyer
Copy link
Member

shoyer commented Nov 19, 2019

Thank you @fujiisoup !

I think it could be a little cleaner to entirely avoid sparse inside any of the reindex functions, but this is fine for now.

@fujiisoup fujiisoup deleted the sparse_reindex branch November 19, 2019 22:40
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 22, 2019
* master: (24 commits)
  Tweaks to release instructions (pydata#3555)
  Clarify conda environments for new contributors (pydata#3551)
  Revert to dev version
  0.14.1 whatsnew (pydata#3547)
  sparse option to reindex and unstack (pydata#3542)
  Silence sphinx warnings (pydata#3516)
  Numpy 1.18 support (pydata#3537)
  tweak whats-new. (pydata#3540)
  small simplification of rename from pydata#3532 (pydata#3539)
  Added fill_value for unstack (pydata#3541)
  Add DatasetGroupBy.quantile (pydata#3527)
  ensure rename does not change index type (pydata#3532)
  Leave empty slot when not using accessors
  interpolate_na: Add max_gap support. (pydata#3302)
  units & deprecation merge (pydata#3530)
  Fix set_index when an existing dimension becomes a level (pydata#3520)
  add Variable._replace (pydata#3528)
  Tests for module-level functions with units (pydata#3493)
  Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502)
  FUNDING.yml (pydata#3523)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Dec 4, 2019
* upstream/master: (22 commits)
  Resolve the version issues on RTD (pydata#3589)
  Add bottleneck & rasterio git tip to upstream-dev CI (pydata#3585)
  update whats-new.rst (pydata#3581)
  Examples for quantile (pydata#3576)
  add cftime intersphinx entries (pydata#3577)
  Add pyXpcm to Related Projects doc page (pydata#3578)
  Reimplement quantile with apply_ufunc (pydata#3559)
  add environment file for binderized examples (pydata#3568)
  Add drop to api.rst under pending deprecations (pydata#3561)
  replace duplicate method _from_vars_and_coord_names (pydata#3565)
  propagate indexes in to_dataset, from_dataset (pydata#3519)
  Switch examples to notebooks + scipy19 docs improvements (pydata#3557)
  fix whats-new.rst (pydata#3554)
  Tweaks to release instructions (pydata#3555)
  Clarify conda environments for new contributors (pydata#3551)
  Revert to dev version
  0.14.1 whatsnew (pydata#3547)
  sparse option to reindex and unstack (pydata#3542)
  Silence sphinx warnings (pydata#3516)
  Numpy 1.18 support (pydata#3537)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Dec 5, 2019
* upstream/master: (35 commits)
  fix plotting with transposed nondim coords. (pydata#3441)
  make coarsen reductions consistent with reductions on other classes (pydata#3500)
  Resolve the version issues on RTD (pydata#3589)
  Add bottleneck & rasterio git tip to upstream-dev CI (pydata#3585)
  update whats-new.rst (pydata#3581)
  Examples for quantile (pydata#3576)
  add cftime intersphinx entries (pydata#3577)
  Add pyXpcm to Related Projects doc page (pydata#3578)
  Reimplement quantile with apply_ufunc (pydata#3559)
  add environment file for binderized examples (pydata#3568)
  Add drop to api.rst under pending deprecations (pydata#3561)
  replace duplicate method _from_vars_and_coord_names (pydata#3565)
  propagate indexes in to_dataset, from_dataset (pydata#3519)
  Switch examples to notebooks + scipy19 docs improvements (pydata#3557)
  fix whats-new.rst (pydata#3554)
  Tweaks to release instructions (pydata#3555)
  Clarify conda environments for new contributors (pydata#3551)
  Revert to dev version
  0.14.1 whatsnew (pydata#3547)
  sparse option to reindex and unstack (pydata#3542)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Have "unstack" return a boolean mask?
5 participants