sparse option to reindex and unstack #3542

fujiisoup · 2019-11-16T14:41:00Z

Closes Have "unstack" return a boolean mask? #3518
Tests added
Passes black . && mypy . && flake8
Fully documented, including whats-new.rst for all changes and api.rst for new API

Added sparse option to reindex and unstack.
I just added a minimal set of codes necessary to unstack and reindex.

There is still a lot of space to complete the sparse support as discussed in #3245.

# Conflicts: # xarray/core/dataarray.py # xarray/core/dataset.py # xarray/tests/test_dataset.py

pep8speaks · 2019-11-16T14:41:32Z

Hello @fujiisoup! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-18 09:02:36 UTC

fujiisoup · 2019-11-16T14:49:04Z

xarray/core/variable.py

@@ -993,6 +993,32 @@ def chunk(self, chunks=None, name=None, lock=False):

        return type(self)(self.dims, data, self._attrs, self._encoding, fastpath=True)

+    def _as_sparse(self, sparse_format=_default, fill_value=dtypes.NA):


Currently, this is a private method.
Probably we can expose it to the public and add the same method to DataArray and Dataset as well in the future.

fujiisoup · 2019-11-16T14:49:58Z

xarray/core/variable.py

+        data = as_sparse(self.data.astype(dtype), fill_value=fill_value)
+        return self._replace(data=data)
+
+    def _to_dense(self):


Also private, as is _as_sparse.

We should make these public in DataArray and Dataset. See discussion here: #3245 (comment)

Can be left for a future PR though :)

Thanks, @dcherian.
I would like to expose them to public, but what is the best name of these functions?
#3245

fujiisoup · 2019-11-16T14:51:08Z

xarray/tests/test_dataset.py

+        actual = data.reindex(dim3=dim3, sparse=True)
+        expected = data.reindex(dim3=dim3, sparse=False)
+        for k, v in data.data_vars.items():
+            np.testing.assert_equal(actual[k].data.todense(), expected[k].data)


Currently, assert_equal cannot be used as we need to explicitly densify the array for the comparison.

max-sixty · 2019-11-16T14:49:19Z

xarray/tests/test_variable.py

@@ -1862,6 +1863,17 @@ def test_getitem_with_mask_nd_indexer(self):
        )


+@requires_sparse
+class TestVariableWithSparse:
+    # TODO inherit VariableSubclassobjects to cover more tests


max-sixty · 2019-11-16T14:51:00Z

xarray/tests/test_dataset.py

+
+        actual = ds["var"].unstack("index", sparse=True)
+        expected = ds["var"].unstack("index")
+        assert actual.variable._to_dense().equals(expected.variable)


Do we test whether actual.variable is actually sparse?

max-sixty · 2019-11-16T14:53:31Z

xarray/tests/test_dataset.py

+        actual = data.reindex(dim3=dim3, sparse=True, fill_value=-10)
+        expected = data.reindex(dim3=dim3, sparse=False, fill_value=-10)
+        for k, v in data.data_vars.items():
+            np.testing.assert_equal(actual[k].data.todense(), expected[k].data)


I think we generally use assert_array_equal for numpy arrays (but I can't immediately recall the difference...)

I think these actually end up doing the exact same checks.

values property does not work for the sparse-backed Variable, resulting in the failure of assert_array_equal.
I'll add TODO comment for this.

shoyer

@fujiisoup it's great to have you back!

xarray/core/variable.py

shoyer · 2019-11-16T20:28:31Z

xarray/tests/test_dataset.py

+        actual = data.reindex(dim3=dim3, sparse=True, fill_value=-10)
+        expected = data.reindex(dim3=dim3, sparse=False, fill_value=-10)
+        for k, v in data.data_vars.items():
+            np.testing.assert_equal(actual[k].data.todense(), expected[k].data)


I think these actually end up doing the exact same checks.

shoyer · 2019-11-16T20:34:24Z

xarray/core/variable.py

+
+        if sparse_format is _default:
+            sparse_format = "coo"
+        as_sparse = getattr(sparse, "as_{}".format(sparse_format.lower()))


I like the idea of not hard-coding supported sparse formats, but I wonder if we could be a little more careful here if AttributeError is raised. We should probably catch and re-raise Attribute error with a more informative message if this fails.

Otherwise, I expect we might see bug reports from confused users, e.g., when sparse_format='csr' raises a confusing message.

shoyer · 2019-11-16T20:36:28Z

xarray/core/duck_array_ops.py

@@ -251,6 +253,9 @@ def count(data, axis=None):

 def where(condition, x, y):
    """Three argument where() with better dtype promotion rules."""
+    # sparse support
+    if isinstance(x, sparse_array_type) or isinstance(y, sparse_array_type):


I am a little surprised this is necessary. Does sparse not support __array_function__ for np.where?

Well, yes. sparse looks not working with np.result_type and astype(copy=False).
I'll add a TODO here.

do you have the latest version of sparse installed?

when I test this on my machine, it works:

In [13]: import sparse In [14]: import numpy as np In [15]: import xarray In [16]: x = sparse.COO(np.arange(3)) In [17]: xarray.core.duck_array_ops.where(x > 1, x, x) Out[17]: <COO: shape=(3,), dtype=int64, nnz=2, fill_value=0>

Thanks. You are right.
I was running with sparse 0.7.0. With 0.8.0, it is running.

shoyer · 2019-11-16T20:37:54Z

xarray/core/variable.py

+        """
+        import sparse
+
+        # TODO  what to do if dask-backended?


Hopefully sparse will raise an error if you try to convert a dask array into a sparse array! If not, we should do that ourselves.

Long term, the best solution would be to convert a dask array from dense chunks to sparse chunks.

xarray/core/dataset.py

shoyer · 2019-11-19T16:23:49Z

Thank you @fujiisoup !

I think it could be a little cleaner to entirely avoid sparse inside any of the reindex functions, but this is fine for now.

* master: (24 commits) Tweaks to release instructions (pydata#3555) Clarify conda environments for new contributors (pydata#3551) Revert to dev version 0.14.1 whatsnew (pydata#3547) sparse option to reindex and unstack (pydata#3542) Silence sphinx warnings (pydata#3516) Numpy 1.18 support (pydata#3537) tweak whats-new. (pydata#3540) small simplification of rename from pydata#3532 (pydata#3539) Added fill_value for unstack (pydata#3541) Add DatasetGroupBy.quantile (pydata#3527) ensure rename does not change index type (pydata#3532) Leave empty slot when not using accessors interpolate_na: Add max_gap support. (pydata#3302) units & deprecation merge (pydata#3530) Fix set_index when an existing dimension becomes a level (pydata#3520) add Variable._replace (pydata#3528) Tests for module-level functions with units (pydata#3493) Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502) FUNDING.yml (pydata#3523) ...

* upstream/master: (22 commits) Resolve the version issues on RTD (pydata#3589) Add bottleneck & rasterio git tip to upstream-dev CI (pydata#3585) update whats-new.rst (pydata#3581) Examples for quantile (pydata#3576) add cftime intersphinx entries (pydata#3577) Add pyXpcm to Related Projects doc page (pydata#3578) Reimplement quantile with apply_ufunc (pydata#3559) add environment file for binderized examples (pydata#3568) Add drop to api.rst under pending deprecations (pydata#3561) replace duplicate method _from_vars_and_coord_names (pydata#3565) propagate indexes in to_dataset, from_dataset (pydata#3519) Switch examples to notebooks + scipy19 docs improvements (pydata#3557) fix whats-new.rst (pydata#3554) Tweaks to release instructions (pydata#3555) Clarify conda environments for new contributors (pydata#3551) Revert to dev version 0.14.1 whatsnew (pydata#3547) sparse option to reindex and unstack (pydata#3542) Silence sphinx warnings (pydata#3516) Numpy 1.18 support (pydata#3537) ...

* upstream/master: (35 commits) fix plotting with transposed nondim coords. (pydata#3441) make coarsen reductions consistent with reductions on other classes (pydata#3500) Resolve the version issues on RTD (pydata#3589) Add bottleneck & rasterio git tip to upstream-dev CI (pydata#3585) update whats-new.rst (pydata#3581) Examples for quantile (pydata#3576) add cftime intersphinx entries (pydata#3577) Add pyXpcm to Related Projects doc page (pydata#3578) Reimplement quantile with apply_ufunc (pydata#3559) add environment file for binderized examples (pydata#3568) Add drop to api.rst under pending deprecations (pydata#3561) replace duplicate method _from_vars_and_coord_names (pydata#3565) propagate indexes in to_dataset, from_dataset (pydata#3519) Switch examples to notebooks + scipy19 docs improvements (pydata#3557) fix whats-new.rst (pydata#3554) Tweaks to release instructions (pydata#3555) Clarify conda environments for new contributors (pydata#3551) Revert to dev version 0.14.1 whatsnew (pydata#3547) sparse option to reindex and unstack (pydata#3542) ...

fujiisoup added 6 commits November 16, 2019 20:07

Added fill_value for unstack

4a6237a

remove sparse option and fix unintended changes

e7b470d

a bug fix

1df4a3c

Added sparse option to unstack and reindex

6a66831

black

3a369a1

Merge branch 'master' into sparse_reindex

6fe30e2

# Conflicts: # xarray/core/dataarray.py # xarray/core/dataset.py # xarray/tests/test_dataset.py

More tests

179cc1f

fujiisoup commented Nov 16, 2019

View reviewed changes

black

5d8ab27

max-sixty reviewed Nov 16, 2019

View reviewed changes

shoyer reviewed Nov 16, 2019

View reviewed changes

fujiisoup added 2 commits November 17, 2019 15:11

Remove sparse option from reindex

13ad683

try __array_function__ where

ac41ef8

fujiisoup mentioned this pull request Nov 17, 2019

sparse and other duck array issues #3245

Closed

flake8

92ce6cd

shoyer merged commit 220adbc into pydata:master Nov 19, 2019

fujiisoup deleted the sparse_reindex branch November 19, 2019 22:40

friedrichknuth mentioned this pull request Jan 14, 2020

Need documentation on sparse / cupy integration #3484

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sparse option to reindex and unstack #3542

sparse option to reindex and unstack #3542

fujiisoup commented Nov 16, 2019

pep8speaks commented Nov 16, 2019 •

edited

Loading

fujiisoup Nov 16, 2019

fujiisoup Nov 16, 2019

dcherian Nov 17, 2019

fujiisoup Nov 19, 2019

fujiisoup Nov 16, 2019

max-sixty Nov 16, 2019

max-sixty Nov 16, 2019

max-sixty Nov 16, 2019

shoyer Nov 16, 2019

fujiisoup Nov 17, 2019

shoyer left a comment

shoyer Nov 16, 2019

shoyer Nov 16, 2019

shoyer Nov 16, 2019

fujiisoup Nov 17, 2019

shoyer Nov 17, 2019

fujiisoup Nov 18, 2019

shoyer Nov 16, 2019

shoyer commented Nov 19, 2019

		@@ -993,6 +993,32 @@ def chunk(self, chunks=None, name=None, lock=False):

		return type(self)(self.dims, data, self._attrs, self._encoding, fastpath=True)

		def _as_sparse(self, sparse_format=_default, fill_value=dtypes.NA):

sparse option to reindex and unstack #3542

sparse option to reindex and unstack #3542

Conversation

fujiisoup commented Nov 16, 2019

pep8speaks commented Nov 16, 2019 • edited Loading

Comment last updated at 2019-11-18 09:02:36 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Nov 19, 2019

pep8speaks commented Nov 16, 2019 •

edited

Loading