Add N-ary broadcasting operations. #98

hameerabbasi · 2018-02-10T11:20:08Z

This PR adds N-ary broadcasting operations (in preparation for where) and simplifies code for the N-ary case.

Discussed in #1

hameerabbasi · 2018-02-10T11:20:53Z

cc @shoyer Your input would be valuable, if you have time.
cc @mrocklin Your input is welcome, as always.

mrocklin

I left a few comments, mostly on style. I need to go over your earlier conversation with @shoyer before I'm able to properly review this.

mrocklin · 2018-02-10T13:52:54Z

sparse/tests/test_coo.py

+
+    assert_eq(sparse.elemwise(func, xs, ys, zs), func(x, y, z))
+
+


There are some extra checks in the removed tests that we may want to maintain, for example that the result of elemwise is a COO, and that its non-zeros are as expected

We might also consider having tests for some of the following:

N-ary broadcasting where the arguments have different shapes

N-ary broadcasting including arguments that are scalars and zero-dimensional arrays

mrocklin · 2018-02-10T13:57:34Z

sparse/coo.py

+
+        __doc__ = func.__doc__
+
+    return Partial()


Thoughts on replacing this with a just a functools.partial on top of a normal function?

This is our solution for dask

def partial_by_order(*args, **kwargs): """ >>> from operator import add >>> partial_by_order(5, function=add, other=[(1, 10)]) 15 """ function = kwargs.pop('function') other = kwargs.pop('other') args2 = list(args) for i, arg in other: args2.insert(i, arg) return function(*args2, **kwargs)

I could have, but our situation is slightly unique:

We're using str(func) for exceptions. functools.wraps doesn't work on that for all callables (e.g. ufuncs), and breaks a few docstrings. This leads to illegible names in exceptions like _posarg_partial.<locals>.wrapper (and the same for debugging).

We're replacing a number of arguments in different positions.

I guess I could turn it into a class rather than a decorator style function.

I turned it into a callable class.

mrocklin · 2018-02-10T13:58:37Z

sparse/coo.py

@@ -2426,80 +2468,39 @@ def _elemwise_unary(func, self, *args, **kwargs):
               sorted=self.sorted)


-def _get_matching_coords(coords1, coords2, shape1, shape2):
+def _get_nary_matching_coords(coords, params, shape):


Maybe just call this _get_matching_coords and drop the nary. Presumably there wll be no need to distinguish any longer.

mrocklin · 2018-02-10T13:59:13Z

sparse/coo.py

-    matching_coords : np.ndarray
-        The coordinates of the output array for which both inputs will be nonzero.
+    numpy.ndarray
+        The broacasted coordinates.


Style nit, there is no need to place a period at the end of a phrase like this. We tend to reserve periods for full sentences.

mrocklin · 2018-02-10T14:02:31Z

sparse/coo.py

-    result_shape = _get_broadcast_shape(self.shape, other.shape)
+    Parameters
+    ----------
+    args : tuple[COO]


If you're trying for parametrized python type annotations then I think it's supposed to be standard to use capitalized types like List[COO] or Tuple[np.ndarray]

I don't know though, this is somewhat new to me.

tuple and list tend to work better with intersphinx and code type annotations, so I tend to prefer those. Of course, I could import in something, but then that gives me PEP8 failures as I don't use it in code, just in docstrings.

mrocklin · 2018-02-10T14:04:53Z

sparse/coo.py

-    other_data = other_data[i]
+    # Filter out scalars as they are 'baked' into the function.
+    func = _posarg_partial(func, pos, posargs)
+    args = list(filter(lambda arg: not isscalar(arg), args))


You might consider toolz.remove here

I'd prefer not to introduce a dependency for something as simple as this.

mrocklin · 2018-02-10T14:07:07Z

sparse/coo.py

+    args = list(args)
+    posargs = []
+    pos = []
+    for i in range(len(args)):


You might consider for i, arg in enumerate(args), which might be a bit more idiomatic for Python readers

mrocklin · 2018-02-10T14:08:33Z

sparse/coo.py

@@ -1954,6 +1894,20 @@ def tril(x, k=0):
    return COO(coords, data, x.shape, x.has_duplicates, x.sorted)


+def _nary_match(*arrays):


I'm not able to quickly figure out what this function does. Can I ask you for a small docstring? If possible I find small example sections in docstrings to be very helpful when learning codebases that others have written.

Whoops, must have missed that one.

It looks like this function is no longer used. Delete?

mrocklin · 2018-02-10T14:15:19Z

sparse/coo.py

+        ci, di = _unmatch_coo(func, args, mask, **kwargs)
+
+        coords_list.extend(ci)
+        data_list.extend(di)


This confuses me and seems concerning. I see that this was a main point of your conversation with @shoyer earlier. I probably have some thinking to do on this problem before I'm able to reasonably comment on this.

mrocklin · 2018-02-10T14:21:33Z

I think that it would be good to see a more comprehensive test suite that fully explains the complexity of what we're trying to accomplish here. I think that that will make it more clear as we discuss different possibilities here. We might ask "why are we doing X" and the answer can be "see test_X". I get the sense that you've thought deeply about this problem and know all of the problems that might arise. It would be very valuable to encode that deep thinking and all of those corner cases into a test suite.

hameerabbasi · 2018-02-10T15:06:43Z

I plan to make more comprehensive tests, yes. But the issue is some of the complexity can't be directly tested: For example, the optimizations are just that: Optimizations. We can design the tests so the optimizations are hit but we can't know that they kicked in without weird monkey-patching of some sort.

hameerabbasi · 2018-02-11T15:36:37Z

It seems there's a slight bug for number of inputs >2 and broadcasting, nothing unfixable, but will have to think a bit. I'm on it.

mrocklin · 2018-02-11T21:26:59Z

I plan to make more comprehensive tests, yes. But the issue is some of the complexity can't be directly tested: For example, the optimizations are just that: Optimizations. We can design the tests so the optimizations are hit but we can't know that they kicked in without weird monkey-patching of some sort.

I think that there are probably a lot of correctness tests that could be written as well. In #1 you discussed many situations that might arise for which a system like this would be necessary to catch. Ideally we would encode all of those situations as tests to ensure that future developers don't change code to alter correct behavior here.

shoyer · 2018-02-11T21:50:29Z

sparse/coo.py

-    other_data = other_data[i]
+    # Filter out scalars as they are 'baked' into the function.
+    func = PositinalArgumentPartial(func, pos, posargs)
+    args = list(filter(lambda arg: not isscalar(arg), args))


optional: consider using a list comprehension instead

shoyer · 2018-02-11T21:52:49Z

sparse/coo.py

-    matched_coords : np.ndarray
-        The overall coordinates that match from both arrays.
+    args : tuple[COO]
+        The input :obj:`COO` arrays.


add in func, mask and **kwargs to the docstring?

shoyer · 2018-02-11T21:53:01Z

sparse/coo.py


-    coords_list = []
-    data_list = []
+    pos, = np.where([not m for m in mask])


maybe use np.flatnonzero()?

This isn't really a numerical operation. I've converted it to a tuple(generator comprehension) form and avoided np.where altogether. The exact code is

pos = tuple(i for i, m in enumerate(mask) if not m)

shoyer · 2018-02-11T22:01:24Z

I agree with @mrocklin that a more extensive test suite is vital here. This logic is complicated and fixing bugs later will be hard. I haven't seriously tried to follow it yet.

I would suggest parametric tests verifying proper broadcasting with 2 or 3 arguments with:

* for sparse broadcasting
+ for dense broadcasting
an order dependent function (e.g., -) to verify you get order right
with pathological values np.infty and np.nan

@shoyer

…and @shoyer.

hameerabbasi · 2018-02-17T11:38:20Z

I think this PR is now ready for a comprehensive review + merge. cc @mrocklin

Also cc @shoyer if you have the time.

mrocklin

Some small coverage comments

mrocklin · 2018-02-17T14:04:41Z

sparse/coo.py

+            pos.append(i)
+            posargs.append(args[i])
+        elif isinstance(arg, SparseArray) and not isinstance(arg, COO):
+            args[i] = COO(arg)


This line doesn't get hit by tests. Should we add a small DOK test?

mrocklin · 2018-02-17T14:05:09Z

sparse/coo.py

+            posargs.append(args[i])
+        elif isinstance(arg, SparseArray) and not isinstance(arg, COO):
+            args[i] = COO(arg)
+        elif not isinstance(arg, COO):
            raise ValueError("Performing this operation would produce "
                             "a dense result: %s" % str(func))


Same here. No test triggers this error-handling code.

Added a small test that hits this.

mrocklin · 2018-02-17T14:05:30Z

sparse/coo.py

+    args = [arg for arg in args if not isscalar(arg)]
+
+    if len(args) == 0:
+        return func(**kwargs)


Also here. No test operates on no args

Added another small test for this.

mrocklin · 2018-02-17T14:06:58Z

sparse/coo.py

+        raise ValueError('Unknown kwargs %s' % kwargs.keys())
+
+    if return_midx and (len(args) != 2 or cache is not None):
+        raise NotImplementedError('Matching only supported for two args, and no cache.')


Do we still need this option?

No, we don't. I'm not omniscient, so I went ahead and added this check in case someone tried to trigger caching on return_midx (which we don't cache, it's never repeated); or tried to match indices for len(args) != 2 (I'm not sure if we'll need this in the future, but we might, and it's useful to err rather than have it return incorrect results).

mrocklin · 2018-02-17T14:22:51Z

sparse/tests/test_coo.py

+    fs = sparse.elemwise(func, *args)
+    assert isinstance(fs, COO)
+
+    assert_eq(fs, func(*dense_args))


It would be nice to test and verify that we are not creating unnecessary zeroes in the data attribute. We might either test that explicitly here, or we might put it into assert_eq. I've gone ahead and pushed a commit to your branch that adds a check into assert_eq. Please remove if you prefer not to add this here.

I'd like to verify we don't create additional zeros for all our operations, so that seems like a rather useful addition.

Although I would prefer to use np.count_nonzero.

Edit: I reconsidered, this might be more useful for fill values.

On a related note, maybe it would be useful to add a test that verifies "sparse" broadcasting is actually done in a sparse way?

I think you could do this by mocking the underlying functions (e.g., np.mul) and then verifying that the calls match expectations.

mrocklin · 2018-02-17T14:24:11Z

sparse/tests/test_coo.py

+    def value_array(n):
+        ar = np.empty((n,), dtype=np.float_)
+        ar[:] = value
+        return ar


We might want just a few of the values to be pathological instead of all of them.

I'll modify the test to match that.

hameerabbasi · 2018-02-17T20:34:59Z

I've incorporated more or less all of your suggestions about coverage, with one exception (see comments!)

mrocklin · 2018-02-17T20:36:16Z

@shoyer do you have a chance to look at this? "Nope" is a fine answer.

hameerabbasi · 2018-02-17T22:30:47Z

I'm guessing @shoyer doesn't work weekends. :-) If there's no reply or a "Nope" by the end of Monday, we can decide what to do next.

shoyer · 2018-02-17T22:32:02Z

It's a mixed bag on weekends, but this weekend my wife is away so I have time for open source :).

I'll take a look.

shoyer · 2018-02-17T22:36:20Z

sparse/tests/test_coo.py

+         (2,),
+         (3, 2),
+         (4, 3, 2),
+     ], lambda x, y, z: (x + y) * z),


Consider doing a full cross-product of shapes and functions here.

shoyer · 2018-02-17T22:40:17Z

sparse/tests/test_coo.py

+         (4, 4),
+         (4, 4, 4),
+     ], lambda x, y, z: x - y + z),
+])


It would be good to add checks for a few more variations on the broadcasting logic to exercise the matching logic:

Dimensions of size 1, e.g., (3, 1) + (3, 4) -> (3, 4)

Output shapes that don't match one of the inputs, e.g., (3, 1) + (1, 4) -> (3, 4).

Outputs that require matching across three inputs, e.g., (1, 1, 2) + (1, 3, 1) + (4, 1, 1) -> (4, 3, 2).

The first two were already covered in test_broadcasting. I renamed that to test_binary_broadcasting and moved it closer to these.

The third, I also added.

shoyer · 2018-02-17T22:47:00Z

sparse/tests/test_coo.py

+    fs = sparse.elemwise(func, *args)
+    assert isinstance(fs, COO)
+
+    assert_eq(fs, func(*dense_args))


On a related note, maybe it would be useful to add a test that verifies "sparse" broadcasting is actually done in a sparse way?

I think you could do this by mocking the underlying functions (e.g., np.mul) and then verifying that the calls match expectations.

hameerabbasi · 2018-02-17T23:28:21Z

I can't seem to be able to respond to your "sparse" broadcasting comment, so I'm responding here.

I monkey-patched one of our own functions and verified the behavior is correct there. I also verified operator.add takes the 'dense' path.

Edit: However; I will add that like all monkey patching, it's implementation dependent, not (just) API dependent.

mrocklin · 2018-02-17T23:31:26Z

I can't seem to be able to respond to your "sparse" broadcasting comment, so I'm responding here.

Yeah, I was trying to do that as well. I haven't seen that before

Is checking for the right number of non-zeros in the output not sufficient? Do we have code paths that would re-sparsify a dense intermediate result?

hameerabbasi · 2018-02-17T23:33:03Z

We're actually talking about the "optimized" code path for things like operator.mul where it only calculates full matches and not partial matches.

mrocklin · 2018-02-17T23:40:13Z

Ah, happy to retract my comment

hameerabbasi · 2018-02-17T23:41:54Z

No, let's leave it up for other people who can't follow the terminology.

hameerabbasi · 2018-02-18T19:05:49Z

Are there any further comments or is this good to merge?

shoyer · 2018-02-18T19:21:28Z

I haven't reviewed the logic in detail, but the implementation looks relatively sane and I am satisfied with the test coverage. 👍

mrocklin · 2018-02-18T21:29:44Z

Merged!

This PR adds N-ary broadcasting operations (in preparation for where) and simplifies code for the N-ary case.

hameerabbasi added 3 commits February 4, 2018 21:53

Start working on N-ary broadcasting.

6c2ecf9

N-ary broadcasting now works!

fa65902

Docs.

9ec04e6

hameerabbasi requested review from mrocklin and removed request for mrocklin February 10, 2018 11:30

hameerabbasi mentioned this pull request Feb 10, 2018

Support Everything that XArray Expects #1

Open

16 tasks

mrocklin reviewed Feb 10, 2018

View reviewed changes

Add back test disabled for debugging.

84e4e6c

hameerabbasi force-pushed the nary-broadcast branch from 03a552f to 84e4e6c Compare February 10, 2018 15:09

shoyer reviewed Feb 11, 2018

View reviewed changes

Complete elemwise broadcasting and add tests recommended by @mrocklin …

a715e9c

…and @shoyer.

hameerabbasi force-pushed the nary-broadcast branch from d96e178 to a715e9c Compare February 17, 2018 11:31

Filter out warnings in pathological case.

ebd69f2

mrocklin reviewed Feb 17, 2018

View reviewed changes

Add test for nnz in assert_eq

4ba99d1

mrocklin reviewed Feb 17, 2018

View reviewed changes

Add additional tests recommended by @mrocklin.

2fabb72

Delete unused function.

0b3111f

shoyer reviewed Feb 17, 2018

View reviewed changes

Move broadcasting test and add sparse vs dense broadcasting case.

031fb67

hameerabbasi added 2 commits February 18, 2018 09:21

Fix potential spontaneous test failures.

e10fd4a

Avoid unnecessary operations when matching indices.

97353c8

mrocklin merged commit ef423c6 into pydata:master Feb 18, 2018

hameerabbasi deleted the nary-broadcast branch February 18, 2018 21:32

hameerabbasi added a commit to hameerabbasi/sparse that referenced this pull request Feb 27, 2018

Add N-ary broadcasting operations. (pydata#98)

cf4a147

This PR adds N-ary broadcasting operations (in preparation for where) and simplifies code for the N-ary case.

		@@ -1954,6 +1894,20 @@ def tril(x, k=0):
		return COO(coords, data, x.shape, x.has_duplicates, x.sorted)


		def _nary_match(*arrays):

Add N-ary broadcasting operations. #98

Add N-ary broadcasting operations. #98

Conversation

hameerabbasi commented Feb 10, 2018 • edited by mrocklin Loading

hameerabbasi commented Feb 10, 2018

mrocklin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Feb 10, 2018

hameerabbasi commented Feb 10, 2018

hameerabbasi commented Feb 11, 2018

mrocklin commented Feb 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Feb 11, 2018

hameerabbasi commented Feb 17, 2018 • edited Loading

mrocklin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hameerabbasi Feb 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hameerabbasi commented Feb 17, 2018

mrocklin commented Feb 17, 2018

hameerabbasi commented Feb 17, 2018

shoyer commented Feb 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hameerabbasi commented Feb 17, 2018 • edited Loading

mrocklin commented Feb 17, 2018

hameerabbasi commented Feb 17, 2018

mrocklin commented Feb 17, 2018

hameerabbasi commented Feb 17, 2018

hameerabbasi commented Feb 18, 2018

shoyer commented Feb 18, 2018

mrocklin commented Feb 18, 2018

hameerabbasi commented Feb 10, 2018 •

edited by mrocklin

Loading

hameerabbasi commented Feb 17, 2018 •

edited

Loading

hameerabbasi Feb 17, 2018 •

edited

Loading

shoyer commented Feb 17, 2018 •

edited

Loading

hameerabbasi commented Feb 17, 2018 •

edited

Loading