Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Behavior of .all() is not consistent with numpy #166

Closed
masonproffitt opened this issue Jul 19, 2019 · 7 comments
Closed

Behavior of .all() is not consistent with numpy #166

masonproffitt opened this issue Jul 19, 2019 · 7 comments

Comments

@masonproffitt
Copy link
Contributor

Consider the difference between .all() with a numpy.ndarray and a JaggedArray:

>>> import numpy, awkward
>>> numpy_array = numpy.ndarray([1, 1], dtype=bool)
>>> awkward_array = awkward.fromiter([[True]])
>>> numpy_array
array([[ True]])
>>> awkward_array
<JaggedArray [[True]] at 0x7f353ea89278>
>>> numpy_array.all()
True
>>> awkward_array.all()
array([ True])

One would certainly intuitively expect .all() to give the same result in both of these cases, but for a JaggedArray an array is returned rather than a bool. Numpy does a logical AND over all the dimensions of the input array, whereas awkward seems to only do an AND over the final dimension. Is there a good reason why AwkwardArray.all() doesn't follow the same convention as numpy's default .all()?

@masonproffitt
Copy link
Contributor Author

masonproffitt commented Jul 19, 2019

The same goes of course for similar reduction methods like any, sum, prod, min, max, mean, var, and std. I get that you can't reduce along an arbitrary axis like in numpy because of the jaggedness (so there can't be an integer axis parameter), but it is always possible to reduce along all axes (by repeatedly reducing along the final dimension until you get a scalar).

Edit: The last claim is at least true for JaggedArray. I guess there are things like Tables with Columns of differing shapes where this isn't possible, but it seems like those are the special cases that should require behavior diverging from numpy rather than JaggedArray.

@jpivarski
Copy link
Member

The reducers are consistent with Numpy with an axis=-1, to the extent that -1 can be interpreted as "the deepest axis, not following branch points." Because of jaggedness, no other axis is possible—only two cases: -1 (current behavior) and None (compete reduction).

In awkward 1.0, all operations on awkward arrays will be in the module namespace, so awkward.all(array), which is allowed to be different from numpy.all(array).

Because arbitrary axis values are not possible, it would be misleading to call the parameter axis, but that's the only way to get even partial agreement with Numpy...

@nsmith-
Copy link
Contributor

nsmith- commented Jul 19, 2019

Personally I think this default behavior of assuming each reducer is on axis=-1 is much more useful.

@masonproffitt
Copy link
Contributor Author

A claim about which default behavior is more useful is subjective because it obviously depends on your use cases. The most common reduction I've used is all() along all axes to get a scalar bool. But I'm not even taking a side on what's the most useful. One thing I take some issue with is an unnecessary deviation from numpy (because it's just surprising and confusing). However, the more important issue here is that, without any parameter available to change behavior, it's a pain to reduce along all axes because you have to check how many axes there are and then run that many all()s on the array.

Incidentally, I think you can meaningfully reduce along an arbitrary axis in JaggedArray if you define an identity element for each operator (0 for sum(), 1 for prod(), etc.) and assume "missing" elements (where entries don't exist due to jaggedness) are the identity. Along each dimension, the output would have the maximum size of any element along that dimension. I think that could make for some difficult debugging, so I'm not really advocating that this part be implemented. But I think there should at minimum be a parameter that indicates numpy's axis=None behavior, which doesn't need to be named axis.

@jpivarski
Copy link
Member

My example (written in #167) was supposed to be here:

>>> a = numpy.ones((2, 3, 4), dtype=int)
>>> a
array([[[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]],

       [[1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 1]]])
>>> a.sum(axis=0)
array([[2, 2, 2, 2],
       [2, 2, 2, 2],
       [2, 2, 2, 2]])
>>> a.sum(axis=1)
array([[3, 3, 3, 3],
       [3, 3, 3, 3]])
>>> a.sum(axis=2)
array([[4, 4, 4],
       [4, 4, 4]])

Here's a version with some zeros:

>>> b = numpy.array([[[1, 1, 1, 1], [1, 1, 1, 0], [1, 1, 0, 0]], [[1, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]])
>>> b
array([[[1, 1, 1, 1],
        [1, 1, 1, 0],
        [1, 1, 0, 0]],

       [[1, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]]])
>>> b.sum(axis=0)
array([[2, 1, 1, 1],
       [1, 1, 1, 0],
       [1, 1, 0, 0]])
>>> b.sum(axis=1)
array([[3, 3, 2, 1],
       [1, 0, 0, 0]])
>>> b.sum(axis=2)
array([[4, 3, 2],
       [1, 0, 0]])

Now what if instead of zeros, we had gaps...

>>> c = awkward.fromiter([[[1, 1, 1, 1], [1, 1, 1], [1, 1]], [[1], []], []])
>>> c
<JaggedArray [[[1 1 1 1] [1 1 1] [1 1]] [[1] []] []] at 0x78e211da4400>

Note that I did emptiness three different ways: (1) I turned a [0, 0, 0, 0] array into an empty array; (2) I eliminated it entirely, so nothing instead of []; (3) I added an empty array at a higher level of nesting.

It's clear what an axis=-1 summation should be because for all the irregularity of this array, the deepest level of arrays can be replaced with scalars, no matter how many of those there are. This axis=-1 summation is what JaggedArray.sum does:

>>> c.sum()
<JaggedArray [[4 3 2] [1 0] []] at 0x78e212497588>

Compare this to

>>> b.sum(axis=-1)
array([[4, 3, 2],
       [1, 0, 0]])

The first row is the same because nonexistent elements are equivalent to zeros in summation and we haven't dealt with any of the tricky empty cases. The second row is different because while [] sums to 0, nothing sums to nothing, not 0. And then the jagged array has a third row, which is empty, corresponding to the empty array I stuck in there at the middle nesting level.

So we have axis=2 covered, but what about axis=1? Drawing it out to make the nesting clear,

[
 [
  [1, 1, 1, 1],     # first "block"
  [1, 1, 1],
  [1, 1]
 ],

 [
  [1],              # second "block"
  []
 ],

 [
                    # third "block"
 ]
]

an axis=1 sum means summing downward, such that the first row becomes [3, 3, 2, 1] (see where they line up in that first block?), just like the first row of b.sum(axis=1). Following the same procedure, the second row becomes [1], corresponding to the [1, 0, 0, 0] second row of b.sum(axis=1). But is there a third row? If so, is it []?

It gets even harder to give a well-reasoned definition for axis=0. As a reminder,

>>> b.sum(axis=0)
array([[2, 1, 1, 1],
       [1, 1, 1, 0],
       [1, 1, 0, 0]])

For the flat array b, the 2 in the top-left slot of b.sum(axis=0) is because two of the blocks have a 1 in the top-left slot. But for the jagged array, it's not clear how to align the rows and columns of the blocks to one another because they have different numbers of elements.

That's why I decided some time ago that we can't make sense of axis != -1 when jaggedness is involved. But looking at it now, we can make a consistent definition: slide all inner arrays to the leftmost and topmost and expand them with the reducer's identity to the smallest common size, then use Numpy's definition. With something like that, a jagged array of Muon_pt would have

Muon_pt.max(axis=0)

give you an n-element array, in which n is the maximum number of muons in the dataset, with the first slot showing the maximum first muon in each event, the second showing the maximum second muon in each event, etc. That could be useful.

In awkward 1.0, when this is done in C++, we can have a reasonable implementation of an output array that grows each time we see a larger inner array (using std::vector's push_back, probably). In awkward 0.x, I think we'd have to use things like

>>> c.pad(c.flatten().counts.max(), axis=1).fillna(0).pad(c.counts.max(), axis=0).fillna(0)
<JaggedArray [[[1 1 1 1] [1 1 1 0] [1 1 0 0]] [[1 0 0 0] [0 0 0 0] [0 0 0 0]]] at 0x7f03145497f0>

before running Numpy's sum with an axis, which is a potential memory hog...

@jpivarski
Copy link
Member

I'm going to close this here, though the axis parameter for reductions is planned for awkward 1.0.

@jpivarski
Copy link
Member

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants