Implemented JaggedArray.concatenate(axis=1) #80

guitargeek · 2019-02-09T14:31:08Z

Hi Jim,

for my analysis I had to combine jagged arrays with muons with the ones for electrons to new jagged arrays for leptons in general. This is what the stack method was supposed to do, right? I suggest here an implementation for it, using a new helper method startsstops2mask. Probably you have some suggestions for improvement but this can be a starting point :)

Cheers,
Jonas

jpivarski · 2019-02-09T16:58:40Z

Yay, new functionality! The implementation of this new function looks spot-on: no missing edge-cases. (Except maybe one: I'll make a comment.)

I don't think it should be named "stack," though. Numpy's stack does something different:

>>> a = numpy.array([[1, 1.1], [2, 2.2], [3, 3.3]])
>>> b = numpy.array([[4, 4.4], [5, 5.5], [6, 6.6]])
>>> numpy.stack([a, b])
array([[[1. , 1.1],
        [2. , 2.2],
        [3. , 3.3]],

       [[4. , 4.4],
        [5. , 5.5],
        [6. , 6.6]]])
>>> numpy.hstack([a, b])
array([[1. , 1.1, 4. , 4.4],
       [2. , 2.2, 5. , 5.5],
       [3. , 3.3, 6. , 6.6]])
>>> numpy.vstack([a, b])
array([[1. , 1.1],
       [2. , 2.2],
       [3. , 3.3],
       [4. , 4.4],
       [5. , 5.5],
       [6. , 6.6]])
>>> numpy.dstack([a, b])
array([[[1. , 4. ],
        [1.1, 4.4]],

       [[2. , 5. ],
        [2.2, 5.5]],

       [[3. , 6. ],
        [3.3, 6.6]]])

What you've implemented is different from all the "stack" variants:

>>> a
<JaggedArray [[0.0 1.1 2.2] [] [3.3 4.4] [5.5 6.6 7.7 8.8 9.9]] at 0x72a9fa91d358>
>>> b
<JaggedArray [[6.5 7.6 8.7 9.8 10.9] [4.3 5.4] [] [1.0 2.1 3.2]] at 0x72a9fa91d390>
>>> a.stack(b)   # (simulated)
[[0.,1.1,2.2,6.5,7.6,8.7,9.8,10.9],[4.3,5.4],[3.3,4.4],[5.5,6.6,7.7,8.8,9.9,1.,2.1,3.2]]

However, it is a bit like "concatenate" with an "axis" parameter:

>>> numpy.concatenate([a, b], axis=1)
array([[1. , 1.1, 4. , 4.4],
       [2. , 2.2, 5. , 5.5],
       [3. , 3.3, 6. , 6.6]])

There's an analogue of this kind of generalization in JaggedArray.flatten: this function handles the "axis" parameter (unlike many jagged operations, for which "axis" would be nonsensical). Much earlier, I had thought about implementing what you did as a method called "allconcat" or "allconcatenate", but that doesn't leave room for future generalization to other "axis" values.

Perhaps you should add this implementation as concatenate with "axis=1"? You could keep the function as it is, renaming it to something internal, like _concatenate_axis1 and then generalize the concatenate method to:

    def _concatenate_axis1(cls, self, arrays):
        # your implementation without cls_or_self handling

    @awkward.util.bothmethod
    def concatenate(isclassmethod, cls_or_self, arrays, axis=0):
        # standard cls_or_self handling (cls is well defined, but self might be None)
        if axis == 1:
            return _concatenate_axis1(cls, self, arrays)
        if axis > 1:
            return NotImplementedError("concatenate with axis > 1")
        # original concatenate implementation

I'll give a few smaller comments inline as a review.

But bottom line: thanks! I think this is a good implementation of much-needed functionality. (The reason I considered implementing it earlier was because another user asked for it, but I was overwhelmed at the time.)

jpivarski

See inline...

jpivarski · 2019-02-09T17:02:02Z

awkward/array/jagged.py


        return starts, starts + counts

+    @classmethod


This implementation is not general enough to be a public-facing function. If, for instance, the stops are out of order, it should be stops.max(), not stops[-1]. However, I don't think you should generalize it; it's general enough for your case (starts and stops made from counts). I think you should hide it with an underscore or absorb it into your public function.

Motivation behind that request: I don't see how it would be useful in general. I see why your new function needs it, but not beyond that.

jpivarski · 2019-02-09T17:02:24Z

awkward/array/jagged.py

+
+        np = cls.numpy
+
+        mask = np.zeros(length+1, dtype=cls.numpy.int)


dtype=cls.INDEXTYPE

jpivarski · 2019-02-09T17:02:55Z

awkward/array/jagged.py

+        mask = np.zeros(length+1, dtype=cls.numpy.int)
+        mask[starts] += 1
+        mask[stops] -= 1
+        return np.array(np.cumsum(mask)[:-1], dtype=np.bool)


dtype=cls.MASKTYPE (so that they can be changed centrally in the future)

jpivarski · 2019-02-09T17:03:53Z

awkward/array/jagged.py

-    def stack(cls, first, *rest):    # each item in first followed by second, etc.
-        raise NotImplementedError
+    @awkward.util.bothmethod
+    def stack(isclassmethod, cls_or_self, arrays):    # each item in first followed by second, etc.


As in my first comment, I think this should be the axis=1 case of concatenate, for better agreement with Numpy. Also, that means you wouldn't have to copy the cls_or_self handling.

awkward/array/jagged.py

jpivarski · 2019-02-09T17:08:04Z

awkward/array/jagged.py

+            arrays = (self,) + tuple(arrays)
+
+        if len(set([len(a) for a in arrays])) > 1:
+            raise ValueError("cannot stack JaggedArrays of different lengths")


What about when len(arrays) == 0? Maybe handle both error states without creating a set:

if len(arrays) == 0: raise ValueError("at least one array must be provided") # this can only happen in the classmethod case if any(len(x) != len(arrays[0]) for x in arrays): raise ValueError("cannot concatenate JaggedArrays of different lengths with axis=1")

awkward/array/jagged.py

jpivarski · 2019-02-09T17:22:52Z

awkward/array/jagged.py

+
+        n = len(arrays)
+        n_content = sum([len(a.flatten()) for a in arrays])
+        dtype = arrays[0].content.dtype


What if the first array has dtype=int and the others have dtype=float? You don't want floating point numbers to turn into integers; you want the most general type. If you've already guaranteed that the arrays have numpy.ndarray contents (see comment on line 1400), then you could add the first flattened element of each:

dtype = np.dtype(sum(x[0] for x in flatarrays if len(x) != 0, False))

Summing a single element is not too expensive and it implements the right type-promotion from boolean to int to float to complex. Note that we start the sum with a boolean zero: False. That way, an array of booleans won't be unintentionally promoted to integers.

Hmm when I try this idea out, for example with type([True, True], False), it is unfortunately an int. It seems to me this shortcut via the tentative sum does not work if there are only booleans. Or am I understanding something wrong?

Thanks a lot for all the feedback!

Oh, then maybe you just need to handle boolean as a special case. (If it's adding True + True, that's like 2, which is too high of a number to be boolean...) If all arrays are boolean, the result is boolean, and otherwise addition will work.

Here's another way that takes advantage of some logic directly intended to promote types (that we control):

cls.awkward.UnionArray([], [], flatarrays).dtype

And this doesn't like to merge booleans with numbers, either (it returns Numpy's object type, which we should avoid whenever possible).

Ok! I went with handling as a special case now and made an extra unit test for it.

jpivarski · 2019-02-09T17:23:52Z

awkward/array/jagged.py

+        np = cls.numpy
+
+        n = len(arrays)
+        n_content = sum([len(a.flatten()) for a in arrays])


Maybe flatten them once so that you don't have to re-flatten them. Flattening is trivial if the arrays have nice starts and stops, but if they're multidimensional, out of order, or whatever, it's not necessarily a fast operation.

jpivarski · 2019-02-09T17:24:26Z

tests/test_jagged.py

+        a = JaggedArray([0, 3, 3, 5], [3, 3, 5, 10], [0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
+        b = JaggedArray([0, 5, 7, 7], [5, 7, 7, 10], [6.5, 7.6, 8.7, 9.8, 10.9, 4.3, 5.4, 1., 2.1, 3.2])
+        c = a.stack([b])
+        assert c.tolist() == [[0.,1.1,2.2,6.5,7.6,8.7,9.8,10.9],[4.3,5.4],[3.3,4.4],[5.5,6.6,7.7,8.8,9.9,1.,2.1,3.2]]


Thanks for testing!

nsmith- · 2019-02-19T14:47:16Z

Will this support also concatenation of Table instances without special consideration? I would then be very interested. Personally I wanted to do a similar operation to merge lepton types together, while still preserving their identity. This can simplify future pairs-like operations significantly.

guitargeek · 2019-02-27T16:03:26Z

Hi @jpivarski, what are your plans with this? Is there something I still should do here? My colleagues start to get interested in my developments if would make things simpler if this is in the awkward release.

@nsmith-, I don't think you can concatenate tables like this in the same way, and so far I had no need for this myself. But that will probably come soon and then I'd make another PR for that.

I thought there were unresolved issues, but it's hard to go through it now that things have changed so much. I'll review it fresh.

jpivarski

I had thought I was waiting on you for this, but it looks like you've addressed my concerns and I didn't see the update. Thanks! I'll work on merging it now.

jpivarski · 2019-02-27T16:24:11Z

Sorry for the confusion! I was looking at this PR off and on, thinking that I was waiting for you, but you were waiting for me. I'll see about getting a new version number for it, too. (awkward 0.8.6)

guitargeek · 2019-02-27T16:53:11Z

Ok, thank you very much! Should have pinged you earlier then 👍

jpivarski · 2019-02-27T17:33:59Z

When this reaches Python 3.7, awkward 0.8.6 will be released. Enjoy!

implemented jagged stack

a13dd7b

jpivarski previously requested changes Feb 9, 2019

View reviewed changes

Implemented Jims comments

ebe28a6

guitargeek changed the title ~~Implemented jagged stack~~ Implemented JaggedArray.concatenate(axis=1) Feb 9, 2019

Bugfix in concatenate with axis=1

20c0575

jpivarski self-requested a review February 27, 2019 16:14

jpivarski approved these changes Feb 27, 2019

View reviewed changes

jpivarski merged commit 24a5cac into scikit-hep:master Feb 27, 2019

guitargeek deleted the jagged_stack branch February 27, 2019 16:52

jpivarski added a commit that referenced this pull request Feb 27, 2019

Bump version for release of PR #80.

cf844a9

guitargeek mentioned this pull request Apr 6, 2019

JaggedArray.concatenate (axis 0 and 1) for ObjectArrays with Tables #117

Merged

jpivarski mentioned this pull request Mar 20, 2020

Concatenate for axis != 0 scikit-hep/awkward#184

Closed


		np = cls.numpy

		mask = np.zeros(length+1, dtype=cls.numpy.int)

Implemented JaggedArray.concatenate(axis=1) #80

Implemented JaggedArray.concatenate(axis=1) #80

Uh oh!

Conversation

guitargeek commented Feb 9, 2019

Uh oh!

jpivarski commented Feb 9, 2019

Uh oh!

jpivarski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpivarski Feb 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsmith- commented Feb 19, 2019

Uh oh!

guitargeek commented Feb 27, 2019

Uh oh!

jpivarski left a comment

Choose a reason for hiding this comment

Uh oh!

jpivarski commented Feb 27, 2019

Uh oh!

guitargeek commented Feb 27, 2019

Uh oh!

jpivarski commented Feb 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jpivarski Feb 9, 2019 •

edited

Loading