Skip to content
This repository was archived by the owner on Jun 21, 2022. It is now read-only.

Conversation

@guitargeek
Copy link
Contributor

Hi Jim,

for my analysis I had to combine jagged arrays with muons with the ones for electrons to new jagged arrays for leptons in general. This is what the stack method was supposed to do, right? I suggest here an implementation for it, using a new helper method startsstops2mask. Probably you have some suggestions for improvement but this can be a starting point :)

Cheers,
Jonas

@jpivarski
Copy link
Member

Yay, new functionality! The implementation of this new function looks spot-on: no missing edge-cases. (Except maybe one: I'll make a comment.)

I don't think it should be named "stack," though. Numpy's stack does something different:

>>> a = numpy.array([[1, 1.1], [2, 2.2], [3, 3.3]])
>>> b = numpy.array([[4, 4.4], [5, 5.5], [6, 6.6]])
>>> numpy.stack([a, b])
array([[[1. , 1.1],
        [2. , 2.2],
        [3. , 3.3]],

       [[4. , 4.4],
        [5. , 5.5],
        [6. , 6.6]]])
>>> numpy.hstack([a, b])
array([[1. , 1.1, 4. , 4.4],
       [2. , 2.2, 5. , 5.5],
       [3. , 3.3, 6. , 6.6]])
>>> numpy.vstack([a, b])
array([[1. , 1.1],
       [2. , 2.2],
       [3. , 3.3],
       [4. , 4.4],
       [5. , 5.5],
       [6. , 6.6]])
>>> numpy.dstack([a, b])
array([[[1. , 4. ],
        [1.1, 4.4]],

       [[2. , 5. ],
        [2.2, 5.5]],

       [[3. , 6. ],
        [3.3, 6.6]]])

What you've implemented is different from all the "stack" variants:

>>> a
<JaggedArray [[0.0 1.1 2.2] [] [3.3 4.4] [5.5 6.6 7.7 8.8 9.9]] at 0x72a9fa91d358>
>>> b
<JaggedArray [[6.5 7.6 8.7 9.8 10.9] [4.3 5.4] [] [1.0 2.1 3.2]] at 0x72a9fa91d390>
>>> a.stack(b)   # (simulated)
[[0.,1.1,2.2,6.5,7.6,8.7,9.8,10.9],[4.3,5.4],[3.3,4.4],[5.5,6.6,7.7,8.8,9.9,1.,2.1,3.2]]

However, it is a bit like "concatenate" with an "axis" parameter:

>>> numpy.concatenate([a, b], axis=1)
array([[1. , 1.1, 4. , 4.4],
       [2. , 2.2, 5. , 5.5],
       [3. , 3.3, 6. , 6.6]])

There's an analogue of this kind of generalization in JaggedArray.flatten: this function handles the "axis" parameter (unlike many jagged operations, for which "axis" would be nonsensical). Much earlier, I had thought about implementing what you did as a method called "allconcat" or "allconcatenate", but that doesn't leave room for future generalization to other "axis" values.

Perhaps you should add this implementation as concatenate with "axis=1"? You could keep the function as it is, renaming it to something internal, like _concatenate_axis1 and then generalize the concatenate method to:

    def _concatenate_axis1(cls, self, arrays):
        # your implementation without cls_or_self handling

    @awkward.util.bothmethod
    def concatenate(isclassmethod, cls_or_self, arrays, axis=0):
        # standard cls_or_self handling (cls is well defined, but self might be None)
        if axis == 1:
            return _concatenate_axis1(cls, self, arrays)
        if axis > 1:
            return NotImplementedError("concatenate with axis > 1")
        # original concatenate implementation

I'll give a few smaller comments inline as a review.

But bottom line: thanks! I think this is a good implementation of much-needed functionality. (The reason I considered implementing it earlier was because another user asked for it, but I was overwhelmed at the time.)

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline...


return starts, starts + counts

@classmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is not general enough to be a public-facing function. If, for instance, the stops are out of order, it should be stops.max(), not stops[-1]. However, I don't think you should generalize it; it's general enough for your case (starts and stops made from counts). I think you should hide it with an underscore or absorb it into your public function.

Motivation behind that request: I don't see how it would be useful in general. I see why your new function needs it, but not beyond that.


np = cls.numpy

mask = np.zeros(length+1, dtype=cls.numpy.int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype=cls.INDEXTYPE

mask = np.zeros(length+1, dtype=cls.numpy.int)
mask[starts] += 1
mask[stops] -= 1
return np.array(np.cumsum(mask)[:-1], dtype=np.bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype=cls.MASKTYPE (so that they can be changed centrally in the future)

def stack(cls, first, *rest): # each item in first followed by second, etc.
raise NotImplementedError
@awkward.util.bothmethod
def stack(isclassmethod, cls_or_self, arrays): # each item in first followed by second, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in my first comment, I think this should be the axis=1 case of concatenate, for better agreement with Numpy. Also, that means you wouldn't have to copy the cls_or_self handling.

arrays = (self,) + tuple(arrays)

if len(set([len(a) for a in arrays])) > 1:
raise ValueError("cannot stack JaggedArrays of different lengths")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about when len(arrays) == 0? Maybe handle both error states without creating a set:

if len(arrays) == 0:
    raise ValueError("at least one array must be provided")   # this can only happen in the classmethod case
if any(len(x) != len(arrays[0]) for x in arrays):
    raise ValueError("cannot concatenate JaggedArrays of different lengths with axis=1")


n = len(arrays)
n_content = sum([len(a.flatten()) for a in arrays])
dtype = arrays[0].content.dtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the first array has dtype=int and the others have dtype=float? You don't want floating point numbers to turn into integers; you want the most general type. If you've already guaranteed that the arrays have numpy.ndarray contents (see comment on line 1400), then you could add the first flattened element of each:

dtype = np.dtype(sum(x[0] for x in flatarrays if len(x) != 0, False))

Summing a single element is not too expensive and it implements the right type-promotion from boolean to int to float to complex. Note that we start the sum with a boolean zero: False. That way, an array of booleans won't be unintentionally promoted to integers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm when I try this idea out, for example with type([True, True], False), it is unfortunately an int. It seems to me this shortcut via the tentative sum does not work if there are only booleans. Or am I understanding something wrong?

Thanks a lot for all the feedback!

Copy link
Member

@jpivarski jpivarski Feb 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, then maybe you just need to handle boolean as a special case. (If it's adding True + True, that's like 2, which is too high of a number to be boolean...) If all arrays are boolean, the result is boolean, and otherwise addition will work.

Here's another way that takes advantage of some logic directly intended to promote types (that we control):

cls.awkward.UnionArray([], [], flatarrays).dtype

And this doesn't like to merge booleans with numbers, either (it returns Numpy's object type, which we should avoid whenever possible).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I went with handling as a special case now and made an extra unit test for it.

np = cls.numpy

n = len(arrays)
n_content = sum([len(a.flatten()) for a in arrays])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe flatten them once so that you don't have to re-flatten them. Flattening is trivial if the arrays have nice starts and stops, but if they're multidimensional, out of order, or whatever, it's not necessarily a fast operation.

a = JaggedArray([0, 3, 3, 5], [3, 3, 5, 10], [0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
b = JaggedArray([0, 5, 7, 7], [5, 7, 7, 10], [6.5, 7.6, 8.7, 9.8, 10.9, 4.3, 5.4, 1., 2.1, 3.2])
c = a.stack([b])
assert c.tolist() == [[0.,1.1,2.2,6.5,7.6,8.7,9.8,10.9],[4.3,5.4],[3.3,4.4],[5.5,6.6,7.7,8.8,9.9,1.,2.1,3.2]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing!

@guitargeek guitargeek changed the title Implemented jagged stack Implemented JaggedArray.concatenate(axis=1) Feb 9, 2019
@nsmith-
Copy link
Member

nsmith- commented Feb 19, 2019

Will this support also concatenation of Table instances without special consideration? I would then be very interested. Personally I wanted to do a similar operation to merge lepton types together, while still preserving their identity. This can simplify future pairs-like operations significantly.

@guitargeek
Copy link
Contributor Author

Hi @jpivarski, what are your plans with this? Is there something I still should do here? My colleagues start to get interested in my developments if would make things simpler if this is in the awkward release.

@nsmith-, I don't think you can concatenate tables like this in the same way, and so far I had no need for this myself. But that will probably come soon and then I'd make another PR for that.

@jpivarski jpivarski self-requested a review February 27, 2019 16:14
@jpivarski jpivarski dismissed their stale review February 27, 2019 16:15

I thought there were unresolved issues, but it's hard to go through it now that things have changed so much. I'll review it fresh.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought I was waiting on you for this, but it looks like you've addressed my concerns and I didn't see the update. Thanks! I'll work on merging it now.

@jpivarski jpivarski merged commit 24a5cac into scikit-hep:master Feb 27, 2019
@jpivarski
Copy link
Member

Sorry for the confusion! I was looking at this PR off and on, thinking that I was waiting for you, but you were waiting for me. I'll see about getting a new version number for it, too. (awkward 0.8.6)

@guitargeek guitargeek deleted the jagged_stack branch February 27, 2019 16:52
@guitargeek
Copy link
Contributor Author

Ok, thank you very much! Should have pinged you earlier then 👍

jpivarski added a commit that referenced this pull request Feb 27, 2019
@jpivarski
Copy link
Member

When this reaches Python 3.7, awkward 0.8.6 will be released. Enjoy!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants