-
Notifications
You must be signed in to change notification settings - Fork 38
Implemented JaggedArray.concatenate(axis=1) #80
Conversation
|
Yay, new functionality! The implementation of this new function looks spot-on: no missing edge-cases. (Except maybe one: I'll make a comment.) I don't think it should be named "stack," though. Numpy's stack does something different: >>> a = numpy.array([[1, 1.1], [2, 2.2], [3, 3.3]])
>>> b = numpy.array([[4, 4.4], [5, 5.5], [6, 6.6]])
>>> numpy.stack([a, b])
array([[[1. , 1.1],
[2. , 2.2],
[3. , 3.3]],
[[4. , 4.4],
[5. , 5.5],
[6. , 6.6]]])
>>> numpy.hstack([a, b])
array([[1. , 1.1, 4. , 4.4],
[2. , 2.2, 5. , 5.5],
[3. , 3.3, 6. , 6.6]])
>>> numpy.vstack([a, b])
array([[1. , 1.1],
[2. , 2.2],
[3. , 3.3],
[4. , 4.4],
[5. , 5.5],
[6. , 6.6]])
>>> numpy.dstack([a, b])
array([[[1. , 4. ],
[1.1, 4.4]],
[[2. , 5. ],
[2.2, 5.5]],
[[3. , 6. ],
[3.3, 6.6]]])What you've implemented is different from all the "stack" variants: >>> a
<JaggedArray [[0.0 1.1 2.2] [] [3.3 4.4] [5.5 6.6 7.7 8.8 9.9]] at 0x72a9fa91d358>
>>> b
<JaggedArray [[6.5 7.6 8.7 9.8 10.9] [4.3 5.4] [] [1.0 2.1 3.2]] at 0x72a9fa91d390>
>>> a.stack(b) # (simulated)
[[0.,1.1,2.2,6.5,7.6,8.7,9.8,10.9],[4.3,5.4],[3.3,4.4],[5.5,6.6,7.7,8.8,9.9,1.,2.1,3.2]]However, it is a bit like "concatenate" with an "axis" parameter: >>> numpy.concatenate([a, b], axis=1)
array([[1. , 1.1, 4. , 4.4],
[2. , 2.2, 5. , 5.5],
[3. , 3.3, 6. , 6.6]])There's an analogue of this kind of generalization in Perhaps you should add this implementation as concatenate with "axis=1"? You could keep the function as it is, renaming it to something internal, like def _concatenate_axis1(cls, self, arrays):
# your implementation without cls_or_self handling
@awkward.util.bothmethod
def concatenate(isclassmethod, cls_or_self, arrays, axis=0):
# standard cls_or_self handling (cls is well defined, but self might be None)
if axis == 1:
return _concatenate_axis1(cls, self, arrays)
if axis > 1:
return NotImplementedError("concatenate with axis > 1")
# original concatenate implementationI'll give a few smaller comments inline as a review. But bottom line: thanks! I think this is a good implementation of much-needed functionality. (The reason I considered implementing it earlier was because another user asked for it, but I was overwhelmed at the time.) |
jpivarski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See inline...
awkward/array/jagged.py
Outdated
|
|
||
| return starts, starts + counts | ||
|
|
||
| @classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is not general enough to be a public-facing function. If, for instance, the stops are out of order, it should be stops.max(), not stops[-1]. However, I don't think you should generalize it; it's general enough for your case (starts and stops made from counts). I think you should hide it with an underscore or absorb it into your public function.
Motivation behind that request: I don't see how it would be useful in general. I see why your new function needs it, but not beyond that.
awkward/array/jagged.py
Outdated
|
|
||
| np = cls.numpy | ||
|
|
||
| mask = np.zeros(length+1, dtype=cls.numpy.int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype=cls.INDEXTYPE
awkward/array/jagged.py
Outdated
| mask = np.zeros(length+1, dtype=cls.numpy.int) | ||
| mask[starts] += 1 | ||
| mask[stops] -= 1 | ||
| return np.array(np.cumsum(mask)[:-1], dtype=np.bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype=cls.MASKTYPE (so that they can be changed centrally in the future)
awkward/array/jagged.py
Outdated
| def stack(cls, first, *rest): # each item in first followed by second, etc. | ||
| raise NotImplementedError | ||
| @awkward.util.bothmethod | ||
| def stack(isclassmethod, cls_or_self, arrays): # each item in first followed by second, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in my first comment, I think this should be the axis=1 case of concatenate, for better agreement with Numpy. Also, that means you wouldn't have to copy the cls_or_self handling.
awkward/array/jagged.py
Outdated
| arrays = (self,) + tuple(arrays) | ||
|
|
||
| if len(set([len(a) for a in arrays])) > 1: | ||
| raise ValueError("cannot stack JaggedArrays of different lengths") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about when len(arrays) == 0? Maybe handle both error states without creating a set:
if len(arrays) == 0:
raise ValueError("at least one array must be provided") # this can only happen in the classmethod case
if any(len(x) != len(arrays[0]) for x in arrays):
raise ValueError("cannot concatenate JaggedArrays of different lengths with axis=1")
awkward/array/jagged.py
Outdated
|
|
||
| n = len(arrays) | ||
| n_content = sum([len(a.flatten()) for a in arrays]) | ||
| dtype = arrays[0].content.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the first array has dtype=int and the others have dtype=float? You don't want floating point numbers to turn into integers; you want the most general type. If you've already guaranteed that the arrays have numpy.ndarray contents (see comment on line 1400), then you could add the first flattened element of each:
dtype = np.dtype(sum(x[0] for x in flatarrays if len(x) != 0, False))Summing a single element is not too expensive and it implements the right type-promotion from boolean to int to float to complex. Note that we start the sum with a boolean zero: False. That way, an array of booleans won't be unintentionally promoted to integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm when I try this idea out, for example with type([True, True], False), it is unfortunately an int. It seems to me this shortcut via the tentative sum does not work if there are only booleans. Or am I understanding something wrong?
Thanks a lot for all the feedback!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, then maybe you just need to handle boolean as a special case. (If it's adding True + True, that's like 2, which is too high of a number to be boolean...) If all arrays are boolean, the result is boolean, and otherwise addition will work.
Here's another way that takes advantage of some logic directly intended to promote types (that we control):
cls.awkward.UnionArray([], [], flatarrays).dtypeAnd this doesn't like to merge booleans with numbers, either (it returns Numpy's object type, which we should avoid whenever possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! I went with handling as a special case now and made an extra unit test for it.
awkward/array/jagged.py
Outdated
| np = cls.numpy | ||
|
|
||
| n = len(arrays) | ||
| n_content = sum([len(a.flatten()) for a in arrays]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe flatten them once so that you don't have to re-flatten them. Flattening is trivial if the arrays have nice starts and stops, but if they're multidimensional, out of order, or whatever, it's not necessarily a fast operation.
tests/test_jagged.py
Outdated
| a = JaggedArray([0, 3, 3, 5], [3, 3, 5, 10], [0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]) | ||
| b = JaggedArray([0, 5, 7, 7], [5, 7, 7, 10], [6.5, 7.6, 8.7, 9.8, 10.9, 4.3, 5.4, 1., 2.1, 3.2]) | ||
| c = a.stack([b]) | ||
| assert c.tolist() == [[0.,1.1,2.2,6.5,7.6,8.7,9.8,10.9],[4.3,5.4],[3.3,4.4],[5.5,6.6,7.7,8.8,9.9,1.,2.1,3.2]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for testing!
|
Will this support also concatenation of |
|
Hi @jpivarski, what are your plans with this? Is there something I still should do here? My colleagues start to get interested in my developments if would make things simpler if this is in the awkward release. @nsmith-, I don't think you can concatenate tables like this in the same way, and so far I had no need for this myself. But that will probably come soon and then I'd make another PR for that. |
I thought there were unresolved issues, but it's hard to go through it now that things have changed so much. I'll review it fresh.
jpivarski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thought I was waiting on you for this, but it looks like you've addressed my concerns and I didn't see the update. Thanks! I'll work on merging it now.
|
Sorry for the confusion! I was looking at this PR off and on, thinking that I was waiting for you, but you were waiting for me. I'll see about getting a new version number for it, too. (awkward 0.8.6) |
|
Ok, thank you very much! Should have pinged you earlier then 👍 |
|
When this reaches Python 3.7, awkward 0.8.6 will be released. Enjoy! |
Hi Jim,
for my analysis I had to combine jagged arrays with muons with the ones for electrons to new jagged arrays for leptons in general. This is what the
stackmethod was supposed to do, right? I suggest here an implementation for it, using a new helper methodstartsstops2mask. Probably you have some suggestions for improvement but this can be a starting point :)Cheers,
Jonas