Start flatten implementation and add tests #45

ianna · 2020-01-07T10:59:02Z

TODO:

implement checks for an argument axis: if less or equal -1 or greater then dimension trigger a wrong user input by throwing std::invalid_argument
add tests for the above
* add a C++ test for a RawArrayOf
finish flatten implementations

@jpivarski - please, comment

jpivarski

I think I should add tests directly, both to illustrate corner cases and to illustrate what it should do.

src/libawkward/array/EmptyArray.cpp

src/libawkward/array/ListArray.cpp

src/libawkward/array/NumpyArray.cpp

src/libawkward/array/Record.cpp

src/libawkward/array/RecordArray.cpp

src/libawkward/array/RegularArray.cpp

jpivarski · 2020-01-07T14:54:17Z

Oh, and you probably don't need a test of RawArrayOf<T>. This kind of array is strictly one-dimensional, so flatten would always return an error. That's a simple enough case that it can get away without a test.

…scikit-hep/awkward-1.0 into feature/PR045-flatten-operation

jpivarski · 2020-01-07T16:05:09Z

I just put in an example of a ListArray that demonstrates its generality. What's allowed and what isn't allowed is described here.

When you enter the flatten implementation, you don't know if the array is in a legal state; the policy is to check as many rules as you need to perform the operation (during the operation).

ListArray is sufficiently general that it will need a new cpu-kernel. That's part of what makes this function a good introduction. :)

tests/test_PR045_flatten_operation.py

jpivarski · 2020-01-08T21:10:38Z

Uh-oh: I didn't intend to merge the whole master branch into this branch, but if I'm reading it right, that's what I did with 5dde95c. If that's a problem, you can roll back that commit—I only did it to make sure that the VERSION_INFO was ahead of the latest release.

ianna · 2020-01-08T21:15:13Z

Uh-oh: I didn't intend to merge the whole master branch into this branch, but if I'm reading it right, that's what I did with 5dde95c. If that's a problem, you can roll back that commit—I only did it to make sure that the VERSION_INFO was ahead of the latest release.

No problem, I'll update my branch

…scikit-hep/awkward-1.0 into feature/PR045-flatten-operation

ianna · 2020-01-09T15:57:33Z

@jpivarski - it's not finished yet. I just wanted to bring it up-to-date with my local branch. I'm a bit confused with the indices as you can see from the tests :-)

ianna · 2020-01-10T11:50:13Z

@jpivarski - Could you, please, have a look? I'm not sure if I'm going in the right direction. Besides, I need some help here. If I retrieve the indices for the sliced array, I need an offset:

     array2 = array[2:-1]
-    assert awkward1.tolist(array2.flatten()) == [3.3, 4.4, 5.5]
+    #assert awkward1.tolist(array2.flatten()) == [3.3, 4.4, 5.5]

        array2 = array[2:-1]
>       assert awkward1.tolist(array2.flatten()) == [3.3, 4.4, 5.5]
E       assert [0.0, 1.1, 2.2] == [3.3, 4.4, 5.5]
E         At index 0 diff: 0.0 != 3.3
E         Full diff:
E         - [0.0, 1.1, 2.2]
E         + [3.3, 4.4, 5.5]

jpivarski · 2020-01-10T15:45:17Z

First of all, with the ListArray defined as it is in the test code ([[0.0, 1.1, 2.2], [], [5.5], [8.8]] and not [[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]]), the assertion should be false because array[2:-1] of this array is [[5.5]] and its flattening is [5.5]. It's not an interesting example. (Your ListArray example differs from the ListOffsetArray example because it's missing the 3:5 subarray and the 6:10 subarray has been shortened to 8:9.)

To see this, try running the pure Python flatten function from the top of the test file:

>>> flatten(awkward1.tolist(array))
[0.0, 1.1, 2.2, 5.5, 8.8]
>>> flatten(awkward1.tolist(array[2:-1]))
[5.5]

Bigger troubles

However, you're going to have more difficulty when you get to the next block:

content = awkward1.layout.NumpyArray(numpy.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=numpy.int64))
starts  = awkward1.layout.Index64(numpy.array([4, 999, 0, 0, 1, 7]))
stops   = awkward1.layout.Index64(numpy.array([7, 999, 1, 4, 5, 10]))
array   = awkward1.layout.ListArray64(starts, stops, content)
assert awkward1.tolist(array) == [[4, 5, 6], [], [0], [0, 1, 2, 3], [1, 2, 3, 4], [7, 8, 9]]

This one not only has gaps, but they're out of order and overlap. Maybe you should concentrate on this one first because of its generality.

Suggested strategy

There are many ways to implement these functions, and I see that you're following a procedure for ListArray that is similar to what was done for ListOffsetArray. Here's a better idea: instead of trying to "fix" the ListArray with a new starts and stops (you can't because of the gaps, overlaps, and out-of-orderness), write a kernel that performs the flattening as an integer array that will be passed on to the next level down.

For instance, with the first ListArray:

>>> awkward1.tolist(array)
[[0.0, 1.1, 2.2], [], [5.5], [8.8]]
>>> flatten(awkward1.tolist(array))
[0.0, 1.1, 2.2, 5.5, 8.8]

Your kernel can create an Index64 named nextcarry that looks like this: [0, 1, 2, 5, 8]. Then when you call

std::shared_ptr<Content> out = content_.get()->carry(nextcarry)

that out array will be [0.0, 1.1, 2.2, 5.5, 8.8]. "Carrying" is an operation that I added to support getitem; it does exactly the same thing as this:

>>> content[[0, 1, 2, 5, 8]]
<NumpyArray format="d" shape="5" data="0 1.1 2.2 5.5 8.8" at="0x55f493797580"/>

("gather" in SIMD and numpy.take in NumPy). It doesn't matter what kind of an array content_ is—it could be records or more jagged arrays or whatever.

Back when all of these things were implemented exclusively in NumPy calls (see old implementation of flatten), we used this trick everywhere: an array of non-negative integers is a partial application of a function on any type of data structure. I came up with this explanation why: considering element-selection from an array (i.e. passing an integer between square brackets) as a function call, treating that array of length n, element type D as a function [0, n) → D, and gather/numpy.take is function composition over those function calls:

(   [0, m) → [0, n)   ) ∘ (   [0, n) → D   ) = (   [0, m) → D   )

Function composition is associative, so we can apply it whenever we like. In getitem, I had to recursively apply an operation that had to carry information from one recursive step to the next, much like carrying a digit in longhand addition, so this operation is called carry. It's only used internally, but ListArray::flatten would make good use of it.

So, given an array like

content = awkward1.layout.NumpyArray(numpy.array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]))
starts  = awkward1.layout.Index64(numpy.array([4, 999, 0, 0, 1, 7]))
stops   = awkward1.layout.Index64(numpy.array([7, 999, 1, 4, 5, 10]))
array   = awkward1.layout.ListArray64(starts, stops, content)
assert awkward1.tolist(array) == [[4.4, 5.5, 6.6], [], [0.0], [0.0, 1.1, 2.2, 3.3], [1.1, 2.2, 3.3, 4.4], [7.7, 8.8, 9.9]]

The task of your new cpu-kernel would be to generate this Index64:

[4, 5, 6, 0, 0, 1, 2, 3, 1, 2, 3, 4, 7, 8, 9]

and then pass that Index64 as "nextcarry" to

content_.get()->carry(nextcarry)

to get

>>> numpy.asarray(content[[4, 5, 6, 0, 0, 1, 2, 3, 1, 2, 3, 4, 7, 8, 9]])
array([4.4, 5.5, 6.6, 0. , 0. , 1.1, 2.2, 3.3, 1.1, 2.2, 3.3, 4.4, 7.7, 8.8, 9.9])

Actually, you'll need two cpu-kernels: the first one adds up all the stop[i] - start[i] to determine the length of nextcarry before you can allocate it. Then the second cpu-kernel fills it with integers.

src/libawkward/array/ListArray.cpp

ianna · 2020-01-10T17:42:01Z

@jpivarski - thanks, using the offsets fixed the test. Please, review the PR.

jpivarski

I put a lot of comments inline. The only actual "request changes" is that the flatten_length cpu-kernel needs starts_.offset() and stops_.offset() and NumpyArray::flatten needs more extreme testing (described inline).

Once those are done, it's up to you whether you want to extend to axis != 0 in this PR or in a new one. Now that we're collaborating, I'd prefer smaller PRs because then we can merge more frequently.

include/awkward/cpu-kernels/operations.h

src/cpu-kernels/operations.cpp

src/libawkward/array/ListArray.cpp

src/libawkward/array/NumpyArray.cpp

src/libawkward/array/RecordArray.cpp

tests/test_PR045_flatten_operation.py

ianna · 2020-01-13T11:36:19Z

@jpivarski - I think, I'd prefer to open a new PR for axis != 0.

The tests with NumpyArray work, but I couldn't figure out how to test it on the stride tricks - are they allowed?

>>> a = np.arange(1,10).reshape(3,3)
>>> a
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
>>> np.lib.stride_tricks.as_strided(a, shape=a.shape[::-1], strides=a.strides[::-1])
array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

jpivarski · 2020-01-13T12:44:53Z

The examples I have in the inline comment, by slicing a e-dimensional array, should provide enough extreme examples of strides.

The stride_trucks function allows you to create arbitrary shape/stride combinations, some of which could even raise segfaults when iterating over the array. (It's one of the few ways to raise a segfault in pure Python.) But slicing can also make some extreme shape/stride combinations, all of which are legal.

The only shape/stride combination I can think of that you can't make with slicing, but can with stride_tricks is an axis with shape != 0 and stride == 0. This effectively duplicates a value from the original array shape times. Is used to broadcast a scalar or single-valued dimension to match a dimension of length shape without actually copying data.

But you don't need the low-level stride_tricks to do this: there are broadcast functions that create such examples more safely.

ianna · 2020-01-13T17:08:17Z

The examples I have in the inline comment, by slicing a e-dimensional array, should provide enough extreme examples of strides.

The stride_trucks function allows you to create arbitrary shape/stride combinations, some of which could even raise segfaults when iterating over the array. (It's one of the few ways to raise a segfault in pure Python.) But slicing can also make some extreme shape/stride combinations, all of which are legal.

The only shape/stride combination I can think of that you can't make with slicing, but can with stride_tricks is an axis with shape != 0 and stride == 0. This effectively duplicates a value from the original array shape times. Is used to broadcast a scalar or single-valued dimension to match a dimension of length shape without actually copying data.

But you don't need the low-level stride_tricks to do this: there are broadcast functions that create such examples more safely.

Thanks, @jpivarski! I think, I'm done with this PR. What are the next steps?

jpivarski · 2020-01-13T17:42:45Z

I just resolved the merge conflict (version number), removed the [WIP] from the name, and the tests have just passed. Are there any intermediate commit states you think are important to save before I squash and merge? If not, then the intermediate history will be lost on GitHub, preserved only in our local clones of the repo.

Beyond that, I think it would be time to add axis != 0, which would be a good introduction to what recursion looks like on columnar data. The first case to add would be axis > 0, for which the Python-only flatten is a good illustration. Test examples for this case should be pretty deep, like 3 levels or more. It can be particularly tricky crossing the boundary from Awkward Array nodes like ListArray and RegularArray (each node is at most one "dimension") into NumpyArray nodes (can be more than one dimension). Some examples should have multidimensional NumpyArrays, in which the axis picks out one of the NumPy dimensions.

The axis < 0 case, counting from -1 being the innermost dimension, is hard even to define. For example, is this structure

two-dimensional or three-dimensional (assuming the NumpyArrays here are one-dimensional)? Following the "x" branch, it's a ListOffsetArray64 of a one-dimensional NumpyArray, which looks two-dimensional. Following the "y" branch, it's a ListOffsetArray64 of ListOffsetArray64 of a one-dimensional NumpyArray, which looks three-dimensional.

For a structure like this, should axis=-1 mean that we should flatten the nested array inside "y"? If so, then there would be no way to deal with "x". In other cases (where "x" and "y" are both multidimensional, perhaps to different degrees), such an axis=-1 notation would provide ways of flattening that wouldn't be possible (in one step) if axis=-1 weren't defined this way.

If, instead, axis=-1 means we should only flatten above the RecordArray: removing the outermost ListOffsetArray64 in this example, then it would always be easier to make sense of: it would succeed or fail in the same cases that you've already defined in this PR. However, that would make it impossible to flatten inside of a RecordArray—in one step. (The user would have to project the record into its components, flatten each, and recombine them. That sounds annoying, but it's a special case that probably doesn't come up often.)

I'm inclined to define axis=-1 to be above the RecordArray. Since all operations should adhere to the same conventions, this means that subsequent operations should also only descend to the first RecordArray when determining what axis=-1 means. There would need to be a new virtual method, list_depth, beside minmax_depth to do this counting and convert negative axis values into non-negative ones before recursing. All of that would be the subject of the new PR that adds axis != 0 to flatten.

jpivarski · 2020-01-13T17:50:08Z

When you're sure that intermediate commits don't have to be preserved, give me the okay and I'll squash-and-merge. Thanks!

ianna · 2020-01-13T21:45:39Z

When you're sure that intermediate commits don't have to be preserved, give me the okay and I'll squash-and-merge. Thanks!

Thanks, yes, you can squash-and-merge it.

start flatten implementation and add tests

63b8a4c

ianna requested a review from jpivarski January 7, 2020 10:59

Merge branch 'master' into feature/PR045-flatten-operation

22f350c

jpivarski reviewed Jan 7, 2020

View reviewed changes

ianna and others added 6 commits January 7, 2020 16:32

[skip ci] correct EmptyArray

6a68c0f

[skip ci] correct Record

96e520c

[skip ci] correct RegularArray

e375859

[skip ci] correct RawArray

6aaf021

Merge branch 'feature/PR045-flatten-operation' of https://github.com/…

ac79329

…scikit-hep/awkward-1.0 into feature/PR045-flatten-operation

Example of an extreme (but legal) ListArray.

a73a451

jpivarski reviewed Jan 7, 2020

View reviewed changes

tests/test_PR045_flatten_operation.py Outdated Show resolved Hide resolved

jpivarski and others added 2 commits January 8, 2020 15:04

Keep version number ahead of latest release.

e76eaa6

Merge branch 'master' into feature/PR045-flatten-operation

5dde95c

ianna added 2 commits January 9, 2020 16:43

[skip ci] operations in cpu kernels

3838f01

Merge branch 'feature/PR045-flatten-operation' of https://github.com/…

b51893b

…scikit-hep/awkward-1.0 into feature/PR045-flatten-operation

[skip ci] cleanup debugging messages

55fb884

ianna commented Jan 10, 2020

View reviewed changes

src/libawkward/array/ListArray.cpp Show resolved Hide resolved

ianna added 2 commits January 10, 2020 18:26

[skip ci] restructure functions

5568dcd

use offsets in calculations

67c6769

jpivarski requested changes Jan 10, 2020

View reviewed changes

ianna added 2 commits January 11, 2020 19:02

[skip ci] address Jim's comments to fix ListArray and add more tests

774a724

flatten non contiguous NumpyArray

554f3f5

jpivarski approved these changes Jan 13, 2020

View reviewed changes

Merge branch 'master' into feature/PR045-flatten-operation

1a6ee15

jpivarski changed the title ~~[WIP] start flatten implementation and add tests~~ Start flatten implementation and add tests Jan 13, 2020

jpivarski merged commit 8a71e52 into master Jan 13, 2020

jpivarski deleted the feature/PR045-flatten-operation branch January 13, 2020 23:03

This was referenced Jan 14, 2020

*::flatten and *::count for axis != 0 #51

Closed

Generic reducer operation #69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start flatten implementation and add tests #45

Start flatten implementation and add tests #45

ianna commented Jan 7, 2020 •

edited

jpivarski left a comment

jpivarski commented Jan 7, 2020

jpivarski commented Jan 7, 2020

jpivarski commented Jan 8, 2020

ianna commented Jan 8, 2020

ianna commented Jan 9, 2020

ianna commented Jan 10, 2020

jpivarski commented Jan 10, 2020

ianna commented Jan 10, 2020

jpivarski left a comment

ianna commented Jan 13, 2020 •

edited

jpivarski commented Jan 13, 2020

ianna commented Jan 13, 2020

jpivarski commented Jan 13, 2020 •

edited

jpivarski commented Jan 13, 2020

ianna commented Jan 13, 2020

Start flatten implementation and add tests #45

Start flatten implementation and add tests #45

Conversation

ianna commented Jan 7, 2020 • edited

jpivarski left a comment

Choose a reason for hiding this comment

jpivarski commented Jan 7, 2020

jpivarski commented Jan 7, 2020

jpivarski commented Jan 8, 2020

ianna commented Jan 8, 2020

ianna commented Jan 9, 2020

ianna commented Jan 10, 2020

jpivarski commented Jan 10, 2020

Bigger troubles

Suggested strategy

ianna commented Jan 10, 2020

jpivarski left a comment

Choose a reason for hiding this comment

ianna commented Jan 13, 2020 • edited

jpivarski commented Jan 13, 2020

ianna commented Jan 13, 2020

jpivarski commented Jan 13, 2020 • edited

jpivarski commented Jan 13, 2020

ianna commented Jan 13, 2020

ianna commented Jan 7, 2020 •

edited

ianna commented Jan 13, 2020 •

edited

jpivarski commented Jan 13, 2020 •

edited