Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start flatten implementation and add tests #45

Merged
merged 18 commits into from Jan 13, 2020

Conversation

ianna
Copy link
Collaborator

@ianna ianna commented Jan 7, 2020

TODO:

  • implement checks for an argument axis: if less or equal -1 or greater then dimension trigger a wrong user input by throwing std::invalid_argument
  • add tests for the above
    * add a C++ test for a RawArrayOf
  • finish flatten implementations

@jpivarski - please, comment

@ianna ianna requested a review from jpivarski January 7, 2020 10:59
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should add tests directly, both to illustrate corner cases and to illustrate what it should do.

src/libawkward/array/EmptyArray.cpp Outdated Show resolved Hide resolved
src/libawkward/array/ListArray.cpp Outdated Show resolved Hide resolved
src/libawkward/array/NumpyArray.cpp Outdated Show resolved Hide resolved
src/libawkward/array/Record.cpp Outdated Show resolved Hide resolved
src/libawkward/array/RecordArray.cpp Outdated Show resolved Hide resolved
src/libawkward/array/RegularArray.cpp Outdated Show resolved Hide resolved
@jpivarski
Copy link
Member

Oh, and you probably don't need a test of RawArrayOf<T>. This kind of array is strictly one-dimensional, so flatten would always return an error. That's a simple enough case that it can get away without a test.

@jpivarski
Copy link
Member

I just put in an example of a ListArray that demonstrates its generality. What's allowed and what isn't allowed is described here.

When you enter the flatten implementation, you don't know if the array is in a legal state; the policy is to check as many rules as you need to perform the operation (during the operation).

ListArray is sufficiently general that it will need a new cpu-kernel. That's part of what makes this function a good introduction. :)

@jpivarski
Copy link
Member

Uh-oh: I didn't intend to merge the whole master branch into this branch, but if I'm reading it right, that's what I did with 5dde95c. If that's a problem, you can roll back that commit—I only did it to make sure that the VERSION_INFO was ahead of the latest release.

@ianna
Copy link
Collaborator Author

ianna commented Jan 8, 2020

Uh-oh: I didn't intend to merge the whole master branch into this branch, but if I'm reading it right, that's what I did with 5dde95c. If that's a problem, you can roll back that commit—I only did it to make sure that the VERSION_INFO was ahead of the latest release.

No problem, I'll update my branch

@ianna
Copy link
Collaborator Author

ianna commented Jan 9, 2020

@jpivarski - it's not finished yet. I just wanted to bring it up-to-date with my local branch. I'm a bit confused with the indices as you can see from the tests :-)

@ianna
Copy link
Collaborator Author

ianna commented Jan 10, 2020

@jpivarski - Could you, please, have a look? I'm not sure if I'm going in the right direction. Besides, I need some help here. If I retrieve the indices for the sliced array, I need an offset:

     array2 = array[2:-1]
-    assert awkward1.tolist(array2.flatten()) == [3.3, 4.4, 5.5]
+    #assert awkward1.tolist(array2.flatten()) == [3.3, 4.4, 5.5]
        array2 = array[2:-1]
>       assert awkward1.tolist(array2.flatten()) == [3.3, 4.4, 5.5]
E       assert [0.0, 1.1, 2.2] == [3.3, 4.4, 5.5]
E         At index 0 diff: 0.0 != 3.3
E         Full diff:
E         - [0.0, 1.1, 2.2]
E         + [3.3, 4.4, 5.5]

@jpivarski
Copy link
Member

First of all, with the ListArray defined as it is in the test code ([[0.0, 1.1, 2.2], [], [5.5], [8.8]] and not [[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]]), the assertion should be false because array[2:-1] of this array is [[5.5]] and its flattening is [5.5]. It's not an interesting example. (Your ListArray example differs from the ListOffsetArray example because it's missing the 3:5 subarray and the 6:10 subarray has been shortened to 8:9.)

To see this, try running the pure Python flatten function from the top of the test file:

>>> flatten(awkward1.tolist(array))
[0.0, 1.1, 2.2, 5.5, 8.8]
>>> flatten(awkward1.tolist(array[2:-1]))
[5.5]

Bigger troubles

However, you're going to have more difficulty when you get to the next block:

content = awkward1.layout.NumpyArray(numpy.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=numpy.int64))
starts  = awkward1.layout.Index64(numpy.array([4, 999, 0, 0, 1, 7]))
stops   = awkward1.layout.Index64(numpy.array([7, 999, 1, 4, 5, 10]))
array   = awkward1.layout.ListArray64(starts, stops, content)
assert awkward1.tolist(array) == [[4, 5, 6], [], [0], [0, 1, 2, 3], [1, 2, 3, 4], [7, 8, 9]]

This one not only has gaps, but they're out of order and overlap. Maybe you should concentrate on this one first because of its generality.

Suggested strategy

There are many ways to implement these functions, and I see that you're following a procedure for ListArray that is similar to what was done for ListOffsetArray. Here's a better idea: instead of trying to "fix" the ListArray with a new starts and stops (you can't because of the gaps, overlaps, and out-of-orderness), write a kernel that performs the flattening as an integer array that will be passed on to the next level down.

For instance, with the first ListArray:

>>> awkward1.tolist(array)
[[0.0, 1.1, 2.2], [], [5.5], [8.8]]
>>> flatten(awkward1.tolist(array))
[0.0, 1.1, 2.2, 5.5, 8.8]

Your kernel can create an Index64 named nextcarry that looks like this: [0, 1, 2, 5, 8]. Then when you call

std::shared_ptr<Content> out = content_.get()->carry(nextcarry)

that out array will be [0.0, 1.1, 2.2, 5.5, 8.8]. "Carrying" is an operation that I added to support getitem; it does exactly the same thing as this:

>>> content[[0, 1, 2, 5, 8]]
<NumpyArray format="d" shape="5" data="0 1.1 2.2 5.5 8.8" at="0x55f493797580"/>

("gather" in SIMD and numpy.take in NumPy). It doesn't matter what kind of an array content_ is—it could be records or more jagged arrays or whatever.

Back when all of these things were implemented exclusively in NumPy calls (see old implementation of flatten), we used this trick everywhere: an array of non-negative integers is a partial application of a function on any type of data structure. I came up with this explanation why: considering element-selection from an array (i.e. passing an integer between square brackets) as a function call, treating that array of length n, element type D as a function [0, n) → D, and gather/numpy.take is function composition over those function calls:

(   [0, m) → [0, n)   ) ∘ (   [0, n) → D   ) = (   [0, m) → D   )

Function composition is associative, so we can apply it whenever we like. In getitem, I had to recursively apply an operation that had to carry information from one recursive step to the next, much like carrying a digit in longhand addition, so this operation is called carry. It's only used internally, but ListArray::flatten would make good use of it.

So, given an array like

content = awkward1.layout.NumpyArray(numpy.array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9]))
starts  = awkward1.layout.Index64(numpy.array([4, 999, 0, 0, 1, 7]))
stops   = awkward1.layout.Index64(numpy.array([7, 999, 1, 4, 5, 10]))
array   = awkward1.layout.ListArray64(starts, stops, content)
assert awkward1.tolist(array) == [[4.4, 5.5, 6.6], [], [0.0], [0.0, 1.1, 2.2, 3.3], [1.1, 2.2, 3.3, 4.4], [7.7, 8.8, 9.9]]

The task of your new cpu-kernel would be to generate this Index64:

[4, 5, 6, 0, 0, 1, 2, 3, 1, 2, 3, 4, 7, 8, 9]

and then pass that Index64 as "nextcarry" to

content_.get()->carry(nextcarry)

to get

>>> numpy.asarray(content[[4, 5, 6, 0, 0, 1, 2, 3, 1, 2, 3, 4, 7, 8, 9]])
array([4.4, 5.5, 6.6, 0. , 0. , 1.1, 2.2, 3.3, 1.1, 2.2, 3.3, 4.4, 7.7, 8.8, 9.9])

Actually, you'll need two cpu-kernels: the first one adds up all the stop[i] - start[i] to determine the length of nextcarry before you can allocate it. Then the second cpu-kernel fills it with integers.

@ianna
Copy link
Collaborator Author

ianna commented Jan 10, 2020

@jpivarski - thanks, using the offsets fixed the test. Please, review the PR.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a lot of comments inline. The only actual "request changes" is that the flatten_length cpu-kernel needs starts_.offset() and stops_.offset() and NumpyArray::flatten needs more extreme testing (described inline).

Once those are done, it's up to you whether you want to extend to axis != 0 in this PR or in a new one. Now that we're collaborating, I'd prefer smaller PRs because then we can merge more frequently.

include/awkward/cpu-kernels/operations.h Outdated Show resolved Hide resolved
src/cpu-kernels/operations.cpp Outdated Show resolved Hide resolved
src/cpu-kernels/operations.cpp Show resolved Hide resolved
src/libawkward/array/ListArray.cpp Show resolved Hide resolved
src/libawkward/array/NumpyArray.cpp Outdated Show resolved Hide resolved
src/libawkward/array/RecordArray.cpp Outdated Show resolved Hide resolved
tests/test_PR045_flatten_operation.py Outdated Show resolved Hide resolved
tests/test_PR045_flatten_operation.py Outdated Show resolved Hide resolved
@ianna
Copy link
Collaborator Author

ianna commented Jan 13, 2020

@jpivarski - I think, I'd prefer to open a new PR for axis != 0.

The tests with NumpyArray work, but I couldn't figure out how to test it on the stride tricks - are they allowed?

>>> a = np.arange(1,10).reshape(3,3)
>>> a
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
>>> np.lib.stride_tricks.as_strided(a, shape=a.shape[::-1], strides=a.strides[::-1])
array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

@jpivarski
Copy link
Member

The examples I have in the inline comment, by slicing a e-dimensional array, should provide enough extreme examples of strides.

The stride_trucks function allows you to create arbitrary shape/stride combinations, some of which could even raise segfaults when iterating over the array. (It's one of the few ways to raise a segfault in pure Python.) But slicing can also make some extreme shape/stride combinations, all of which are legal.

The only shape/stride combination I can think of that you can't make with slicing, but can with stride_tricks is an axis with shape != 0 and stride == 0. This effectively duplicates a value from the original array shape times. Is used to broadcast a scalar or single-valued dimension to match a dimension of length shape without actually copying data.

But you don't need the low-level stride_tricks to do this: there are broadcast functions that create such examples more safely.

@ianna
Copy link
Collaborator Author

ianna commented Jan 13, 2020

The examples I have in the inline comment, by slicing a e-dimensional array, should provide enough extreme examples of strides.

The stride_trucks function allows you to create arbitrary shape/stride combinations, some of which could even raise segfaults when iterating over the array. (It's one of the few ways to raise a segfault in pure Python.) But slicing can also make some extreme shape/stride combinations, all of which are legal.

The only shape/stride combination I can think of that you can't make with slicing, but can with stride_tricks is an axis with shape != 0 and stride == 0. This effectively duplicates a value from the original array shape times. Is used to broadcast a scalar or single-valued dimension to match a dimension of length shape without actually copying data.

But you don't need the low-level stride_tricks to do this: there are broadcast functions that create such examples more safely.

Thanks, @jpivarski! I think, I'm done with this PR. What are the next steps?

@jpivarski jpivarski changed the title [WIP] start flatten implementation and add tests Start flatten implementation and add tests Jan 13, 2020
@jpivarski
Copy link
Member

jpivarski commented Jan 13, 2020

I just resolved the merge conflict (version number), removed the [WIP] from the name, and the tests have just passed. Are there any intermediate commit states you think are important to save before I squash and merge? If not, then the intermediate history will be lost on GitHub, preserved only in our local clones of the repo.

Beyond that, I think it would be time to add axis != 0, which would be a good introduction to what recursion looks like on columnar data. The first case to add would be axis > 0, for which the Python-only flatten is a good illustration. Test examples for this case should be pretty deep, like 3 levels or more. It can be particularly tricky crossing the boundary from Awkward Array nodes like ListArray and RegularArray (each node is at most one "dimension") into NumpyArray nodes (can be more than one dimension). Some examples should have multidimensional NumpyArrays, in which the axis picks out one of the NumPy dimensions.

The axis < 0 case, counting from -1 being the innermost dimension, is hard even to define. For example, is this structure

example-hierarchy

two-dimensional or three-dimensional (assuming the NumpyArrays here are one-dimensional)? Following the "x" branch, it's a ListOffsetArray64 of a one-dimensional NumpyArray, which looks two-dimensional. Following the "y" branch, it's a ListOffsetArray64 of ListOffsetArray64 of a one-dimensional NumpyArray, which looks three-dimensional.

For a structure like this, should axis=-1 mean that we should flatten the nested array inside "y"? If so, then there would be no way to deal with "x". In other cases (where "x" and "y" are both multidimensional, perhaps to different degrees), such an axis=-1 notation would provide ways of flattening that wouldn't be possible (in one step) if axis=-1 weren't defined this way.

If, instead, axis=-1 means we should only flatten above the RecordArray: removing the outermost ListOffsetArray64 in this example, then it would always be easier to make sense of: it would succeed or fail in the same cases that you've already defined in this PR. However, that would make it impossible to flatten inside of a RecordArray—in one step. (The user would have to project the record into its components, flatten each, and recombine them. That sounds annoying, but it's a special case that probably doesn't come up often.)

I'm inclined to define axis=-1 to be above the RecordArray. Since all operations should adhere to the same conventions, this means that subsequent operations should also only descend to the first RecordArray when determining what axis=-1 means. There would need to be a new virtual method, list_depth, beside minmax_depth to do this counting and convert negative axis values into non-negative ones before recursing. All of that would be the subject of the new PR that adds axis != 0 to flatten.

@jpivarski
Copy link
Member

When you're sure that intermediate commits don't have to be preserved, give me the okay and I'll squash-and-merge. Thanks!

@ianna
Copy link
Collaborator Author

ianna commented Jan 13, 2020

When you're sure that intermediate commits don't have to be preserved, give me the okay and I'll squash-and-merge. Thanks!

Thanks, yes, you can squash-and-merge it.

@jpivarski jpivarski merged commit 8a71e52 into master Jan 13, 2020
@jpivarski jpivarski deleted the feature/PR045-flatten-operation branch January 13, 2020 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants