Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ak.Array wrongly deduces input type #1807

Open
HDembinski opened this issue Oct 18, 2022 · 18 comments
Open

ak.Array wrongly deduces input type #1807

HDembinski opened this issue Oct 18, 2022 · 18 comments
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged

Comments

@HDembinski
Copy link
Member

HDembinski commented Oct 18, 2022

Version of Awkward Array

1.10.1

Description and code to reproduce

a = np.array([1, 2, 3], dtype=np.int32)
ak.Array(a)  # <Array [1, 2, 3] type='3 * int32'> OK
ak.Array([a])   # <Array [[1, 2, 3]] type='1 * var * int64'> BAD

As you can see, in the varlength context, Array deduces int64 although the underlying type is int32. It is important that Array deduces the exact type, since the array in 64 bit wastes memory and CPU cycles.

This is also important when these arrays are written to ROOT files. Using types that are too large wastes disk space.

@HDembinski HDembinski added the bug (unverified) The problem described would be a bug, but needs to be triaged label Oct 18, 2022
@agoose77
Copy link
Collaborator

If you pass in a Python iterable, Awkward invokes ArrayBuilder under the hood (via ak.from_iter) as it is ultimately looping in Python to build the result. I don't believe we expose an array fast-path in ArrayBuilder; I haven't looked at this in depth, but I think if this were trivial to do, we'd already have done it. Our ArrayBuilder mechanism supports things like unions, so it's slightly complicated under the hood.

The answer here is that if you want to adopt a NumPy array, then you need to pass it in as a NumPy array (or use ak.from_numpy()).

@HDembinski
Copy link
Member Author

HDembinski commented Oct 18, 2022

I need a varlength array, aka JaggedArray, so from_numpy() wouldn't work for me.

@agoose77
Copy link
Collaborator

agoose77 commented Oct 18, 2022

In that case, you can use from_regular, e.g.

array = ak.from_regular(a[np.newaxis, :], axis=1)

Most of the high-level Awkward functions call ak.to_layout under the hood, which knows how to adopt various array types. So, in this case, I've dropped the from_numpy() which usually isn't necessary.

@ianna
Copy link
Collaborator

ianna commented Oct 18, 2022

ArrayBuilder has only Int64Builder. We could extend the API to support int32.

@HDembinski
Copy link
Member Author

This command works for the minimal example, but I don't know how to apply this to my actual case.

array = ak.from_regular(a[np.newaxis, :], axis=1)

I have a list of numpy arrays. I want to turn them into Awkward (JaggedArray) types.

@HDembinski
Copy link
Member Author

HDembinski commented Oct 18, 2022

Yes, please support all integer and float types in ArrayBuilder that numpy supports.

@ianna
Copy link
Collaborator

ianna commented Oct 18, 2022

I think, there is a more type friendly option to use a LayoutBuilder, but it currently lacks pythonization:

template<class PRIMITIVE>
using NumpyBuilder = awkward::LayoutBuilder::Numpy<PRIMITIVE>;

@agoose77
Copy link
Collaborator

agoose77 commented Oct 18, 2022

I have a list of numpy arrays. I want to turn them into Awkward (JaggedArray) types.

In this case, using ArrayBuilder e.g. via ak.Array(arrays) is going to involve iterating over the elements of each array. It would be faster to do something like:

flattened = np.concatenate(arrays, axis=0)
counts = [len(c) for c in arrays]
ak.unflatten(flattened, counts)

this will involve fewer allocations AFAICR.

@HDembinski
Copy link
Member Author

Ah, yes, that's also a good workaround for the meantime for me.

@agoose77
Copy link
Collaborator

agoose77 commented Oct 18, 2022

I'm not sure that I'd call this a workaround. ak.ArrayBuilder needs to be able to handle all kinds of user input; even if we added support for all of the NumPy primitive types, ArrayBuilder still performs multiple copies of intermediate buffers (it allocates in panels) and has to visit all of the array elements. So, I may be corrected here, but I think it's fairly certain that this will remain the advised solution in the case that you have existing NumPy arrays.

@HDembinski
Copy link
Member Author

It is a workaround for me, because it is not an intuitive API. It is also not documented.

@HDembinski
Copy link
Member Author

At least fix the initial issue that I reported, because that behavior is harmful in any case.

@HDembinski
Copy link
Member Author

When designing APIs, one needs to think about all ways in which this API can be used and then handle all those cases. Not only a few.

@HDembinski
Copy link
Member Author

I'm not sure that I'd call this a workaround. ak.ArrayBuilder needs to be able to handle all kinds of user input; even if we added support for all of the NumPy primitive types, ArrayBuilder still performs multiple copies of intermediate buffers (it allocates in panels) and has to visit all of the array elements. So, I may be corrected here, but I think it's fairly certain that this will remain the advised solution in the case that you have existing NumPy arrays.

I don't know how ArrayBuilder works internally, but I suppose ArrayBuilder can figure out how to handle this special case more efficiently. The question is only whether you have enough information at the call-site, but I think you do.

@jpivarski
Copy link
Member

What's happening here is that we have several specialized functions for loading arrays in different ways (all of which are documented in docstrings/API reference; the hard part is leading users to the appropriate page).

The ak.Array (and ak.Record) constructor dispatches to these specialized functions by argument type, as a convenience. The mapping from types to method of construction is given in the ak.Array (and ak.Record) constructor documentation, for the data argument.

a = np.array([1, 2, 3], dtype=np.int32)

ak.Array(a)  # <Array [1, 2, 3] type='3 * int32'>
ak.Array([a])   # <Array [[1, 2, 3]] type='1 * var * int64'>

is the behavior that we want, for the following reason:

  • When a (data) is a NumPy array, ak.from_numpy is called. This wraps the NumPy array without copying it, so the data type is preserved.
  • When a is a generic iterable (not CuPy, not pyarrow, not a dict of str → columns, and not a string), then the generic ak.from_iter is called. All of the nested data are presumed to be generic iterables, including NumPy arrays, so the integers from NumPy are generic integers. We use int64 as a generic integer type.

Having ak.from_iter preserve NumPy arrays that it encounters inside arbitrary iterables would make its behavior more complex, more difficult to reason about. Your example works this way for simplicity's sake: if the outermost element is a non-array iterable ([a] is a list), everything is generic iterables and generic numeric types.

Since iterating in Python is slow anyway, there's not much speed advantage to recognizing the suite of NumPy scalar types and adding specialized-integer append methods to ArrayBuilder, so that they can be mixed into the output. There is a memory-space advantage, but that can be fixed after constructing the array using the type coercion method described in #1805 (comment), or depending on how simple your use-case is, ak.values_astype can already do it.

Also note that we've only been talking about the memory-space used by the numerical values of the array. The offsets of the jagged array also take space, and ArrayBuilder makes int64 offsets. There is nowhere in the API to insert the information that you might want int32 offsets, since you've only given it an iterable. Meanwhile, ak.unflatten lets you specify this because you give it two arrays, and you can set the dtype of the arrays you give it.

@ianna mentioned LayoutBuilder because it was designed for this purpose. ArrayBuilder takes generic iterables, looks at the type of ever element it is given, and builds an array conforming to that type, no matter how wacky the tagged union has to be. LayoutBuilder is constructed with a type and can only be filled with data of that type. It is designed for speed, so it only exists in compiled languages, currently C++ and someday Numba. (We may need a slow version of LayoutBuilder in pure Python for use in debugging code that will be sent to Numba, but only for the purpose of debugging.) I thought that LayoutBuilder would be a good interface to impy and described how it could be used here: impy-project/chromo#65.

Since ArrayBuilder needs to discover types as it goes along, it needs to be able to compare the current expected type X to the given type Y of each new datum. For $n$ data types, this is n classes (instance of the current expected type X) with $n$ methods each (for handling the given datum of type Y). That's a total of $n^2$ methods. Right now, $n=12$. If, in addition to Int64Builder, we add builders for int8, uint8, int16, uint16, int32, uint32, int64, uint64, and in addition to Float64Builder, we add a builder for float32, then $n=21$. The codebase would triple, and all those new methods would need to properly implement type unification: just as int64 and float64 goes to float64 instead of a tagged union, the whole matrix of numeric types would have to be implemented. It's not impossible, but the value proposition is not strong enough for all of that work because there are other ways of getting what you want.

@HDembinski
Copy link
Member Author

HDembinski commented Oct 19, 2022

Thank you for the long write-up, but it does not matter to me how you implement it, we are talking about API design. You can change your implementation so that it matches user expectations.

My user expectation is that these two commands

ak.Array(a)  # <Array [1, 2, 3] type='3 * int32'>
ak.Array([a])   # <Array [[1, 2, 3]] type='1 * var * int64'>

both give me int32. Using a larger int than the input is wasting resources.

If you cannot provide this because it would be an unreasonable effort (although that may point to a problem with the scalability of the implementation), then you should at very least provide a way for me to restrict the int to int32. Right now, that's not possible.

@agoose77
Copy link
Collaborator

agoose77 commented Oct 19, 2022

All APIs are ultimately a compromise between multiple different constraints. Everyone has different expectations about how these constraints should be satisfied, and they apply different weightings according to their own needs and interests. As you note, you have expectations, but that does not mean that all users share those expectations. It's our job to find a suitable solution that doesn't compromise our goals.

Ultimately, we only have so many developers with a finite amount of time. We could apply a host of micro-optimisations to ak.from_iter, but we need to prioritise the more important parts of the project (bugs, and cornerstone features) over things that currently work, even if with some caveats. It might be that we need to improve our documentation here so that it is clearer how Awkward behaves in ak.Array for non-typed objects.

Jim's given a good overview of why this happens. All libraries require some domain specific knowledge, and in this case, if you care about performance and/or memory usage, then you need to use the high-performance APIs such as from_numpy. Python lists are untyped, so it will by definition be slower than dealing with flat buffers and shape information.

at very least provide a way for me to restrict the int to int32

We have APIs to do this, but they're aimed more at library authors than analysis users. Note that, most users are consuming existing ragged data from Parquet, ROOT, or other sources. It is my understanding that you are integrating Awkward as a library author. The solution that I mentioned above is one way to restrict the type. If you pass in a typed array, e.g. np.ndarray to ak.array or ak.from_numpy, Awkward will not change the type. This is why these functions are here - from_numpy will error if you pass the wrong type. We also have from_buffers, which accepts Awkward's Form (a high-level representation of an array data type). Python types don't have a one-to-one correspondence with Awkward types, so if you care about these things you should use a form to describe your Array.

@jpivarski
Copy link
Member

I had intended this:

is the behavior that we want, for the following reason:

  • When a (data) is a NumPy array, ...
  • When a is a generic iterable...

as a statement of the desired interface. I know I expressed it as "if X, from_numpy is called; if Y, from_iter is called," but I didn't mean that as a statement of implementation; I meant that as "if X, the resulting array is formed as though from_numpy were called; if Y, the resulting array is formed as though from_iter were called." I'm defining the interface to the ak.Array constructor in terms of the interfaces to from_numpy and from_iter.

I was saying that when the argument is pure NumPy, the result will preserve the NumPyness. When the argument is a mixture of NumPy and Python builtins, or any other sequence types, there's no longer any attempt to dig out the nested NumPy types. This decision was made for the sake of simplicity, so that the output is more predictable. (I remember making it, responding to a user issue: simplicity was the overriding concern that led us to this choice.)

The next part of my message was about the difficulty of implementing specialized types in ArrayBuilder and how little it would aid performance. @agoose77 described a way to build the array without ever creating int64 intermediates (and you can also control the integer type of the offsets buffer) and I described a way to change the int64 back into int32. Both methods would be 1‒3 lines of code in a library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged
Projects
None yet
Development

No branches or pull requests

4 participants