-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ak.Array wrongly deduces input type #1807
Comments
If you pass in a Python iterable, Awkward invokes The answer here is that if you want to adopt a NumPy array, then you need to pass it in as a NumPy array (or use |
I need a varlength array, aka JaggedArray, so from_numpy() wouldn't work for me. |
In that case, you can use array = ak.from_regular(a[np.newaxis, :], axis=1) Most of the high-level Awkward functions call |
|
This command works for the minimal example, but I don't know how to apply this to my actual case. array = ak.from_regular(a[np.newaxis, :], axis=1) I have a list of numpy arrays. I want to turn them into Awkward (JaggedArray) types. |
Yes, please support all integer and float types in ArrayBuilder that numpy supports. |
I think, there is a more type friendly option to use a template<class PRIMITIVE>
using NumpyBuilder = awkward::LayoutBuilder::Numpy<PRIMITIVE>; |
In this case, using flattened = np.concatenate(arrays, axis=0)
counts = [len(c) for c in arrays]
ak.unflatten(flattened, counts) this will involve fewer allocations AFAICR. |
Ah, yes, that's also a good workaround for the meantime for me. |
I'm not sure that I'd call this a workaround. |
It is a workaround for me, because it is not an intuitive API. It is also not documented. |
At least fix the initial issue that I reported, because that behavior is harmful in any case. |
When designing APIs, one needs to think about all ways in which this API can be used and then handle all those cases. Not only a few. |
I don't know how ArrayBuilder works internally, but I suppose ArrayBuilder can figure out how to handle this special case more efficiently. The question is only whether you have enough information at the call-site, but I think you do. |
What's happening here is that we have several specialized functions for loading arrays in different ways (all of which are documented in docstrings/API reference; the hard part is leading users to the appropriate page). The ak.Array (and ak.Record) constructor dispatches to these specialized functions by argument type, as a convenience. The mapping from types to method of construction is given in the ak.Array (and ak.Record) constructor documentation, for the a = np.array([1, 2, 3], dtype=np.int32)
ak.Array(a) # <Array [1, 2, 3] type='3 * int32'>
ak.Array([a]) # <Array [[1, 2, 3]] type='1 * var * int64'> is the behavior that we want, for the following reason:
Having Since iterating in Python is slow anyway, there's not much speed advantage to recognizing the suite of NumPy scalar types and adding specialized-integer append methods to ArrayBuilder, so that they can be mixed into the output. There is a memory-space advantage, but that can be fixed after constructing the array using the type coercion method described in #1805 (comment), or depending on how simple your use-case is, ak.values_astype can already do it. Also note that we've only been talking about the memory-space used by the numerical values of the array. The offsets of the jagged array also take space, and ArrayBuilder makes @ianna mentioned LayoutBuilder because it was designed for this purpose. ArrayBuilder takes generic iterables, looks at the type of ever element it is given, and builds an array conforming to that type, no matter how wacky the tagged union has to be. LayoutBuilder is constructed with a type and can only be filled with data of that type. It is designed for speed, so it only exists in compiled languages, currently C++ and someday Numba. (We may need a slow version of LayoutBuilder in pure Python for use in debugging code that will be sent to Numba, but only for the purpose of debugging.) I thought that LayoutBuilder would be a good interface to impy and described how it could be used here: impy-project/chromo#65. Since ArrayBuilder needs to discover types as it goes along, it needs to be able to compare the current expected type |
Thank you for the long write-up, but it does not matter to me how you implement it, we are talking about API design. You can change your implementation so that it matches user expectations. My user expectation is that these two commands ak.Array(a) # <Array [1, 2, 3] type='3 * int32'>
ak.Array([a]) # <Array [[1, 2, 3]] type='1 * var * int64'> both give me int32. Using a larger int than the input is wasting resources. If you cannot provide this because it would be an unreasonable effort (although that may point to a problem with the scalability of the implementation), then you should at very least provide a way for me to restrict the int to int32. Right now, that's not possible. |
All APIs are ultimately a compromise between multiple different constraints. Everyone has different expectations about how these constraints should be satisfied, and they apply different weightings according to their own needs and interests. As you note, you have expectations, but that does not mean that all users share those expectations. It's our job to find a suitable solution that doesn't compromise our goals. Ultimately, we only have so many developers with a finite amount of time. We could apply a host of micro-optimisations to Jim's given a good overview of why this happens. All libraries require some domain specific knowledge, and in this case, if you care about performance and/or memory usage, then you need to use the high-performance APIs such as
We have APIs to do this, but they're aimed more at library authors than analysis users. Note that, most users are consuming existing ragged data from Parquet, ROOT, or other sources. It is my understanding that you are integrating Awkward as a library author. The solution that I mentioned above is one way to restrict the type. If you pass in a typed array, e.g. |
I had intended this:
as a statement of the desired interface. I know I expressed it as "if X, I was saying that when the argument is pure NumPy, the result will preserve the NumPyness. When the argument is a mixture of NumPy and Python builtins, or any other sequence types, there's no longer any attempt to dig out the nested NumPy types. This decision was made for the sake of simplicity, so that the output is more predictable. (I remember making it, responding to a user issue: simplicity was the overriding concern that led us to this choice.) The next part of my message was about the difficulty of implementing specialized types in ArrayBuilder and how little it would aid performance. @agoose77 described a way to build the array without ever creating |
Version of Awkward Array
1.10.1
Description and code to reproduce
As you can see, in the varlength context, Array deduces int64 although the underlying type is int32. It is important that Array deduces the exact type, since the array in 64 bit wastes memory and CPU cycles.
This is also important when these arrays are written to ROOT files. Using types that are too large wastes disk space.
The text was updated successfully, but these errors were encountered: