Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specifying type for empty Array #1805

Closed
HDembinski opened this issue Oct 18, 2022 · 8 comments · Fixed by #2365
Closed

Allow specifying type for empty Array #1805

HDembinski opened this issue Oct 18, 2022 · 8 comments · Fixed by #2365
Assignees
Labels
feature New feature or request

Comments

@HDembinski
Copy link
Member

Description of new feature

The API does not allow one to create an empty ak.Array with a given type. It is missing the equivalent of a dtype argument.

a = Array([[1.2], []])  # succeeds, type inferred
b = Array([])  # succeeds, but type '0 * unknown'

Please allow the user to pass a type keyword, so that they can create empty arrays with an appropriate type.

@HDembinski HDembinski added the feature New feature or request label Oct 18, 2022
@HDembinski HDembinski changed the title ak.Array does not allow to specify type Allow specifying type for empty Array Oct 18, 2022
@ianna
Copy link
Collaborator

ianna commented Oct 18, 2022

I wonder if using an ArrayBuilder can help you to solve this issue:

>>> builder = ak.ArrayBuilder()
>>> builder.append(ak.Array([]))
>>> builder
<ArrayBuilder type='1 * var * unknown'>
>>> builder.snapshot()
<Array [[]] type='1 * var * unknown'>
>>> builder.append(ak.Array([1.1]))
>>> builder.snapshot()
<Array [[], [1.1]] type='2 * var * float64'>
>>> builder.append(ak.Array([]))
>>> builder.snapshot()
<Array [[], [1.1], []] type='3 * var * float64'>

@agoose77
Copy link
Collaborator

Whilst it is obvious what should happen in the sense of a single ragged array of identically typed values, many Awkward Arrays have more complex structures (and types). In those cases, passing a single dtype argument is not possible.

One could imagine an API in which we pass the ak.types.Type object. We currently have something like that; LayoutBuilder, although this has disappeared in v2 (at the Python level, at least. We'll recover it at the numba / RDataFrame level IIRC), and ak.from_buffers. The latter accepts a high-level Form argument that is capable of describing the rich structures that Awkward understands.

If you're only dealing with identically typed arrays, you can also get away with using ak.values_astype on your empty array.

@jpivarski
Copy link
Member

There's already an ArrayBuilder implicitly in

a = ak.Array([[1.2], []])  # succeeds, type inferred
b = ak.Array([])  # succeeds, but type '0 * unknown'

because the ak.Array constructor calls ak.from_iter and ak.from_iter fills an ak.ArrayBuilder.

We have some of the pieces needed to do this, but not all.

  1. We have a parser that parses type strings into ak.types.Type objects.
  2. We do not have a converter from ak.types.Type objects to ak.forms.Form objects, as that is a non-unique transformation. (Every Form maps to one Type, whereas a Type maps to multiple Forms.)
  3. We can generate an empty array from a given ak.forms.Form (using a trick, but it's a nice trick).
  4. I think we have a function to check the consistency of an array (ak.Array/ak.Record or its layout) with an ak.types.Type. The "consistency" (not equality) logic considers unknown to be consistent with any other type.
  5. We do not have a way of, after checking consistency of most of the tree, filling in the unknown nodes with the type they're judged to be consistent with. In the array, this is an EmptyArray node, and it needs to be replaced with a Content subtree generated from a Form made from the Type. (The choice of Form is non-unique, but since it won't have any data, it could be some canonical choice.)

It sounds like satisfying this request means writing two things: (a) Type → canonical Form, making the most simple choice, (b) a recursive function that walks down a Content and a Type at the same time. At most nodes, it checks consistency, but at EmptyArray Content nodes, it takes the current Type node, generates a canonical Form from it (using (a)), and generates an empty array from that (using the trick from item 3 above).

This recursive function (b) would be applied after constructing the array, to ensure that it conforms to a given Type, filling in where necessary. It's similar to another problem (c) that hasn't been asked for: ensuring that an array conforms to a given Form. In that case, though, the replacement of EmptyArray with anything but EmptyForm would have to be an explicit rule, it doesn't come out naturally as part of consistency rules because those rules are defined on Types, not Forms. Maybe we should ignore (c) until it's actually raised, if ever.

This is doable, but it's not as simple as you might have been imagining, @HDembinski. I don't think you should take it on right away, @agoose77, since you have so many things on your plate right now, but I'll think about it. To make that formal, I'll assign this to myself.

@HDembinski
Copy link
Member Author

HDembinski commented Oct 19, 2022

Whenever I report something here, I get frustrated to be honest, and that's keeping me from reporting here or getting involved in awkward. My feeling is that awkward was designed to be so general that it can do amazing things that I and most other users never need, but it cannot do comparably simple things well (JaggedArrays is all I want to work with) that I need.

@jpivarski
Copy link
Member

From user feedback, I know that record types and option types are used fairly often, though union types are quite rare. (Maybe I can scan the full set of GitHub repos that import awkward to get more rigorous statistics.)

But there is a good argument for drawing a dotted line around just the ragged arrays (or arbitrarily deep ragged arrays? with or without fixed-length dimensions? with or without option types?), since that's an especially frequent use-case. The trouble is defining a closed system around that—functions that only return types $t \in T$ and only take types $t \in T$ as arguments. Functions like min and max on lists whose length might be zero have reason to return missing values, and functions like cartesian and combinations have reason to return tuples (a record-like structure) to represent the paired output, which can be different types because the depths might be different.

But you may be interested in this: pydata/xarray#4285 (comment). We've been talking about interoperability between Awkward and xarray, and the use-cases described there are more strictly focused on ragged arrays (unlike the HEP use-cases). Developers of the CloudDrift project (oceanography) are thinking of creating a RaggedArray structure that excludes records and unions, though I don't know yet what they think of option types and fixed-length dimensions. It won't be backed by an Awkward Array or xarray, though it will be developed in such a way as to be easily convertible.

Regarding this issue that you raised, if your arrays have no record or union types, then ak.values_astype will completely solve the problem of ensuring a specified numeric type. The disadvantage of this function is that will will convert the type of all leaves in the tree to a given dtype, but purely ragged arrays have only one leaf.

I'm still thinking about the general problem of ensuring that a given Awkward Array conforms to a given type, narrowing the type (from unknown) if necessary, but that won't impact you because the value it would add beyond values_astype is that it would handle trees.

@raymondEhlers
Copy link
Contributor

Just chiming in to say that I would find this useful too. I've run into this issue recently because I build up some arrays which at times end up as empty. Most of my analysis code calling ak.* functions works fine, but the unknown type breaks numba compilation. I understand from skimming the above that this may not be so trivial to resolve, which is fine for me since it's not a showstopper and I can workaround it by checking for "unknown" in the type in just a couple of places. But it took some time and effort to track down, so it would be nice if there was a way to avoid it!

@jpivarski
Copy link
Member

Specifically for Numba (and C++/RDataFrame), I thought EmptyArrays get promoted to NumpyArrays with dtype=np.float64 (like empty np.array([])) here:

def tolookup(layout, positions):
if isinstance(layout, ak.contents.EmptyArray):
return tolookup(layout.toNumpyArray(np.dtype(np.float64)), positions)

Consequently, this should work:

>>> import awkward as ak, numba as nb
>>> @nb.njit
... def nested_sum(array):
...     output = 0
...     for one_dimensional in array:
...         for item in one_dimensional:
...             output += item
...     return output
... 
>>> nested_sum(ak.Array([[1, 2, 3], [], [4, 5]]))
15
>>> nested_sum(ak.Array([[], [], []]))
0.0

and it does.

Is the problem with Numba compilation that np.float64 is being assumed, when an integer type is wanted?

@raymondEhlers
Copy link
Contributor

Humm, I unfortunately don't have a concise reproducer - all I have is the traceback (which probably isn't terribly useful without getting into my rather nested code):

Traceback (most recent call last):
  File "/software/rehlers/dev/mammoth/mammoth/hardest_kt/produce_flat_skim_from_track_skim.py", line 440, in _run_embedding_skim
    result = analysis_track_skim_to_flat_tree.hardest_kt_embedding_skim(
  File "/software/rehlers/dev/mammoth/mammoth/hardest_kt/analysis_track_skim_to_flat_tree.py", line 421, in hardest_kt_embedding_skim
    _hardest_kt_embedding_skim(
  File "/software/rehlers/dev/mammoth/mammoth/hardest_kt/analysis_track_skim_to_flat_tree.py", line 231, in _hardest_kt_embedding_skim
    skim_to_flat_tree.calculate_embedding_skim_impl(
  File "/software/rehlers/dev/mammoth/mammoth/hardest_kt/skim_to_flat_tree.py", line 889, in calculate_embedding_skim_impl
    generator_subjet_momentum_fraction_in_measured_jet_numba_wrapper(
  File "/software/rehlers/dev/mammoth/mammoth/hardest_kt/skim_to_flat_tree.py", line 662, in generator_subjet_momentum_fraction_in_measured_jet_numba_wrapper
    leading_momentum_fraction, subleading_momentum_fraction = generator_subjet_momentum_fraction_in_measured_jet_numba(
  File "/software/rehlers/dev/mammoth/.venv/lib/python3.9/site-packages/numba/core/dispatcher.py", line 468, in _compile_for_args
    error_rewrite(e, 'typing')
  File "/software/rehlers/dev/mammoth/.venv/lib/python3.9/site-packages/numba/core/dispatcher.py", line 409, in error_rewrite
    raise e.with_traceback(None)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of getiter with parameters (float64)

During: typing of intrinsic-call at /software/rehlers/dev/mammoth/mammoth/hardest_kt/skim_to_flat_tree.py (293)

File "mammoth/hardest_kt/skim_to_flat_tree.py", line 293:
def _sort_subjets(input_jet: ak.Array, input_subjets: List[analysis_jet_substructure.Subjet]) -> Tuple[analysis_jet_substructure.Subjet, analysis_jet_substructure.Subjet]:
    <source elided>
        py = 0
        for constituent_index in sj.constituent_indices:
        ^

During: resolving callee type: type(CPUDispatcher(<function _sort_subjets at 0x7fa6ed3e9dc0>))
During: typing of call at /software/rehlers/dev/mammoth/mammoth/hardest_kt/skim_to_flat_tree.py (637)

During: resolving callee type: type(CPUDispatcher(<function _sort_subjets at 0x7fa6ed3e9dc0>))
During: typing of call at /software/rehlers/dev/mammoth/mammoth/hardest_kt/skim_to_flat_tree.py (637)


File "mammoth/hardest_kt/skim_to_flat_tree.py", line 637:
def generator_subjet_momentum_fraction_in_measured_jet_numba(
    <source elided>
        # Sort
        generator_like_leading, generator_like_subleading = _sort_subjets(generator_like_jet, generator_like_subjets)

The issue Invalid use of getiter with parameters (float64) only appears to occur when I had an "unknown" in the type.
In any case, my intention is not to hijack this issue :-) Perhaps the underlying cause of my issue is something else. Since I have a workaround, I probably can't take the time to dig into it further. Thanks for the pointer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants