fix: support `highlevel=False` in all branches for `from_parquet` #2646

agoose77 · 2023-08-15T12:39:04Z

This PR ensures that ak.from_parquet can handle highlevel=False in all branches. It also adds some useful comments to the Arrow table-reading code.

codecov · 2023-08-15T12:47:28Z

Codecov Report

Merging #2646 (38a06e5) into main (961637e) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Additional details and impacted files

Files Changed	Coverage Δ
src/awkward/operations/ak_from_parquet.py	`91.17% <100.00%> (+1.39%)`	⬆️
src/awkward/operations/ak_to_arrow_table.py	`100.00% <100.00%> (ø)`

agoose77 · 2023-08-15T12:49:16Z

src/awkward/operations/ak_from_parquet.py

    if len(arrays) == 0:
        return wrap_layout(
-            subform.length_zero_array(highlevel=False), behavior=behavior
+            subform.length_zero_array(highlevel=False),
+            highlevel=highlevel,
+            behavior=behavior,


It doesn't look like we can actually trigger this pathway - the metadata function validates the path list to ensure it's non-empty. We should perhaps remove this case from the function.

If so, then it can be replaced by an

assert len(arrays) != 0

(It could be this PR or another one. At the moment, it's just lost coverage.)

agoose77 · 2023-08-15T13:53:57Z

@jim Pivarski I started this PR with the intent of solving #2337, but I realised that some prep work would be necessary.

Could you help me to figure out some of our design choices?

Why do we serialise tuples as a table with an unnamed field-of-struct, where the struct has string names, rather than as a top-level field, i.e. why do we do

pyarrow.Table
: struct<0: large_string not null, 1: large_string not null, 2: large_string not null> not null
child 0, 0: large_string not null
child 1, 1: large_string not null
child 2, 2: large_string not null
----

Rather than

pyarrow.Table
0: large_string not null
1: large_string not null
2: large_string not null
----

Why do we store metadata for non-extensionarray arrays? This metadata is currently used to reconstruct the top-level RecordArray and parents (which is not covered by the extensionarray logic). In my view, our non-extension output should be considered a lossy, tool-friendly format that coerces the output to something that can be written & read form from a file. As such, we don't need to include any metadata.
Should we start versioning our extension-array system / metadata? Unlike to_buffers, this feels less stable and more likely to change in future.

jpivarski · 2023-08-15T16:29:42Z

Why do we serialise tuples as a table with an unnamed field-of-struct?

Background: Arrow's Struct and Table are distinct things, but we only have Record. Sometimes we have to ask for user help in deciding which of these to turn a Record into, and that's why we have ak.to_arrow and ak.to_arrow_table. In Arrow, Struct is a type that can be mixed with any other types, but Table is always at top-level, and includes partitioning (a Table contains RecordBatches). Serialized formats like Parquet must be Tables, regardless of the Awkward type.

You're asking about Awkward tuples, i.e. Records without field names. Arrow doesn't have an equivalent of tuples: Structs must have names (which is why, unfortunately, pyarrow.compute.extract_regex requires the regex to have named captures, rather than simple parenthesized groups). Arrow Tables also need names. Either way, names must be invented, though we have a rule for creating these names: "0", "1", "2", etc.

But even if either Arrow Struct or Arrow Table had names, we couldn't choose one or the other on that basis, since Awkward Records can appear at any level and Arrow Table is a special top-level only thing. I think you're asking, "Given that ak.to_arrow_table is being called, why is a RecordArray with field names split into separate Table fields while a RecordArray without field names is put into a Table with one empty field name?" I.e. why does this if-statement depend on check[-1].is_tuple?

awkward/src/awkward/operations/ak_to_arrow_table.py

Lines 108 to 155 in a960a56

    
           parameters = None 
        
           paarrays, pafields = [], [] 
        
           if check[-1].is_record and not check[-1].is_tuple: 
        
               optiontype_fields = [] 
        
               for name in check[-1].fields: 
        
                   paarrays.append( 
        
                       layout[name].to_arrow( 
        
                           list_to32=list_to32, 
        
                           string_to32=string_to32, 
        
                           bytestring_to32=bytestring_to32, 
        
                           emptyarray_to=emptyarray_to, 
        
                           categorical_as_dictionary=categorical_as_dictionary, 
        
                           extensionarray=extensionarray, 
        
                           count_nulls=count_nulls, 
        
                           record_is_scalar=record_is_scalar, 
        
                       ) 
        
                   ) 
        
                   pafields.append( 
        
                       pyarrow.field(name, paarrays[-1].type).with_nullable( 
        
                           layout[name].is_option 
        
                       ) 
        
                   ) 
        
                   if check[-1].contents[check[-1].field_to_index(name)].is_option: 
        
                       optiontype_fields.append(name) 
        
               parameters = [ 
        
                   {"optiontype_fields": optiontype_fields}, 
        
                   {"record_is_scalar": record_is_scalar}, 
        
               ] 
        
               for x in check: 
        
                   parameters.append({direct_Content_subclass(x).__name__: x._parameters}) 
        
           else: 
        
               paarrays.append( 
        
                   layout.to_arrow( 
        
                       list_to32=list_to32, 
        
                       string_to32=string_to32, 
        
                       bytestring_to32=bytestring_to32, 
        
                       emptyarray_to=emptyarray_to, 
        
                       categorical_as_dictionary=categorical_as_dictionary, 
        
                       extensionarray=extensionarray, 
        
                       count_nulls=count_nulls, 
        
                       record_is_scalar=record_is_scalar, 
        
                   ) 
        
               ) 
        
               pafields.append( 
        
                   pyarrow.field("", paarrays[-1].type).with_nullable(layout.is_option) 
        
               )

I'm not really sure. Are Arrow Table names "0", "1", "2", etc. legal? Even if they're legal, are they reasonable in the Arrow world?

The most important thing is that data can round-trip through Arrow. The choice between turning Awkward tuples into a Table or a Struct-in-Table only changes how Arrow users see it if they're not using Awkward.

It would also be hard to change, since I don't see a way for a user to opt-in during a deprecation cycle.

Why do we store metadata for non-extensionarray arrays?

We choose between the full extensionarray framework or no extensionarray based on user choice, not the type of the array. Even if a simple Awkward Array could easily round-trip through Arrow without any extensionarray metadata, the extensionarray will be there if extensionarray=True (the default).

Maybe you're asking something different: if extensionarray=False, it should be a "lossy, tool-friendly format." It doesn't carry metadata because it can't—there's nowhere to put it.

Should we start versioning our extension-array system / metadata?

We do expect it to change and you're right that we should be concerned about backward compatibility. We should feel free to add fields to the JSON, but have a fallback for if they're not there, pulling them from the dict with get.

There are drawbacks to including a version number in the protocol. Let's consider two ways of doing schema evolution:

There is a version number; a file has version v_file and the code has version v_code. If v_code >= v_file, then the code knows that it can read the file, because the code is full of if-then logic or it has separate functions for each version or something similar. If v_code < v_file, then the (old) code does not attempt to read the file because it assumes that future versions of the format may be changed in arbitrary ways. Advantage: future versions of the code are allowed to change in arbitrary ways. Disadvantage: forward-compatibility is categorically excluded. It may be that a future change is minor and it ought to be possible for old versions of the code to read it, but they're not allowed to in principle.
There is no version number; changes to the format must always be narrowing. That is, a future version of the format must be a subtype of past versions of the format, for example, by adding fields to a record. It's particularly easy to do this with JSON because JSON is dynamically typed, new fields can always be queried with get, rather than __getitem__, and if old code doesn't know about a new field, it doesn't even ask for it. Advantage: all versions of the code can read all versions of the format, though old code may be less capable than new code. (For example, a format change can patch a missing case that prevented lossless round-trips; the old code or files made with old code would still suffer from that lossiness. But it would be generally readable, despite that caveat.) Disadvantage: new formats must always be subtypes of old formats, which limits how the format can change.

Python's pickle is an example of a versioned format, and it makes a lot of sense: the pickle developers want to be able to arbitrarily change the byte-format to make it as efficient as possible. The limitation, though, is that old versions of Python can't read new pickle protocols, and therefore they can't introduce new protocols too quickly or users would be faced with a big matrix of what's compatible and what's not.

Avro schemas are versionless, and its schema evolution works by requiring the reader code's schema to be a subtype of the schema in the file (the writer code's schema). This makes a lot of sense, too, because these schemas are for user data, data analysts have to change their schemas frequently, and they need flexibility to use different code versions. Since there are a lot of record types, there are a lot of opportunities to insert new fields.

Since our extensionarray format is a JSON document in metadata (scaling with the size of the type, not the size of the array), the byte-for-byte performance is not relevant. And I agree that we'll likely need more changes beyond the "NullType Arrow field must be nullable" fix. Because of that reason, I think we don't want a version number. I assume that we can keep adding features by adding fields to the JSON object indefinitely: new ones need to have a default value, much like Avro.

agoose77 · 2023-08-15T18:05:37Z

which is why, unfortunately, pyarrow.compute.extract_regex requires the regex to have named captures, rather than simple parenthesized groups

We could parse the regex itself, and rewrite unnamed groups as named groups. I won't make a PR, because writing the grammar for RE2 will be time consuming and it's not yet asked for!

I think you're asking, "Given that ak.to_arrow_table is being called, why is a RecordArray with field names split into separate Table fields while a RecordArray without field names is put into a Table with one empty field name?"

Yes, this.

agoose77 · 2023-08-15T18:08:42Z

Maybe you're asking something different: if extensionarray=False, it should be a "lossy, tool-friendly format." It doesn't carry metadata because it can't—there's nowhere to put it.

Nearly. I mean to say; this metadata is used to reconstruct properties of the Awkward layout if we round-trip. However, we don't guarantee round-trip for extensionarray=False, so why do we include the metadata required to do so for that case? It's not actively harmful, but I was curious.

Because of that reason, I think we don't want a version number.

Yes, in fact I ask because of that bugfix; we'll need to include metadata even for non-record arrays, e.g. if we have a bare empty array. I wondered about restructuring the metadata, but ultimately it's not hugely user-facing.

jpivarski · 2023-08-15T19:22:35Z

However, we don't guarantee round-trip for extensionarray=False, so why do we include the metadata required to do so for that case? It's not actively harmful, but I was curious.

Oh! Just because it's less complicated to do so and the computational cost is not significant enough to worry about it.

we'll need to include metadata even for non-record arrays, e.g. if we have a bare empty array.

That would be fine. In fact, isn't it already the case? We need to carry information about option-types, even with no records present. Also,

I wondered about restructuring the metadata, but ultimately it's not hugely user-facing.

it's absolutely an implementation detail. It needs to be consistent so that we don't lose the ability to read old files, but it doesn't have to be beautifully organized, just reasonably well. (This is what I mean when I say, "Someplace to stash some metadata.")

In fact, one easy way to ensure that future formats are subtypes of past formats is to populate the old fields and add a new field called "entirely_new_data_for_version_two", and within that, an "entirely_new_data_for_version_three", and so on. I'm not actually suggesting it (the format needs to be non-horrible for developers, too), but if we're not primarily concerned with presentation, we'll never be locked out of full backward and forward compatibility.

agoose77 · 2023-08-15T20:51:20Z

@jpivarski are you happy for this to merge as-is, and I'll follow up with the bugfix PR we discussed above?

jpivarski

@jpivarski are you happy for this to merge as-is, and I'll follow up with the bugfix PR we discussed above?

Yes, this is ready to be merged.

jpivarski · 2023-08-15T21:58:34Z

src/awkward/operations/ak_from_parquet.py

    if len(arrays) == 0:
        return wrap_layout(
-            subform.length_zero_array(highlevel=False), behavior=behavior
+            subform.length_zero_array(highlevel=False),
+            highlevel=highlevel,
+            behavior=behavior,


If so, then it can be replaced by an

assert len(arrays) != 0

(It could be this PR or another one. At the moment, it's just lost coverage.)

agoose77 added 3 commits August 15, 2023 13:21

fix: support highlevel in all branches for from_parquet

6676847

refactor: improve readability of to_arrow_table

e9f9f44

refactor: improve readability of to_arrow_table

4729d00

test: add simple test

cc3379b

agoose77 requested a review from jpivarski August 15, 2023 12:48

agoose77 commented Aug 15, 2023

View reviewed changes

fix: drop partial change

11a0447

agoose77 temporarily deployed to docs-preview August 15, 2023 13:02 — with GitHub Actions Inactive

test: importorskip

a284992

agoose77 temporarily deployed to docs-preview August 15, 2023 13:19 — with GitHub Actions Inactive

jpivarski approved these changes Aug 15, 2023

View reviewed changes

agoose77 added 2 commits August 15, 2023 23:05

refactor: drop unused code-path

9f8b620

Merge branch 'main' into agoose77/fix-highlevel-from-parquet

38a06e5

agoose77 enabled auto-merge (squash) August 15, 2023 22:19

agoose77 temporarily deployed to docs-preview August 15, 2023 22:24 — with GitHub Actions Inactive

agoose77 merged commit 75ca3bf into main Aug 15, 2023
33 checks passed

agoose77 deleted the agoose77/fix-highlevel-from-parquet branch August 15, 2023 22:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: support `highlevel=False` in all branches for `from_parquet` #2646

fix: support `highlevel=False` in all branches for `from_parquet` #2646

agoose77 commented Aug 15, 2023

codecov bot commented Aug 15, 2023 •

edited

Loading

agoose77 Aug 15, 2023

jpivarski Aug 15, 2023

agoose77 commented Aug 15, 2023

jpivarski commented Aug 15, 2023

agoose77 commented Aug 15, 2023

agoose77 commented Aug 15, 2023

jpivarski commented Aug 15, 2023

agoose77 commented Aug 15, 2023

jpivarski left a comment

jpivarski Aug 15, 2023

fix: support highlevel=False in all branches for from_parquet #2646

fix: support highlevel=False in all branches for from_parquet #2646

Conversation

agoose77 commented Aug 15, 2023

codecov bot commented Aug 15, 2023 • edited Loading

Codecov Report

agoose77 Aug 15, 2023

Choose a reason for hiding this comment

jpivarski Aug 15, 2023

Choose a reason for hiding this comment

agoose77 commented Aug 15, 2023

jpivarski commented Aug 15, 2023

agoose77 commented Aug 15, 2023

agoose77 commented Aug 15, 2023

jpivarski commented Aug 15, 2023

agoose77 commented Aug 15, 2023

jpivarski left a comment

Choose a reason for hiding this comment

jpivarski Aug 15, 2023

Choose a reason for hiding this comment

fix: support `highlevel=False` in all branches for `from_parquet` #2646

fix: support `highlevel=False` in all branches for `from_parquet` #2646

codecov bot commented Aug 15, 2023 •

edited

Loading