Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support highlevel=False in all branches for from_parquet #2646

Merged
merged 8 commits into from
Aug 15, 2023

Conversation

agoose77
Copy link
Collaborator

This PR ensures that ak.from_parquet can handle highlevel=False in all branches. It also adds some useful comments to the Arrow table-reading code.

@codecov
Copy link

codecov bot commented Aug 15, 2023

Codecov Report

Merging #2646 (38a06e5) into main (961637e) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Additional details and impacted files
Files Changed Coverage Δ
src/awkward/operations/ak_from_parquet.py 91.17% <100.00%> (+1.39%) ⬆️
src/awkward/operations/ak_to_arrow_table.py 100.00% <100.00%> (ø)

Comment on lines 221 to 225
if len(arrays) == 0:
return wrap_layout(
subform.length_zero_array(highlevel=False), behavior=behavior
subform.length_zero_array(highlevel=False),
highlevel=highlevel,
behavior=behavior,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like we can actually trigger this pathway - the metadata function validates the path list to ensure it's non-empty. We should perhaps remove this case from the function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, then it can be replaced by an

assert len(arrays) != 0

(It could be this PR or another one. At the moment, it's just lost coverage.)

@agoose77 agoose77 temporarily deployed to docs-preview August 15, 2023 13:02 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview August 15, 2023 13:19 — with GitHub Actions Inactive
@agoose77
Copy link
Collaborator Author

@jim Pivarski I started this PR with the intent of solving #2337, but I realised that some prep work would be necessary.

Could you help me to figure out some of our design choices?

  1. Why do we serialise tuples as a table with an unnamed field-of-struct, where the struct has string names, rather than as a top-level field, i.e. why do we do
    pyarrow.Table
    : struct<0: large_string not null, 1: large_string not null, 2: large_string not null> not null
    child 0, 0: large_string not null
    child 1, 1: large_string not null
    child 2, 2: large_string not null
    ----
    
    Rather than
    pyarrow.Table
    0: large_string not null
    1: large_string not null
    2: large_string not null
    ----
    
  2. Why do we store metadata for non-extensionarray arrays? This metadata is currently used to reconstruct the top-level RecordArray and parents (which is not covered by the extensionarray logic). In my view, our non-extension output should be considered a lossy, tool-friendly format that coerces the output to something that can be written & read form from a file. As such, we don't need to include any metadata.
  3. Should we start versioning our extension-array system / metadata? Unlike to_buffers, this feels less stable and more likely to change in future.

@jpivarski
Copy link
Member

  1. Why do we serialise tuples as a table with an unnamed field-of-struct?

Background: Arrow's Struct and Table are distinct things, but we only have Record. Sometimes we have to ask for user help in deciding which of these to turn a Record into, and that's why we have ak.to_arrow and ak.to_arrow_table. In Arrow, Struct is a type that can be mixed with any other types, but Table is always at top-level, and includes partitioning (a Table contains RecordBatches). Serialized formats like Parquet must be Tables, regardless of the Awkward type.

You're asking about Awkward tuples, i.e. Records without field names. Arrow doesn't have an equivalent of tuples: Structs must have names (which is why, unfortunately, pyarrow.compute.extract_regex requires the regex to have named captures, rather than simple parenthesized groups). Arrow Tables also need names. Either way, names must be invented, though we have a rule for creating these names: "0", "1", "2", etc.

But even if either Arrow Struct or Arrow Table had names, we couldn't choose one or the other on that basis, since Awkward Records can appear at any level and Arrow Table is a special top-level only thing. I think you're asking, "Given that ak.to_arrow_table is being called, why is a RecordArray with field names split into separate Table fields while a RecordArray without field names is put into a Table with one empty field name?" I.e. why does this if-statement depend on check[-1].is_tuple?

parameters = None
paarrays, pafields = [], []
if check[-1].is_record and not check[-1].is_tuple:
optiontype_fields = []
for name in check[-1].fields:
paarrays.append(
layout[name].to_arrow(
list_to32=list_to32,
string_to32=string_to32,
bytestring_to32=bytestring_to32,
emptyarray_to=emptyarray_to,
categorical_as_dictionary=categorical_as_dictionary,
extensionarray=extensionarray,
count_nulls=count_nulls,
record_is_scalar=record_is_scalar,
)
)
pafields.append(
pyarrow.field(name, paarrays[-1].type).with_nullable(
layout[name].is_option
)
)
if check[-1].contents[check[-1].field_to_index(name)].is_option:
optiontype_fields.append(name)
parameters = [
{"optiontype_fields": optiontype_fields},
{"record_is_scalar": record_is_scalar},
]
for x in check:
parameters.append({direct_Content_subclass(x).__name__: x._parameters})
else:
paarrays.append(
layout.to_arrow(
list_to32=list_to32,
string_to32=string_to32,
bytestring_to32=bytestring_to32,
emptyarray_to=emptyarray_to,
categorical_as_dictionary=categorical_as_dictionary,
extensionarray=extensionarray,
count_nulls=count_nulls,
record_is_scalar=record_is_scalar,
)
)
pafields.append(
pyarrow.field("", paarrays[-1].type).with_nullable(layout.is_option)
)

I'm not really sure. Are Arrow Table names "0", "1", "2", etc. legal? Even if they're legal, are they reasonable in the Arrow world?

The most important thing is that data can round-trip through Arrow. The choice between turning Awkward tuples into a Table or a Struct-in-Table only changes how Arrow users see it if they're not using Awkward.

It would also be hard to change, since I don't see a way for a user to opt-in during a deprecation cycle.

  1. Why do we store metadata for non-extensionarray arrays?

We choose between the full extensionarray framework or no extensionarray based on user choice, not the type of the array. Even if a simple Awkward Array could easily round-trip through Arrow without any extensionarray metadata, the extensionarray will be there if extensionarray=True (the default).

Maybe you're asking something different: if extensionarray=False, it should be a "lossy, tool-friendly format." It doesn't carry metadata because it can't—there's nowhere to put it.

  1. Should we start versioning our extension-array system / metadata?

We do expect it to change and you're right that we should be concerned about backward compatibility. We should feel free to add fields to the JSON, but have a fallback for if they're not there, pulling them from the dict with get.

There are drawbacks to including a version number in the protocol. Let's consider two ways of doing schema evolution:

  • There is a version number; a file has version v_file and the code has version v_code. If v_code >= v_file, then the code knows that it can read the file, because the code is full of if-then logic or it has separate functions for each version or something similar. If v_code < v_file, then the (old) code does not attempt to read the file because it assumes that future versions of the format may be changed in arbitrary ways. Advantage: future versions of the code are allowed to change in arbitrary ways. Disadvantage: forward-compatibility is categorically excluded. It may be that a future change is minor and it ought to be possible for old versions of the code to read it, but they're not allowed to in principle.
  • There is no version number; changes to the format must always be narrowing. That is, a future version of the format must be a subtype of past versions of the format, for example, by adding fields to a record. It's particularly easy to do this with JSON because JSON is dynamically typed, new fields can always be queried with get, rather than __getitem__, and if old code doesn't know about a new field, it doesn't even ask for it. Advantage: all versions of the code can read all versions of the format, though old code may be less capable than new code. (For example, a format change can patch a missing case that prevented lossless round-trips; the old code or files made with old code would still suffer from that lossiness. But it would be generally readable, despite that caveat.) Disadvantage: new formats must always be subtypes of old formats, which limits how the format can change.

Python's pickle is an example of a versioned format, and it makes a lot of sense: the pickle developers want to be able to arbitrarily change the byte-format to make it as efficient as possible. The limitation, though, is that old versions of Python can't read new pickle protocols, and therefore they can't introduce new protocols too quickly or users would be faced with a big matrix of what's compatible and what's not.

Avro schemas are versionless, and its schema evolution works by requiring the reader code's schema to be a subtype of the schema in the file (the writer code's schema). This makes a lot of sense, too, because these schemas are for user data, data analysts have to change their schemas frequently, and they need flexibility to use different code versions. Since there are a lot of record types, there are a lot of opportunities to insert new fields.

Since our extensionarray format is a JSON document in metadata (scaling with the size of the type, not the size of the array), the byte-for-byte performance is not relevant. And I agree that we'll likely need more changes beyond the "NullType Arrow field must be nullable" fix. Because of that reason, I think we don't want a version number. I assume that we can keep adding features by adding fields to the JSON object indefinitely: new ones need to have a default value, much like Avro.

@agoose77
Copy link
Collaborator Author

which is why, unfortunately, pyarrow.compute.extract_regex requires the regex to have named captures, rather than simple parenthesized groups

We could parse the regex itself, and rewrite unnamed groups as named groups. I won't make a PR, because writing the grammar for RE2 will be time consuming and it's not yet asked for!

I think you're asking, "Given that ak.to_arrow_table is being called, why is a RecordArray with field names split into separate Table fields while a RecordArray without field names is put into a Table with one empty field name?"

Yes, this.

@agoose77
Copy link
Collaborator Author

Maybe you're asking something different: if extensionarray=False, it should be a "lossy, tool-friendly format." It doesn't carry metadata because it can't—there's nowhere to put it.

Nearly. I mean to say; this metadata is used to reconstruct properties of the Awkward layout if we round-trip. However, we don't guarantee round-trip for extensionarray=False, so why do we include the metadata required to do so for that case? It's not actively harmful, but I was curious.

Because of that reason, I think we don't want a version number.

Yes, in fact I ask because of that bugfix; we'll need to include metadata even for non-record arrays, e.g. if we have a bare empty array. I wondered about restructuring the metadata, but ultimately it's not hugely user-facing.

@jpivarski
Copy link
Member

However, we don't guarantee round-trip for extensionarray=False, so why do we include the metadata required to do so for that case? It's not actively harmful, but I was curious.

Oh! Just because it's less complicated to do so and the computational cost is not significant enough to worry about it.

we'll need to include metadata even for non-record arrays, e.g. if we have a bare empty array.

That would be fine. In fact, isn't it already the case? We need to carry information about option-types, even with no records present. Also,

I wondered about restructuring the metadata, but ultimately it's not hugely user-facing.

it's absolutely an implementation detail. It needs to be consistent so that we don't lose the ability to read old files, but it doesn't have to be beautifully organized, just reasonably well. (This is what I mean when I say, "Someplace to stash some metadata.")

In fact, one easy way to ensure that future formats are subtypes of past formats is to populate the old fields and add a new field called "entirely_new_data_for_version_two", and within that, an "entirely_new_data_for_version_three", and so on. I'm not actually suggesting it (the format needs to be non-horrible for developers, too), but if we're not primarily concerned with presentation, we'll never be locked out of full backward and forward compatibility.

@agoose77
Copy link
Collaborator Author

@jpivarski are you happy for this to merge as-is, and I'll follow up with the bugfix PR we discussed above?

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpivarski are you happy for this to merge as-is, and I'll follow up with the bugfix PR we discussed above?

Yes, this is ready to be merged.

Comment on lines 221 to 225
if len(arrays) == 0:
return wrap_layout(
subform.length_zero_array(highlevel=False), behavior=behavior
subform.length_zero_array(highlevel=False),
highlevel=highlevel,
behavior=behavior,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, then it can be replaced by an

assert len(arrays) != 0

(It could be this PR or another one. At the moment, it's just lost coverage.)

@agoose77 agoose77 enabled auto-merge (squash) August 15, 2023 22:19
@agoose77 agoose77 temporarily deployed to docs-preview August 15, 2023 22:24 — with GitHub Actions Inactive
@agoose77 agoose77 merged commit 75ca3bf into main Aug 15, 2023
33 checks passed
@agoose77 agoose77 deleted the agoose77/fix-highlevel-from-parquet branch August 15, 2023 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants