-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: support highlevel=False
in all branches for from_parquet
#2646
Conversation
Codecov Report
Additional details and impacted files
|
if len(arrays) == 0: | ||
return wrap_layout( | ||
subform.length_zero_array(highlevel=False), behavior=behavior | ||
subform.length_zero_array(highlevel=False), | ||
highlevel=highlevel, | ||
behavior=behavior, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like we can actually trigger this pathway - the metadata
function validates the path list to ensure it's non-empty. We should perhaps remove this case from the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, then it can be replaced by an
assert len(arrays) != 0
(It could be this PR or another one. At the moment, it's just lost coverage.)
@jim Pivarski I started this PR with the intent of solving #2337, but I realised that some prep work would be necessary. Could you help me to figure out some of our design choices?
|
Background: Arrow's Struct and Table are distinct things, but we only have Record. Sometimes we have to ask for user help in deciding which of these to turn a Record into, and that's why we have You're asking about Awkward tuples, i.e. Records without field names. Arrow doesn't have an equivalent of tuples: Structs must have names (which is why, unfortunately, pyarrow.compute.extract_regex requires the regex to have named captures, rather than simple parenthesized groups). Arrow Tables also need names. Either way, names must be invented, though we have a rule for creating these names: But even if either Arrow Struct or Arrow Table had names, we couldn't choose one or the other on that basis, since Awkward Records can appear at any level and Arrow Table is a special top-level only thing. I think you're asking, "Given that awkward/src/awkward/operations/ak_to_arrow_table.py Lines 108 to 155 in a960a56
I'm not really sure. Are Arrow Table names The most important thing is that data can round-trip through Arrow. The choice between turning Awkward tuples into a Table or a Struct-in-Table only changes how Arrow users see it if they're not using Awkward. It would also be hard to change, since I don't see a way for a user to opt-in during a deprecation cycle.
We choose between the full extensionarray framework or no extensionarray based on user choice, not the type of the array. Even if a simple Awkward Array could easily round-trip through Arrow without any extensionarray metadata, the extensionarray will be there if Maybe you're asking something different: if
We do expect it to change and you're right that we should be concerned about backward compatibility. We should feel free to add fields to the JSON, but have a fallback for if they're not there, pulling them from the dict with There are drawbacks to including a version number in the protocol. Let's consider two ways of doing schema evolution:
Python's pickle is an example of a versioned format, and it makes a lot of sense: the pickle developers want to be able to arbitrarily change the byte-format to make it as efficient as possible. The limitation, though, is that old versions of Python can't read new pickle protocols, and therefore they can't introduce new protocols too quickly or users would be faced with a big matrix of what's compatible and what's not. Avro schemas are versionless, and its schema evolution works by requiring the reader code's schema to be a subtype of the schema in the file (the writer code's schema). This makes a lot of sense, too, because these schemas are for user data, data analysts have to change their schemas frequently, and they need flexibility to use different code versions. Since there are a lot of record types, there are a lot of opportunities to insert new fields. Since our extensionarray format is a JSON document in metadata (scaling with the size of the type, not the size of the array), the byte-for-byte performance is not relevant. And I agree that we'll likely need more changes beyond the "NullType Arrow field must be nullable" fix. Because of that reason, I think we don't want a version number. I assume that we can keep adding features by adding fields to the JSON object indefinitely: new ones need to have a default value, much like Avro. |
We could parse the regex itself, and rewrite unnamed groups as named groups. I won't make a PR, because writing the grammar for RE2 will be time consuming and it's not yet asked for!
Yes, this. |
Nearly. I mean to say; this metadata is used to reconstruct properties of the Awkward layout if we round-trip. However, we don't guarantee round-trip for
Yes, in fact I ask because of that bugfix; we'll need to include metadata even for non-record arrays, e.g. if we have a bare empty array. I wondered about restructuring the metadata, but ultimately it's not hugely user-facing. |
Oh! Just because it's less complicated to do so and the computational cost is not significant enough to worry about it.
That would be fine. In fact, isn't it already the case? We need to carry information about option-types, even with no records present. Also,
it's absolutely an implementation detail. It needs to be consistent so that we don't lose the ability to read old files, but it doesn't have to be beautifully organized, just reasonably well. (This is what I mean when I say, "Someplace to stash some metadata.") In fact, one easy way to ensure that future formats are subtypes of past formats is to populate the old fields and add a new field called "entirely_new_data_for_version_two", and within that, an "entirely_new_data_for_version_three", and so on. I'm not actually suggesting it (the format needs to be non-horrible for developers, too), but if we're not primarily concerned with presentation, we'll never be locked out of full backward and forward compatibility. |
@jpivarski are you happy for this to merge as-is, and I'll follow up with the bugfix PR we discussed above? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpivarski are you happy for this to merge as-is, and I'll follow up with the bugfix PR we discussed above?
Yes, this is ready to be merged.
if len(arrays) == 0: | ||
return wrap_layout( | ||
subform.length_zero_array(highlevel=False), behavior=behavior | ||
subform.length_zero_array(highlevel=False), | ||
highlevel=highlevel, | ||
behavior=behavior, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, then it can be replaced by an
assert len(arrays) != 0
(It could be this PR or another one. At the moment, it's just lost coverage.)
This PR ensures that
ak.from_parquet
can handlehighlevel=False
in all branches. It also adds some useful comments to the Arrow table-reading code.