Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use pyarrow.field to preserve nullability in Arrow conversion #602

Merged
merged 6 commits into from Dec 15, 2020

Conversation

jpivarski
Copy link
Member

No description provided.

@jpivarski jpivarski linked an issue Dec 15, 2020 that may be closed by this pull request
@jpivarski
Copy link
Member Author

It is quite fortunate that all the old Arrow tests checked for agreement at the level of Python lists. Changing the nullability/option-type of data at various levels doesn't affect any of those tests (but the new tests require the right types).

@jpivarski
Copy link
Member Author

@lgray and @nsmith- I just noticed that Arrow can finally convert records nested within variable-length lists! (That is, particles in events!) I've been waiting for that since—searches email—August 2016.

Behold:

original = ak.Array(
    [
        [{"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}, {"x": 3, "y": 3.3}],
        [],
        [{"x": 4, "y": 4.4}, {"x": 5, "y": 5.5}],
        [],
        [],
        [
            {"x": 6, "y": 6.6},
            {"x": 7, "y": 7.7},
            {"x": 8, "y": 8.8},
            {"x": 9, "y": 9.9},
        ],
    ]
)

ak.to_parquet(original, os.path.join(tmp_path, "data.parquet"))
reconstituted = ak.from_parquet(os.path.join(tmp_path, "data.parquet"))
assert reconstituted.tolist() == [
    [{"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}, {"x": 3, "y": 3.3}],
    [],
    [{"x": 4, "y": 4.4}, {"x": 5, "y": 5.5}],
    [],
    [],
    [
        {"x": 6, "y": 6.6},
        {"x": 7, "y": 7.7},
        {"x": 8, "y": 8.8},
        {"x": 9, "y": 9.9},
    ],
]
assert str(reconstituted.type) == '6 * var * {"x": int64, "y": float64}'

The part I did was making sure that it comes back without option-types, but it's more of a big deal that we can write and read back types like N * var * {some record} now.

Let me try another one:

>>> events = ak.Array([
...     {"MET": 1, "muons": [{"px": 1, "py": 1}, {"px": 2, "py": 2}], "jets": [{"x": 1, "y": 1}]},
...     {"MET": 2, "muons": [{"px": 3, "py": 3}, {"px": 4, "py": 4}], "jets": [{"x": 2, "y": 2}]}])
>>> ak.to_parquet(events, "events.parquet")
>>> events2 = ak.from_parquet("events.parquet")
>>> events2.tolist()
[{'MET': 1, 'muons': [{'px': 1, 'py': 1}, {'px': 2, 'py': 2}], 'jets': [{'x': 1, 'y': 1}]},
 {'MET': 2, 'muons': [{'px': 3, 'py': 3}, {'px': 4, 'py': 4}], 'jets': [{'x': 2, 'y': 2}]}]
>>> events2.type
2 * {"MET": int64, "muons": var * {"px": int64, "py": int64}, "jets": var * {"x": int64, "y": int64}}

@lgray
Copy link
Contributor

lgray commented Dec 15, 2020

image

@jpivarski jpivarski merged commit cad41a3 into main Dec 15, 2020
@jpivarski jpivarski deleted the jpivarski/preserve-nullability-in-arrow-and-parquet branch December 15, 2020 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate pyarrow.field to preserve nullability in Arrow conversion
2 participants