Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Awkward table #2597

Closed
HealthyPear opened this issue Jul 28, 2023 · 7 comments
Closed

Awkward table #2597

HealthyPear opened this issue Jul 28, 2023 · 7 comments
Labels
feature New feature or request

Comments

@HealthyPear
Copy link

Description of new feature

Now that #2545 is already tackling #2468, I feel like a cool feature would be to create a table like astropy.table.QTable

Is this already possible, or it is linked to #1391 and should come after #2545?

@HealthyPear HealthyPear added the feature New feature or request label Jul 28, 2023
@agoose77
Copy link
Collaborator

What would the purpose of such a QTable-like feature be? If it's about grouping data in a column-based structure, we have record arrays. Is this what you're referring to, or is it the Quantity aspect to this that I'm not considering?

@HealthyPear
Copy link
Author

I am just starting trying to use this project, so maybe what you mention is already enough, but an astropy QTable is a table made of Quantities (arrays with attached units - what triggered #2468) which supports all table-like operations (like masking, indexing, etc....)

More details here
https://docs.astropy.org/en/stable/table/

@jpivarski jpivarski added this to Unprioritized in Finalization Jan 19, 2024
@jpivarski
Copy link
Member

This looks to me like a DataFrame, and Awkward Arrays can be placed in Pandas DataFrames with awkward-pandas. Also, @martindurant is looking at doing this for Polars and CuDF as well.

If I've overlooked something and what you mean is more than a DataFrame (or a DataFrame of Awkward Arrays that have units (#2468 is still going to happen), since these features should compose easily), then we can reopen this. At the moment, I'm cleaning old issues.

@jpivarski jpivarski removed this from Unprioritized in Finalization Jan 19, 2024
@HealthyPear
Copy link
Author

Hi Jim,

thanks for coming back to this!

To be honest I still need to play around with awkward-array so I probably missed something during the last development efforts.

My original idea was to add a C++ interface and associated python bindings to this data format in order to read files directly as an awkward "table" (with units, like astropy's QTables). As far as I know, anything based on pandas cannot work with array in the cells, am right?

A file from that data format might look like this,

A B B1
"foo" 3 array[4.3, 5, 7.9]
"bar" 6 array[4.3, 5, 7.9, 2.4, 1, 6.4]

with metadata containing e.g. the units of quantities B and B1.

@HealthyPear
Copy link
Author

indeed awkward-pandas seems to go in that direction actually (apart from the units support)

@jpivarski
Copy link
Member

All this time, I've been under the wrong impression about what this request is—I thought it was to produce something table-like in the Awkward ecosystem, but you want to connect Awkward Arrays to existing file formats (XCDF) and interfaces (QTable), right?

That's a different story (and maybe should be a new issue, if we can narrow in on it).

For DataFrame-like functionality with ragged arrays, it's possible to do in awkward-pandas like this:

>>> import awkward as ak
>>> import awkward_pandas as akpd
>>> import pandas as pd
>>> 
>>> a = ak.Array(["foo", "bar"])
>>> b = ak.Array([3, 6])
>>> b1 = ak.Array([[4.3, 5, 7.9], [4.3, 5, 7.9, 2.4, 1, 6.4]])
>>> 
>>> df = pd.DataFrame({
...     "a": akpd.from_awkward(a),
...     "b": akpd.from_awkward(b),
...     "b1": akpd.from_awkward(b1),
... })
>>> df
     a  b                              b1
0  foo  3                 [4.3, 5.0, 7.9]
1  bar  6  [4.3, 5.0, 7.9, 2.4, 1.0, 6.4]

These aren't Python objects; they're stored in a packed way:

>>> df["b1"].values._data.layout.offsets.data
array([0, 3, 9])
>>> df["b1"].values._data.layout.content.data
array([4.3, 5. , 7.9, 4.3, 5. , 7.9, 2.4, 1. , 6.4])

so there's probably an efficient way to get them to and from other packed formats, which presumably XCDF and QTable are. I was looking at the XCDF documentation and couldn't find a description of the disk format itself.

If you're planning to go through a disk format anyway, it's likely that Pandas is an efficient way to do it, end to end (no conversion into inefficient formats in the middle).

But if you want an in-memory transfer, that would be inefficient because (a) it would go through disk access and (b) Pandas is a very packed format. QTable's documentation says that it goes through Arrow and pyarrow—so does Awkward:

so a more streamlined, in-memory connection could be made through Arrow arrays or Tables. (Awkward ↔ Arrow is mostly zero-copy, with some corner-case exceptions.)

@HealthyPear
Copy link
Author

Indeed, the original use case is still that of writing python bindings for this format which would allow to read an XCDF file directly as an Awkward table instead of a dictionary of numpy arrays (which is what has been done first).

Something like a QTable would be awesome: the problem with it is that even though there is amazing support for units, they are based on numpy which doesn't like ragged arrays, so I cannot use it to read the file at once as a single table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants