Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Awkward Arrays be usable as Pandas columns? #350

Closed
jpivarski opened this issue Jul 23, 2020 · 11 comments · Fixed by #460
Closed

Should Awkward Arrays be usable as Pandas columns? #350

jpivarski opened this issue Jul 23, 2020 · 11 comments · Fixed by #460
Labels
policy Choice of behavior

Comments

@jpivarski
Copy link
Member

This was one of the design goals described in the original motivations document, but it has required some non-intuitive sorcery to implement and it's not clear to me that it's a valuable feature. To be clear, we're talking about

>>> import awkward1 as ak
>>> import pandas as pd
>>> pd.DataFrame({"awkward": ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])})
           awkward
0  [1.1, 2.2, 3.3]
1               []
2       [4.4, 5.5]

and not

>>> ak.pandas.df(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
                values
entry subentry        
0     0            1.1
      1            2.2
      2            3.3
2     0            4.4
      1            5.5

The explicit conversion into a MultiIndex DataFrame with ak.pandas.df has no issues: the implementation is straightforward and I know how I would use it—there are plenty of Pandas functions for dealing with MultiIndex. For example,

>>> df = ak.pandas.df(ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
>>> df.unstack()
         values          
subentry      0    1    2
entry                    
0           1.1  2.2  3.3
2           4.4  5.5  NaN

But for the Awkward-in-Pandas, the only things I know of that can be used directly are ufuncs:

>>> pd.DataFrame({"awkward": ak.Array([[1, 2, 3], [], [4, 5]])}) + 100
           awkward
0  [101, 102, 103]
1               []
2       [104, 105]

but not all ufuncs, for some Pandas reason:

>>> np.sqrt(pd.DataFrame({"awkward": ak.Array([[1, 2, 3], [], [4, 5]])}))
Traceback (most recent call last):
  File "/home/pivarski/irishep/awkward-1.0/awkward1/highlevel.py", line 996, in __getattr__
    raise AttributeError("no field named {0}".format(repr(where)))
AttributeError: no field named 'sqrt'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: loop of ufunc does not support argument 0 of type Array which has no callable sqrt method

Presumably, we could narrow in on that reason and get it to work, but there are a lot of Pandas functions to test. The fundamental problem is that Awkward objects are "black boxes" to Pandas. Sure, we can put them in a DataFrame, but what's Pandas going to do with them once they're there?

There are other downsides to making Awkward Arrays subclasses of pandas.core.arrays.base.ExtensionArray (so that they can be columns). For one thing, it implies that we have to import pandas at startup, which can cost up to a second on slow machines or might try to import a broken installation of Pandas even if the user isn't planning on using Pandas. (If Pandas is not installed, we can change the class hierarchy, but that means ak.Array behaves differently, depending on whether you've installed Pandas, even if you're not using it.)

To avoid the above, the current implementation only makes ak.Array inherit from pandas.core.arrays.base.ExtensionArray if you try to use it in Pandas, which can be detected by a call to dtype. But for consistency, that's even worse, since the inheritance of ak.Array now changes at runtime, depending on whether you've ever tried to use an Awkward Array in a DataFrame. This came up in a difference in behavior (reported on Slack) that I couldn't reproduce at first because my test didn't invoke Pandas. Namely, the pandas.core.arrays.base.ExtensionArray defines some methods, and these methods exist or don't exist on ak.Array unless they're overshadowed by my own implementations. At the very least, I should overshadow all the non-underscored ones so that their existence is not history-dependent, but it fills up the ak.Array namespace with names I don't necessarily want.

  • to_numpy: This would be fine; it would call ak.to_numpy, though the other methods don't have an underscore, such as tolist (for consistency with NumPy).
  • dtype: Already tricky, since Pandas requires a new one, AwkwardDType, and Dask requires np.dtype("O").
  • shape: Pandas needs this to be one-dimensional, which is misleading for an Awkward Array. Preferably, Awkward Arrays would have no shape at all; the combined dtype and shape can only be fully captured by ak.type.
  • ndim: Much like shape, it's misleading for this to always be 1.
  • nbytes: This is fine, and other libraries expect such a property, too.
  • astype: This was the surprise that triggered this issue: I didn't think Awkward Arrays had an astype, since it's not clear what it should mean. For changing numeric types, there's an open PR Operation to change the number type #346, but it's a new function since it doesn't change the whole type of the array, it descends to the leaves where the numbers are.
  • isna: This can go to ak.is_none, though "na" is not how we refer to missing data.
  • argsort: This can go to ak.argsort.
  • fillna: This can go to ak.fill_none, but see the note on isna above.
  • dropna: We don't have an ak.drop_none, but such a thing wouldn't be too hard to write.
  • shift: This one only makes sense for rectangular tables. (See the definition.)
  • unique: We don't have an ak.unique and there could be some subtitles there. We don't have a definition for record equality, for example, and string equality is already handled through a behavioral extension.
  • searchsorted: Only makes sense if the data are actually sorted. Should there be an axis=1 version of this for variable-length lists? Usually, physics events are unsorted but the particles (axis=1) are sorted by pT.
  • factorize: This is a non-intuitive name, but it could be good to have an Awkward function that turns arrays into an IndexedArray of unique values. But for complex objects like records, this brings up the same issues as unique (above).
  • repeat: We don't have an ak.repeat, but that might be useful in some contexts. I usually find np.repeat and np.tile to be a pair that have to be used together, usually to make a Cartesian product (and we already have ak.cartesian).
  • take: This seems unnecessary to me, since we already have __getitem__ with integer arrays.
  • copy: I don't know if we have a high-level "copy" function, but we have the low-level ones to link it up.
  • view: This wouldn't make much sense for an Awkward Array. It's not a simple buffer.
  • ravel: Maybe the equivalent of this is ak.flatten? Flattening variable-length arrays, particularly ones that include records, is a different kind of thing from flattening rectilinear data.

Given these mismatches, I'm strongly considering removing the Awkward-in-Pandas feature before Awkward1 actually becomes 1.0. The explicit conversion functions, ak.pandas.df and ak.pandas.dfs, would be kept.

But I might be wrong—there might be some fantastic use-case for Awkward-in-Pandas that I don't know about. This question is an informal vote on the feature. You might have been sent here by an error message, where the feature is provisionally removed with a way to opt-in. If you find it useful to include Awkward Arrays inside of Pandas DataFrames (distinct from the ak.pandas.df conversion), then say so here, describing the use-case. You can opt-in now by calling ak.pandas.register(), but if I don't hear from people saying that they really use it, the feature will be removed and you won't be able to use it past 1.0.

So let me know!

@mloning
Copy link
Contributor

mloning commented Jul 27, 2020

I think we may have an interesting use case with sktime, but not sure if that justifies the extra maintenance burden.

We want to represent a variety of time series formats, including univariate, multivariate, panel data, unequal length and unequally sampled time series data. We currently (ab)use pandas by storing entire time series as a pd.Series or np.array in the cells of a pd.DataFrame, but that makes it very awkward to use for most people, which brings me here! Awkward array basically meets all our requirements except two:

  1. Handling of time indices: we'd like to support not just sequences, but series consisting of (index, value) pairs where the value represents the observed value and the index the time points at which we observed the value;
  2. Handling of metadata (column names, etc).

For time series analysis, 1) seems important. 2) would be nice to have, but not essential.

  • Does awkward array support data/time indexing?
  • Are you aware of any other libraries similar to awkward array? We only know of xarray (but they don't support ragged arrays) apart from a few other smaller libraries.

This has gone slightly off-topic, please let me know if there's a better to place to discuss this!

For more info, see our condensed data container discussion here.

cc @fkiraly @prockenschaub @matteogales

@prockenschaub
Copy link

The above argument by @mloning is related to an earlier enquiry #289 . Unfortunately I hadn't had the time yet to make a deep dive into the options that @jpivarski laid out in #289 to represent time indices. However, even without time indices this gist should illustrate what sktime is hoping to achieve by using awkwardarrays as pandas columns (for now assuming time is simply represented by position in the array).

As @mloning mentioned, if this is the only usecase it might not justify the extra maintenance burden on you. In this case, maybe there is an option to factor the awkardarray-as-extensionarry into a separate package?

@fkiraly
Copy link

fkiraly commented Jul 27, 2020

As @mloning mentioned, if this is the only usecase it might not justify the extra maintenance burden on you.

I slightly disagree with @prockenschaub and @mloning here, since I think that time series* are a pretty important use case, that to my knowledge none of the existing data container solutions is solving particularly well.

While indeed it would put the maintenance burden on you (and not on us 😃), I´d see it as a potential solution to a long-standing annoyance - the eternal search for a great family of data containers for time series* - and therefore with potential to become a "pillar of data science"...

*univariate, multivariate, panel data, unequal length and unequally sampled time series data, as @mloning says.

@fkiraly
Copy link

fkiraly commented Jul 27, 2020

so, where do I vote

@jpivarski
Copy link
Member Author

(This is the vote. It's informal. )

I'm reading what you've written above and also logged into https://gitter.im/Scikit-HEP/awkward-array so we can chat in real time.

@jpivarski
Copy link
Member Author

So far, I see three things that you need: (1) time-valued data, (2) data-valued index, and (3) complex data structures.

(1) Pandas has always had good handling of time-valued data (from my perspective as someone who doesn't use time-valued data much).

(2) The data-valued index is, I think, the thing that sets Pandas (1d) and xarray (nd) apart from NumPy (nd). This isn't a failure of NumPy, either: it's a lower level component that handles the data in the arrays, whereas Pandas and xarray are higher level components that manage what the data means through indexing. Awkward Array has been targeting that lower-level slot, too: it's designed to handle x-y data as two arrays, rather than a unified object like a Pandas Series.

I've left a placeholder in the implementation called an array's Identities, based on some examples (AwkwardQL) that it would be useful to keep track of where each item came from for the sake of future joins. The Identities is underdeveloped, but with the understanding that it's a stub for future growth. What the Identities currently do is associate each quantity of an array with a unique integer—if that unique integer is then associated with elements of a data array, that's a Pandas/xarray style index.

(3) Awkward Array handles complex data structures in a unique way, which can't be done in a relational-like structure such as Pandas without multiple tables (DataFrames).

So the real issue here is that you want all three, you can get (1) and (2) from Pandas/xarray and (3) from Awkward, but not all together. Coming back to the original point of this thread, I'm not 100% sure that putting Awkward Arrays in Pandas DataFrames and Series will make that happen, since the data will be physically contained in those containers, but without operations that know how to use it, it's not much use.

Moving conversation over to Gitter now...

@jpivarski
Copy link
Member Author

A key function in that example is summarise_over_time, which is more efficiently computed over a ragged array of many sublists than a NumPy reducer would be over separate NumPy arrays (assuming that you have many arrays, as discussed on Gitter).

jpivarski added a commit that referenced this issue Jul 31, 2020
* jupyter-books 0.7.3 no longer supports 'headers'.

* Update GitHub README to reflect the focus on tutorials.

* Tweak sizes and port to setup.py.

* Drop test_0090 in light of #350 and the fact that it's now broken.
@jpivarski
Copy link
Member Author

Awkward arrays as Pandas columns will be deprecated.

The next release will present a deprecation warning when you try to use an Awkward array in Pandas (as a Series or a DataFrame column) and it will be removed in 0.3.0.

The ak.pandas.df and ak.pandas.dfs functions will be combined and renamed as ak.to_pandas for consistency. The new function name already exists and the old ones will be removed in 0.3.0.

@jpivarski
Copy link
Member Author

The next release I deploy will be 0.3.0 and will not have the Awkward-as-Pandas-column feature.

@TomAugspurger
Copy link

But for the Awkward-in-Pandas, the only things I know of that can be used directly are ufuncs: [...] but not all ufuncs, for some Pandas reason:

The specific issue of np.sqrt(dataframe) failing for DataFrames with extension arrays comes down to DataFrame not defining __array_ufunc__ yet. That's a known issue: pandas-dev/pandas#23743 (I don't think anyone is working on it at the moment). But to your next point;

The fundamental problem is that Awkward objects are "black boxes" to Pandas. Sure, we can put them in a DataFrame, but what's Pandas going to do with them once they're there?

That's the essential motivation for ExtensionArrays: a way for pandas and these black boxes of arrays to interact through a well-defined interface. For example, cyberpandas provides vectorized implementations of ipaddress operations to pandas. pandas doesn't need to know about the memory layout of cyberpandas (a 2D int64 ndarray) or any IP operations for this to work.

Now, the interface is relatively young. Some things work and some things (as you've discovered) don't. But it is improving with each release.

There are other downsides to making Awkward Arrays subclasses of pandas.core.arrays.base.ExtensionArray

I personally wouldn't recommend making general-purpose objects like AwkwardArray try to implement pandas' Extension Array interface. As you note, there are some public methods that might clash with implementations in AwkwardArray. And I've never had good experiences making base classes dynamic. I'd instead recommend a dedicated object that implements the interface.

This raises some issues around putting AwkwardArray objects into a pandas DataFrame, if AwkwardArray doesn't implement the interface.
I'm sure the pandas maintainers would be happy to discuss options there (like a __pandas_extension_array__ interface that objects can
implement to return a pandas' extension array-compatible object. That would ensure that pd.DataFrame({"A": my_awkward_array}) keeps the data as an awkward array, rather than copying to an object-dtype ndarray.

As general point though, the extension array interface is still evolving. If you run into issues please do speak up, either here or on the pandas issue tracker!

@jpivarski
Copy link
Member Author

This issue was describing the problems involved in making Awkward arrays subclasses of the Pandas ExtensionArray, particularly as dynamic subclasses, and justifying the decision to drop this original design requirement. The difficulties encountered and gaps in usefulness once implemented are surmountable technical problems, but for the stability of the Awkward Array library, I had to remove the dynamic subclassing. In the future, it would be great if we could make Awkward arrays into Pandas columns through a loose coupling like __pandas_extension_array__, but it shouldn't be done the way it was before this issue was opened.

In particular, I'm interested in getting this to work on cuDF, which is introducing ListDtype and StructDtype into its data model and is backed by Apache Arrow. (See #359.) If these column types are not black boxes but something that is understood by the dataframe class, then it could make sense to introduce non-NumPy, non-Pandas functions like ak.cartesian to the dataframe, implemented by Awkward. Showing that this is a usable interface, sensible for analysis, on a specialized dataframe like cuDF (which is only implemented for GPUs) would make a good argument for bringing it to Pandas and Dask DataFrame, showing that the changes required to make that work are justified. Or maybe by implementing it in cuDF, we might find that the first draft of an interface is wrong and needs to be tweaked.

By the way, the above is completely my own aspirations, not a formal plan. I've been talking with the cuDF developers on their Slack, but only in the sense of floating this idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
policy Choice of behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants