Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical Array #8463

Closed
ilan-gold opened this issue Nov 17, 2023 · 19 comments · Fixed by #8723
Closed

Categorical Array #8463

ilan-gold opened this issue Nov 17, 2023 · 19 comments · Fixed by #8723

Comments

@ilan-gold
Copy link
Contributor

Is your feature request related to a problem?

We are looking to improve compatibility between AnnData and xarray (see scverse/anndata#744), and so categoricals are naturally on our roadmap. Thus, I think some sort of standard-use categoricals array would be desirable. It seems something similar has come up with netCDF, although my knowledge is limited so this issue may be more distinct than I am aware. So what comes of this issue may solve two birds with one stone, or it may work towards some common solution that can at least help both use-cases (AnnData and netCDF ENUM).

Describe the solution you'd like

The goal would be a standard-use categorical data type xarray container of some sort. I'm not sure what form this can take.

We have something functional here that inherits from ExplicitlyIndexedNDArrayMixin and returns pandas.CategoricalDtype. So let's say this implementation would be at least a conceptual starting point to work from (it also seems not dissimilar to what is done here for new CF types).

Some issues:

  1. I have no idea what a standard "return type" for an xarray categorical array should be (i.e., numpy with the categories applied, pandas, something custom etc.). So I'm not sure if using pandas.CategoricalDtype type is acceptable as In do in the linked implementation. Relatedly....
  2. I don't think using pandas.CategoricalDtype really helps with the already existing CF Enum need if you want to have the return type be some sort of numpy array (although again, not sure about the return type). As I understand it, though, the whole point of categoricals is to use integers as the base type and then only show "strings" outwardly i.e., printing, the API for equality operations, accessors etc., while the internals are based on integers. So I'm not really sure numpy is even an option here. Maybe we roll our own solution?
  3. I am not sure this is the right level at which to implement this (maybe it should be a Variable? I don't think so, but I am just a beginner here 😄 )

It seems you may want, in addition to the array container, some sort of i/o functionality for this feature (so maybe some on-disk specification?).

Describe alternatives you've considered

I think there is some route via VariableCoder as hinted here i.e., using encode/decode. This would probably be more general purpose as we could encode directly to other data types if using pandas is not desirable. Maybe this would be a way to support both netCDF and returning a pandas.CategoricalDtype (again, not sure what the netCDF return type should be for ENUM).

Additional context

So just for reference, the current behavior of to_xarray with pandas.CategoricalDtype is object dtype from numpy:

import pandas as pd
df = pd.DataFrame({'cat': ['a', 'b', 'a', 'b', 'c']})
df['cat'] = df['cat'].astype('category')
 df.to_xarray()['cat']
# <xarray.DataArray 'cat' (index: 5)>
# array(['a', 'b', 'a', 'b', 'c'], dtype=object)
# Coordinates:
#   * index    (index) int64 0 1 2 3 4

And as stated in the netCDF issue, for that use-case, the information about ENUM is lost (from what I can read).

Apologies if I'm missing something here! Feedback welcome! Sorry if this is a bit chaotic, just trying to cover my bases.

Copy link

welcome bot commented Nov 17, 2023

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@TomNicholas
Copy link
Contributor

@pydata/xarray Ilan's question seems interesting and potentially impactful if we can support his use case, but I don't know eough about categorical arrays to weigh in. Can anyone provide guidance?

@rabernat
Copy link
Contributor

It would be great to reuse the Arrow Dictionary concept here somehow.

@ilan-gold
Copy link
Contributor Author

@rabernat / @TomNicholas thanks for the replies :) @rabernat What concept here are you referring to?

@dcherian
Copy link
Contributor

Alternatively, could you write an array-api compliant container instead? This would wrap a (potentially lazy) numpy-like array of "codes" with dtype=np.intXX and a CategoricalDtype to preserve the mapping from codes to category labels.

We could "decode" netcdf Enum dtypes to this new container and also allow converting from CF-style "Flag Variables".

@benbovy
Copy link
Member

benbovy commented Dec 1, 2023

Supporting both netcdf enum and pandas.Categorical seem sensible to me, although I don't know if it will be easy to do so via a single array wrapper.

On a related note, I'm wondering if we shouldn't start thinking about how to generalize the approach based on ExplicitlyIndexedNDArrayMixin that @ilan-gold describe above:

  1. maybe refactor the indexing classes in xarray.core.indexing and expose (some of) them as public API such that these could be subclassed in 3rd-party libraries (related to Lazy indexing arrays as a stand-alone package #5081).
  2. provide built-in helpers to facilitate and uniformize integration of any pandas.ExtensionArray and/or pyarrow.ExtensionArray with Xarray.

I guess we will need to address this if we want to keep Xarray up to date with Pandas switching backends from Numpy to Arrow? The question about how to work with Arrow arrays in Xarray has also been asked multiple times (AFAIK in Xarray geospatial extensions).

@ilan-gold
Copy link
Contributor Author

Alternatively, could you write an array-api compliant container instead? This would wrap a (potentially lazy) numpy-like array of "codes" with dtype=np.intXX and a CategoricalDtype to preserve the mapping from codes to category labels.

We could "decode" netcdf Enum dtypes to this new container and also allow converting from CF-style "Flag Variables".

Yes I will try to do that. If I run into any roadbumps, I will post here.

On a related note, I'm wondering if we shouldn't start thinking about how to generalize the approach based on ExplicitlyIndexedNDArrayMixin that @ilan-gold describe above:

I do think this would be good. I think what I have makes sense, but making it more explicit/public to end-users would be cool.

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Dec 11, 2023

This is not the place for this but it is related: how do I encode the necessary information for encoding/decoding? From what I understand, this belongs in attrs but that seems to be reserved for the returned value of ncattrs.

In the docs, it seems that attrs is the place that decode should read from, and then move to encoding: "Variable
attributes no more applicable after the decoding, are dropped and stored in the
Variable.encoding to make them available to the encode method, which
performs the inverse transformation."

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Dec 11, 2023

Relatedly, how strictly should the information in attrs or encoding follow the netcdf convention for encoding the enum type information? NetCDf stores this enum mapping as a dictionary on-disk and in-memory. But one think I think about is, for example in AnnData, that your categories might be stored in a zarr array (i.e., an array of strings where the mapping values are just the positional indices) and you don't want to have to read that in on load. So a dict for the enum type would not be ideal, and we would want something like:

categories = zarr.open_array('unique_categories')
encoding = {
    ...
    'enum_dict': {
        'from': list(range(len(categories))),
        'to': categories
}

instead of

encoding = {
    ...
    'enum_dict': dict(zip(zarr.open_array('unique_categories'), list(range(len(categories))))) 
}

whch is what netcdf has (i.e., an on-disk dictionary, instead of a stored array) and would force the user to read in all the categories data as well as the code data. But this might be too big of an ask.

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Jan 10, 2024

So I have started working on a branch, but have come across a bit of an issue. When the Dataset is created, it loses any reference to the CategoricalArray - this is a problem for serialization because you need a reference to the underlying integer (or whatever dtype) arrays to write out. For example, xr_ds here

import netCDF4 as nc
import xarray as xr
import numpy as np


ds = nc.Dataset("mre.nc", "w", format="NETCDF4")   
cloud_type_enum = ds.createEnumType(int,"cloud_type",{"clear":0, "cloudy":1}) #
ds.createDimension("time", size=(10))
x = np.arange(10)
ds.createVariable("x", np.int32, dimensions=("time",))
ds.variables["x"][:] = x
# {'cloud_type': <class 'netCDF4._netCDF4.EnumType'>: name = 'cloud_type', numpy dtype = int64, fields/values ={'clear': 0, 'cloudy': 1}} 
ds.createVariable("cloud", cloud_type_enum,  dimensions=("time",))
ds["cloud"][:] = [1, 0, 1, 0, 1, 0, 1, 0, 0, 1]
ds.close()

# -- Open dataset with xarray
xr_ds = xr.open_dataset("./mre.nc")

with my branch lacks any reference to CategoricalArray, which is similar to e.g., BoolTypeArray, even though the Dataset was initiated with it.

This is something that was quite surprising, actually - there were 4 or more reconstructions of the Variable object containing the codes and each time data got more complicated (although still containing some reference to CategoricalArray) until the final Variable completely lacks a way to recover the CategoricalArray. For example, calling xr.open_dataset("./mre.nc") has Variable constructions for CategoricalArray(codes=..., categories={0: 'clear', 1: 'cloudy'}, ordered=False), LazilyIndexedArray(array=CategoricalArray(codes=..., categories={0: 'clear', 1: 'cloudy'}, ordered=False), key=BasicIndexer((slice(None, None, None),))) and then LazilyIndexedArray(array=CategoricalArray(codes=..., categories={0: 'clear', 1: 'cloudy'}, ordered=False), key=BasicIndexer((slice(None, None, None),))) again three times, but xr_ds['cloud'].variable._data is MemoryCachedArray(array=NumpyIndexingAdapter(array=array(['cloudy', 'clear', 'cloudy', 'clear', 'cloudy', 'clear', 'cloudy', 'clear', 'clear', 'cloudy'], dtype='<U6'))). This lacks a reference to the original data.

In terms of investigating, there seems to be something going on here where backend_ds in this line has a reference to CategoricalArray via backend_ds['cloud'].variable._data (which isn't amazing but something) but then loses it somehow once you step into the next function, _dataset_from_backend_dataset!


TL;DR I will keep digging but if anyone has any quick suggestions of how to keep a reference to the underlying object CategoricalArray when constructing a Dataset for NetCDF datasets, let me know. In writing this comment, I discovered that this is not a general problem i.e., the following works well:

import numpy as np
import xarray as xr

codes = np.array([0, 1, 2, 1, 0])
categories = {0: 'foo', 1: 'jazz', 2: 'bar'}
cat_arr = xr.coding.variables.CategoricalArray(codes=codes, categories=categories)
v = xr.Variable(("time,"), cat_arr, fastpath=True)
ds = xr.Dataset({'cloud': v})
ds['cloud']._variable._data
# CategoricalArray(codes=..., categories={0: 'foo', 1: 'jazz', 2: 'bar'}, ordered=False)

So perhaps it would make sense to release this as a general feature first instead of focusing on NetCDF. Perhaps that is a good place to start for a PR

@kmuehlbauer
Copy link
Contributor

@ilan-gold Sorry, just saw this popping up in my inbox. There is work to read/write netCDF4.EnumType in #8147. This PR is close to be finished.

The proposed solution is to add metadata to `encoding["dtype"]:

        if isinstance(var.datatype, netCDF4.EnumType):
            encoding["dtype"] = np.dtype(
                data.dtype,
                metadata={
                    "enum": var.datatype.enum_dict,
                    "enum_name": var.datatype.name,
                },
            )

This follows what h5py is doing, beside adding the enum_name. Could this be used here?

@dcherian
Copy link
Contributor

For background, these various arrays will all get lost on data load, which I think is what you're discovering.

I would instead work on a CategoricalArray that followed the array API. Once that works, we can discussing lazy loading and decoding to that array type. Potentially, this could be built as a numpy custom dtype but I haven't followed the progress there.

@ilan-gold
Copy link
Contributor Author

@dcherian That sounds like a plan to me. I will check that out.

@ilan-gold
Copy link
Contributor Author

@kmuehlbauer Thanks for this, that is good to see.

@ilan-gold
Copy link
Contributor Author

So I have spent some time looking into this. @kmuehlbauer Thanks for directing me to that PR, not sure how I missed it. This PR is interesting and contains some good stuff. It's definitely a step in the right direction.

@dcherian I don't think an array-api compliant categorical container is possible, unless we implement array methods (i.e., xp.exp) directly on the codes, which seems not good. So if we're not going to do that, trying to create an array-api compliant container for categoricals then reduces down to the problem of creating an array-api container for strings since this is the array that __getitem__ on some categorical container should return. Creating an array-api compliant container for strings, though, seems out of scope (the thread below this comment should explain more).

Separately, it still seems strange to me that one can write code that declares a Variable object during i/o process and then lose that object or any reference to it or its constituent parts by the end of the process. This being said, as I noted, the following works in that using ds doesn't seem to error out in any horrifying way with normal usage:

import numpy as np
import xarray as xr

codes = np.array([0, 1, 2, 1, 0])
categories = {0: 'foo', 1: 'jazz', 2: 'bar'}
cat_arr = xr.coding.variables.CategoricalArray(codes=codes, categories=categories)
v = xr.Variable(("time,"), cat_arr)
ds = xr.Dataset({'cloud': v})

# Some simple relevant operations, there are probably others but these are what I came up with quickly to test
ds['cloud']._variable._data # This is a CategoricalArray
ds.where(ds.cloud == 1) # all nans as expected even though this is the correct code
ds.where(ds.cloud == "jazz") # correct masking
ds.where(ds.cloud.isin(["jazz", "bar"])) # more correct masking, but more complicated
ds.groupby("cloud") # three groups as expected

The current status of CategoricalArray is that it inherits from ExplicitlyIndexedNDArrayMixin, which means that everything is kosher staying within xarray and normal development patterns (from what I can tell).

Would a PR that tests this array and its behavior with Variable/Dataset be acceptable? We can punt on i/o. This could give people the opportunity to work with the object before integrating into officially supported i/o.

Concretely, I could see a PR containing CategoricalArray alone (and tests), and then a separate PR that adds some sort of _writeable_data attribute to Variable or DataArray for i/o, which would allow us to store a reference to the codes there. The encoding for the categories could then be stored as in #8147 or something similar.

Does this seem feasible?

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Jan 16, 2024

P.S Another step along this path could be adding #5287 for pandas extension array support, which would give us Categorical support because pandas' categorical array is an extension array. But I'm not sure this really changes the roadmap I've laid out beyond changing the return type of our categorical array wrapper. It's tough to say without seeing #5287 implemented.

@dcherian
Copy link
Contributor

AFAICT you're subclassing from ExplicitlyIndexedArrayMixin to bypass our checks for what can go in an Xarray object. The purpose of these classes is to enable lazy indexing of data from disk; and to abstract away gory indexing-support details for internal use.

If you satisfy this duck array check, you won't need to do that:

xarray/xarray/core/utils.py

Lines 260 to 270 in 33d51c8

def is_duck_array(value: Any) -> TypeGuard[T_DuckArray]:
if isinstance(value, np.ndarray):
return True
return (
hasattr(value, "ndim")
and hasattr(value, "shape")
and hasattr(value, "dtype")
and (
(hasattr(value, "__array_function__") and hasattr(value, "__array_ufunc__"))
or hasattr(value, "__array_namespace__")
)

I agree that Array API compliance probably makes no sense. Would it make sense to support __array_function__ and __array_ufunc__. IIUC you are allowed to return NotImplementedError from these, so adding those two methods would let Xarray know that CategoricalArray is a real array.

I agree that writing a NEP-18 wrapper for Pandas Extension arrays may be quite valuable but it doesn. It would allow IntervalArray too but perhaps only <=2D (?).

@dcherian
Copy link
Contributor

We discussed this at a meeting today.

  • The numpy custom dtype is probably the best solution but now is probably not the best time to do it since things are in flux.
  • I haven't followed the array api string dtype discussion; but one could raise NotImplementedError from a bunch of methods that don't make sense. This seems not ideal.
  • Could modify your current container to have __array_function__ and __array_ufunc__ overrides. These are free to return NotImplementedError, so might fit best.
  • (repeating from above) a NEP-18 wrapper for Pandas Extension arrays may be quite valuable. It would allow IntervalArray too but perhaps only <=2D (?).

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Jan 17, 2024

Thanks @dcherian . I will look into these. I think 1+3 from your above points go together, and form one concrete path forward. Point 4 is a separate path forward from what I can tell. I think I will look into the fourth point closely then because it seems to offer the best way forward IMO in terms of generalization, unless this is not an issue (#5287) that is seen as important. If point 4 also goes belly up, we will need to fall back to a combination of 1+3.

And I am strongly opposed to point 2 for a variety of reasons, especially now that I have looked into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants