Skip to content

RFC: In-memory sparse array interchange #840

Open
@hameerabbasi

Description

@hameerabbasi
Contributor

Motivation

The sparse (also called PyData/Sparse) development team have been working on integration efforts with the ecosystem, most notably with SciPy, scikit-learn and others, with CuPy, PyTorch, JAX and TensorFlow also on the radar. One of the challenges we were facing was the lack of (possibly zero-copy) interchange between the different sparse array implementations. We believe this may be a pain point for many sparse array implementations moving forward.

This mirrors an issue seen for dense arrays previously, where the DLPack protocol was the one of the first things to be standardised. We're hoping to achieve community consensus for a similar problem.

Luckily, all sparse array formats (with the possible exception of DOK) are usually collections of dense arrays underneath. In addition, this problem has been solved for on-disk arrays before by the binsparse specification. @willow-ahrens is a co-author of that spec, and is also a collaborator for the sparse work.

Proposal

We propose introducing two new methods to the array-API compliant sparse array objects (such as those in sparse), which are described below.

__binsparse_descriptor__

The returned item is a dict equivalent to a parsed JSON binsparse descriptor of an array.

__binsparse__

The second item is a dict[str, Array] of __dlpack__ compatible arrays, which are the constituent arrays of the sparse array. The key represents the equivalent key in the descriptor.

Introduction of from_binsparse function.

If a library supports sparse arrays, its from_binsparse method should support accepting (when possible, zero-copy) versions of objects that follow this __binsparse__ protocol, and have an equivalent sparse format within the library.

Psuedocode implementation

Here's a psuedocode example using two libraries, xp1 and xp2, both supporting sparse arrays:

# In library code:
xp2_sparray = xp2.from_binsparse(xp1_sparray, ...)

# This psuedocode impl is common between `xp1` and `xp2`
def from_binsparse(x: object, /, *, device: device | None = None, copy: bool | None = None) -> array:
    binsparse_descr = getattr(x, "__binsparse_descriptor__", None)
    binsparse_impl = getattr(x, "__binsparse__", None)
    if binsparse_impl is None or binsparse_descr is None:
        raise TypeError(...)
    
    binsparse_descriptor = binsparse_descr()
    # Will raise an error if the format/descriptor is unsupported.
    sparse_type = _type_from_binsparse_descriptor(binsparse_descriptor)
    constituent_arrays = binsparse_impl()
    my_constituent_arrays = {
        k: from_dlpack(arr, device=device, copy=copy) for k, arr in constituent_arrays.items()
    }
    return sparse_type.from_strided_arrays(my_constituent_arrays, shape=...)

Parallel implementation in sparse: pydata/sparse#764
Parallel implementation in SciPy: scipy/scipy#22553

Alternative solutions

There are formats for on-disk sparse-array interchange [1] [2]; but none for in-memory interchange. binsparse is the one that comes closest to offering in-memory interchange.

Pinging possibly interested parties:

Updated on 2024.10.09 as agreed in #840 (comment).

Activity

added
RFCRequest for comments. Feature requests and proposed changes.
API extensionAdds new functions or objects to the API.
on Sep 6, 2024
deleted a comment from on Sep 6, 2024
pearu

pearu commented on Sep 6, 2024

@pearu

A couple of quick notes.

First, dlpack and binsparse are self-contained specifications where dlpack provides a protocol for sharing strided arrays between different array libraries and binsparse provides a unified description of various sparse formats. This proposal tries to glue these together by implicitly extending DLPack protocol but that contradicts with the semantics of DLPack that "describes the memory layout of dense, strided, n-dimensional arrays". In addition, the Python Array API standard specifies that DLPack should not support sparse arrays.

My suggestion is to cook up a sparse array interchange protocol that may use dlpack and binsparse (as these are obviously relevant pieces of this problem) but in a cleaner way.

For instance, binsparse specifies that the binsparse-compatible object provides a 2-tuple (<sparse array descriptor>, <list of sparse array data as strided arrays>). So, the appropriate place to deploy the dlpack protocol is to require that the arrays in the <list of sparse array data as strided arrays> must support the dlpack protocol.

For instance 2, introduce "Python specification of binsparse" (similar to one in dlpack) that consists of

  1. a from_binsparse function that accepts objects that implement the binsparse protocol per 2. below
  2. a __binsparse__ method on the objects representing sparse arrays that will return a 2-tuple (<sparse array descriptor>, <list of sparse array data as strided arrays>) which is used by the from_binsparse function of the consumer library.

Also, the specification should include the C-side interface as well.

What do you think?

hameerabbasi

hameerabbasi commented on Sep 7, 2024

@hameerabbasi
ContributorAuthor

While I'm on board with the proposed __binsparse__ protocol, I'm of the opinion that sparse arrays should have as little user-facing API differences as possible as compared to strided arrays, i.e. using sparse arrays should be as frictionless from the Array API side as using strided arrays. Allocating a different function based on sparse and dense arrays (from_dlpack vs from_binsparse) seems counter to that.

I'm not opposed to a C-side specification, I'd welcome that so that the benefits can extend beyond the Python ecosystem. We do have to be mindful that the binsparse specification requires key-value maps, and we would need to maintain their relative order. Perhaps an iterable of 2-tuples makes more sense in this case.

pearu

pearu commented on Sep 7, 2024

@pearu

While I'm on board with the proposed __binsparse__ protocol, I'm of the opinion that sparse arrays should have as little user-facing API differences as possible as compared to strided arrays, i.e. using sparse arrays should be as frictionless from the Array API side as using strided arrays.

Absolutely. I presume that users of sparse arrays are aware of advantages and disadvantages of using a particular sparse format over other sparse formats or strided arrays. Using a particular sparse format is a choice of optimization method that must be supported by user-facing API.

Allocating a different function based on sparse and dense arrays (from_dlpack vs from_binsparse) seems counter to that.

from_dlpack is specifically defined for strided arrays, both by the Python Array API and dlpack. So, I find abusing the from_dlpack function for sparse arrays more confusing than providing a new API function which would follow the same design pattern that dlpack introduced: the pair (from_foo, __foo__) provides a minimal API for sharing data between provider and consumer libraries that support a protocol foo.

Btw, if one considers strided and sparse arrays semantically equivalent (as in PyTorch, for instance), a more intuitive approach would be to use asarray instead of from_dlpack as a common API function for constructing strided or sparse arrays from provider library objects. Although, Python Array API does not support that and existing implementations (PyTorch, JAX, ...) do not mix the functionalities of asarray and from_dlpack. I don't suggest to do that here either.

hameerabbasi

hameerabbasi commented on Sep 9, 2024

@hameerabbasi
ContributorAuthor

Btw, if one considers strided and sparse arrays semantically equivalent (as in PyTorch, for instance), a more intuitive approach would be to use asarray instead of from_dlpack as a common API function for constructing strided or sparse arrays from provider library objects.

While I agree with a common API; the disadvantage I find in this approach is that since from_dlpack requires the producer to pass ownership to the consumer; asarray requires one to maintain the ownership of the consumed array. This, in turn, means that there's no way to have a zero-copy transfer using asarray using the from_dlpack function; i.e. one would have to do something like def asarray(x, ...): return from_dlpack(copy(x), ...) somewhere to maintain ownership semantics, incurring a copy. See #840 (comment)

My intention was to have an API that essentially could support zero-copy interchange across libraries, regardless of whether the consumed array was strided, sparse or something else.

pearu

pearu commented on Sep 9, 2024

@pearu

While I agree with a common API; the disadvantage I find in this approach is that since from_dlpack requires the producer to pass ownership to the consumer;

This is incorrect. When using from_dlpack(x), the producer keeps owning the memory of x. See https://dmlc.github.io/dlpack/latest/python_spec.html#semantics.

hameerabbasi

hameerabbasi commented on Sep 9, 2024

@hameerabbasi
ContributorAuthor

While I agree with a common API; the disadvantage I find in this approach is that since from_dlpack requires the producer to pass ownership to the consumer;

This is incorrect. When using from_dlpack(x), the producer keeps owning the memory of x. See https://dmlc.github.io/dlpack/latest/python_spec.html#semantics.

Ah, in that case; yes, asarray can use from_dlpack and a possible from_binsparse, and still be zero-copy. Thanks for the correction; asarray seems like the right format-agnostic API for this.

hameerabbasi

hameerabbasi commented on Sep 11, 2024

@hameerabbasi
ContributorAuthor

I've updated the issue with @pearu's feedback.

moved this to Stage 0 in Proposalson Sep 16, 2024
moved this from Stage 0 to Stage 1 in Proposalson Sep 16, 2024

16 remaining items

pearu

pearu commented on Dec 3, 2024

@pearu

Libraries may support more formats than are supported by the binsparse protocol, and may need to perform an additional conversion.

I am not sure how __binsparse_format__ will be helpful here. If a library implements a sparse format that binsparse protocol does not support, then the binsparse protocol will be useless anyway: binsparse protocol cannot implement conversion about a format that it does not support, any necessary conversion should be done by the provider library.
If you'd disagree, please provide more details about the __binsparse_format__ specification as I may miss some points here.

hameerabbasi

hameerabbasi commented on Dec 3, 2024

@hameerabbasi
ContributorAuthor

any necessary conversion should be done by the provider library

I think you've hit the nail on the head with this part -- a conversion might be necessary, which may make the cost of constructing a capsule O(n), and therefore we're left with two options if we want to guarantee an O(1) conversion (both require a .asformat which can take binsparse descriptors):

  • Raise on unsupported formats. The consumer must choose whether to call .asformat on error.
  • Allow the consumer to query if they support the format via __binsparse_format__ and then raise or call .asformat as necessary.
pearu

pearu commented on Dec 3, 2024

@pearu

An exception with a helpful exception message is better than a silent expensive conversion, imho.

rgommers

rgommers commented on Dec 4, 2024

@rgommers
Member

The two-method format seems clearly preferred, indeed for the same reason as __dlpack_device__: allowing to introspect metadata for compatibility and "hand-shaking" (it allows the consumer to change its request when it supports multiple devices/formats/etc.).

@pearu the analogy with __array_interface__ is not a good one. That method is numpy-specific, and the actual data is hence always already in memory and always in a single format (strided array on CPU). A __binsparse__ call, just like a __dlpack__ call, requires the data to be in memory in a supported format, so for (for example) lazily evaluated formats or custom formats not part of the protocol, a call to __binsparse__ may indeed require an expensive operation (and no, always raising an exception instead is definitely not the correct answer, there are always needs to do the expensive thing if no cheap thing is possible).

If anything, we've so far found a few corner cases where it would be helpful for more parts of the DLPack protocol to be introspectable separately from the actual "give me a capsule with a pointer to data in memory".

pearu

pearu commented on Dec 4, 2024

@pearu

A binsparse call, just like a dlpack call, requires the data to be in memory in a supported format

Sparse tensors of any format can be modeled as a pair

(shape-dtype-format specs, (<a list of dlpack-compatible objects that hold the indices data and values>))

(this is how we think about sparse tensors in PyTorch, for instance). Notice that __binsparse__ structure does not (need to) hold raw pointers to memory which indeed is the case of the __dlpack__ structure. Hence, __binsparse__ structure would be lazy by definition when using this mental model.

Lazy access to sparse tensors is about accessing shape-dtype-format specs without accessing the data of indices and values. Notice that the __dlpack__ capsule that holds pointers to raw memory is constructed when the __dlpack__ method is called (this is the point where data needs to be materialized in memory, not earlier). So, lazy access to sparse tensors shape-dtype-format data is possible if one will not trigger the creation of __dlpack__ capsules (read: materialize data in memory) earlier than necessary.

In any case, the laziness property should be provided at the DLpack protocol level, not in binsparse protocol that just uses the dlpack protocol for accessing sparse tensor data.

rgommers

rgommers commented on Dec 4, 2024

@rgommers
Member

Lazy access to sparse tensors is about accessing shape-dtype-format specs without accessing the data of indices and values.

This is the entire point of why two separate methods have to exist. Your answer agrees that there should be this separation, you're just moving the separation between accessing metadata and data to a place it doesn't exist (__dlpack__). Coupling the introduction of a binsparse protocol to not-yet-proposed and probably infeasible changes to __dlpack__ isn't a good idea. Separation of concerns is nice:)

Please either go with the 2-method approach for binsparse itself instead, or propose a way of what a 1-method __binsparse__ should return to keep things lazy without any changes to DLPack.

hameerabbasi

hameerabbasi commented on Feb 19, 2025

@hameerabbasi
ContributorAuthor

I've modified the issue description to use the two-method version.

hameerabbasi

hameerabbasi commented on Mar 6, 2025

@hameerabbasi
ContributorAuthor

Looping in @BenBrock.

Based on scipy/scipy#22553 (comment) We discussed in the bi-weekly Binsparse meeting yesterday where the spec should live -- It seems the most appropriate thing to do is (similar to DLPack) let the binsparse repo take care of the descriptor and the naming conventions of the arrays, and the array-api repo should specify how that descriptor is used in Python-land.

For context, all the binsparse spec requires from back-ends is:

  1. A way to represent and interpret JSON (Python objects equivalent to the parsed JSON in our case).
  2. A key-value store of 1D and 2D arrays (A dict[str, SupportsDLPack] in our case).

So this seems to be a good separation of concerns.

rgommers

rgommers commented on Mar 8, 2025

@rgommers
Member

That seems reasonable. I think a draft PR on this repo for review of the new methods in more detail would be useful. Based on feedback and working PRs to 2-3 array libraries that are all in sync, things can be tested and iterated on until both library and spec maintainers are happy, before merging anything.

hameerabbasi

hameerabbasi commented on Mar 11, 2025

@hameerabbasi
ContributorAuthor

PR is up at #912.

hameerabbasi

hameerabbasi commented on Apr 4, 2025

@hameerabbasi
ContributorAuthor

@rgommers Following the discussion of yesterday's meeting, one valid use-case for implementing __binsparse__ on dense arrays might be for an array library like PyTorch or finch-tensor, which supports both dense and sparse arrays. In that case, it isn't clear that asarray should densify: Indeed, it should keep the format as-is and attempt to use __dlpack__ to consume the data. However, in this case, arr.__binsparse__(descriptor=some_format) (or, to a user, xp2.from_binsparse(xp1_arr, descriptor=some_format)) might be suitable to convert to a given sparse array format, or even to a dense one.

Alternatively we can add array.asformat(some_descriptor) on both sparse and dense arrays if hasattr(xp, 'sparse') holds.

I'm open to both options.

rgommers

rgommers commented on Apr 4, 2025

@rgommers
Member

That rationale isn't logically consistent. asarray(x_dense) will always use __dlpack__ and it's irrelevant whether __binsparse__ is present.

Can you please start simple, with only binsparse on sparse arrays, and then have a very concrete code example of where/how things can go wrong?

Alternatively we can add array.asformat(some_descriptor) on both sparse and dense arrays if hasattr(xp, 'sparse') holds.

As a reminder: we're only working on __binsparse__ and from_binsparse to work out any potential kinks and guiding libraries with sparse arrays on how to implement those two things in the same way. Once that is stable in multiple libraries and proven useful, we can add it to the standard. It's quite premature at this point to discuss a larger API surface.

hasattr(xp, 'sparse') is also a misconception. We need hasattr(xp, 'from_binsparse') or hasattr(an_array, '__binsparse__').

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    API extensionAdds new functions or objects to the API.Needs DiscussionNeeds further discussion.RFCRequest for comments. Feature requests and proposed changes.topic: DLPackDLPack.

    Type

    No type

    Projects

    Status

    Stage 1

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @rgommers@pearu@hameerabbasi@kgryte@willow-ahrens

      Issue actions

        RFC: In-memory sparse array interchange · Issue #840 · data-apis/array-api