Description
Motivation
The sparse
(also called PyData/Sparse) development team have been working on integration efforts with the ecosystem, most notably with SciPy, scikit-learn and others, with CuPy, PyTorch, JAX and TensorFlow also on the radar. One of the challenges we were facing was the lack of (possibly zero-copy) interchange between the different sparse array implementations. We believe this may be a pain point for many sparse array implementations moving forward.
This mirrors an issue seen for dense arrays previously, where the DLPack protocol was the one of the first things to be standardised. We're hoping to achieve community consensus for a similar problem.
Luckily, all sparse array formats (with the possible exception of DOK) are usually collections of dense arrays underneath. In addition, this problem has been solved for on-disk arrays before by the binsparse specification. @willow-ahrens is a co-author of that spec, and is also a collaborator for the sparse
work.
Proposal
We propose introducing two new methods to the array-API compliant sparse array objects (such as those in sparse
), which are described below.
__binsparse_descriptor__
The returned item is a dict
equivalent to a parsed JSON binsparse
descriptor of an array.
__binsparse__
The second item is a dict[str, Array]
of __dlpack__
compatible arrays, which are the constituent arrays of the sparse array. The key represents the equivalent key in the descriptor.
Introduction of from_binsparse
function.
If a library supports sparse arrays, its from_binsparse
method should support accepting (when possible, zero-copy) versions of objects that follow this __binsparse__
protocol, and have an equivalent sparse format within the library.
Psuedocode implementation
Here's a psuedocode example using two libraries, xp1
and xp2
, both supporting sparse arrays:
# In library code:
xp2_sparray = xp2.from_binsparse(xp1_sparray, ...)
# This psuedocode impl is common between `xp1` and `xp2`
def from_binsparse(x: object, /, *, device: device | None = None, copy: bool | None = None) -> array:
binsparse_descr = getattr(x, "__binsparse_descriptor__", None)
binsparse_impl = getattr(x, "__binsparse__", None)
if binsparse_impl is None or binsparse_descr is None:
raise TypeError(...)
binsparse_descriptor = binsparse_descr()
# Will raise an error if the format/descriptor is unsupported.
sparse_type = _type_from_binsparse_descriptor(binsparse_descriptor)
constituent_arrays = binsparse_impl()
my_constituent_arrays = {
k: from_dlpack(arr, device=device, copy=copy) for k, arr in constituent_arrays.items()
}
return sparse_type.from_strided_arrays(my_constituent_arrays, shape=...)
Parallel implementation in sparse
: pydata/sparse#764
Parallel implementation in SciPy: scipy/scipy#22553
Alternative solutions
There are formats for on-disk sparse-array interchange [1] [2]; but none for in-memory interchange. binsparse
is the one that comes closest to offering in-memory interchange.
Pinging possibly interested parties:
- @mtsokol @ivirshup (for
scipy.sparse
) - @willow-ahrens @mtsokol (from
binsparse
andfinch-tensor
/sparse
) - @leofang (for
cupyx.sparse
) - @pearu (for
torch.sparse
) - @jakevdp (for JAX/TensorFlow)
Updated on 2024.10.09 as agreed in #840 (comment).
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Activity
pearu commentedon Sep 6, 2024
A couple of quick notes.
First, dlpack and binsparse are self-contained specifications where dlpack provides a protocol for sharing strided arrays between different array libraries and binsparse provides a unified description of various sparse formats. This proposal tries to glue these together by implicitly extending DLPack protocol but that contradicts with the semantics of DLPack that "describes the memory layout of dense, strided, n-dimensional arrays". In addition, the Python Array API standard specifies that DLPack should not support sparse arrays.
My suggestion is to cook up a sparse array interchange protocol that may use dlpack and binsparse (as these are obviously relevant pieces of this problem) but in a cleaner way.
For instance, binsparse specifies that the binsparse-compatible object provides a 2-tuple
(<sparse array descriptor>, <list of sparse array data as strided arrays>)
. So, the appropriate place to deploy the dlpack protocol is to require that the arrays in the<list of sparse array data as strided arrays>
must support the dlpack protocol.For instance 2, introduce "Python specification of binsparse" (similar to one in dlpack) that consists of
from_binsparse
function that accepts objects that implement the binsparse protocol per 2. below__binsparse__
method on the objects representing sparse arrays that will return a 2-tuple(<sparse array descriptor>, <list of sparse array data as strided arrays>)
which is used by thefrom_binsparse
function of the consumer library.Also, the specification should include the C-side interface as well.
What do you think?
hameerabbasi commentedon Sep 7, 2024
While I'm on board with the proposed
__binsparse__
protocol, I'm of the opinion that sparse arrays should have as little user-facing API differences as possible as compared to strided arrays, i.e. using sparse arrays should be as frictionless from the Array API side as using strided arrays. Allocating a different function based on sparse and dense arrays (from_dlpack
vsfrom_binsparse
) seems counter to that.I'm not opposed to a C-side specification, I'd welcome that so that the benefits can extend beyond the Python ecosystem. We do have to be mindful that the binsparse specification requires key-value maps, and we would need to maintain their relative order. Perhaps an iterable of 2-tuples makes more sense in this case.
pearu commentedon Sep 7, 2024
Absolutely. I presume that users of sparse arrays are aware of advantages and disadvantages of using a particular sparse format over other sparse formats or strided arrays. Using a particular sparse format is a choice of optimization method that must be supported by user-facing API.
from_dlpack
is specifically defined for strided arrays, both by the Python Array API and dlpack. So, I find abusing thefrom_dlpack
function for sparse arrays more confusing than providing a new API function which would follow the same design pattern that dlpack introduced: the pair (from_foo
,__foo__
) provides a minimal API for sharing data between provider and consumer libraries that support a protocolfoo
.Btw, if one considers strided and sparse arrays semantically equivalent (as in PyTorch, for instance), a more intuitive approach would be to use
asarray
instead offrom_dlpack
as a common API function for constructing strided or sparse arrays from provider library objects. Although, Python Array API does not support that and existing implementations (PyTorch, JAX, ...) do not mix the functionalities ofasarray
andfrom_dlpack
. I don't suggest to do that here either.hameerabbasi commentedon Sep 9, 2024
While I agree with a common API; the disadvantage I find in this approach is that sinceSee #840 (comment)from_dlpack
requires the producer to pass ownership to the consumer;asarray
requires one to maintain the ownership of the consumed array. This, in turn, means that there's no way to have a zero-copy transfer usingasarray
using thefrom_dlpack
function; i.e. one would have to do something likedef asarray(x, ...): return from_dlpack(copy(x), ...)
somewhere to maintain ownership semantics, incurring a copy.My intention was to have an API that essentially could support zero-copy interchange across libraries, regardless of whether the consumed array was strided, sparse or something else.
pearu commentedon Sep 9, 2024
This is incorrect. When using
from_dlpack(x)
, the producer keeps owning the memory ofx
. See https://dmlc.github.io/dlpack/latest/python_spec.html#semantics.hameerabbasi commentedon Sep 9, 2024
Ah, in that case; yes,
asarray
can usefrom_dlpack
and a possiblefrom_binsparse
, and still be zero-copy. Thanks for the correction;asarray
seems like the right format-agnostic API for this.hameerabbasi commentedon Sep 11, 2024
I've updated the issue with @pearu's feedback.
16 remaining items
pearu commentedon Dec 3, 2024
I am not sure how
__binsparse_format__
will be helpful here. If a library implements a sparse format that binsparse protocol does not support, then the binsparse protocol will be useless anyway: binsparse protocol cannot implement conversion about a format that it does not support, any necessary conversion should be done by the provider library.If you'd disagree, please provide more details about the
__binsparse_format__
specification as I may miss some points here.hameerabbasi commentedon Dec 3, 2024
I think you've hit the nail on the head with this part -- a conversion might be necessary, which may make the cost of constructing a capsule O(n), and therefore we're left with two options if we want to guarantee an O(1) conversion (both require a
.asformat
which can take binsparse descriptors):.asformat
on error.__binsparse_format__
and then raise or call.asformat
as necessary.pearu commentedon Dec 3, 2024
An exception with a helpful exception message is better than a silent expensive conversion, imho.
rgommers commentedon Dec 4, 2024
The two-method format seems clearly preferred, indeed for the same reason as
__dlpack_device__
: allowing to introspect metadata for compatibility and "hand-shaking" (it allows the consumer to change its request when it supports multiple devices/formats/etc.).@pearu the analogy with
__array_interface__
is not a good one. That method is numpy-specific, and the actual data is hence always already in memory and always in a single format (strided array on CPU). A__binsparse__
call, just like a__dlpack__
call, requires the data to be in memory in a supported format, so for (for example) lazily evaluated formats or custom formats not part of the protocol, a call to__binsparse__
may indeed require an expensive operation (and no, always raising an exception instead is definitely not the correct answer, there are always needs to do the expensive thing if no cheap thing is possible).If anything, we've so far found a few corner cases where it would be helpful for more parts of the DLPack protocol to be introspectable separately from the actual "give me a capsule with a pointer to data in memory".
pearu commentedon Dec 4, 2024
Sparse tensors of any format can be modeled as a pair
(this is how we think about sparse tensors in PyTorch, for instance). Notice that
__binsparse__
structure does not (need to) hold raw pointers to memory which indeed is the case of the__dlpack__
structure. Hence,__binsparse__
structure would be lazy by definition when using this mental model.Lazy access to sparse tensors is about accessing
shape-dtype-format specs
without accessing the data of indices and values. Notice that the__dlpack__
capsule that holds pointers to raw memory is constructed when the__dlpack__
method is called (this is the point where data needs to be materialized in memory, not earlier). So, lazy access to sparse tensors shape-dtype-format data is possible if one will not trigger the creation of__dlpack__
capsules (read: materialize data in memory) earlier than necessary.In any case, the laziness property should be provided at the DLpack protocol level, not in binsparse protocol that just uses the dlpack protocol for accessing sparse tensor data.
rgommers commentedon Dec 4, 2024
This is the entire point of why two separate methods have to exist. Your answer agrees that there should be this separation, you're just moving the separation between accessing metadata and data to a place it doesn't exist (
__dlpack__
). Coupling the introduction of a binsparse protocol to not-yet-proposed and probably infeasible changes to__dlpack__
isn't a good idea. Separation of concerns is nice:)Please either go with the 2-method approach for binsparse itself instead, or propose a way of what a 1-method
__binsparse__
should return to keep things lazy without any changes to DLPack.hameerabbasi commentedon Feb 19, 2025
I've modified the issue description to use the two-method version.
__binsparse__
protocol scipy/scipy#22553hameerabbasi commentedon Mar 6, 2025
Looping in @BenBrock.
Based on scipy/scipy#22553 (comment) We discussed in the bi-weekly Binsparse meeting yesterday where the spec should live -- It seems the most appropriate thing to do is (similar to DLPack) let the binsparse repo take care of the descriptor and the naming conventions of the arrays, and the array-api repo should specify how that descriptor is used in Python-land.
For context, all the binsparse spec requires from back-ends is:
dict[str, SupportsDLPack]
in our case).So this seems to be a good separation of concerns.
rgommers commentedon Mar 8, 2025
That seems reasonable. I think a draft PR on this repo for review of the new methods in more detail would be useful. Based on feedback and working PRs to 2-3 array libraries that are all in sync, things can be tested and iterated on until both library and spec maintainers are happy, before merging anything.
__binsparse__
protocol #912hameerabbasi commentedon Mar 11, 2025
PR is up at #912.
hameerabbasi commentedon Apr 4, 2025
@rgommers Following the discussion of yesterday's meeting, one valid use-case for implementing
__binsparse__
on dense arrays might be for an array library like PyTorch orfinch-tensor
, which supports both dense and sparse arrays. In that case, it isn't clear thatasarray
should densify: Indeed, it should keep the format as-is and attempt to use__dlpack__
to consume the data. However, in this case,arr.__binsparse__(descriptor=some_format)
(or, to a user,xp2.from_binsparse(xp1_arr, descriptor=some_format)
) might be suitable to convert to a given sparse array format, or even to a dense one.Alternatively we can add
array.asformat(some_descriptor)
on both sparse and dense arrays ifhasattr(xp, 'sparse')
holds.I'm open to both options.
rgommers commentedon Apr 4, 2025
That rationale isn't logically consistent.
asarray(x_dense)
will always use__dlpack__
and it's irrelevant whether__binsparse__
is present.Can you please start simple, with only binsparse on sparse arrays, and then have a very concrete code example of where/how things can go wrong?
As a reminder: we're only working on
__binsparse__
andfrom_binsparse
to work out any potential kinks and guiding libraries with sparse arrays on how to implement those two things in the same way. Once that is stable in multiple libraries and proven useful, we can add it to the standard. It's quite premature at this point to discuss a larger API surface.hasattr(xp, 'sparse')
is also a misconception. We needhasattr(xp, 'from_binsparse')
orhasattr(an_array, '__binsparse__')
..dtype
or.dtype.kind
in sparse indexer scipy/scipy#22935