New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow np.array(COO) #218

Closed
hameerabbasi opened this Issue Dec 12, 2018 · 8 comments

Comments

Projects
None yet
6 participants
@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Dec 12, 2018

Currently, we have a mix of when to allow dense inputs and when not to allow them. Currently, we have the __array__ protocol which converts the sparse array into a dense one when using np.array(COO) or similar.

This would be fine if we knew that we were densifying, but in practice, we really don't. Think of something that calls np.asarray internally, this would densify, and in many cases fill up memory and raise a MemoryError.

In practice, what we want most of the time is to disallow np.[as]array(COO) and provide an explicit escape hatch in the form of COO.todense().

Recently, I added an environment variable to control this, so this discussion is mostly about changing the default behaviour.

Thoughts, @rgommers, @shoyer, @mrocklin, @perimosocordiae?

@asmeurer

This comment has been minimized.

Copy link
Member

asmeurer commented Dec 12, 2018

Having worked with the environment variable enabled for the past few weeks, I can confirm that it often densifies in unexpected places. For instance, an algorithm that uses einsum will densify because einsum isn't implemented for sparse.

@mrocklin

This comment has been minimized.

Copy link
Collaborator

mrocklin commented Dec 12, 2018

CuPy explicitly doesn't implement the __array__ protocol for this reason. Dask.array still does. Both seem defensible. No strong thoughts either way.

@scopatz

This comment has been minimized.

Copy link

scopatz commented Dec 12, 2018

I agree with @asmeurer here. The purpose of sparse seems to be to create, manage, and operate on sparse arrays. I don't expect that default behavior is to densify as frequently as it does.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Dec 12, 2018

__array_function__ will help here, since NumPy functions won't automatically densify anymore.

I agree that either choice is defensible.

In the long term, it would be nice to have a protocol for coercion to NumPy arrays even if the operation could be very expensive.

@asmeurer

This comment has been minimized.

Copy link
Member

asmeurer commented Dec 12, 2018

I think @hameerabbasi didn't want to add the flag to the Python level because of thread safety (right now it only exists as an environment variable that must be set before importing sparse). See the discussion at #210.

@perimosocordiae

This comment has been minimized.

Copy link

perimosocordiae commented Dec 13, 2018

I'm +1 for disallowing implicit densification. That's the approach taken by scipy.sparse, and I think it has been effective in guiding users away from unexpected performance traps.

Users often test their code using small inputs, so problems caused by densification have a bad habit of only showing up in production.

@hameerabbasi

This comment has been minimized.

Copy link
Collaborator

hameerabbasi commented Dec 13, 2018

It seems all comments are either in favor or neutral. I've implemented this in #220.

In the long term, it would be nice to have a protocol for coercion to NumPy arrays even if the operation could be very expensive.

This already exists as a method rather than a protocol, .todense(), the problem with __array__ seems to be that it makes densification implicit, with a lot of NumPy and other library functions calling it.

__array_function__ will help here, since NumPy functions won't automatically densify anymore.

__array_function__ will help, but only internal NumPy functions are guaranteed to use it. So unless we allow overriding np.asarray, it isn't perfect.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Dec 13, 2018

This already exists as a method rather than a protocol, .todense()

Ideally the method would also be available on non-sparse arrays

mlemainque added a commit to mlemainque/dask-lightgbm that referenced this issue Jan 2, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment