New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size and density properties #69

Merged
merged 12 commits into from Jan 16, 2018

Conversation

Projects
None yet
3 participants
@nils-werner
Contributor

nils-werner commented Jan 10, 2018

NumPy has a .size property that gives you the number of elements in an array.

Additionally for sparse arrays a .density property may be handy, which is the ratio of nonzero elements to total elements in the array (a value between 0 and 1).

This PR implements both properties for COO arrays

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 10, 2018

Seems fine, however it would be really nice to have docstrings for these, and if possible, tests.

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Jan 10, 2018

Added them

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 10, 2018

Ignore the Travis CI failure. Numpy 1.14 changed the way it converts arrays to strings, and that's causing a doctests failure. I've addressed it over in #43.

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Jan 10, 2018

Yup, I already noticed that

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 10, 2018

Well, this looks good. Let me know if you're finished working on it, and I'll review and merge (don't want to be too fast this time).

def test_density():
s = sparse.random((20, 30, 40), density=0.1)
assert 0.09 < s.density < 0.11

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 10, 2018

Collaborator

It might be wise to use np.isclose here.

>>> x[0, :] = 1
>>> s = COO.from_numpy(x)
>>> s.density
0.10000000000000001

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 10, 2018

Collaborator

This might fluctuate in different system floating-point implementations. Perhaps np.isclose(density, 0.1)?

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Jan 10, 2018

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Jan 11, 2018

@mrocklin that's clever! Done.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 11, 2018

This looks ready. If you merge, I can do a final review and merge.

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Jan 12, 2018

Did you see my code comment?

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 12, 2018

Hmmm, that's strange, it isn't showing any of your comments in that link, or on any commits.

@@ -1277,7 +1325,7 @@ def astype(self, dtype, out=None):
def maybe_densify(self, allowed_nnz=1e3, allowed_fraction=0.25):
""" Convert to a dense numpy array if not too costly. Err othrewise """
if reduce(operator.mul, self.shape) <= allowed_nnz or self.nnz >= np.prod(self.shape) * allowed_fraction:
if self.size <= allowed_nnz or self.density >= allowed_fraction:

This comment has been minimized.

@nils-werner

nils-werner Jan 12, 2018

Contributor

It is easy to see that

self.nnz >= np.prod(self.shape) * allowed_fraction

can be transformed to

self.density >= allowed_fraction

But does self.density >= allowed_fraction actually make sense? Shouldn't it be the other way round?

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 12, 2018

Collaborator

It actually does make sense. If the density is low, then the cost of a dense array is higher compared to that of a sparse array, and we should avoid densifying.

If, however, density is high, the overhead of a sparse matrix far outweighs the benefits and we should densify.

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 12, 2018

Collaborator

To put it another way, a large array with a very low density should NOT be densified but it's okay to densify a high-density or a small array.

This comment has been minimized.

@nils-werner

nils-werner Jan 12, 2018

Contributor

True. Maybe I got confused that one allowed means max and the other one means min... What about renaming them to

maybe_densify(self, max_nnz=1e3, min_density=0.25)

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 12, 2018

Collaborator

I would make sure that no tests break either here or in Dask before doing that. We can change unit tests here... But it's best to send a PR to Dask if any tests break there.

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 15, 2018

Collaborator

Let's change it anyway and I'll send a PR to Dask if anything fails for them.

This comment has been minimized.

@nils-werner

nils-werner Jan 15, 2018

Contributor

Actually, shouldn't max_nnz be actually named max_size? We are not comparing it to .nnz anyways...

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 15, 2018

Collaborator

Yes, max_size is a good idea as well.

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Jan 12, 2018

Aha! after commenting on lines you have to also open the code review thingy.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 15, 2018

Is NotImplementedError the best way to show that densfication is forbidden? What would be the right exception here?

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Jan 15, 2018

I would say ValueError is the right one, as it's the choice of function arguments that make the operation fail.

A NotImplementedError sounds more like "whatever you pass to this function, in this version it will never succeed".

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 15, 2018

Best to switch that to a ValueError then.

@nils-werner nils-werner force-pushed the nils-werner:density branch from 1fa7541 to 832b5ab Jan 15, 2018

an error.
>>> x = np.zeros((5, 5), dtype=np.uint8)
>>> x[2, 2] = 1
>>> s = COO.from_numpy(x)
>>> s.maybe_densify(allowed_nnz=5, allowed_fraction=0.25)
>>> s.maybe_densify(max_size=5, min_fraction=0.25)
Traceback (most recent call last):
...
NotImplementedError: Operation would require converting large sparse array to dense

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 15, 2018

Collaborator

You need to update this line in the docstring. See here.

@@ -2381,17 +2381,17 @@ def astype(self, dtype, out=None):
assert out is None
return self.elemwise(np.ndarray.astype, dtype)
def maybe_densify(self, allowed_nnz=1000, allowed_fraction=0.25):
def maybe_densify(self, max_size=1000, min_fraction=0.25):

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 15, 2018

Collaborator

Best keep this as min_density, it's clearer that it represents the density.

an error.
>>> x = np.zeros((5, 5), dtype=np.uint8)
>>> x[2, 2] = 1
>>> s = COO.from_numpy(x)
>>> s.maybe_densify(allowed_nnz=5, allowed_fraction=0.25)
>>> s.maybe_densify(max_size=5, min_density=0.25)
Traceback (most recent call last):
...
NotImplementedError: Operation would require converting large sparse array to dense

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 15, 2018

Collaborator

Line 2425...

NotImplementedError: Operation would require converting large sparse array to dense

Need to change this to ValueError.

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 15, 2018

Collaborator

It's causing doctests to fail. :)

This comment has been minimized.

@nils-werner

nils-werner Jan 15, 2018

Contributor

Currently there are a million tests failing on master, it is hard to spot my own mistakes...

This comment has been minimized.

@hameerabbasi

This comment has been minimized.

@nils-werner

nils-werner Jan 15, 2018

Contributor

I checked, they only fail until Numpy 1.13.1. Latest Numpy works fine...

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 15, 2018

Collaborator

That is really weird, I tested with Numpy 1.13.3 on my local system and it still works fine.

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Jan 15, 2018

Is NotImplementedError the best way to show that densfication is forbidden? What would be the right exception here?

To me NotImplementedError means "We have not yet implemented this, but it is a reasonable request". If it is a case of a type not being supported like "we don't think that densifying scipy.sparse matrices is a good idea" then I would raise a TypeError. If it is a case of the values being inappropriate like "sometimes we do densify sparse arrays, but this particular array is very very sparse" then I would use a ValueError.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 15, 2018

So, we all agree that it should be ValueError in maybe_densify then. :-)

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 15, 2018

Are you still making changes here? If not, this gets a +1 from me.

@hameerabbasi hameerabbasi added this to the 0.3 milestone Jan 15, 2018

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Jan 16, 2018

No, I'm done.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 16, 2018

A rebase would be nice.

@nils-werner nils-werner force-pushed the nils-werner:density branch from 647fc33 to a65e1d0 Jan 16, 2018

@hameerabbasi hameerabbasi merged commit bbb0869 into pydata:master Jan 16, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

hameerabbasi added a commit to hameerabbasi/sparse that referenced this pull request Feb 27, 2018

Size and density properties (pydata#69)
* Size and density properties

* Docstrings and tests

* Test .density using np.isclose

* Fixed typos that made tests fail

* Replaced other occurences of np.prod(self.shape) with self.size

* Avoid float precision errors in .denstity docstring

* maybe_densify raises ValueError instead of NotImplementedError

* Updated docstring to match ValueError exception

* changed allowed_nnz and allowed_fraction to max_size and min_density

* changed min_fraction to min_density

* Another docstring that didn't show ValueError

* Documentation pages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment