New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add random() function #41

Merged
merged 18 commits into from Jan 2, 2018

Conversation

Projects
None yet
3 participants
@nils-werner
Contributor

nils-werner commented Dec 26, 2017

I think a random() function is could be pretty handy for generating some quick testing data.

I tried several methods, like generating .coords and .data myself, and creating a COO object from them. In the end it turned out to be the easiest to create a linear scipy.sparse array and simply reshape it to the desired output shape.

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Dec 26, 2017

Thanks for this. Some notes:c

  1. This could use a test
  2. Any thoughts on using this to replace our current random_x function currently used in tests? Your solution is likely far more efficient.

@nils-werner nils-werner force-pushed the nils-werner:random branch from 02eb6ae to 0eee01b Dec 26, 2017

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Dec 26, 2017

  1. This could use a test

Definitely! What would you like to be tested?

  1. Any thoughts on using this to replace our current random_x function currently used in tests?

Might be worth investigating. I didn't understand what random_x was doing at first glance, so I just ignored it for now.

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Dec 26, 2017

Well, we might ensure a variety of things we know about sprase matrices. Here are a few that come to mind:

  1. The shape matches the intended shape
  2. The dtype matches the intended dtype (for a few dtypes)
  3. The density matches the intended density (according to nnz)
  4. Two calls to the same function don't produce the same results
  5. Two calls with the same args and with the same random state seed (we'll need to pass this through) produce equivalent results
@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Dec 26, 2017

Might be worth investigating. I didn't understand what random_x was doing at first glance, so I just ignored it for now.

It attempts to do the same thing that you're doing here, just much less efficiently, and much less generally :)

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Dec 26, 2017

The proper way to test a random function is currently a hot debate topic. There seem to be two ways to test it, both which I'm not happy with:

  • Mock/Monkeypatch the underlying random function to see it is getting the correct arguments. In this case, scipy.sparse.rand.
  • Run it lots of times until it is statistically certain to be in a certain range, then test for that range.

I will also add that random_x returns a dense Numpy array, and this returns a sparse one.

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Dec 26, 2017

I don't think we need to test the random number generator within this. We can rely on SciPy/NumPy to produce a decent distribution of numbers. I'm mostly concerned that we test basic user expectations.

elements = np.prod(shape)
return COO.from_scipy_sparse(
scipy.sparse.rand(elements, 1, density, dtype=dtype)

This comment has been minimized.

@hameerabbasi

hameerabbasi Dec 27, 2017

Collaborator

Here, it would be useful to specify format='coo' explicitly. Might speed up the conversion.

This comment has been minimized.

@nils-werner

nils-werner Dec 29, 2017

Contributor

Maybe I am missing something, but scipy.sparse.rand() defaults to format='coo'...

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Dec 27, 2017

(we'll need to pass this through)

It looks like the version of scipy.sparse.rand that supports passing through a random state is called scipy.sparse.random and only supports floating point dtypes. So we might lose something if we implement this check.

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Dec 27, 2017

It looks like the version of scipy.sparse.rand that supports passing through a random state is called scipy.sparse.random and only supports floating point dtypes. So we might lose something if we implement this check.

That's unfortunate

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Dec 29, 2017

It looks like the version of scipy.sparse.rand that supports passing through a random state is called scipy.sparse.random

Really? Looking at the reference it says

scipy.sparse.rand(m, n, density=0.01, format='coo', dtype=None, random_state=None)

and

scipy.sparse.random(m, n, density=0.01, format='coo', dtype=None, random_state=None, data_rvs=None)

Note the random_state kwarg for both functions.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Dec 29, 2017

Ah, my mistake, the notes on rand were misleading.

Similar function that allows a user-specified random data source.

@nils-werner nils-werner force-pushed the nils-werner:random branch from 1badb20 to a8ae22d Dec 30, 2017

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Dec 31, 2017

I have replaced random_x and random_x_bool with calls to sparse.random in all the tests. However, I stumbled across a few oddities that need to be discussed. See code review.

@@ -9,28 +9,6 @@
import sparse

This comment has been minimized.

@hameerabbasi

hameerabbasi Dec 31, 2017

Collaborator

Please remove random here, it's unneeded.

x[tuple(random.randint(0, d - 1) for d in x.shape)] = True
return x
@pytest.mark.parametrize('reduction,kwargs', [

This comment has been minimized.

@hameerabbasi

hameerabbasi Dec 31, 2017

Collaborator

Add an extra newline here.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Dec 31, 2017

The issue you were having here earlier was already fixed with #56. I've added comments that should help fix the flake8 errors.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Dec 31, 2017

I'm good to merge as soon as flake8 is fixed.

Edit: It'd be helpful to add the test @mrocklin described, though.

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Dec 31, 2017

Don't merge yet, the tests still randomly fail due to the input data

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Dec 31, 2017

You might want to rebase this on master and see if the tests still fail. If we're talking about the one I saw fail (test_reductions) then it was fixed in #56.

@nils-werner nils-werner force-pushed the nils-werner:random branch from 40d97a9 to 6222628 Dec 31, 2017

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Dec 31, 2017

Since it's just failing for the np.float16 dtype, I'm pretty sure these are just floating point errors. np.allclose checks for a relative error of 1.0e-5 (see here) and float16 is nowhere near that accurate. You might want to pass an additional argument called rtol to assert_eq that defaults to 1.0e-5, increase that in this case. I think this fails because we're moving from straight integers stored in a floating-point dtype to real floating-points... Which increases the likelihood of random errors.

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Dec 31, 2017

If I raise atol to 1e-2 (which I find uncomfortably high) I stop seeing random errors...

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Dec 31, 2017

Great! Fix flake8 and add a test for random and we should be good to merge! I could do that too, if you like!

Thanks for your work on all this.

Edit: You just need to run pip install flake8 (or the conda equivalent) and run flake8 inside the root of the local git repo.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Dec 31, 2017

float16 has just 10 bits of floating point precision, out of which one is essentially a sign bit. That leaves 9 bits. 2 ** -9 = 1.953125e-3. Account for the fact that the errors can accumulate, and I think we're good with atol=1.0e-2.

Of course, this is relative tolerance, but since the maximum value of random is 1.0-ish, I'm pretty sure the words carry over.

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Dec 31, 2017

While playing around with ways of comparing arrays I realized that COO(scipy.sparse.random(...)) always returns unsorted indices. Do we want them to be never sorted/always sorted/optionally sorted?

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Dec 31, 2017

It would be nice to have an option sorted which defaults to False. I actually think sorted=False is better because we don't want our methods to assume sorted indices, we want them to work regardless of whether the input is or isn't sorted.

@hameerabbasi hameerabbasi force-pushed the nils-werner:random branch from b3c2637 to 0471fcc Jan 1, 2018

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 1, 2018

During testing I found that the dtype argument in scipy.random.sparse has no effect. I was considering removing it.

Edit: I also added tests and rebased this onto master.

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Jan 1, 2018

Long term we might consider copying scipy's logic to produce the coordinates but then create our own data.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 1, 2018

Do you think this is good to merge? It's good from my side. I took a look at SciPy's code, it's got a lot of unnecessary branches and I think we're better off just using it.

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Jan 1, 2018

Or we could use the coordinates from SciPy and then do

self.data = numpy.random.rand(elements, dtype=dtype)
@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 1, 2018

numpy.random.rand doesn't seem to have a dtype parameter either.

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Jan 1, 2018

That's right... no Idea why I thought it had one...

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Jan 1, 2018

Maybe this is starting to overengineer the solution a bit, but we could allow for a data_callback kwarg, and users could inject their own random data generator into the function call:

sparse.random((2, 3, 5), density=0.5)  # float
sparse.random((2, 3, 5), density=0.5, data_callback=lambda x: np.random.choice([True, False], size=x))  # True and False
sparse.random((2, 3, 5), density=0.5, data_callback=lambda x: np.random.randint(10, 20, size=x)).todense()  # Integers from 10 to 20
sparse.random((2, 3, 5), density=0.5, data_callback=lambda x: np.full(x, True))  # all True
@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 1, 2018

No, I actually really like this solution. However, I think it should be in a separate method rather than in sparse.random. I think we should mimic scipy.sparse.random with sparse.random, and add another utility method random_callback (my brain is crapping out on a better name). However, that's for another day... Maybe another pull request?

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 1, 2018

Merging if there are no comments until 21:00 German time. This can hold tests on other branches back, and they may need to be rebased on this, so I don't want to hold this out too long.

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Jan 1, 2018

Actually, scipy.sparse.random() does exactly that, and calls the parameter data_rvs. However what they don't do is change the output array dtype.

I've pushed a commit that implements the parameter, and tests it with a few possible shapes, densities and RVS's (one taken from the scipy.sparse.random() docs)

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 1, 2018

Hmmm... This seems rather inefficient and wasteful to generate the data twice. Maybe it's worth it to copy this part of the Scipy code if we can guarantee consistency internally.

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 1, 2018

@nils-werner Any more changes? If not, we should merge this.

@nils-werner

This comment has been minimized.

Contributor

nils-werner commented Jan 1, 2018

Slow down... we've just doubled the LOC in random() a minute ago, and nobody had the time to review the code yet... :-)

@hameerabbasi

This comment has been minimized.

Collaborator

hameerabbasi commented Jan 1, 2018

Good point. I just really don't want to put off changing the test code in the other branches.

@@ -2,7 +2,7 @@
from .core import COO
def assert_eq(x, y):
def assert_eq(x, y, rtol=1.e-5, atol=1.e-8):

This comment has been minimized.

@mrocklin

mrocklin Jan 1, 2018

Collaborator

Should we just pass through **kwargs here?

This comment has been minimized.

@hameerabbasi

hameerabbasi Jan 1, 2018

Collaborator

Yes, that seems like a better option.

@mrocklin

This comment has been minimized.

Collaborator

mrocklin commented Jan 1, 2018

Overall this seems great to me. It looks like we're able to handle most cases and we're staying within the scipy.sparse API.

@hameerabbasi hameerabbasi merged commit 462f606 into pydata:master Jan 2, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment