Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add random() function #41

Merged
merged 18 commits into from Jan 2, 2018
Merged

Add random() function #41

merged 18 commits into from Jan 2, 2018

Conversation

nils-werner
Copy link
Contributor

I think a random() function is could be pretty handy for generating some quick testing data.

I tried several methods, like generating .coords and .data myself, and creating a COO object from them. In the end it turned out to be the easiest to create a linear scipy.sparse array and simply reshape it to the desired output shape.

@mrocklin
Copy link
Collaborator

Thanks for this. Some notes:c

  1. This could use a test
  2. Any thoughts on using this to replace our current random_x function currently used in tests? Your solution is likely far more efficient.

@nils-werner
Copy link
Contributor Author

  1. This could use a test

Definitely! What would you like to be tested?

  1. Any thoughts on using this to replace our current random_x function currently used in tests?

Might be worth investigating. I didn't understand what random_x was doing at first glance, so I just ignored it for now.

@mrocklin
Copy link
Collaborator

Well, we might ensure a variety of things we know about sprase matrices. Here are a few that come to mind:

  1. The shape matches the intended shape
  2. The dtype matches the intended dtype (for a few dtypes)
  3. The density matches the intended density (according to nnz)
  4. Two calls to the same function don't produce the same results
  5. Two calls with the same args and with the same random state seed (we'll need to pass this through) produce equivalent results

@mrocklin
Copy link
Collaborator

Might be worth investigating. I didn't understand what random_x was doing at first glance, so I just ignored it for now.

It attempts to do the same thing that you're doing here, just much less efficiently, and much less generally :)

@hameerabbasi
Copy link
Collaborator

The proper way to test a random function is currently a hot debate topic. There seem to be two ways to test it, both which I'm not happy with:

  • Mock/Monkeypatch the underlying random function to see it is getting the correct arguments. In this case, scipy.sparse.rand.
  • Run it lots of times until it is statistically certain to be in a certain range, then test for that range.

I will also add that random_x returns a dense Numpy array, and this returns a sparse one.

@mrocklin
Copy link
Collaborator

I don't think we need to test the random number generator within this. We can rely on SciPy/NumPy to produce a decent distribution of numbers. I'm mostly concerned that we test basic user expectations.

sparse/core.py Outdated
elements = np.prod(shape)

return COO.from_scipy_sparse(
scipy.sparse.rand(elements, 1, density, dtype=dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it would be useful to specify format='coo' explicitly. Might speed up the conversion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something, but scipy.sparse.rand() defaults to format='coo'...

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Dec 27, 2017

(we'll need to pass this through)

It looks like the version of scipy.sparse.rand that supports passing through a random state is called scipy.sparse.random and only supports floating point dtypes. So we might lose something if we implement this check.

@mrocklin
Copy link
Collaborator

It looks like the version of scipy.sparse.rand that supports passing through a random state is called scipy.sparse.random and only supports floating point dtypes. So we might lose something if we implement this check.

That's unfortunate

@nils-werner
Copy link
Contributor Author

nils-werner commented Dec 29, 2017

It looks like the version of scipy.sparse.rand that supports passing through a random state is called scipy.sparse.random

Really? Looking at the reference it says

scipy.sparse.rand(m, n, density=0.01, format='coo', dtype=None, random_state=None)

and

scipy.sparse.random(m, n, density=0.01, format='coo', dtype=None, random_state=None, data_rvs=None)

Note the random_state kwarg for both functions.

@hameerabbasi
Copy link
Collaborator

Ah, my mistake, the notes on rand were misleading.

Similar function that allows a user-specified random data source.

@nils-werner
Copy link
Contributor Author

I have replaced random_x and random_x_bool with calls to sparse.random in all the tests. However, I stumbled across a few oddities that need to be discussed. See code review.

@@ -9,28 +9,6 @@
import sparse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove random here, it's unneeded.

x[tuple(random.randint(0, d - 1) for d in x.shape)] = True
return x


@pytest.mark.parametrize('reduction,kwargs', [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an extra newline here.

@hameerabbasi
Copy link
Collaborator

The issue you were having here earlier was already fixed with #56. I've added comments that should help fix the flake8 errors.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Dec 31, 2017

I'm good to merge as soon as flake8 is fixed.

Edit: It'd be helpful to add the test @mrocklin described, though.

@nils-werner
Copy link
Contributor Author

Don't merge yet, the tests still randomly fail due to the input data

@hameerabbasi
Copy link
Collaborator

You might want to rebase this on master and see if the tests still fail. If we're talking about the one I saw fail (test_reductions) then it was fixed in #56.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Dec 31, 2017

Since it's just failing for the np.float16 dtype, I'm pretty sure these are just floating point errors. np.allclose checks for a relative error of 1.0e-5 (see here) and float16 is nowhere near that accurate. You might want to pass an additional argument called rtol to assert_eq that defaults to 1.0e-5, increase that in this case. I think this fails because we're moving from straight integers stored in a floating-point dtype to real floating-points... Which increases the likelihood of random errors.

@nils-werner
Copy link
Contributor Author

nils-werner commented Dec 31, 2017

If I raise atol to 1e-2 (which I find uncomfortably high) I stop seeing random errors...

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Dec 31, 2017

Great! Fix flake8 and add a test for random and we should be good to merge! I could do that too, if you like!

Thanks for your work on all this.

Edit: You just need to run pip install flake8 (or the conda equivalent) and run flake8 inside the root of the local git repo.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Dec 31, 2017

float16 has just 10 bits of floating point precision, out of which one is essentially a sign bit. That leaves 9 bits. 2 ** -9 = 1.953125e-3. Account for the fact that the errors can accumulate, and I think we're good with atol=1.0e-2.

Of course, this is relative tolerance, but since the maximum value of random is 1.0-ish, I'm pretty sure the words carry over.

@nils-werner
Copy link
Contributor Author

nils-werner commented Dec 31, 2017

While playing around with ways of comparing arrays I realized that COO(scipy.sparse.random(...)) always returns unsorted indices. Do we want them to be never sorted/always sorted/optionally sorted?

@hameerabbasi
Copy link
Collaborator

It would be nice to have an option sorted which defaults to False. I actually think sorted=False is better because we don't want our methods to assume sorted indices, we want them to work regardless of whether the input is or isn't sorted.

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Jan 1, 2018

During testing I found that the dtype argument in scipy.random.sparse has no effect. I was considering removing it.

Edit: I also added tests and rebased this onto master.

@mrocklin
Copy link
Collaborator

mrocklin commented Jan 1, 2018

Long term we might consider copying scipy's logic to produce the coordinates but then create our own data.

@hameerabbasi
Copy link
Collaborator

Do you think this is good to merge? It's good from my side. I took a look at SciPy's code, it's got a lot of unnecessary branches and I think we're better off just using it.

@nils-werner
Copy link
Contributor Author

Or we could use the coordinates from SciPy and then do

self.data = numpy.random.rand(elements, dtype=dtype)

@hameerabbasi
Copy link
Collaborator

numpy.random.rand doesn't seem to have a dtype parameter either.

@nils-werner
Copy link
Contributor Author

That's right... no Idea why I thought it had one...

@nils-werner
Copy link
Contributor Author

Maybe this is starting to overengineer the solution a bit, but we could allow for a data_callback kwarg, and users could inject their own random data generator into the function call:

sparse.random((2, 3, 5), density=0.5)  # float
sparse.random((2, 3, 5), density=0.5, data_callback=lambda x: np.random.choice([True, False], size=x))  # True and False
sparse.random((2, 3, 5), density=0.5, data_callback=lambda x: np.random.randint(10, 20, size=x)).todense()  # Integers from 10 to 20
sparse.random((2, 3, 5), density=0.5, data_callback=lambda x: np.full(x, True))  # all True

@hameerabbasi
Copy link
Collaborator

hameerabbasi commented Jan 1, 2018

No, I actually really like this solution. However, I think it should be in a separate method rather than in sparse.random. I think we should mimic scipy.sparse.random with sparse.random, and add another utility method random_callback (my brain is crapping out on a better name). However, that's for another day... Maybe another pull request?

@hameerabbasi
Copy link
Collaborator

Merging if there are no comments until 21:00 German time. This can hold tests on other branches back, and they may need to be rebased on this, so I don't want to hold this out too long.

@nils-werner
Copy link
Contributor Author

nils-werner commented Jan 1, 2018

Actually, scipy.sparse.random() does exactly that, and calls the parameter data_rvs. However what they don't do is change the output array dtype.

I've pushed a commit that implements the parameter, and tests it with a few possible shapes, densities and RVS's (one taken from the scipy.sparse.random() docs)

@hameerabbasi
Copy link
Collaborator

Hmmm... This seems rather inefficient and wasteful to generate the data twice. Maybe it's worth it to copy this part of the Scipy code if we can guarantee consistency internally.

@hameerabbasi
Copy link
Collaborator

@nils-werner Any more changes? If not, we should merge this.

@nils-werner
Copy link
Contributor Author

nils-werner commented Jan 1, 2018

Slow down... we've just doubled the LOC in random() a minute ago, and nobody had the time to review the code yet... :-)

@hameerabbasi
Copy link
Collaborator

Good point. I just really don't want to put off changing the test code in the other branches.

sparse/utils.py Outdated
@@ -2,7 +2,7 @@
from .core import COO


def assert_eq(x, y):
def assert_eq(x, y, rtol=1.e-5, atol=1.e-8):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just pass through **kwargs here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems like a better option.

@mrocklin
Copy link
Collaborator

mrocklin commented Jan 1, 2018

Overall this seems great to me. It looks like we're able to handle most cases and we're staying within the scipy.sparse API.

@hameerabbasi hameerabbasi merged commit 462f606 into pydata:master Jan 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants