Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: sparse: smarter random index selection #3650

Merged
merged 1 commit into from
May 29, 2014

Conversation

perimosocordiae
Copy link
Member

Fixes #3648.

Caveat: The np.random.choice function was added in numpy 1.7.0, so we probably need a shim for older versions. I'm not sure how best to accomplish that, so suggestions are welcome.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling ee4f0eb on perimosocordiae:patch-5 into 7ff4e90 on scipy:master.

@WarrenWeckesser
Copy link
Member

This would be the right thing to do, except that np.random.choice ultimately does this:

    idx = self.permutation(pop_size)[:size]

where pop_size will be m*n and size will be k. permutation is implemented like this:

    def permutation(self, object x):
        if isinstance(x, (int, long, np.integer)):
            arr = np.arange(x)
        else:
            arr = np.array(x)
        self.shuffle(arr)
        return arr

So a temporary array with size m*n is created behind the scenes. That's bad if m*n is large.

@argriffing
Copy link
Contributor

This would be the right thing to do, except that np.random.choice ultimately does this:

Should this not be changed in numpy?

>>> np.random.choice(1000000000000, size=2)
array([586913473276, 824603097730])
>>> np.random.choice(1000000000000, size=2, replace=False)
Traceback (most recent call last):
  File "mtrand.pyx", line 4490, in mtrand.RandomState.permutation (numpy/random/mtrand/mtrand.c:20666)
MemoryError

@perimosocordiae
Copy link
Member Author

I agree that it's a numpy bug, and there's even an issue for it: numpy/numpy#2764

Commit f375852 uses the workaround from that issue. It doesn't use the numpy random state (because it uses the Python stdlib's random.sample), but I'm not sure if that's a problem.

@perimosocordiae
Copy link
Member Author

It's a problem for Travis, at least. The failing tests are due to not using numpy's random seed for index selection, which makes the tests nondeterministic.

I'm not sure how to fix this. Any ideas?

@WarrenWeckesser
Copy link
Member

Before using Python's random.sample, the internal state of the Python RNG can be set using random.setstate, giving it numpy's state retrieved using np.random.get_state(). I haven't thought too much about this, so I don't know what problems that will cause, if any.

@perimosocordiae
Copy link
Member Author

A quick check shows that the states aren't easily transferable. I've rewritten the patch (again) to use np.random.choice when n < 3*k, and to use the simple set-based algorithm from the python stdlib otherwise.

The heuristic used in the actual random.sample is more sophisticated, but I'm not sure that it transfers correctly to the numpy case.

gk *= 1.05
ind = _gen_unique_rand(random_state, gk)
# Use the algorithm from python's random.sample for k < n/3.
if n < 3*k:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be mn < 3*k (and mn in the comment above).

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 02842a6 on perimosocordiae:patch-5 into 7ff4e90 on scipy:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 7dd6169 on perimosocordiae:patch-5 into 7ff4e90 on scipy:master.

@WarrenWeckesser
Copy link
Member

The tests passed with numpy 1.5.1? Looks like we need a unit test for the case mn < 3*k.

@WarrenWeckesser
Copy link
Member

Caveat: The np.random.choice function was added in numpy 1.7.0...

Instead of using choice, just roll your own version, with something like

    # ind = random_state.choice(mn, size=k, replace=False)
    r = np.arange(mn)
    random_state.shuffle(r)
    ind = r[:k]

@WarrenWeckesser
Copy link
Member

permutation is in numpy 1.5.1, so a shorter replacement is

    # ind = random_state.choice(mn, size=k, replace=False)
    ind = random_state.permutation(mn)[:k]

@argriffing
Copy link
Contributor

Instead of using np.random.choice, just roll your own version

I've been putting these into scipy/lib/_numpy_compat.py.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 69555fd on perimosocordiae:patch-5 into 7ff4e90 on scipy:master.

@perimosocordiae
Copy link
Member Author

Any further comments before merging?

@WarrenWeckesser
Copy link
Member

I'll take another look this week.

I'm curious how the performance of the new method compares to the (debugged) old method.

ind = _gen_unique_rand(random_state, gk)
# Use the algorithm from python's random.sample for k < mn/3.
if mn < 3*k:
# ind = random_state.choice(mn, size=k, replace=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally don't leave in code that has been commented out. This line should be removed, or additional comments should be added to explain that this is what we would use, but choice is only available in numpy version 1.7 or later.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) when pulling 918cb7f on perimosocordiae:patch-5 into 7ff4e90 on scipy:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) when pulling 918cb7f on perimosocordiae:patch-5 into 7ff4e90 on scipy:master.

@WarrenWeckesser
Copy link
Member

I don't know what went wrong with the python 2.7 build on Travis.

How about rebasing and squashing your commits into a single commit? That will clean up the history and rerun Travis. If the tests pass, I think this is ready to go.

@WarrenWeckesser
Copy link
Member

By the way, since this is a bug fix, the prefix for the commit message should be "BUG: sparse: ...".

@perimosocordiae perimosocordiae changed the title ENH: smarter random index selection BUG: smarter random index selection May 29, 2014
@perimosocordiae perimosocordiae changed the title BUG: smarter random index selection BUG: sparse: smarter random index selection May 29, 2014
@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling 589c372 on perimosocordiae:patch-5 into acc7c94 on scipy:master.

WarrenWeckesser added a commit that referenced this pull request May 29, 2014
BUG: sparse: smarter random index selection
@WarrenWeckesser WarrenWeckesser merged commit 93656a8 into scipy:master May 29, 2014
@WarrenWeckesser
Copy link
Member

Merged. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scipy.sparse.rand leaves empty columns
5 participants