Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Vitter Random Sampling without replacement #540

Merged
merged 7 commits into from Mar 15, 2022
Merged

Adding Vitter Random Sampling without replacement #540

merged 7 commits into from Mar 15, 2022

Conversation

smldub
Copy link
Contributor

@smldub smldub commented Mar 13, 2022

Fixes #539
Added the random number generation algorithms from this paper:
https://www.cs.emory.edu/~cheung/Courses/584/Syllabus/papers/RandomSampling/1984-Vitter-Faster-random-sampling.pdf
Seems to run fine if the new functions are not compiled with numba, but that step introduces quite a few new errors.

@smldub smldub changed the title Rand Adding Vitter Random Sampling without replacement Mar 13, 2022
@codecov
Copy link

codecov bot commented Mar 13, 2022

Codecov Report

Merging #540 (42fb5ce) into master (fc77d69) will decrease coverage by 2.58%.
The diff coverage is 18.26%.

@@            Coverage Diff             @@
##           master     #540      +/-   ##
==========================================
- Coverage   95.33%   92.74%   -2.59%     
==========================================
  Files          20       20              
  Lines        3022     3116      +94     
==========================================
+ Hits         2881     2890       +9     
- Misses        141      226      +85     

@smldub
Copy link
Contributor Author

smldub commented Mar 13, 2022

I believe the main problem arises from Numba not supporting RandomState. It does however support np.random.seed, so I think the work around is feed a random number from random_state into functions as the seed.

@hameerabbasi
Copy link
Collaborator

Can you post an empirical CDF sampled with the new function, just for comparison?

@smldub
Copy link
Contributor Author

smldub commented Mar 14, 2022

Screen Shot 2022-03-14 at 10 27 42 AM

Code based off newest commit:

import numpy as np
import sparse
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
N = int(1E4)
for i in range(1,30,4):
    a=sparse.random((N),i/100).coords.reshape(-1)
    ax1.plot(a, np.linspace(0, 1, len(a), endpoint=False))
    a = np.random.choice(N,int(N*i/100),replace = False)
    ax2.plot(np.sort(a), np.linspace(0, 1, len(a), endpoint=False))

ax1.legend([('dens=',i/100) for i in range(1,30,4)])
ax2.legend([('dens=',i/100) for i in range(1,30,4)])
ax1.set_title("New Function")
ax2.set_title("np.random.choice")
ax1.plot(np.linspace(0,N,N),np.linspace(0,1,N), 'k--')
ax2.plot(np.linspace(0,N,N),np.linspace(0,1,N), 'k--')
plt.show()

@hameerabbasi
Copy link
Collaborator

Thanks! Always good to check if there are any major numerical inaccuracies. Would you happen to know the numerical stability of the algorithm and how it affects the resulting distribution, just out of curiosity? If not, the plot should suffice.

@hameerabbasi
Copy link
Collaborator

Also, let me know when you're done making changes, and I'll merge.

Copy link
Collaborator

@hameerabbasi hameerabbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@smldub
Copy link
Contributor Author

smldub commented Mar 14, 2022

I don't know how to measure numerical stability, but the algorithm is used in some languages' random number generation libaries like Julia, and the original paper been cited a lot, so I think the underlying principle is good.
Potential problems:

  1. I might've messed up in the implementation.
  2. I worry that the way I implemented randomness is overly limiting the possible sparse matrices that can be produced. (might be better not to feed in any random_state info at all if none is provided by the user)
  3. This change has messed up anything that relied on a specific random seeded matrix, so I don't know if that could be a problem for legacy projects.
  4. I am not a computer science guy, so I don't really have the experience to check the code for common edge cases that might cause this to fail.

I think having someone else look over it for a sanity check before merging would probably be a good idea.

@hameerabbasi
Copy link
Collaborator

Okay! If you're willing, you could post to the SciPy Developers mailing list and request a review.

I'm not particularly concerned about breaking random stability guarantees for the same seed, just about the distribution being okay.

That said, if one considers the cardinality of all sequences that can be sampled, and the size of the seed (which still exists internally even if we pass nothing) some sparse matrices were ALWAYS going to be avoided.

@hameerabbasi
Copy link
Collaborator

One potential problem I spot right away is that Numba doesn't pass the seed back to the code, and data_rvs could be called with a stale state (the two random bit streams could be correlated).

It might make sense to return the RNG state from each algorithm manually and seed the RandomState object with that.

@smldub
Copy link
Contributor Author

smldub commented Mar 15, 2022

Yea unfortunately Numba doesn't support np.random.get_state, but the numba functions aren't calling the same seed that makes the random state. The seed for those functions is
seed = random_state.choice(np.iinfo(np.int32).max),
so the likelihood (from my pretty bad understanding of statistics) that the states will be the same is 1/2^64 (assuming the first seed is just randomly generated). Also numba functions don't advance the counter for random_state (global or local) or change the global np.random.seed.
I am going to post one more (hopefully) commit changing some comments, and if you are fine with how it looks/my above argument. Feel free to merge it.

@hameerabbasi hameerabbasi merged commit 07eeab2 into pydata:master Mar 15, 2022
@smldub smldub deleted the rand branch March 15, 2022 14:37
@hameerabbasi
Copy link
Collaborator

Thank-you for your excellent work on this, @smldub! You're a fast learner. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Additional Random Generator
2 participants