New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Vitter Random Sampling without replacement #540
Conversation
Codecov Report
@@ Coverage Diff @@
## master #540 +/- ##
==========================================
- Coverage 95.33% 92.74% -2.59%
==========================================
Files 20 20
Lines 3022 3116 +94
==========================================
+ Hits 2881 2890 +9
- Misses 141 226 +85 |
I believe the main problem arises from Numba not supporting RandomState. It does however support np.random.seed, so I think the work around is feed a random number from random_state into functions as the seed. |
Can you post an empirical CDF sampled with the new function, just for comparison? |
Code based off newest commit:
|
Thanks! Always good to check if there are any major numerical inaccuracies. Would you happen to know the numerical stability of the algorithm and how it affects the resulting distribution, just out of curiosity? If not, the plot should suffice. |
Also, let me know when you're done making changes, and I'll merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I don't know how to measure numerical stability, but the algorithm is used in some languages' random number generation libaries like Julia, and the original paper been cited a lot, so I think the underlying principle is good.
I think having someone else look over it for a sanity check before merging would probably be a good idea. |
Okay! If you're willing, you could post to the SciPy Developers mailing list and request a review. I'm not particularly concerned about breaking random stability guarantees for the same seed, just about the distribution being okay. That said, if one considers the cardinality of all sequences that can be sampled, and the size of the seed (which still exists internally even if we pass nothing) some sparse matrices were ALWAYS going to be avoided. |
One potential problem I spot right away is that Numba doesn't pass the seed back to the code, and It might make sense to return the RNG state from each algorithm manually and seed the |
Yea unfortunately Numba doesn't support np.random.get_state, but the numba functions aren't calling the same seed that makes the random state. The seed for those functions is |
Thank-you for your excellent work on this, @smldub! You're a fast learner. 😄 |
Fixes #539
Added the random number generation algorithms from this paper:
https://www.cs.emory.edu/~cheung/Courses/584/Syllabus/papers/RandomSampling/1984-Vitter-Faster-random-sampling.pdf
Seems to run fine if the new functions are not compiled with numba, but that step introduces quite a few new errors.