Adding Vitter Random Sampling without replacement #540

smldub · 2022-03-13T13:43:50Z

Fixes #539
Added the random number generation algorithms from this paper:
https://www.cs.emory.edu/~cheung/Courses/584/Syllabus/papers/RandomSampling/1984-Vitter-Faster-random-sampling.pdf
Seems to run fine if the new functions are not compiled with numba, but that step introduces quite a few new errors.

codecov · 2022-03-13T13:45:38Z

Codecov Report

Merging #540 (42fb5ce) into master (fc77d69) will decrease coverage by 2.58%.
The diff coverage is 18.26%.

@@            Coverage Diff             @@
##           master     #540      +/-   ##
==========================================
- Coverage   95.33%   92.74%   -2.59%     
==========================================
  Files          20       20              
  Lines        3022     3116      +94     
==========================================
+ Hits         2881     2890       +9     
- Misses        141      226      +85

smldub · 2022-03-13T21:31:13Z

I believe the main problem arises from Numba not supporting RandomState. It does however support np.random.seed, so I think the work around is feed a random number from random_state into functions as the seed.

hameerabbasi · 2022-03-14T08:57:34Z

Can you post an empirical CDF sampled with the new function, just for comparison?

smldub · 2022-03-14T15:31:48Z

Code based off newest commit:

import numpy as np
import sparse
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
N = int(1E4)
for i in range(1,30,4):
    a=sparse.random((N),i/100).coords.reshape(-1)
    ax1.plot(a, np.linspace(0, 1, len(a), endpoint=False))
    a = np.random.choice(N,int(N*i/100),replace = False)
    ax2.plot(np.sort(a), np.linspace(0, 1, len(a), endpoint=False))

ax1.legend([('dens=',i/100) for i in range(1,30,4)])
ax2.legend([('dens=',i/100) for i in range(1,30,4)])
ax1.set_title("New Function")
ax2.set_title("np.random.choice")
ax1.plot(np.linspace(0,N,N),np.linspace(0,1,N), 'k--')
ax2.plot(np.linspace(0,N,N),np.linspace(0,1,N), 'k--')
plt.show()

hameerabbasi · 2022-03-14T15:43:24Z

Thanks! Always good to check if there are any major numerical inaccuracies. Would you happen to know the numerical stability of the algorithm and how it affects the resulting distribution, just out of curiosity? If not, the plot should suffice.

hameerabbasi · 2022-03-14T15:46:42Z

Also, let me know when you're done making changes, and I'll merge.

hameerabbasi

LGTM!

smldub · 2022-03-14T16:18:10Z

I don't know how to measure numerical stability, but the algorithm is used in some languages' random number generation libaries like Julia, and the original paper been cited a lot, so I think the underlying principle is good.
Potential problems:

I might've messed up in the implementation.
I worry that the way I implemented randomness is overly limiting the possible sparse matrices that can be produced. (might be better not to feed in any random_state info at all if none is provided by the user)
This change has messed up anything that relied on a specific random seeded matrix, so I don't know if that could be a problem for legacy projects.
I am not a computer science guy, so I don't really have the experience to check the code for common edge cases that might cause this to fail.

I think having someone else look over it for a sanity check before merging would probably be a good idea.

hameerabbasi · 2022-03-14T16:27:04Z

Okay! If you're willing, you could post to the SciPy Developers mailing list and request a review.

I'm not particularly concerned about breaking random stability guarantees for the same seed, just about the distribution being okay.

That said, if one considers the cardinality of all sequences that can be sampled, and the size of the seed (which still exists internally even if we pass nothing) some sparse matrices were ALWAYS going to be avoided.

hameerabbasi · 2022-03-14T16:32:00Z

One potential problem I spot right away is that Numba doesn't pass the seed back to the code, and data_rvs could be called with a stale state (the two random bit streams could be correlated).

It might make sense to return the RNG state from each algorithm manually and seed the RandomState object with that.

smldub · 2022-03-15T02:17:47Z

Yea unfortunately Numba doesn't support np.random.get_state, but the numba functions aren't calling the same seed that makes the random state. The seed for those functions is
seed = random_state.choice(np.iinfo(np.int32).max),
so the likelihood (from my pretty bad understanding of statistics) that the states will be the same is 1/2^64 (assuming the first seed is just randomly generated). Also numba functions don't advance the counter for random_state (global or local) or change the global np.random.seed.
I am going to post one more (hopefully) commit changing some comments, and if you are fine with how it looks/my above argument. Feel free to merge it.

sparse/_utils.py

hameerabbasi · 2022-03-16T07:38:19Z

Thank-you for your excellent work on this, @smldub! You're a fast learner. 😄

smldub added 2 commits March 12, 2022 23:58

added random algorithms

c36dec7

fixed code without numba

df2395f

smldub changed the title ~~Rand~~ Adding Vitter Random Sampling without replacement Mar 13, 2022

smldub added 3 commits March 13, 2022 17:25

fixed numba issue

d4d20ce

fixed dense sampling behavior & changed to 32 bit for seed

2bdd17a

fixed doc

e96fc75

fix extraneous variable

f585236

hameerabbasi approved these changes Mar 14, 2022

View reviewed changes

added comments, fixed edge case

42fb5ce

hameerabbasi reviewed Mar 15, 2022

View reviewed changes

sparse/_utils.py Show resolved Hide resolved

hameerabbasi merged commit 07eeab2 into pydata:master Mar 15, 2022

smldub deleted the rand branch March 15, 2022 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Vitter Random Sampling without replacement #540

Adding Vitter Random Sampling without replacement #540

smldub commented Mar 13, 2022 •

edited by hameerabbasi

codecov bot commented Mar 13, 2022 •

edited

smldub commented Mar 13, 2022

hameerabbasi commented Mar 14, 2022

smldub commented Mar 14, 2022 •

edited

hameerabbasi commented Mar 14, 2022

hameerabbasi commented Mar 14, 2022

hameerabbasi left a comment

smldub commented Mar 14, 2022

hameerabbasi commented Mar 14, 2022

hameerabbasi commented Mar 14, 2022

smldub commented Mar 15, 2022

hameerabbasi commented Mar 16, 2022

Adding Vitter Random Sampling without replacement #540

Adding Vitter Random Sampling without replacement #540

Conversation

smldub commented Mar 13, 2022 • edited by hameerabbasi

codecov bot commented Mar 13, 2022 • edited

Codecov Report

smldub commented Mar 13, 2022

hameerabbasi commented Mar 14, 2022

smldub commented Mar 14, 2022 • edited

hameerabbasi commented Mar 14, 2022

hameerabbasi commented Mar 14, 2022

hameerabbasi left a comment

Choose a reason for hiding this comment

smldub commented Mar 14, 2022

hameerabbasi commented Mar 14, 2022

hameerabbasi commented Mar 14, 2022

smldub commented Mar 15, 2022

hameerabbasi commented Mar 16, 2022

smldub commented Mar 13, 2022 •

edited by hameerabbasi

codecov bot commented Mar 13, 2022 •

edited

smldub commented Mar 14, 2022 •

edited