New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Faster _select_samples in _differential_evolution.py #18496
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a little snippet to demonstrate the speed improvement.
import numpy as np
from numpy.testing import assert_equal, assert_allclose
from scipy import stats, optimize, special, integrate
import matplotlib.pyplot as plt
from mpmath import mp
mp.dps = 50
rng = np.random.default_rng(462983456835469345692)
class MyClass:
num_population_members = 1000000
random_number_generator = rng
self = MyClass()
candidate = 500000
number_samples = 100000
def _select_samples1(self, candidate, number_samples):
idxs = list(range(self.num_population_members))
idxs.remove(candidate)
self.random_number_generator.shuffle(idxs)
idxs = idxs[:number_samples]
return idxs
# %timeit _select_samples1(self, candidate, number_samples)
# 334 ms ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# the obvious improvement
def _select_samples1b(self, candidate, number_samples):
idxs = np.concatenate([np.arange(candidate),
np.arange(candidate+1, self.num_population_members)])
self.random_number_generator.shuffle(idxs)
idxs = idxs[:number_samples]
return idxs
# %timeit _select_samples1b(self, candidate, number_samples)
# 41.7 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# the PR
def _select_samples2(self, candidate, number_samples):
idxs = self.random_number_generator.choice(self.num_population_members - 1,
number_samples, replace=False)
idxs = idxs + (idxs >= candidate)
return idxs
# %timeit _select_samples2(self, candidate, number_samples)
# 6.96 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Ideally, you'd show that that the improvement holds for different population sizes, candidate
, and number_samples
(or whatever the defaults value is) so we can be confident that we're not improving one case at the expense of others. But I would still be happy if this were to improve the slowest cases at the (slight) expense of faster cases. A few histograms would convince me.
This looks like it still does what it is supposed to. The improved code still works with RandomState
(good), although the improvement is not as substantial.
If you don't need the order of the indices to be shuffled, the documentation of Generator
says you can get a little boost by using shuffle=False
in the call to choice
. This wouldn't work for RandomState
, and it doesn't save much time on my machine. Not sure if this is beneficial outside _select_samples
.
Depending on how these are used, there are some other ideas that might be faster (e.g. logical indices?), or maybe if this is called many times you could batch the results. But I don't see a reason to delay a simple improvement with a search for a bigger improvement.
To do: please reduce the line length (preferably to 79, but this may be changing to 88?) and add a comment about why it works.
self.random_number_generator.shuffle(idxs) | ||
idxs = idxs[:number_samples] | ||
idxs = self.random_number_generator.choice(self.num_population_members - 1, number_samples, replace=False) | ||
idxs = idxs + (idxs >= candidate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me what this line does. It's important in the sampling process that the candidate solution is not a member of idxs
. Can you explain the line in detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the initial choice
, integers candidate
, candidate + 1
, ..., num_population_members - 2
are used to represent integers candidate + 1
, candidate + 2
, ..., num_population_members - 1
. This line increments them all by one to ensure that the final values don't include candidate
(and so that the distribution of possible values is what it is supposed to be).
Is this accurate @tacchinotacchi?
This is not as obvious as the code in main, so I asked for a comment in the code explaining how it works. Normally we'd prefer the explicit version instead of the "clever" version, but here the speedup may be a bit too big to pass over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's correct, I'll add in some comments
@mdhaber the indices need to be shuffled. In addition |
It ensures that |
@mdhaber thanks for the snippet, will do some investigation to make sure that it's faster across diferent input parameters it seems that some tests are failing, I'm not sure if that's because the new code is doing something wrong or because the test compares the output to reference values that rely on the previous implementation. in the latter case, I assume there is no problem with overwriting the ref values? |
I meant to comment - that failure of |
I think I know what the code is doing now. Unfortunately the change is a lot less performant at the other end of the scale. For one of my projects the following overhead would be ~10% of the function evaluation time.
I wonder if there's another way? |
@andyfaff would you have noticed that regression if you weren't looking for it? |
select_samples is run in a loop over every entry of the population. So the method gets called 1:1 with every objective function evaluation. In terms of overhead what matters is the relative time of the two. OP gets stung because the overhead of select_samples outweighs objective function time for very high popsizes. For small popsizes the PR would penalize fast objective functions. Would I have noticed it? Probably not immediately. However, I remember writing the code like this on purpose after having looked at the alternatives (hence the use of list/range). I looked at It'll be worth profiling the existing code. The rate limiting step for the existing code may be creation of the large list in the first line. There might be a way of storing that, then shuffling an array with the candidate value masked. Lastly, the number of samples is never greater than 5, so don't go higher than that in profiling. |
_select_samples1b might be the best compromise. |
Is that total population size? |
If |
it's the value of |
I'll benchmark that too, I also didn't notice that number_samples = 5 when I wrote my version |
Re: rejection sampling. This takes about 30 µs regardless of population size: def _select_samples3(self, candidate, number_samples):
while True:
idxs = self.random_number_generator.integers(0, self.num_population_members - 1,
number_samples)
samples = set(idxs)
if len(samples) == number_samples and candidate not in samples:
return list(samples) Unfortunately, that's 5x as long as For small samples, it is faster to generate a list (or array) with all the elements of the population, (don't remove def _select_samples4(self, candidate, number_samples):
idxs = np.arange(self.num_population_members)
while True:
self.random_number_generator.shuffle(idxs)
samples = set(idxs[:number_samples])
if candidate not in samples:
return list(samples) For a population size of 100, this takes me about 14 µs, which is about twice as long as main. For a large population size, it's a little faster than this PR is now. def _select_samples5(self, candidate, number_samples):
while True:
a, b = 0, self.num_population_members - 1
samples = set((random.randint(a, b) for i in range(number_samples)))
if len(samples) == number_samples and candidate not in samples:
return list(samples) This takes < 10 µs regardless of population size. So I think we're just up against the overhead of NumPy here. So if it really matters, you could create an object that makes one big call to |
Let's not branch. I think the profiling by @tacchinotacchi shows we should use 1b |
and I know we prefer not to branch, but by only switching the type of the
|
@tacchinotacchi can you rejig this PR to the contents of |
(if you can do it in the next day or so it'll make the branch for the next scipy release) |
6b4a543
to
df9b681
Compare
I'm not sure what's going on with this test failure
But it looks like someone had a similar problem in
|
No problem. Either change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. If the tests pass we'll merge.
I think there's one test fail that needs fixing. (There are others which I don't think are related). See
This PR is bit-for-bit not back compatible because the code consumes the RNG in a different manner, shuffling N numbers rather than N-1. |
Already there were other distributions that didn't pass the basic test, and that's OK. For those, I just added a custom test, so I did the same here. |
LGTM now. Thank you @tacchinotacchi , @mdhaber |
thanks @mdhaber for picking this up, was busier than usual this week |
I found that the old code got very slow with population size > 100000 to the point where most of the time for my optimization run was spent in _select_samples. This gives about a 33% speedup in my runs. I'm still seeing more than 95% of the time spent in this function, so I'm open to suggestions for further speedups.
Keep in mind I haven't run the unit tests. I tried to look up how to run them, and landed at https://docs.scipy.org/doc/scipy/dev/contributor/runtests.html but it seems like this is outdated because the runtests.py script does not exist anymore. How do I run the tests?