ENH: Faster _select_samples in _differential_evolution.py #18496

tacchinotacchi · 2023-05-21T12:31:31Z

I found that the old code got very slow with population size > 100000 to the point where most of the time for my optimization run was spent in _select_samples. This gives about a 33% speedup in my runs. I'm still seeing more than 95% of the time spent in this function, so I'm open to suggestions for further speedups.

Keep in mind I haven't run the unit tests. I tried to look up how to run them, and landed at https://docs.scipy.org/doc/scipy/dev/contributor/runtests.html but it seems like this is outdated because the runtests.py script does not exist anymore. How do I run the tests?

mdhaber

Here is a little snippet to demonstrate the speed improvement.

import numpy as np
from numpy.testing import assert_equal, assert_allclose
from scipy import stats, optimize, special, integrate
import matplotlib.pyplot as plt
from mpmath import mp
mp.dps = 50
rng = np.random.default_rng(462983456835469345692)

class MyClass:
    num_population_members = 1000000
    random_number_generator = rng
self = MyClass()
candidate = 500000
number_samples = 100000

def _select_samples1(self, candidate, number_samples):
    idxs = list(range(self.num_population_members))
    idxs.remove(candidate)
    self.random_number_generator.shuffle(idxs)
    idxs = idxs[:number_samples]
    return idxs

# %timeit _select_samples1(self, candidate, number_samples)
# 334 ms ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# the obvious improvement
def _select_samples1b(self, candidate, number_samples):
    idxs = np.concatenate([np.arange(candidate),
                           np.arange(candidate+1, self.num_population_members)])
    self.random_number_generator.shuffle(idxs)
    idxs = idxs[:number_samples]
    return idxs

# %timeit _select_samples1b(self, candidate, number_samples)
# 41.7 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# the PR
def _select_samples2(self, candidate, number_samples):
    idxs = self.random_number_generator.choice(self.num_population_members - 1,
                                               number_samples, replace=False)
    idxs = idxs + (idxs >= candidate)
    return idxs

# %timeit _select_samples2(self, candidate, number_samples)
# 6.96 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Ideally, you'd show that that the improvement holds for different population sizes, candidate, and number_samples (or whatever the defaults value is) so we can be confident that we're not improving one case at the expense of others. But I would still be happy if this were to improve the slowest cases at the (slight) expense of faster cases. A few histograms would convince me.

This looks like it still does what it is supposed to. The improved code still works with RandomState(good), although the improvement is not as substantial.

If you don't need the order of the indices to be shuffled, the documentation of Generator says you can get a little boost by using shuffle=False in the call to choice. This wouldn't work for RandomState, and it doesn't save much time on my machine. Not sure if this is beneficial outside _select_samples.

Depending on how these are used, there are some other ideas that might be faster (e.g. logical indices?), or maybe if this is called many times you could batch the results. But I don't see a reason to delay a simple improvement with a search for a bigger improvement.

To do: please reduce the line length (preferably to 79, but this may be changing to 88?) and add a comment about why it works.

andyfaff · 2023-05-21T16:15:11Z

scipy/optimize/_differentialevolution.py

-        self.random_number_generator.shuffle(idxs)
-        idxs = idxs[:number_samples]
+        idxs = self.random_number_generator.choice(self.num_population_members - 1, number_samples, replace=False)
+        idxs = idxs + (idxs >= candidate)


It's not clear to me what this line does. It's important in the sampling process that the candidate solution is not a member of idxs. Can you explain the line in detail?

In the initial choice, integers candidate, candidate + 1, ..., num_population_members - 2 are used to represent integers candidate + 1, candidate + 2, ..., num_population_members - 1. This line increments them all by one to ensure that the final values don't include candidate (and so that the distribution of possible values is what it is supposed to be).

Is this accurate @tacchinotacchi?

This is not as obvious as the code in main, so I asked for a comment in the code explaining how it works. Normally we'd prefer the explicit version instead of the "clever" version, but here the speedup may be a bit too big to pass over.

yes, that's correct, I'll add in some comments

andyfaff · 2023-05-21T16:18:07Z

@mdhaber the indices need to be shuffled. In addition candidate can't be in idxs. I'm not sure what the last line of the PR is doing.

mdhaber · 2023-05-21T16:21:08Z

In addition candidate can't be in idxs. I'm not sure what the last line of the PR is. I'm not sure what the last line of the PR is doing.

It ensures that candidate is not in idxs.

tacchinotacchi · 2023-05-21T16:32:38Z

@mdhaber thanks for the snippet, will do some investigation to make sure that it's faster across diferent input parameters

it seems that some tests are failing, I'm not sure if that's because the new code is doing something wrong or because the test compares the output to reference values that rely on the previous implementation. in the latter case, I assume there is no problem with overwriting the ref values?

mdhaber · 2023-05-21T16:34:31Z

I meant to comment - that failure oftest_integrality is not a concern. Just adjust the tolerance. It is using differential_evolution to fit a distribution to data sampled from a distribution with known parameters. The fitted distribution parameters should be similar to the known parameters, but it is stochastic. The tolerance was adjusted to work with the old implementation, but if the random number stream changes, the match will get better or worse. In this case, it got a little worse, but the same could happen by changing the seed, so I'm not bothered by it. (If there are other failures, I didn't see them.)

andyfaff · 2023-05-21T18:32:26Z

I think I know what the code is doing now. Unfortunately the change is a lot less performant at the other end of the scale. For one of my projects the following overhead would be ~10% of the function evaluation time.

# num_population_members = 100
# candidate = 2
# number_samples = 3

3.5 µs ± 99.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.06 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
16.1 µs ± 381 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

I wonder if there's another way?

mdhaber · 2023-05-21T20:24:59Z

@andyfaff would you have noticed that regression if you weren't looking for it? ~~What is the total execution time of differential_evolution is that in that context?~~ (oops you wrote that.)
Are you doing differerential_evolution in loops to where those microseconds add up?
We could branch based on the size of the arrays but we'd have to justify the extra complexity.

andyfaff · 2023-05-21T20:56:34Z

select_samples is run in a loop over every entry of the population. So the method gets called 1:1 with every objective function evaluation. In terms of overhead what matters is the relative time of the two. OP gets stung because the overhead of select_samples outweighs objective function time for very high popsizes. For small popsizes the PR would penalize fast objective functions.

Would I have noticed it? Probably not immediately. However, I remember writing the code like this on purpose after having looked at the alternatives (hence the use of list/range). I looked at rng.choice, but discarded it because it was much slower for total popsizes on the lower end, where I anticipated the majority of problems to be. I've tried to optimize this further in the past.
The OP popsizes are the largest I've heard of.

It'll be worth profiling the existing code. The rate limiting step for the existing code may be creation of the large list in the first line. There might be a way of storing that, then shuffling an array with the candidate value masked.

Lastly, the number of samples is never greater than 5, so don't go higher than that in profiling.

andyfaff · 2023-05-21T21:04:12Z

_select_samples1b might be the best compromise.

tacchinotacchi · 2023-05-21T21:31:49Z

It'll be worth profiling the existing code. The rate limiting step for the existing code may be creation of the large list in the first line. There might be a way of storing that, then shuffling an array with the candidate value masked.

I think the bottleneck of the existing code (for large popsize) is the list.remove() call which is O(n) as it has to search the whole list for candidate.

Here's a plot of mean run times by popsize as measured on my rig. Indeed there's a surprising overhead with popsize < 200, I wonder why

andyfaff · 2023-05-21T21:39:39Z

Is that total population size?

mdhaber · 2023-05-21T21:43:29Z

If number_samples is always small compared to population size, how about rejection sampling? (Uniform random sample of integers, just try again if there are duplicates or candidate is in the sample.) I'll try it when I get back. Maybe not worth it.

tacchinotacchi · 2023-05-21T21:45:18Z

Is that total population size?

it's the value of self.num_population_members

tacchinotacchi · 2023-05-21T21:46:18Z

If number_samples is always small compared to population size, how about rejection samples? (Random sample and try again if it fails the criteria - duplicates or candidate.)

I'll benchmark that too, I also didn't notice that number_samples = 5 when I wrote my version

mdhaber · 2023-05-22T01:27:12Z

Re: rejection sampling.

This takes about 30 µs regardless of population size:

def _select_samples3(self, candidate, number_samples):
    while True:
        idxs = self.random_number_generator.integers(0, self.num_population_members - 1,
                                                     number_samples)
        samples = set(idxs)
        if len(samples) == number_samples and candidate not in samples:
             return list(samples)

Unfortunately, that's 5x as long as main for a population size of 100. (Much faster than any of the solution above for large populations, though.)

For small samples, it is faster to generate a list (or array) with all the elements of the population, (don't remove candidate), shuffle it, and take the first few elements. Maybe that's because it guarantees that the elements are unique, so the loop is less likely to have to repeat.

def _select_samples4(self, candidate, number_samples):
    idxs = np.arange(self.num_population_members)
    while True:
        self.random_number_generator.shuffle(idxs)
        samples = set(idxs[:number_samples])
        if candidate not in samples:
            return list(samples)

For a population size of 100, this takes me about 14 µs, which is about twice as long as main. For a large population size, it's a little faster than this PR is now.

def _select_samples5(self, candidate, number_samples):
    while True:
        a, b = 0, self.num_population_members - 1
        samples = set((random.randint(a, b) for i in range(number_samples)))
        if len(samples) == number_samples and candidate not in samples:
            return list(samples)

This takes < 10 µs regardless of population size. So I think we're just up against the overhead of NumPy here.

So if it really matters, you could create an object that makes one big call to rng.integers at the beginning of differential_evolution to generate a lot of random integers, and then consumes some of those numbers every loop. When it runs out of numbers, it makes another big call, etc... That way the NumPy overhead is amortized over all the iterations. In this way, I think you could get it down to < 10 µs (amortized) regardless of sample size and use the specified NumPy random stream. But at that level of complexity, you might as well just branch the logic based on population size: for small populations, use what we already have, and for large populations, use rejection sampling.

andyfaff · 2023-05-22T06:35:06Z

Let's not branch. I think the profiling by @tacchinotacchi shows we should use 1b

tacchinotacchi · 2023-05-22T09:31:40Z

I tried a lot of variants and I can't seem to get anything better than this one when it comes to overall performance:

def _select_samples6(self, candidate, number_samples):
    pool = np.arange(self.num_population_members)
    self.random_number_generator.shuffle(pool)

    idxs = []
    while len(idxs) < number_samples:
        idx = pool[0]
        pool = pool[1:]
        if idx != candidate:
            idxs.append(idx)
    
    return idxs

this one scales similarly to 1b but is slightly faster across the board:

everyone ok to merge this one?
EDIT: forgot to test it against 4, which is very similar. one minute...

EDIT 2: seems to slightly beat 4 especially on low end popsize

tacchinotacchi · 2023-05-22T10:05:38Z

and I know we prefer not to branch, but by only switching the type of the pool variable...

def _select_samples7(self, candidate, number_samples):
    pool = list(range(self.num_population_members)) if num_population_members < 100 else np.arange(self.num_population_members)
    self.random_number_generator.shuffle(pool)

    idxs = []
    while len(idxs) < number_samples and len(pool) > 0:
        idx = pool[0]
        pool = pool[1:]
        if idx != candidate:
            idxs.append(idx)
    
    return idxs

andyfaff · 2023-05-23T12:42:37Z

@tacchinotacchi can you rejig this PR to the contents of _select_samples6. I think that it's going to be good enough and I don't want the branch.

andyfaff · 2023-05-23T13:46:01Z

(if you can do it in the next day or so it'll make the branch for the next scipy release)

tacchinotacchi · 2023-05-23T16:21:51Z

I'm not sure what's going on with this test failure

FAILED scipy/stats/tests/test_fit.py::TestFit::test_fit_only_loc_scale - RuntimeWarning: overflow encountered in divide

But it looks like someone had a similar problem in basic_fit_test and decided to ignore the warning

with npt.suppress_warnings() as sup:
    sup.filter(RuntimeWarning, "overflow encountered")
    res = stats.fit(dist, data, bounds, method=method,
                    optimizer=self.opt)

mdhaber · 2023-05-23T16:56:03Z

No problem. Either change scale_bounds to (0.01, 5) (for example) to avoid it, or use with np.errstate(over='ignore'):context (instead of npt.suppress_warnings). I think one of those should do it.

andyfaff

LGTM. If the tests pass we'll merge.

andyfaff · 2023-05-24T12:15:18Z

I think there's one test fail that needs fixing. (There are others which I don't think are related). See test_basic_fit_mle.
The MLE resulting from a fit to a distribution isn't being minimised enough, the fit is being done with differential_evolution. @mdhaber would be able to help further on how to get the minimizer over the line. In this situation I would suggest increasing the popsize kwd for differential_evolution. One way may be to monkey patch the kwd dictionary supplied to differential_evolution on L365:

    def opt(self, *args, **kwds):
        kwds['popsize']=30
        return differential_evolution(*args, seed=0, **kwds)

This PR is bit-for-bit not back compatible because the code consumes the RNG in a different manner, shuffling N numbers rather than N-1.

mdhaber · 2023-05-31T04:55:36Z

Already there were other distributions that didn't pass the basic test, and that's OK. For those, I just added a custom test, so I did the same here.

andyfaff · 2023-06-01T11:34:08Z

LGTM now. Thank you @tacchinotacchi , @mdhaber

tacchinotacchi · 2023-06-01T11:41:24Z

thanks @mdhaber for picking this up, was busier than usual this week

tacchinotacchi requested a review from andyfaff as a code owner May 21, 2023 12:31

j-bowhay added enhancement A new feature or improvement scipy.optimize labels May 21, 2023

mdhaber reviewed May 21, 2023

View reviewed changes

andyfaff reviewed May 21, 2023

View reviewed changes

tacchinotacchi force-pushed the main branch 2 times, most recently from 6b4a543 to df9b681 Compare May 23, 2023 15:30

ENH: Faster _select_samples in _differential_evolution.py

26cf239

tacchinotacchi force-pushed the main branch from df9b681 to 26cf239 Compare May 23, 2023 17:37

andyfaff reviewed May 23, 2023

View reviewed changes

TST: stats.fit: adjust arcsine test due to change in DE

291653d

andyfaff merged commit 950112f into scipy:main Jun 1, 2023
24 checks passed

j-bowhay added this to the 1.12.0 milestone Jun 1, 2023

mdhaber mentioned this pull request Jun 14, 2023

TST: stats: fix xslow test failures #18682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Faster _select_samples in _differential_evolution.py #18496

ENH: Faster _select_samples in _differential_evolution.py #18496

tacchinotacchi commented May 21, 2023 •

edited

mdhaber left a comment •

edited

andyfaff May 21, 2023

mdhaber May 21, 2023 •

edited

tacchinotacchi May 21, 2023

andyfaff commented May 21, 2023

mdhaber commented May 21, 2023 •

edited

tacchinotacchi commented May 21, 2023

mdhaber commented May 21, 2023 •

edited

andyfaff commented May 21, 2023

mdhaber commented May 21, 2023 •

edited

andyfaff commented May 21, 2023 •

edited

andyfaff commented May 21, 2023

tacchinotacchi commented May 21, 2023 •

edited

andyfaff commented May 21, 2023

mdhaber commented May 21, 2023 •

edited

tacchinotacchi commented May 21, 2023

tacchinotacchi commented May 21, 2023

mdhaber commented May 22, 2023 •

edited

andyfaff commented May 22, 2023

tacchinotacchi commented May 22, 2023 •

edited

tacchinotacchi commented May 22, 2023 •

edited

andyfaff commented May 23, 2023

andyfaff commented May 23, 2023

tacchinotacchi commented May 23, 2023

mdhaber commented May 23, 2023 •

edited

andyfaff left a comment

andyfaff commented May 24, 2023

mdhaber commented May 31, 2023

andyfaff commented Jun 1, 2023

tacchinotacchi commented Jun 1, 2023

ENH: Faster _select_samples in _differential_evolution.py #18496

ENH: Faster _select_samples in _differential_evolution.py #18496

Conversation

tacchinotacchi commented May 21, 2023 • edited

mdhaber left a comment • edited

Choose a reason for hiding this comment

andyfaff May 21, 2023

Choose a reason for hiding this comment

mdhaber May 21, 2023 • edited

Choose a reason for hiding this comment

tacchinotacchi May 21, 2023

Choose a reason for hiding this comment

andyfaff commented May 21, 2023

mdhaber commented May 21, 2023 • edited

tacchinotacchi commented May 21, 2023

mdhaber commented May 21, 2023 • edited

andyfaff commented May 21, 2023

mdhaber commented May 21, 2023 • edited

andyfaff commented May 21, 2023 • edited

andyfaff commented May 21, 2023

tacchinotacchi commented May 21, 2023 • edited

andyfaff commented May 21, 2023

mdhaber commented May 21, 2023 • edited

tacchinotacchi commented May 21, 2023

tacchinotacchi commented May 21, 2023

mdhaber commented May 22, 2023 • edited

andyfaff commented May 22, 2023

tacchinotacchi commented May 22, 2023 • edited

tacchinotacchi commented May 22, 2023 • edited

andyfaff commented May 23, 2023

andyfaff commented May 23, 2023

tacchinotacchi commented May 23, 2023

mdhaber commented May 23, 2023 • edited

andyfaff left a comment

Choose a reason for hiding this comment

andyfaff commented May 24, 2023

mdhaber commented May 31, 2023

andyfaff commented Jun 1, 2023

tacchinotacchi commented Jun 1, 2023

tacchinotacchi commented May 21, 2023 •

edited

mdhaber left a comment •

edited

mdhaber May 21, 2023 •

edited

mdhaber commented May 21, 2023 •

edited

mdhaber commented May 21, 2023 •

edited

mdhaber commented May 21, 2023 •

edited

andyfaff commented May 21, 2023 •

edited

tacchinotacchi commented May 21, 2023 •

edited

mdhaber commented May 21, 2023 •

edited

mdhaber commented May 22, 2023 •

edited

tacchinotacchi commented May 22, 2023 •

edited

tacchinotacchi commented May 22, 2023 •

edited

mdhaber commented May 23, 2023 •

edited