Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Faster _select_samples in _differential_evolution.py #18496

Merged
merged 2 commits into from Jun 1, 2023

Conversation

tacchinotacchi
Copy link
Contributor

@tacchinotacchi tacchinotacchi commented May 21, 2023

I found that the old code got very slow with population size > 100000 to the point where most of the time for my optimization run was spent in _select_samples. This gives about a 33% speedup in my runs. I'm still seeing more than 95% of the time spent in this function, so I'm open to suggestions for further speedups.

Keep in mind I haven't run the unit tests. I tried to look up how to run them, and landed at https://docs.scipy.org/doc/scipy/dev/contributor/runtests.html but it seems like this is outdated because the runtests.py script does not exist anymore. How do I run the tests?

@j-bowhay j-bowhay added enhancement A new feature or improvement scipy.optimize labels May 21, 2023
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a little snippet to demonstrate the speed improvement.

import numpy as np
from numpy.testing import assert_equal, assert_allclose
from scipy import stats, optimize, special, integrate
import matplotlib.pyplot as plt
from mpmath import mp
mp.dps = 50
rng = np.random.default_rng(462983456835469345692)

class MyClass:
    num_population_members = 1000000
    random_number_generator = rng
self = MyClass()
candidate = 500000
number_samples = 100000

def _select_samples1(self, candidate, number_samples):
    idxs = list(range(self.num_population_members))
    idxs.remove(candidate)
    self.random_number_generator.shuffle(idxs)
    idxs = idxs[:number_samples]
    return idxs

# %timeit _select_samples1(self, candidate, number_samples)
# 334 ms ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# the obvious improvement
def _select_samples1b(self, candidate, number_samples):
    idxs = np.concatenate([np.arange(candidate),
                           np.arange(candidate+1, self.num_population_members)])
    self.random_number_generator.shuffle(idxs)
    idxs = idxs[:number_samples]
    return idxs

# %timeit _select_samples1b(self, candidate, number_samples)
# 41.7 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# the PR
def _select_samples2(self, candidate, number_samples):
    idxs = self.random_number_generator.choice(self.num_population_members - 1,
                                               number_samples, replace=False)
    idxs = idxs + (idxs >= candidate)
    return idxs

# %timeit _select_samples2(self, candidate, number_samples)
# 6.96 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Ideally, you'd show that that the improvement holds for different population sizes, candidate, and number_samples (or whatever the defaults value is) so we can be confident that we're not improving one case at the expense of others. But I would still be happy if this were to improve the slowest cases at the (slight) expense of faster cases. A few histograms would convince me.

This looks like it still does what it is supposed to. The improved code still works with RandomState(good), although the improvement is not as substantial.

If you don't need the order of the indices to be shuffled, the documentation of Generator says you can get a little boost by using shuffle=False in the call to choice. This wouldn't work for RandomState, and it doesn't save much time on my machine. Not sure if this is beneficial outside _select_samples.

Depending on how these are used, there are some other ideas that might be faster (e.g. logical indices?), or maybe if this is called many times you could batch the results. But I don't see a reason to delay a simple improvement with a search for a bigger improvement.

To do: please reduce the line length (preferably to 79, but this may be changing to 88?) and add a comment about why it works.

self.random_number_generator.shuffle(idxs)
idxs = idxs[:number_samples]
idxs = self.random_number_generator.choice(self.num_population_members - 1, number_samples, replace=False)
idxs = idxs + (idxs >= candidate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what this line does. It's important in the sampling process that the candidate solution is not a member of idxs. Can you explain the line in detail?

Copy link
Contributor

@mdhaber mdhaber May 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the initial choice, integers candidate, candidate + 1, ..., num_population_members - 2 are used to represent integers candidate + 1, candidate + 2, ..., num_population_members - 1. This line increments them all by one to ensure that the final values don't include candidate (and so that the distribution of possible values is what it is supposed to be).

Is this accurate @tacchinotacchi?

This is not as obvious as the code in main, so I asked for a comment in the code explaining how it works. Normally we'd prefer the explicit version instead of the "clever" version, but here the speedup may be a bit too big to pass over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's correct, I'll add in some comments

@andyfaff
Copy link
Contributor

@mdhaber the indices need to be shuffled. In addition candidate can't be in idxs. I'm not sure what the last line of the PR is doing.

@mdhaber
Copy link
Contributor

mdhaber commented May 21, 2023

In addition candidate can't be in idxs. I'm not sure what the last line of the PR is. I'm not sure what the last line of the PR is doing.

It ensures that candidate is not in idxs.

@tacchinotacchi
Copy link
Contributor Author

@mdhaber thanks for the snippet, will do some investigation to make sure that it's faster across diferent input parameters

it seems that some tests are failing, I'm not sure if that's because the new code is doing something wrong or because the test compares the output to reference values that rely on the previous implementation. in the latter case, I assume there is no problem with overwriting the ref values?

@mdhaber
Copy link
Contributor

mdhaber commented May 21, 2023

I meant to comment - that failure oftest_integrality is not a concern. Just adjust the tolerance. It is using differential_evolution to fit a distribution to data sampled from a distribution with known parameters. The fitted distribution parameters should be similar to the known parameters, but it is stochastic. The tolerance was adjusted to work with the old implementation, but if the random number stream changes, the match will get better or worse. In this case, it got a little worse, but the same could happen by changing the seed, so I'm not bothered by it. (If there are other failures, I didn't see them.)

@andyfaff
Copy link
Contributor

I think I know what the code is doing now. Unfortunately the change is a lot less performant at the other end of the scale. For one of my projects the following overhead would be ~10% of the function evaluation time.

# num_population_members = 100
# candidate = 2
# number_samples = 3

3.5 µs ± 99.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.06 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
16.1 µs ± 381 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

I wonder if there's another way?

@mdhaber
Copy link
Contributor

mdhaber commented May 21, 2023

@andyfaff would you have noticed that regression if you weren't looking for it? What is the total execution time of differential_evolution is that in that context? (oops you wrote that.)
Are you doing differerential_evolution in loops to where those microseconds add up?
We could branch based on the size of the arrays but we'd have to justify the extra complexity.

@andyfaff
Copy link
Contributor

andyfaff commented May 21, 2023

select_samples is run in a loop over every entry of the population. So the method gets called 1:1 with every objective function evaluation. In terms of overhead what matters is the relative time of the two. OP gets stung because the overhead of select_samples outweighs objective function time for very high popsizes. For small popsizes the PR would penalize fast objective functions.

Would I have noticed it? Probably not immediately. However, I remember writing the code like this on purpose after having looked at the alternatives (hence the use of list/range). I looked at rng.choice, but discarded it because it was much slower for total popsizes on the lower end, where I anticipated the majority of problems to be. I've tried to optimize this further in the past.
The OP popsizes are the largest I've heard of.

It'll be worth profiling the existing code. The rate limiting step for the existing code may be creation of the large list in the first line. There might be a way of storing that, then shuffling an array with the candidate value masked.

Lastly, the number of samples is never greater than 5, so don't go higher than that in profiling.

@andyfaff
Copy link
Contributor

_select_samples1b might be the best compromise.

@tacchinotacchi
Copy link
Contributor Author

tacchinotacchi commented May 21, 2023

It'll be worth profiling the existing code. The rate limiting step for the existing code may be creation of the large list in the first line. There might be a way of storing that, then shuffling an array with the candidate value masked.

I think the bottleneck of the existing code (for large popsize) is the list.remove() call which is O(n) as it has to search the whole list for candidate.

Here's a plot of mean run times by popsize as measured on my rig. Indeed there's a surprising overhead with popsize < 200, I wonder why
image

@andyfaff
Copy link
Contributor

Is that total population size?

@mdhaber
Copy link
Contributor

mdhaber commented May 21, 2023

If number_samples is always small compared to population size, how about rejection sampling? (Uniform random sample of integers, just try again if there are duplicates or candidate is in the sample.) I'll try it when I get back. Maybe not worth it.

@tacchinotacchi
Copy link
Contributor Author

Is that total population size?

it's the value of self.num_population_members

@tacchinotacchi
Copy link
Contributor Author

If number_samples is always small compared to population size, how about rejection samples? (Random sample and try again if it fails the criteria - duplicates or candidate.)

I'll benchmark that too, I also didn't notice that number_samples = 5 when I wrote my version

@mdhaber
Copy link
Contributor

mdhaber commented May 22, 2023

Re: rejection sampling.

This takes about 30 µs regardless of population size:

def _select_samples3(self, candidate, number_samples):
    while True:
        idxs = self.random_number_generator.integers(0, self.num_population_members - 1,
                                                     number_samples)
        samples = set(idxs)
        if len(samples) == number_samples and candidate not in samples:
             return list(samples)

Unfortunately, that's 5x as long as main for a population size of 100. (Much faster than any of the solution above for large populations, though.)

For small samples, it is faster to generate a list (or array) with all the elements of the population, (don't remove candidate), shuffle it, and take the first few elements. Maybe that's because it guarantees that the elements are unique, so the loop is less likely to have to repeat.

def _select_samples4(self, candidate, number_samples):
    idxs = np.arange(self.num_population_members)
    while True:
        self.random_number_generator.shuffle(idxs)
        samples = set(idxs[:number_samples])
        if candidate not in samples:
            return list(samples)

For a population size of 100, this takes me about 14 µs, which is about twice as long as main. For a large population size, it's a little faster than this PR is now.

def _select_samples5(self, candidate, number_samples):
    while True:
        a, b = 0, self.num_population_members - 1
        samples = set((random.randint(a, b) for i in range(number_samples)))
        if len(samples) == number_samples and candidate not in samples:
            return list(samples)

This takes < 10 µs regardless of population size. So I think we're just up against the overhead of NumPy here.

So if it really matters, you could create an object that makes one big call to rng.integers at the beginning of differential_evolution to generate a lot of random integers, and then consumes some of those numbers every loop. When it runs out of numbers, it makes another big call, etc... That way the NumPy overhead is amortized over all the iterations. In this way, I think you could get it down to < 10 µs (amortized) regardless of sample size and use the specified NumPy random stream. But at that level of complexity, you might as well just branch the logic based on population size: for small populations, use what we already have, and for large populations, use rejection sampling.

@andyfaff
Copy link
Contributor

Let's not branch. I think the profiling by @tacchinotacchi shows we should use 1b

@tacchinotacchi
Copy link
Contributor Author

tacchinotacchi commented May 22, 2023

I tried a lot of variants and I can't seem to get anything better than this one when it comes to overall performance:

def _select_samples6(self, candidate, number_samples):
    pool = np.arange(self.num_population_members)
    self.random_number_generator.shuffle(pool)

    idxs = []
    while len(idxs) < number_samples:
        idx = pool[0]
        pool = pool[1:]
        if idx != candidate:
            idxs.append(idx)
    
    return idxs

this one scales similarly to 1b but is slightly faster across the board:
image

everyone ok to merge this one?
EDIT: forgot to test it against 4, which is very similar. one minute...

EDIT 2: seems to slightly beat 4 especially on low end popsize
image

@tacchinotacchi
Copy link
Contributor Author

tacchinotacchi commented May 22, 2023

and I know we prefer not to branch, but by only switching the type of the pool variable...

def _select_samples7(self, candidate, number_samples):
    pool = list(range(self.num_population_members)) if num_population_members < 100 else np.arange(self.num_population_members)
    self.random_number_generator.shuffle(pool)

    idxs = []
    while len(idxs) < number_samples and len(pool) > 0:
        idx = pool[0]
        pool = pool[1:]
        if idx != candidate:
            idxs.append(idx)
    
    return idxs

image

@andyfaff
Copy link
Contributor

@tacchinotacchi can you rejig this PR to the contents of _select_samples6. I think that it's going to be good enough and I don't want the branch.

@andyfaff
Copy link
Contributor

(if you can do it in the next day or so it'll make the branch for the next scipy release)

@tacchinotacchi
Copy link
Contributor Author

I'm not sure what's going on with this test failure

FAILED scipy/stats/tests/test_fit.py::TestFit::test_fit_only_loc_scale - RuntimeWarning: overflow encountered in divide

But it looks like someone had a similar problem in basic_fit_test and decided to ignore the warning

with npt.suppress_warnings() as sup:
    sup.filter(RuntimeWarning, "overflow encountered")
    res = stats.fit(dist, data, bounds, method=method,
                    optimizer=self.opt)

@mdhaber
Copy link
Contributor

mdhaber commented May 23, 2023

No problem. Either change scale_bounds to (0.01, 5) (for example) to avoid it, or use with np.errstate(over='ignore'):context (instead of npt.suppress_warnings). I think one of those should do it.

Copy link
Contributor

@andyfaff andyfaff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If the tests pass we'll merge.

@andyfaff
Copy link
Contributor

I think there's one test fail that needs fixing. (There are others which I don't think are related). See test_basic_fit_mle.
The MLE resulting from a fit to a distribution isn't being minimised enough, the fit is being done with differential_evolution. @mdhaber would be able to help further on how to get the minimizer over the line. In this situation I would suggest increasing the popsize kwd for differential_evolution. One way may be to monkey patch the kwd dictionary supplied to differential_evolution on L365:

    def opt(self, *args, **kwds):
        kwds['popsize']=30
        return differential_evolution(*args, seed=0, **kwds)

This PR is bit-for-bit not back compatible because the code consumes the RNG in a different manner, shuffling N numbers rather than N-1.

@mdhaber
Copy link
Contributor

mdhaber commented May 31, 2023

Already there were other distributions that didn't pass the basic test, and that's OK. For those, I just added a custom test, so I did the same here.

@andyfaff
Copy link
Contributor

andyfaff commented Jun 1, 2023

LGTM now. Thank you @tacchinotacchi , @mdhaber

@andyfaff andyfaff merged commit 950112f into scipy:main Jun 1, 2023
24 checks passed
@tacchinotacchi
Copy link
Contributor Author

thanks @mdhaber for picking this up, was busier than usual this week

@j-bowhay j-bowhay added this to the 1.12.0 milestone Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.optimize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants