Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Picking initial points with latin hypercubes #433

Open
betatim opened this Issue Jul 12, 2017 · 35 comments

Comments

Projects
None yet
7 participants
Owner

betatim commented Jul 12, 2017

Inspired by the comments in #432: currently we pick the initial points at random, and discussed using a Sobol sequence (or other quasi random sequence) instead.

Finding a good implementation of a QR sequence generator in many dimensions isn't soo easy.

However I just realised that we pick the number of initial points "up front", which means we can design a "optimal" grid (latin hypercube?) and evaluate the objective using that. Sobol or random only wins over this static allocation if you do not know how many points you will sample. This would be great because I think latin hypercubes should be much easier to code up than a Sobol sequence.

Would be good to hear some opinions on this.

betatim added the Moderate label Jul 12, 2017

Contributor

kiudee commented Jul 12, 2017

Owner

betatim commented Jul 13, 2017

I thought for a 1D problem it would be fine to use an evenly spaced line of numbers?

Then in N-dimensions the property you want is that none of the points overlap so that when you project onto a particular axis you get x evaluations (if you have x points in total). (I am getting close to the edge of my experience here though)

Contributor

kiudee commented Jul 13, 2017

Yes, in 1D this already suffices. In more dimensions you will increasingly find large volumes of space which are not covered if the latin hypercube is chosen randomly.

For 2D consider this example:
LHS
None of the points is overlapping and the 1D marginals are covered evenly. Yet, large areas are not covered and the GP subsequently will have a large uncertainty in these areas.

Owner

glouppe commented Jul 13, 2017 edited

Owner

iaroslav-ai commented Jul 13, 2017

I agree with @glouppe . We should replace things only if we can show that they improve on existing stuff.

Contributor

kiudee commented Jul 13, 2017

Owner

iaroslav-ai commented Jul 13, 2017

Sounds good! I am quite curious about what results will be :)

Contributor

kiudee commented Jul 13, 2017

Here is the first comparison on Branin-Hoo with the following parameters:

  • n_calls = 32
  • n_init = 10
  • The experiment is repeated 100 times. For Sobol I skip to a random start index of the sequence.

On this plot I show 100 repetitions of gp_minimize, each time with a different randomized initialization. Each point shows the minimal objective value attained on that run.
branin_comparison

Init Method Mean SD
Uniform Random 0.8221 1.0343
Random LHS 0.6649 0.7492
Maximin LHS 0.7383 0.9826
Sobol 0.5860 0.3587

As expected the uniform random initialization has the highest standard deviation. I was a little surprised that the Sobol sequence performs so well when compared to Maximin LHS.
I will try it on the Hart6 function next.

Here is the Jupyter notebook if you want to play around with it yourself. 100 repetitions on one init method take around 30 minutes.

Contributor

kiudee commented Jul 14, 2017 edited

Here is the preliminary result for Hart6 using the same parameters:

hart6

I will let the simulation run again with more n_calls such that more repetitions will arrive at the minimum.

Owner

betatim commented Jul 14, 2017

Nice plots! (I had to finish my coffee and read the seaborn docs to realise what a stripplot does :) )

Should we be looking at median and several other quantiles instead of mean and std? From looking at the plots I was surprised by the number of outliers.

Can we compare the minimum found after the init phase only? Just as a way to speed up stuff. Maybe we need to run for more than 10 samples in that case. Not sure if we want to find out which initialization method gives us the best starting point or if we care about the combined performance of init+optimizer? I will run your notebook with forest_minimize see what happens.

Contributor

kiudee commented Jul 14, 2017 edited

Yes, I noticed that too. I computed the median in between simulation runs for comparison and it gave the same relative ordering. In general I would say the spread is a really important metric for the initialization. If we look at the median absolute deviation or other robust measures we could remove the effect of some of the more extreme outliers.

I also like the idea of comparing only the initializers. I might start an experiment for that later.

Owner

betatim commented Jul 14, 2017

image

This is with n_calls=20 and n_init=20 for the various initializers.

Median and MAD:

uniform: (1.7089403873756748, 1.5806798249842289)
random LHS: (2.223748873054185, 1.686405957735891)
maximin: (1.9106631284885793, 1.5756855462465997)
sobol: (1.7608516673244079, 1.2007219847386255)

Notebook: https://gist.github.com/betatim/c67a068f7d68d9dbd973810596fe575e

Owner

betatim commented Jul 14, 2017

image

n_calls=30 and n_init=20 with forest_minimize. The red line is the median.

uniform: (1.1629356755330829, 0.86059989162721195)
random LHS: (1.2207722924201985, 0.94657218547566235)
maximin: (1.4049412479354109, 0.97115895950305131)
sobol: (1.3608738899059025, 0.77190071728787002)
Owner

iaroslav-ai commented Jul 14, 2017

Yeah at the end you (almost) always want to use optimization on top of things, so it makes more sense to look at final result of optimization.
I have some models (variational autoencoder and SVM) modelling behavior of training of deep network, I will give those a go. (part of it is that I want to master these plots too :D )

Contributor

amueller commented Jul 14, 2017

Did you guys have a look at dask-patternsearch?

Owner

iaroslav-ai commented Jul 14, 2017

@amueller dask-patternsearch seems to be something similar in spirit to Powell method implemented in scipy and derivative free scipy.optimize.minimize in general. Would be interesting to benchmark against these guys some time maybe. It seems that the assumption that they make is that objective is not too expensive to evaluate (or you have massive parallel computing power). Maybe we could take some of their things for initialization, if this improves benchmarks.

Owner

iaroslav-ai commented Jul 14, 2017 edited

I did some optimizations of 4 layer narrow neural networks on digits dataset, and trained GradientBoostingRegressor and VAE to esetimate the cross - validation score given configuration of neural network. Then GradientBoostingRegressor / VAE are used as proxy for objective. Here are results for GradientBoostingRegressor:
init_det bin
Sobol sequence performs well, but maxmin_lhs performs best. Will post results with VAE when I get them.
Edit: I used GradientBoostingRegressor, not SVM
Edit2: GradientBoostingRegressor score = 0.8
Edit3: gp_minimize was used for optimization

eriknw commented Jul 14, 2017

@amueller dask-patternsearch seems to be something similar in spirit to Powell method implemented in scipy and derivative free scipy.optimize.minimize in general. Would be interesting to benchmark against these guys some time maybe. It seems that the assumption that they make is that objective is not too expensive to evaluate (or you have massive parallel computing power). Maybe we could take some of their things for initialization, if this improves benchmarks.

Hey! Author of dask-patternsearch here. Let me first say that dask-patternsearch is very young, and it's current state (which appears to not be completely broken or useless) is intended to be an early proof-of-concept. I expect to redo a lot of it so it can better handle my longer term plans for the library, and I plan to more actively develop it in the immediate future. I'd say it's more like Hooke and Jeeves algorithm than Powell's method. You're correct that, by design, it assumes you can run in parallel. A goal of the library is that it can scale to arbitrarily large number of cores while having relatively small overhead. It is much more naive than scikit-optimize or any Bayesian optimization. I apologize for the lack of documentation.

The pattern in dask-patternsearch assumes the current best point is probably closer to a minima than the previous best point, so the pattern that is searched around the current point is, well, focused around the current best point, but it also explores larger steps away from it. As such, I don't think it is especially relevant for choosing initial points in your project.

A useful property of dask-patternsearch is that it can (er, will) readily accept trial points from any number of oracles. Model-assisted optimization is a fantastic idea, and I would like to be able to use scikit-optimize as an oracle in the future.

I'm happy to have discussions (maybe in a different forum) and to collaborate as appropriate. Cheers!

Owner

iaroslav-ai commented Jul 14, 2017

Hi @eriknw :)

Thanks for introducing your project and sorry if I got something wrong. Looks interesting indeed!

Discussion on benchmarks in eriknw/dask-patternsearch#9 is quite relevant for scikit-optimize also, and has some interesting ideas. You can take a look at benchmarks that we have, but these might change somewhat in future as more complexity would result in more interesting benchmarks.

Cheers! :)

Owner

MechCoder commented Jul 15, 2017

@iaroslav-ai Could you please also show the minimum objective value along y axis similar to plot_convergence?

Owner

MechCoder commented Jul 15, 2017

For historians, the strategy proposed in #262 was to use points generated from a sobel sequence to optimize the acquisition function and not as initial points as proposed in this issue. As the benchmarks in that PR indicate, that performs worse than using a combination of sampling and lbfgs (sampling to provide the start points of the lbfgs optimisation)

Owner

MechCoder commented Jul 15, 2017

@betatim In your two plots, why does the blue line change in between the first and the second plot?

Owner

iaroslav-ai commented Jul 15, 2017 edited

@MechCoder yup will update plots a bit later
Here are results with Variational Auto Encoder, used to map NN parameters to noisy objective. The main idea is to have an objective which is easy to evaluate and that can model noise present in score of NN due to local minima problem. Results are as follows:
image
While VAE is not perfectly modelling the score behavior (some scores are estimated to be larger than 1.0) it should still reflect the original optimization problem accurately to certain extent.
P.S. anyone knows any kinda models except for GANs or VAEs which can learn to sample from a single dimensional distribution conditioned on vector of parameters? Would be quite useful for example to model noisy expensive objectives.

Owner

iaroslav-ai commented Jul 15, 2017

Hmm I take back my comment about VAE - just checked, it does not make estimations larger than -1.0

Owner

MechCoder commented Jul 15, 2017

How are you using a VAE to model score?

My idea of a VAE is to sample from the data generating distribution P(X), but then you use P(Z)P(X | Z) and approximate P(X | Z) with a network. Is that right?

Owner

iaroslav-ai commented Jul 15, 2017

Yeah I think you are right, I learn to sample from P(X | Z) where Z are parameters of the neural network, and X are objective values. My implementation follows the Figure 6 in https://arxiv.org/pdf/1606.05908.pdf.
(I will double check my vae pipeline also, there might be a bit of a bug that I found because of checking again with the reference, so thanks for making you comment :D)

Owner

iaroslav-ai commented Jul 17, 2017

Hmm yeah optimization algorithm somehow seems to have actually found some adversarial inputs to the VAE that result in estimated score > 1.0 :-\ . The complete results are shown below.
Not sure if the differences are just noise.

init_vae bin

init_det bin

Owner

iaroslav-ai commented Jul 17, 2017

Hmm plots above seem to suggest that anything other than uniform initialization might be better in the end

Owner

betatim commented Jul 17, 2017

@betatim In your two plots, why does the blue line change in between the first and the second plot?

Two things: the "grid" line moves, which makes it a bit hard to compare the two plots and in one case it is just random sampling (init phase) and in the other plot it is init + SMBO phase.

Owner

betatim commented Jul 17, 2017

Can we move the discussion of benchmark (methods) and models etc to a new issue? We are starting to mix several threads of conversation here which makes it hard to follow.

For me these are the questions for this thread:

  • are there good, well maintained, low number of dependencies packages that implement latin HC or QR sequences?
  • how much would it cost to implement and maintain one of them ourselves?
  • what does a good API for offering a choice of initialisation procedures look like
    • should *_minimize grow yet another parameter?
    • simple function to generate initial points which we pass to x0?
    • make x0 accept a callable that generates points?
    • something totally different
Contributor

kiudee commented Jul 27, 2017

are there good, well maintained, low number of dependencies packages that implement latin HC or QR sequences?

After looking at several libraries in Python, the most promising and maintained one is SALib. It offers the Sobol sequence in SALib.sample.sobol_sequence

Advantages are:

  • Up to 20999 dimensions are supported (compared to sobol which only supports 40)
  • Dependencies: numpy>=1.9.0, scipy, matplotlib>=1.4.3
Owner

betatim commented Jul 27, 2017

Is Sobol better than latin hypercubes?

This is the bit of code we would need right? I'd be tempted to take just that code instead of adding a new dependency :-/

Semi related question: what would we do about categorical dimensions?

Contributor

kiudee commented Jul 27, 2017

I extended the advantages of LHS and Sobol:

Sobol sequence

Advantage:

  • Proven low-discrepancy, which guarantees even coverage of the marginals and the complete space

Disadvantage:

  • Deterministic sequence. We need to specify a random starting value in order to "randomize" it, which requires computing the sequence until that point.

Latin hypercube sampling

Advantages:

  • Randomized: Both the latin squares and the points can be randomized, thus allowing the user to use different initial points
  • Latin squares assure even coverage of the marginals

Disadvantages:

  • No guarantees regarding discrepancy
  • Worse empirical rate of convergence on a number of test problems (see here for an empirical comparison; LHS breaks down especially with higher number of dimensions)

Of course LHS without any optimization is easy to implement, and we could simply offer both.

This is the bit of code we would need right? I'd be tempted to take just that code instead of adding a new dependency :-/

Yes, we could just take the this and the directions.py file, since it is licensed under MIT license.

Semi related question: what would we do about categorical dimensions?

Categorical dimensions should not be a problem because we currently use one-hot encoding and the low-discrepancy sequence will evenly cover both sides of the 0/1 interval for each dummy variable. The start value of 0.5 is the only problematic value.

Owner

glouppe commented Jul 28, 2017

This is the bit of code we would need right? I'd be tempted to take just that code instead of adding a new dependency :-/

+1 for that

glouppe closed this Jul 28, 2017

glouppe reopened this Jul 28, 2017

Owner

betatim commented Jul 28, 2017

Sounds like we have a proposal to add Sobol sequences as a way to initialise the optimizers.

Deterministic sequence. We need to specify a random starting value in order to "randomize" it, which requires computing the sequence until that point.

OK, let's find out how tedious this is to implement in a nice way in practice. For example as a user I would just continue setting random_state to random-state+1 and expect a "completely different" set of numbers to be used. We should keep this property ... and somehow not waste too much CPU time ;)

Wondering if this is best implemented as a function in skopt.utils (or similar) that I can use to generate a list of starting points x0 which I then pass to the minimizer. Instead of adding another flag to the various methods. Or maybe we should add a new kind of Space which uses a Sobol sequence to sample points instead of using random. WDYT?


Right now the sampling of random values is delegated to each Dimension object. This allows each of them to apply transforms etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment