Local-search based technique to optimize the acquisition function of trees and friends #74
Comments
They also say "we plan to investigate better mechanisms in the future". Might be a good place for us to investigate as well :P |
Thanks for digging this up. This does look very hackish indeed and it would definitely be nice to find a better solution for that. |
Do we first implement this as a baseline? |
Having re-read the SMAC paper I think we should stick with random sampling for now. It seems quite convoluted. They combine the result of 10 local searches with 10000 random samples and it is not terribly clear to me that it does better than simple random sampling. Do you understand the last paragraph in section 4.3? |
Hmm.. They claim however in the paragraph before the last that the "ten So it might be worth a try to just do the local searches independent of the random sampling. I have it in my branch here (https://github.com/MechCoder/scikit-optimize/tree/paramils) and will send a PR once we test that the |
What I couldn't work out was: If the ten local searches "always" outperform the 10000 random samples, why do they continue to do them? The other thing I find confusing is the statement of interleaving random samples for training purposes and unbiasedness. Looking forward to a PR, then we can benchmark things and compare how much we gain from using a smarter/more complicated method. |
OK, I now have a working version but it fails 2 out of the 6 minimizers. I shall verify tomorrow and let you know if the method performs worse or if it is a bug in the code. (I hope it is the latter) |
Great! Could you make a WIP PR? I am curious to see how you do that :) |
done :-) |
We cannot optimize the acquisition function of using conventional gradient / 2nd order information based methods. SMAC does it in the following way described in page 13 of http://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf
Some terminology.
p
parameters and a parameter configuration, a one-exchange neighbourhood is defined as a parameter configuration that is different in exactly one parameter.X
) that is continuous, this neighbor is sampled from a Gaussian centered atX
with std 0.2 keeping all other parameters constant.Y
) that is categorical, this neighbour is any other categorical parameter keeping all other parameters constant.Seems like they do a multi-start local search with 10 points. For each local search:
p
p
, then terminateElse reassign
p
to the neighbour with minimum acquisition value.Then return the minimum of all the 10 local searches.
The text was updated successfully, but these errors were encountered: