Skip to content
This repository has been archived by the owner on Feb 28, 2024. It is now read-only.

Gaussian Process for Integer and Categorical Dimensions? #580

Closed
Hvass-Labs opened this issue Dec 17, 2017 · 5 comments
Closed

Gaussian Process for Integer and Categorical Dimensions? #580

Hvass-Labs opened this issue Dec 17, 2017 · 5 comments

Comments

@Hvass-Labs
Copy link
Contributor

One of the many things that I really like about skopt is that I can mix real-valued, integer and categorical dimensions in the search-space. This is REALLY powerful!

However, I'm not an expert on Gaussian Processes (GP) and Bayesian Optimization (BO), so can you explain to me (and the rest of the users) how you actually do this internally in skopt? All the lectures I have watched on GP and BO always assume a single real-valued variable. How do you model integers and categorical variables using a GP?

@Hvass-Labs Hvass-Labs changed the title Gaussian Process for Integer and Categorial Dimensions? Gaussian Process for Integer and Categorical Dimensions? Dec 17, 2017
@betatim
Copy link
Member

betatim commented Dec 21, 2017

If the search space is purely categorical then we use a special kernel. Beyond that take a look at the transform for categorical dimensions.

Personally I use tree based optimisers when the space has lots of integers/categories instead of using a GP based optimizer.

@Hvass-Labs
Copy link
Contributor Author

I would like to understand this so I can explain it in my tutorial.

For Integer dimensions, it appears from the comment (yay!) in Integer.inverse_transform() that the search-space is perhaps always real-valued and this is simply cast to / from integers. Is that correct?

Is it documented somewhere that all dimensions are converted / transformed internally to real-valued? Or is this transformation only done for Gaussian Processes?

For Categorical dimensions, I managed to track down something called LabelBinarizer but I don't really understand how it works with a Gaussian Process. For example, if we have a Categorical dimension named activation that can take one of three values: linear, sigmoid, relu. Does this get one-hot encoded so linear corresponds to e.g. [1.0, 0.0, 0.0] and sigmoid corresponds to [0.0, 1.0, 0.0] and relu corresponds to [0.0, 0.0, 1.0]? Then a position in the search-space could be e.g. [0.1, 0.6, 0.3] and to get the category we then do an argmax and find the sigmoid category? Is that how it works?

But the categorical dimension can have a very non-smooth impact on the objective function. Is the Gaussian Process really able to handle such non-smooth dimensions? Or is that perhaps why you prefer to use tree-based search?

@betatim
Copy link
Member

betatim commented Dec 29, 2017

How each optimisation method handles the encoding/transforming of dimensions is considered an implementation detail. As a result you need to read the code to find out how it is done. It has the advantage that we can change how this works without having to deprecate behaviour first.

Your description of what happens for a GP is correct:

In [9]: from skopt.utils import normalize_dimensions

In [10]: space = normalize_dimensions([(4, 13), ('a', 'b', 'c', 'd'), (34., 42.2
    ...: 3)])

In [11]: space.rvs(random_state=1)
Out[11]: [[8, 'c', 34.000941304746746]]

In [13]: space.transform(space.rvs(random_state=1))
Out[13]: 
array([[  4.44444444e-01,   0.00000000e+00,   0.00000000e+00,
          1.00000000e+00,   0.00000000e+00,   1.14374817e-04]])

In [14]: space.inverse_transform(np.array([[0.444444444444444445, 0.2, 0.2, 0.2,
    ...:  0.4, 1.14374817e-4], ]))
Out[14]: [[8, 'd', 34.000941304743911]]

But the categorical dimension can have a very non-smooth impact on the objective function. Is the Gaussian Process really able to handle such non-smooth dimensions? Or is that perhaps why you prefer to use tree-based search?

Correct. To learn more you can start reading papers by Frank Hutter like for example https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf (section 4). I think they were the people who first used trees instead of GPs. I think the best way to understand the trade-offs and caveats of using GPs vs trees is to watch some of the many videos from the various summer schools on black-box optimisation and GPs (google for Nando de Freitas and Sheffield summer school).

@Hvass-Labs
Copy link
Contributor Author

Thank you. It might be useful to add some of this information to doc-strings.

@mjmikulski
Copy link

@Hvass-Labs if you are still interested in the subject, have a look on this very ilustrative paper
Dealing with Categorical and Integer-valued Variables in Bayesian Optimization with Gaussian Processes

They modify the kernel to use the information about the integerness of a variable - look at the sections 3.1 and 3.2 and at the figure 1 at page 7.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants