[WIP] Gaussian Process revisited (refactoring of correlation models + some new features/bugfixes) #3388

Closed
wants to merge 21 commits into
from

Projects

None yet

6 participants

@jmetzen
Member
jmetzen commented Jul 15, 2014

This PR contains a proposal for a refactoring of the gaussian_process subpackage (mainly the correlation_models module) and some novel features which become possible because of this refactoring. Moreover, some subtle bugs are fixed. In general, I tried to keep the external interface of GaussianProcess fixed and to change only internals.The main purpose of this PR is to begin a discussion on the future of the GP subpackage and revive its further development.

CHANGES:

  • Correlation models are now classes instead of functions and inherit from abstract base class StationaryCorrelation. This reduces code-redundancy (DRY) and separates GPs and their correlation models more strictly. Moreover, non-stationary correlation models become possible (see examples/gaussian_process/plot_gp_nonstationary.py) This was not possible formerly since only the differences of the datapoints were passed to the correlation models but not the datapoints themselves.
  • The hyperparameters theta are estimated as maximum-a-posterior estimates rather than maximum-likelihood estimates. At the moment, the prior on the hyperparameters theta is uniform such that both result in the same value. However, a non-uniform prior could be used in the future in cases where the risk of overfitting theta to the data is serious (e.g., forcing the factor analysis distance to be close to diagonal etc.)
  • Hyperparameters theta are represented as 1D ndarrays instead of 2D 1xn arrays in GaussianProcess.

NEW FEATURES:

  • Matern correlation models for nu=1.5 and nu=2.5 have been added (see https://en.wikipedia.org/wiki/Mat%C3%A9rn_covariance_function). An example script showing the potential benefit of the Matern correlation model compared to squared-exponential and absolute-exponential was added under examples/gaussian_process/plot_matern_kernel.py (see attached image)
  • squared_exponential, absolute_exponential and Matern correlation models support factor analysis distance. This can be seen as an extension of learning dimension-specific length scales in which also correlations of feature dimensions can be taken into account. See Rasmussen and Williams 2006, p107 for details. An example script showing the potential benefit of this extension was added under examples/gaussian_process/plot_gp_learning_curve.py (see attached image).
  • GP supports additional optimizers for ML estimation passed as callables. In addition to 'fmin_cobyla' and 'Welch', GaussianProcess allows now to pass other optimizers directly as callables via the parameter optimizer.

BUGFIXES

  • Setting optimal_rlf_value correctly in GPs and actually maximizing likelihood over several random starts. The sign of the optimal_rlf_value was set wrongly in _arg_max_reduced_likelihood_function() since it was set to the negative of the return of reduced_likelihood_function(). This error was probably caused by confusing reduced_likelihood_function(theta) with minus_reduced_likelihood_function(log10t). It resulted, however, in _arg_max_reduced_likelihood_function() returning the worst local maxima instead of the best one when performing several random_starts.
  • A typo in gp_diabetes_dataset.py was fixed which caused a crash of the example (gp.theta_ versus gp.theta)

To be discussed:

  • The factor analysis distance has hyperparameters that can can take on arbitrary real values (not just positive ones). Since sklearn enforces the hyperparameters theta to be positive, this is internally handled by taking the log of the corresponding components of theta. Are there better ideas?

plot_gp_learning_curve
plot_matern_kernel

Jan Hendrik ... added some commits Jul 4, 2014
Jan Hendrik Metzen REFACTOR Correlation models now subclasses of generic StationaryCorre…
…lation (work-in-progress)

Advantages
 * GPs and their correlation models more strictly separated
 * Allow non-stationary correlation models in the future
 * Reduce code-redundanca
e36452b
Jan Hendrik Metzen REFACTOR Factored out common code of all correlation models into base…
…class
bd293b6
Jan Hendrik Metzen PEP8 Minor polishing of test_gaussian_process.py d95c1a5
Jan Hendrik Metzen DOC Documentation and interface of StationaryCorrelation 883c916
Jan Hendrik Metzen FIX Welch optimizer in GPs (self.corr_ refers nw to specific instance…
…, self.corr to class)
0f4febf
Jan Hendrik Metzen DOC Adapted documentation of GP to changed correlation model interface 5dd83bd
Jan Hendrik Metzen REFACTOR Conversion to absolute value of componentwise differences on…
…ly when required
01a022f
Jan Hendrik Metzen REFACTOR Using l1_cross_differences instead of l1_cross_distances d8a5929
Jan Hendrik Metzen FIX Using l1_cross_differences instead of l1_cross_distances also dur…
…ing prediction
a33acd6
Jan Hendrik Metzen FIX reduced_likelihood_function_value_ in GP correct if no hyperparam…
…eters are tuned
1334db1
Jan Hendrik Metzen REFACTOR Representing hyperparameters theta as 1D ndarrays instead of…
… 2D 1xn arrays
4f47269
Jan Hendrik Metzen FIX Setting optimal_rlf_value correctly in GPs and actually maximizin…
…g likelihood over several random starts.

The sign of the optimal_rlf_value was set wrongly in 
_arg_max_reduced_likelihood_function() since it was set to the negative
of the return of reduced_likelihood_function(). This error was probably
caused by confusing reduced_likelihood_function(theta) 
with minus_reduced_likelihood_function(log10t). It resulted, however,
in _arg_max_reduced_likelihood_function() returning the worst local
maxima instead of the best one when performing several random_starts.
b5436fd
Jan Hendrik Metzen FIX Adapted Welch optimizer to 1D shape of theta introduced in 4f4726 6c675b5
Jan Hendrik Metzen REFACTOR GP supports additional optimizers for ML estimation passed a…
…s callables

In addition to the 'fmin_cobyla', 'Welch', GaussianProcess allows now
to pass other optimizers directly as callables via the parameter optimizer.
0627904
Jan Hendrik Metzen ADD Materin correlation models for nu=1.5 and nu=2.5 92220aa
Jan Hendrik Metzen ADD Factor analysis distance for _quadratic_activation and example sh…
…owing its advantages
b7a7ec0
Jan Hendrik Metzen REFACTOR Correlation models separate between __init__ and fit()
This allows to call __init__  of more complex correlation models outside
of the GP before the training data X used in fit() is known
ff5ef89
Jan Hendrik Metzen ENH GaussianProcess hyperparameter optimization now based on MAP inst…
…ead of ML

Instead of choosing the hyperparameters that maximize the likelihood,
the maximum-a-posteriori is used. While the prior is assumed to be
uniform for the implemented stationary correlation models, more 
complex correlation models with more hyperparameters may use an
appropriate prior to reduce the risk of overfitting when the training
set is small.
dfa51cb
Jan Hendrik Metzen ADD Example of a non-stationary correlation model ed22c8a
Jan Hendrik Metzen DOC Gaussian Process documentation reflects MAP estimation f46634d
Jan Hendrik Metzen MISC PEP8 and typos fixed in gaussian_process.py 3d5cd8c
@jmetzen
Member
jmetzen commented Jul 15, 2014

This PR might be interesting for @dubourg, @agramfort, @larsmans, @JohnFNovak, and @AlexanderFabisch

@kastnerkyle kastnerkyle commented on the diff Jul 15, 2014
sklearn/gaussian_process/correlation_models.py
+ (nugget = 10. * MACHINE_EPSILON).
+ """
+ self.X = X
+ self.nugget = nugget
+ self.n_samples = X.shape[0]
+
+ # Calculate array with shape (n_eval, n_features) giving the
+ # componentwise distances between locations x and x' at which the
+ # correlation model should be evaluated.
+ self.D, self.ij = l1_cross_differences(self.X)
+ if (np.min(np.sum(self.D, axis=1)) == 0.
+ and not isinstance(self.corr, PureNugget)):
+ raise Exception("Multiple input features cannot have the same"
+ " value.")
+
+ def __call__(self, theta, X=None):
@kastnerkyle
kastnerkyle Jul 15, 2014 Member

Is there any way we could avoid overriding the call method, and call transform() explicitly instead by setting properties on calls to fit()?

@kastnerkyle
Member

Several things here:

  1. While we are refactoring, is there a way we can deprecate the term "nugget"? I really, really dislike the name - there has to be some clearer title for general users.
  2. I don't know about the two abstract base classes. On the one hand, I see the logic in doing it in this way, but I also feel that explicit independent functions for each make things more modular (which is the reasoning behind the change), and would lend a hand toward using these kernels in an eventual KernelMixin class, which was discussed by @mblondel somewhere around here. At very least, I currently don't see why overriding call is preferable to simple making that functionality live in .transform() or another method, and setting the hyperparameters on fit().
@kastnerkyle
Member

Actually, I am not 100% sure why these classes need all the fit() machinery/stateful behavior at all. It seems like there isn't much need for the class structure, and could be implemented as pure functions that also use some shared functions from the module level. Maybe I am missing something crucial?

@jmetzen
Member
jmetzen commented Jul 15, 2014

regarding 1) I also dislike nugget, but it is part of the external GaussianProcess API so I didn't change it.
regarding 2) I am open to use an other interface than call; I mainly used call because it allows to use the correlation model in a similar way as the old functional correlation models . I don't know about calling it transform; it is not really transforming the data and can even be called without data X (computing the autocorrelation on self.X). Regarding splitting the abstract base class into several independent functions: I am not sure if this could easily be passed to the GP via the corr parameter of the constructor?

@kastnerkyle
Member

On the first, I am glad to see that I am not the only person who is bugged by the name :) . It would be a solid opportunity to deprecate nugget at the API level, then introduce a new parameter that we expose as well and use internally, though I am not sure of a good name yet.

I will try to work up a skeleton example off your branch to show what I mean on the second count. Thanks for the work so far - GP definitely needs some love!

@jmetzen
Member
jmetzen commented Jul 15, 2014

Regarding the fit() machinery/stateful behavior: The l1_cross_differences stored in StationaryCorrelation.D need to be computed only once. Formerly, they have been stored in the GP itself (where they are not really required). The stateful behavior would also be required for implementing some non-stationary correlation models from the literature (not part of this PR).

@kastnerkyle
Member

On the l1_cross_differences, it would be pretty easy to do an "if None" check to avoid recomputing.

I would actually prefer storing state in the GP itself - if that was the case, we could still use the state stored in the GP to implement non-stationary correlation models, right? It seems more explicit to me to keep the state in the class itself, and let the correlations stay as simple as possible. In any case, I will try and work out an example of "stupid" correlations off your branch to see if it is a reasonable approach.

@kastnerkyle
Member

Ultimately, I think my comments are better addressed by a larger refactor of the GP class. Ignore me for now, and we can analyze this PR on it's own merits.

In general we should work toward making the GP class more transparent and accessible - I currently find it quite complicated for (possibly) no reason, and much of the class does not adhere to modern sklearn API standards.

Adding a class hierarchy to the kernel methods which is not shared with other kernel classes, pairwise_metrics, etc. doesn't really help this. If we decide to merge this PR, I would prefer to see these implementation details (the two metaclasses and their subclasses) private rather than public, in order to make a larger refactor easier in the future. Making the correlations/kernels/covariances very dumb and keeping as much state contained to the GP class as possible is also a step in this direction. That said, I think this PR can definitely help people work in the GP class now, which is very important.

There is a lot in common between kernel ridge regression and GP/kriging - I do not advocate writing "the class to end all classes" but it is definitely worth seeing if we can merge some common functionality now and in the future.

@larsmans
Member

This is a big PR, let's not start changing the API ("nugget" vs. whatever) if we don't need to. Too many good PRs go stale because to many changes are introduced at the same time and the review process becomes a nightmare.

@kastnerkyle
Member

Good point. If it helps get this merged, let's minimize the surface area
and go with what it does now. I think it is very good, but I am unhappy
with the GP in general. This should help that :)

On Tue, Jul 15, 2014 at 12:52 PM, Lars Buitinck notifications@github.com
wrote:

This is a big PR, let's not start changing the API ("nugget" vs. whatever)
if we don't need to. Too many good PRs go stale because to many changes are
introduced at the same time and the review process becomes a nightmare.


Reply to this email directly or view it on GitHub
#3388 (comment)
.

@jmetzen
Member
jmetzen commented Jul 16, 2014

+1 that the GP class needs to get more transparent and accessible
In my opinion, the best way for this is to focus on one part of it at a time (probably this PR is already too big). Maybe we can leave the ML vs. MAP part for later and focus first on getting the correlation models straight. Here my thoughts on this:

  • correlation_models vs. pairwise_metrics: the strongest overlap here is in my opinion the function l1_cross_differences which is pretty similar to manhattan_distances from pairwise (for Y==None). The main difference seems to be that l1_cross_differences exploits symmetry of D (when Y==None) and has thus a smaller memory footprint. Another overlap is between SquaredExponential and the rbf_kernel from pairwise. rbf_kernel is restricted to the isometric case at the moment, however.
  • correlation models as functions versus stateful objects: for the correlation models implemented currently in sklearn, the state consists basically of the pairwise distances of the training data (via D and ij). This could be kept in the GaussianProcess class. I see many use cases for custom correlation models (and have some myself), which need not become part of sklearn itself. It would be sufficient to keep them external but be able to use them with sklearn's GP as long as they have a compatible interface with sklearn's correlation models. Since some kernels might not only differ in computation but also in their internal state (one of my examples stores a clustering of X internally), stateful objects are in my opinion better suited for this. The specific interface of StationaryCorrelation is certainly not optimal; I am open for suggestions ;-)
@kastnerkyle
Member

Is there any way we could separate the bugfix stuff and the refactoring/API change proposal into two PRs? I really, really like the bugfixes but we will need to have a lot of thought and input about the API stuff. If we can split this into two smaller PRs I think it is more likely to get looked at.

@jmetzen
Member
jmetzen commented Jul 21, 2014

I will create a separate PR for the bugfixes and corresponding tests

@kastnerkyle
Member

Great - there are a lot of fixes here :)

@dubourg

Yes indeed! See PR #2632.

@jmetzen
Member
jmetzen commented Jul 22, 2014

Oops, seems that we fixed the same issue twice. In order to get PR #2632 finally merged, I added a proposal for a regression test in the comments.

@jmetzen
Member
jmetzen commented Aug 30, 2014

Now that the bugfixes are merged via PR #3545, what is the general opinion regarding the new features? And using classes for the correlation_models?

If this PR is too big, I can focus on a subset of the features/refacotoring and create a smaller PR.

@jmetzen
Member
jmetzen commented Dec 19, 2014

If have created a separate PR (PR #3885) for adding the Matern kernel to pairwise.py. Once that is done, I will try refactoring correlation_models.py such that it uses as much from pairwise.py as possible and close this PR. What is your opinion, @kastnerkyle?

@kastnerkyle
Member

That sounds awesome! Thanks for all your work on this, it is really nice to
see the GP getting some fixes - sharing the kernels via pairwise.py should
be good for many estimators. It will also provide a clear roadmap for
people wanting to use their own kernels.

On Fri, Dec 19, 2014 at 9:57 AM, Jan Hendrik Metzen <
notifications@github.com> wrote:

If have created a separate PR (PR #3885
#3885) for adding the
Matern kernel to pairwise.py. Once that is done, I will try refactoring
correlation_models.py such that it uses as much from pairwise.py as
possible and close this PR. What is your opinion, @kastnerkyle
https://github.com/kastnerkyle?


Reply to this email directly or view it on GitHub
#3388 (comment)
.

@jmetzen jmetzen closed this Jan 17, 2015
@jmetzen jmetzen referenced this pull request Jan 17, 2015
Open

[MRG] Matern kernel #3885

2 of 2 tasks complete
@amueller
Member

@jmetzen you closed this becaue you want to break it up? Sorry for the lack of feedback before.

@jmetzen
Member
jmetzen commented Jan 18, 2015

I plan to split the contents of this PR into several smaller PRs. First step would be to refactor correlation_models.py such that it reuses the kernels from pairwise.py.

@amueller
Member

Btw, did you follow the (somewhat) recent discussion on the mailing list on whether it would be worth to have a language for specifying kernels in sklearn?
When using the pairwise model we might end up with a lot of redundant computation if we combine many kernels, right?
I guess it would be better than what we currently have, though.

@mblondel
Member

I suspsect that we'll need kernel classes anyway for kernel parameter
optimization by gradient descent. Also popular kernels tend to be different
for SVMs and GPs.

@amueller
Member

So if we need kernel classes, I'm not sure the proposed refactoring is the best idea.

@mblondel
Member

My advice would be to worry about code factorization after the GP module got rewritten.

@jmetzen
Member
jmetzen commented Jan 20, 2015

Makes sense to defer refactoring until the GP module is rewritten. I also checked how pairwise's kernels could be reused in the GP context and there are at least 2 drawbacks: pairwise's kernels are restricted to the isotropic case (scalar gamma) and they don't exploit that k(X, X) is symmetric - in contrast to the current GP implementation. Thus, code reuse between pairwise and correlation_models would probably also require some non-trivial refactoring of pairwise.

BTW: Is anyone currently working on the GP refactoring?

@amueller
Member

Unfortunately not.
@dfm said he might have some time to help with improving documentation, but he seems busy finding a new earth ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment