[WIP] Add Gaussian Process Classification (GPC) with Laplace approximation #2340

Closed
wants to merge 7 commits into
from

Projects

None yet

5 participants

@dashesy
dashesy commented Aug 3, 2013

This is almost entirely based on "Gaussian Processes for Machine Learning" by Rassmussen and Williams.

Please let me know where I should add test methods, and anything else I need to do (not sure but cannot subscribe to the mailing list to ask my questions there).
I can add real benchmarks as soon as I get some feedback, and hopefully some pointers.

Code

I followed the newton formulation described in the text, the method directly computes the inverse Hessian, so optimize.fmin_ncg was not used, it has the advantage that kernel does not need to be inverted, and avoids numerical instability (as described in the book). I have an implementation that does use fmin_ncg but requires the invert, it can be added as a separate method to the class.

Comparison

It seems Laplace approximation to find the posterior distribution is not well-received in the literature, but it should be a good reference.
since the Expectation Propagation (EP) method described there is for binary classification only, I just implemented the Laplace approximation method for now, if there is any good reference for EP or other methods please let me know. Another option is to use EP in a ovR classification.
Variational methods also mentioned in the book, and seem to be easier to implement than EP for multiple classes.
If integrating with PyMC is an option, an MCMC approach would be a nice addition too.

Rationale

Gaussian Process Classification (GPC) is a kernel-based non-parametric method. One main advantage is flexibility and noise tolerance. It is a natural generalization of linear logistic regression, similar to the transition from linear regression to GP regression. The training learns a set of functions.
It is also the starting point to some powerful dimensionality reduction (DR) techniques that are based on GP-LVM, and learn the manifold of high-dimensional data with few number of samples.

TODO

I wanted to get some feedback before I continue, but here are my TODO list:
1- Hyper parameter theta can be optimized in a straight forward approach similar to GPR
2- Add numerical integration for prediction instead of Monte Carlo (MC) sampling (I wanted to follow the text book)
3- Add a variational method or EP method if I find good references within my level of math

My plan is to continue by implementing latent variable model (GP-LVM) and make it supervised (SGP-LVM). It will be based on these papers:

"Supervised Gussian Process LVM for Dimensionality Reduction", Xinbo Gao
"Supervised Latent Linear Gaussian Process LVM Based Classification", Hou, Feng, Zou
"Supervised Latent Linear Gaussian Process LVM for Dimensionality Reduction", Jiang, Gao, Wong, Zheng
"Discriminative Gaussian Process LVM for Classification", Urtasun and Darrell

Benchmarks

I ran the standard plot_classifier_comparison against all classifiers the last one if GPC (without MLE of theta)

comparison

I will add one more comparison as soon as I finish the MLE of theta.

@ogrisel ogrisel commented on the diff Aug 4, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -878,3 +916,643 @@ def _check_params(self, n_samples=None):
# Force random_start type to int
self.random_start = int(self.random_start)
+
+
+class GaussianProcessClassifier(BaseEstimator, ClassifierMixin):
+ """The Gaussian Process Classifier class.
@ogrisel
ogrisel Aug 4, 2013 Member

Just put "Gaussian Process Classifier" here.

Also please add a small 2 or 3 lines paragraph that gives the gist of how the classification reduction from a regression model such as GP works if possible.

@ogrisel
Member
ogrisel commented Aug 4, 2013

I am not familiar at all with GPC but isn't there anything that can be reused from the GP regression model already implemented in scikit-learn? Or maybe factorize some common parts into private helper functions or a base or mixin class?

I would be very interested in benchmarks too.

@ogrisel
Member
ogrisel commented Aug 4, 2013

If integrating with PyMC is an option, an MCMC approach would be a nice addition too.

Integrating with PyMC is not an option as we don't want to maintain any more dependencies besides numpy and scipy. However the PyMC devs themselves are starting to implement bayesian, MC based machine learning models of their own so they might be interested in a GPC model implementation.

@GaelVaroquaux GaelVaroquaux commented on the diff Aug 4, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -24,6 +25,43 @@
def solve_triangular(x, y, lower=True):
return linalg.solve(x, y)
+def inv_triangular(A, lower = False):
@GaelVaroquaux
GaelVaroquaux Aug 4, 2013 Member

Two remarks on this function (which is cool!):

  • It should be in utils.extmath
  • The routing between the 2 approaches should be performed at module-loading time. In other words, you should implement 2 functions, and at module loading time assign one to 'inv_triangular', as it is currently done for np.unique in utils.fixes.
@dashesy
dashesy Aug 5, 2013

Thanks, I will, then I need to also move solve_triangular to extmath too.

@GaelVaroquaux GaelVaroquaux commented on the diff Aug 4, 2013
sklearn/gaussian_process/gaussian_process.py
+ dtrtri = linalg.lapack.clapack.dtrtri
+ A = np.double(A)
+ A_inv, info = dtrtri(A, lower = lower)
+
+ if A_inv is None:
+ A_inv = solve_triangular(A, np.eye(len(A)), lower = lower)
+
+ return A_inv
+
+if hasattr(sparse, 'block_diag'):
+ # only in scipy since 0.11
+ block_diag = sparse.block_diag
+else:
+ # slower, but works
+ def block_diag(lst, *params, **kwargs):
+ return sparse.csr_matrix(linalg.block_diag(*lst))
@GaelVaroquaux
GaelVaroquaux Aug 4, 2013 Member

For testing purposes, it is better to first define _block_diag independently of the version of scipy, and then assign it to block_diag. That way you can test the version for old scipy even on recent scipy.

@GaelVaroquaux GaelVaroquaux commented on the diff Aug 4, 2013
sklearn/gaussian_process/gaussian_process.py
+ # TODO: Implement variational approximation to the posterior
+ # TODO: better initialize f_vec
+ # TODO: investigate reformulating to use fmin_ncg but without the need to K^-1
+ # TODO: implement numerical integration for prediction to use if mc_iter=0
+
+ # Force data to 2D numpy.array
+ X = array2d(X)
+
+ # Check shapes of DOE & observations
+ n_samples_X, n_features = X.shape
+ n_samples_y = y.shape[0]
+
+ if n_samples_X != n_samples_y:
+ raise ValueError("X and y must have the same number of rows.")
+ else:
+ n_samples = n_samples_X
@GaelVaroquaux
GaelVaroquaux Aug 4, 2013 Member

sklearn.utils.check_arrays should do these checks and a bit more, rendering your code even more robust.

@GaelVaroquaux GaelVaroquaux commented on the diff Aug 4, 2013
sklearn/gaussian_process/gaussian_process.py
+
+ # Create empty cache
+ self._remove_cache()
+
+ # Normalize data or don't
+ if self.normalize:
+ X_mean = np.mean(X, axis=0)
+ X_std = np.std(X, axis=0)
+ X_std[X_std == 0.] = 1.
+ # center and scale X if necessary
+ X = (X - X_mean) / X_std
+ else:
+ X_mean = np.zeros(1)
+ X_std = np.ones(1)
+
+ D, ij = l1_cross_distances(X)
@GaelVaroquaux
GaelVaroquaux Aug 4, 2013 Member

Could you avoid names that are made of a single initial. We made the mistake of using these in the beginning, and when coming back to code after a few years, it is really hard to understand what it means. You should try to find name that somewhat summarize what the variable is about (not saying that it's easy, but that it is useful).

@GaelVaroquaux GaelVaroquaux commented on the diff Aug 4, 2013
sklearn/gaussian_process/gaussian_process.py
+ raise Exception("Multiple input features cannot have the same"
+ " value.")
+
+ self._cache['D'] = D
+ self._cache['ij'] = ij
+
+ # 0/1 encoding of the labels
+ y_vec = np.zeros((n_classes, n_samples))
+ for c in range(n_classes):
+ y_vec[c, y == c] = 1
+ y_vec = y_vec.reshape((n_classes * n_samples, 1))
+
+ # Keep model parameters
+ self.X = X
+ self.y = y_vec
+ self.X_mean, self.X_std = X_mean, X_std
@GaelVaroquaux
GaelVaroquaux Aug 4, 2013 Member

All the above should have trailing underscores (as in 'X_'), as they are quantities derived from the data during training.

@GaelVaroquaux GaelVaroquaux commented on the diff Aug 4, 2013
sklearn/gaussian_process/gaussian_process.py
+ # Looka at them all
+ pi_vec = np.hstack((pi_vec, pi_vec2))
+
+ p[idx] = pi_vec.mean(axis = 1)
+
+ elif self.mc_iter > 0:
+ # If a specific number is given, use it
+ for idx in range(n_eval):
+ f_vec = np.random.multivariate_normal(test_means_[idx], test_covars_[idx], self.mc_iter).T
+ f_vec = f_vec.reshape(f_vec.size)
+ pi_vec = self.soft_max(f_vec)
+ pi_vec = pi_vec.reshape(n_classes, self.mc_iter)
+ p[idx] = pi_vec.mean(axis = 1)
+ else:
+ # TODO: use a integrals calculation
+ raise ValueError("Only monte carlo method implemented at this stage.")
@GaelVaroquaux
GaelVaroquaux Aug 4, 2013 Member

Our policy in scikit-learn has been so far to avoid any monte-carlo based method, because they do not scale.

@ogrisel
ogrisel Aug 4, 2013 Member

Well, RBM training is based on some sort of sampling, albeit very specific to the RBM architecture.

@GaelVaroquaux
GaelVaroquaux Aug 4, 2013 Member

Well, RBM training is based on some sort of sampling, albeit very specific to the RBM architecture.

Good point. And it doesn't scale :þ

@ogrisel
ogrisel Aug 4, 2013 Member

Also GP themselves do not scale as as far as I know they need to maintain a full copy of the training dataset. Still they might be interesting for small smooth problems.

@dashesy
dashesy Aug 5, 2013

I plan to do a numerical integrals as default and keep MC sampling to be there because it is mentioned in the book. GPs usually do not scale but there are some interesting works that use exotic kernels or approximations that make them useable with large data. Simple GPs work very well with few number of samples and high dimensions (curse of dimensionality), that is at least what I found them to be most useful.

@GaelVaroquaux
GaelVaroquaux Aug 5, 2013 Member

GPs usually do not scale but there are some interesting works that use exotic kernels or approximations that make them useable with large data.

That's interesting!

Simple GPs work very well with few number of samples and high dimensions (curse of dimensionality), that is at least what I found them to be most useful.

Not everybody has high n, low p data, indeed. I just want to avoid
something that is slow if p or n are high.

@GaelVaroquaux GaelVaroquaux commented on the diff Aug 4, 2013
sklearn/gaussian_process/gaussian_process.py
+ print("Computed MC samples for 95%% confidense is %s" % N)
+
+ if N > 100:
+ f_vec = np.random.multivariate_normal(test_means_[idx], test_covars_[idx], N - 100).T
+ f_vec = f_vec.reshape(f_vec.size)
+ pi_vec2 = self.soft_max(f_vec)
+ pi_vec2 = pi_vec2.reshape(n_classes, N - 100)
+ # Looka at them all
+ pi_vec = np.hstack((pi_vec, pi_vec2))
+
+ p[idx] = pi_vec.mean(axis = 1)
+
+ elif self.mc_iter > 0:
+ # If a specific number is given, use it
+ for idx in range(n_eval):
+ f_vec = np.random.multivariate_normal(test_means_[idx], test_covars_[idx], self.mc_iter).T
@GaelVaroquaux
GaelVaroquaux Aug 4, 2013 Member

You should never call np.random, but use a random_state that gets set during construction. Check other learners (grep the source code for random_state).

@GaelVaroquaux GaelVaroquaux commented on the diff Aug 4, 2013
sklearn/gaussian_process/gaussian_process.py
+ P[c * n_samples : (c + 1) * n_samples,:] = np.diagflat(pi_mat[c,:])
+
+ # W = D - Π x Π^T
+ W = np.diagflat(pi_vec) - np.dot(P, P.T)
+ y_pi_diff = self.y - pi_vec
+ b = np.dot(W, f_vec) + y_pi_diff
+
+ # E is block diagonal
+ E = []
+
+ F = np.zeros((n_samples, n_samples))
+ # B = I + W^1/2 x K x W^1/2
+ # Compute B^-1 and det(B)
+ B_log_det = 0
+ for c in range(n_classes):
+ D_c_sqrt = np.diagflat(pi_mat[c,:] ** .5)
@GaelVaroquaux
GaelVaroquaux Aug 4, 2013 Member

I would rather have an explicit 'np.sqrt' than using '** .5'.

@GaelVaroquaux GaelVaroquaux commented on the diff Aug 4, 2013
sklearn/gaussian_process/gaussian_process.py
+ F = F + E_c
+
+ L = linalg.cholesky(F, lower=True)
+ L_inv = inv_triangular(L, lower = True)
+ F_inv = np.dot(L_inv.T, L_inv)
+
+ E_sparse = block_diag(E, format='csr')
+ c = E_sparse.dot(K_sparse).dot(b)
+ ERM = E_sparse.dot(R_sparse).dot(F_inv)
+ RTC = R_sparse.T.dot(c)
+ # a = K^-1 x f
+ a = b - c + np.dot(ERM, RTC)
+ f_vec = K_sparse.dot(a)
+ # gradiant = -K^-1 x f + y - π
+ gradiant = -a + y_pi_diff
+ # At maximum must have f = K x (y - π) and gradiant = 0
@GaelVaroquaux
GaelVaroquaux Aug 4, 2013 Member

Please avoid using non-ascii characters in the source code.

@GaelVaroquaux GaelVaroquaux commented on the diff Aug 4, 2013
sklearn/gaussian_process/gaussian_process.py
+ likelihood function for the given autocorrelation parameters theta.
+
+ Maximizing this function wrt the latent function values is
+ equivalent to maximizing the likelihood of the assumed joint Gaussian
+ distribution of the observations y evaluated onto the design of
+ experiments X.
+
+ Parameters
+ ----------
+ theta : array_like, optional
+ An array containing the autocorrelation parameters at which the
+ Gaussian Process model parameters should be determined.
+ Default uses the built-in autocorrelation parameters
+ (ie ``theta = self.theta_``).
+
+ f_vec: array_like, optional
@GaelVaroquaux
GaelVaroquaux Aug 4, 2013 Member

You are missing a space, as in "f_vec :". Also, all parameters learned from the data should have a trailing underscore, which means that you need to protect them in the docstrings, as in "f_vec_ :"

@GaelVaroquaux
GaelVaroquaux Aug 4, 2013 Member

Oops, sorry, I got confused with regards to the trailing character, I though I was looking a the class docstring, not the function docstring.

@GaelVaroquaux GaelVaroquaux commented on the diff Aug 4, 2013
sklearn/gaussian_process/gaussian_process.py
+
+ if (n_classes is not None
+ and n_classes < 2):
+ raise ValueError("at least two classes are needed for training.")
+
+ theta0 = array2d(self.theta0)
+ if n_classes is not None:
+ # if for single class given, extend it to all classes
+ if theta0.shape[0] == 1:
+ theta0 = np.tile(theta0[0,:], (n_classes,1))
+ elif theta0.shape[0] != n_classes:
+ raise ValueError("first dimension of theta0 (if non-single) must be number of classs (= %s)." % n_classes)
+ self.theta0 = theta0
+
+ # Check nugget value
+ self.nugget = np.asarray(self.nugget)
@GaelVaroquaux
GaelVaroquaux Aug 4, 2013 Member

You shouldn't modify any of the attributes given at construction time. You can keep a local version of this variable, as in:

nugget = np.asarray(self.nugget)

@GaelVaroquaux
Member

Thanks for the pull request. Two general remarks:

  • As @ogrisel I am wondering if any code reuse could be done with the existing Gaussian process implementation
  • The needs to be test, with a high test coverage.
@dashesy
dashesy commented Aug 5, 2013

Thanks for the review, I will try to address all the issues, and specially will add tests.

@mblondel
Member

@dashesy What do you think of our implementation for regression? It doesn't follow the common terminology used in ML, which makes it difficult to use IMO. Personally, I would prefer if the API were as close as possible to SVR (e.g. ability to choose the kernel from "rbf", "poly", etc). If you feel like it, I would personally be +1 for a complete rewrite.

@dashesy
dashesy commented Aug 14, 2013

@mblondel IMO, It would certainly be more elegant and will reduce the confusion to use same terminology and avoid code duplication, as long as feature parity can be achieved. It does not need a complete rewrite fortunately, algorithm works fine. As soon as I am done with a deadline project here I will study SVR API and try to make GP more similar to it in a backward-compatible way.

@glouppe
Member
glouppe commented Oct 19, 2015

Closing. A brand new implementation of gaussian processes has been merged through #4270.

Thanks for your contribution.

@glouppe glouppe closed this Oct 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment