# scikit-learn/scikit-learn

### Subversion checkout URL

You can clone with
or
.

# [WIP] Add Gaussian Process Classification (GPC) with Laplace approximation#2340

Open
wants to merge 7 commits into from
+686 −6

### 4 participants

This is almost entirely based on "Gaussian Processes for Machine Learning" by Rassmussen and Williams.

Please let me know where I should add test methods, and anything else I need to do (not sure but cannot subscribe to the mailing list to ask my questions there).
I can add real benchmarks as soon as I get some feedback, and hopefully some pointers.

# Code

I followed the newton formulation described in the text, the method directly computes the inverse Hessian, so optimize.fmin_ncg was not used, it has the advantage that kernel does not need to be inverted, and avoids numerical instability (as described in the book). I have an implementation that does use fmin_ncg but requires the invert, it can be added as a separate method to the class.

# Comparison

It seems Laplace approximation to find the posterior distribution is not well-received in the literature, but it should be a good reference.
since the Expectation Propagation (EP) method described there is for binary classification only, I just implemented the Laplace approximation method for now, if there is any good reference for EP or other methods please let me know. Another option is to use EP in a ovR classification.
Variational methods also mentioned in the book, and seem to be easier to implement than EP for multiple classes.
If integrating with PyMC is an option, an MCMC approach would be a nice addition too.

# Rationale

Gaussian Process Classification (GPC) is a kernel-based non-parametric method. One main advantage is flexibility and noise tolerance. It is a natural generalization of linear logistic regression, similar to the transition from linear regression to GP regression. The training learns a set of functions.
It is also the starting point to some powerful dimensionality reduction (DR) techniques that are based on GP-LVM, and learn the manifold of high-dimensional data with few number of samples.

# TODO

I wanted to get some feedback before I continue, but here are my TODO list:
1- Hyper parameter theta can be optimized in a straight forward approach similar to GPR
2- Add numerical integration for prediction instead of Monte Carlo (MC) sampling (I wanted to follow the text book)
3- Add a variational method or EP method if I find good references within my level of math

My plan is to continue by implementing latent variable model (GP-LVM) and make it supervised (SGP-LVM). It will be based on these papers:

"Supervised Gussian Process LVM for Dimensionality Reduction", Xinbo Gao
"Supervised Latent Linear Gaussian Process LVM Based Classification", Hou, Feng, Zou
"Supervised Latent Linear Gaussian Process LVM for Dimensionality Reduction", Jiang, Gao, Wong, Zheng
"Discriminative Gaussian Process LVM for Classification", Urtasun and Darrell

# Benchmarks

I ran the standard plot_classifier_comparison against all classifiers the last one if GPC (without MLE of theta)

I will add one more comparison as soon as I finish the MLE of theta.

 dashesy `Add Gaussian Process Classification with Laplace approximation` `f6f7ab2` dashesy `Fix print to use functional form` `52597a7` dashesy `Bring back mistakenly omitted classifer` `6c8a727` dashesy `Fix indentation` `c4b76ec` dashesy `Fix print to make it functional form` `071761b` dashesy `Resort to linalg if sparse does not have block_diag support` `7c93024` dashesy `pep-0263 requires the first or second line to be encoding` `693a8eb`
commented on the diff
sklearn/gaussian_process/gaussian_process.py
 @@ -878,3 +916,643 @@ def _check_params(self, n_samples=None): # Force random_start type to int self.random_start = int(self.random_start) + + +class GaussianProcessClassifier(BaseEstimator, ClassifierMixin): + """The Gaussian Process Classifier class.
 Owner ogrisel added a note Aug 4, 2013 Just put "Gaussian Process Classifier" here. Also please add a small 2 or 3 lines paragraph that gives the gist of how the classification reduction from a regression model such as GP works if possible. to join this conversation on GitHub. Already have an account? Sign in to comment
Owner

I am not familiar at all with GPC but isn't there anything that can be reused from the GP regression model already implemented in scikit-learn? Or maybe factorize some common parts into private helper functions or a base or mixin class?

I would be very interested in benchmarks too.

Owner

If integrating with PyMC is an option, an MCMC approach would be a nice addition too.

Integrating with PyMC is not an option as we don't want to maintain any more dependencies besides numpy and scipy. However the PyMC devs themselves are starting to implement bayesian, MC based machine learning models of their own so they might be interested in a GPC model implementation.

commented on the diff
sklearn/gaussian_process/gaussian_process.py
 @@ -24,6 +25,43 @@ def solve_triangular(x, y, lower=True): return linalg.solve(x, y) +def inv_triangular(A, lower = False):
 Owner GaelVaroquaux added a note Aug 4, 2013 Two remarks on this function (which is cool!): It should be in utils.extmath The routing between the 2 approaches should be performed at module-loading time. In other words, you should implement 2 functions, and at module loading time assign one to 'inv_triangular', as it is currently done for np.unique in utils.fixes. dashesy added a note Aug 5, 2013 Thanks, I will, then I need to also move solve_triangular to extmath too. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
sklearn/gaussian_process/gaussian_process.py
 ((24 lines not shown)) + dtrtri = linalg.lapack.clapack.dtrtri + A = np.double(A) + A_inv, info = dtrtri(A, lower = lower) + + if A_inv is None: + A_inv = solve_triangular(A, np.eye(len(A)), lower = lower) + + return A_inv + +if hasattr(sparse, 'block_diag'): + # only in scipy since 0.11 + block_diag = sparse.block_diag +else: + # slower, but works + def block_diag(lst, *params, **kwargs): + return sparse.csr_matrix(linalg.block_diag(*lst))
 Owner GaelVaroquaux added a note Aug 4, 2013 For testing purposes, it is better to first define _block_diag independently of the version of scipy, and then assign it to block_diag. That way you can test the version for old scipy even on recent scipy. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
sklearn/gaussian_process/gaussian_process.py
 ((135 lines not shown)) + # TODO: Implement variational approximation to the posterior + # TODO: better initialize f_vec + # TODO: investigate reformulating to use fmin_ncg but without the need to K^-1 + # TODO: implement numerical integration for prediction to use if mc_iter=0 + + # Force data to 2D numpy.array + X = array2d(X) + + # Check shapes of DOE & observations + n_samples_X, n_features = X.shape + n_samples_y = y.shape[0] + + if n_samples_X != n_samples_y: + raise ValueError("X and y must have the same number of rows.") + else: + n_samples = n_samples_X
 Owner GaelVaroquaux added a note Aug 4, 2013 sklearn.utils.check_arrays should do these checks and a bit more, rendering your code even more robust. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
sklearn/gaussian_process/gaussian_process.py
 ((156 lines not shown)) + + # Create empty cache + self._remove_cache() + + # Normalize data or don't + if self.normalize: + X_mean = np.mean(X, axis=0) + X_std = np.std(X, axis=0) + X_std[X_std == 0.] = 1. + # center and scale X if necessary + X = (X - X_mean) / X_std + else: + X_mean = np.zeros(1) + X_std = np.ones(1) + + D, ij = l1_cross_distances(X)
 Owner GaelVaroquaux added a note Aug 4, 2013 Could you avoid names that are made of a single initial. We made the mistake of using these in the beginning, and when coming back to code after a few years, it is really hard to understand what it means. You should try to find name that somewhat summarize what the variable is about (not saying that it's easy, but that it is useful). to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
sklearn/gaussian_process/gaussian_process.py
 ((174 lines not shown)) + raise Exception("Multiple input features cannot have the same" + " value.") + + self._cache['D'] = D + self._cache['ij'] = ij + + # 0/1 encoding of the labels + y_vec = np.zeros((n_classes, n_samples)) + for c in range(n_classes): + y_vec[c, y == c] = 1 + y_vec = y_vec.reshape((n_classes * n_samples, 1)) + + # Keep model parameters + self.X = X + self.y = y_vec + self.X_mean, self.X_std = X_mean, X_std
 Owner GaelVaroquaux added a note Aug 4, 2013 All the above should have trailing underscores (as in 'X_'), as they are quantities derived from the data during training. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
sklearn/gaussian_process/gaussian_process.py
 ((323 lines not shown)) + # Looka at them all + pi_vec = np.hstack((pi_vec, pi_vec2)) + + p[idx] = pi_vec.mean(axis = 1) + + elif self.mc_iter > 0: + # If a specific number is given, use it + for idx in range(n_eval): + f_vec = np.random.multivariate_normal(test_means_[idx], test_covars_[idx], self.mc_iter).T + f_vec = f_vec.reshape(f_vec.size) + pi_vec = self.soft_max(f_vec) + pi_vec = pi_vec.reshape(n_classes, self.mc_iter) + p[idx] = pi_vec.mean(axis = 1) + else: + # TODO: use a integrals calculation + raise ValueError("Only monte carlo method implemented at this stage.")
 Owner GaelVaroquaux added a note Aug 4, 2013 Our policy in scikit-learn has been so far to avoid any monte-carlo based method, because they do not scale. Owner ogrisel added a note Aug 4, 2013 Well, RBM training is based on some sort of sampling, albeit very specific to the RBM architecture. Owner GaelVaroquaux added a note Aug 4, 2013 Well, RBM training is based on some sort of sampling, albeit very specific to the RBM architecture. Good point. And it doesn't scale :þ Owner ogrisel added a note Aug 4, 2013 Also GP themselves do not scale as as far as I know they need to maintain a full copy of the training dataset. Still they might be interesting for small smooth problems. dashesy added a note Aug 5, 2013 I plan to do a numerical integrals as default and keep MC sampling to be there because it is mentioned in the book. GPs usually do not scale but there are some interesting works that use exotic kernels or approximations that make them useable with large data. Simple GPs work very well with few number of samples and high dimensions (curse of dimensionality), that is at least what I found them to be most useful. Owner GaelVaroquaux added a note Aug 5, 2013 GPs usually do not scale but there are some interesting works that use exotic kernels or approximations that make them useable with large data. That's interesting! Simple GPs work very well with few number of samples and high dimensions (curse of dimensionality), that is at least what I found them to be most useful. Not everybody has high n, low p data, indeed. I just want to avoid something that is slow if p or n are high. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
sklearn/gaussian_process/gaussian_process.py
 ((316 lines not shown)) + print("Computed MC samples for 95%% confidense is %s" % N) + + if N > 100: + f_vec = np.random.multivariate_normal(test_means_[idx], test_covars_[idx], N - 100).T + f_vec = f_vec.reshape(f_vec.size) + pi_vec2 = self.soft_max(f_vec) + pi_vec2 = pi_vec2.reshape(n_classes, N - 100) + # Looka at them all + pi_vec = np.hstack((pi_vec, pi_vec2)) + + p[idx] = pi_vec.mean(axis = 1) + + elif self.mc_iter > 0: + # If a specific number is given, use it + for idx in range(n_eval): + f_vec = np.random.multivariate_normal(test_means_[idx], test_covars_[idx], self.mc_iter).T
 Owner GaelVaroquaux added a note Aug 4, 2013 You should never call np.random, but use a random_state that gets set during construction. Check other learners (grep the source code for random_state). to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
sklearn/gaussian_process/gaussian_process.py
 ((427 lines not shown)) + P[c * n_samples : (c + 1) * n_samples,:] = np.diagflat(pi_mat[c,:]) + + # W = D - Π x Π^T + W = np.diagflat(pi_vec) - np.dot(P, P.T) + y_pi_diff = self.y - pi_vec + b = np.dot(W, f_vec) + y_pi_diff + + # E is block diagonal + E = [] + + F = np.zeros((n_samples, n_samples)) + # B = I + W^1/2 x K x W^1/2 + # Compute B^-1 and det(B) + B_log_det = 0 + for c in range(n_classes): + D_c_sqrt = np.diagflat(pi_mat[c,:] ** .5)
 Owner GaelVaroquaux added a note Aug 4, 2013 I would rather have an explicit 'np.sqrt' than using '** .5'. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
sklearn/gaussian_process/gaussian_process.py
 ((454 lines not shown)) + F = F + E_c + + L = linalg.cholesky(F, lower=True) + L_inv = inv_triangular(L, lower = True) + F_inv = np.dot(L_inv.T, L_inv) + + E_sparse = block_diag(E, format='csr') + c = E_sparse.dot(K_sparse).dot(b) + ERM = E_sparse.dot(R_sparse).dot(F_inv) + RTC = R_sparse.T.dot(c) + # a = K^-1 x f + a = b - c + np.dot(ERM, RTC) + f_vec = K_sparse.dot(a) + # gradiant = -K^-1 x f + y - π + gradiant = -a + y_pi_diff + # At maximum must have f = K x (y - π) and gradiant = 0
 Owner GaelVaroquaux added a note Aug 4, 2013 Please avoid using non-ascii characters in the source code. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
sklearn/gaussian_process/gaussian_process.py
 ((476 lines not shown)) + likelihood function for the given autocorrelation parameters theta. + + Maximizing this function wrt the latent function values is + equivalent to maximizing the likelihood of the assumed joint Gaussian + distribution of the observations y evaluated onto the design of + experiments X. + + Parameters + ---------- + theta : array_like, optional + An array containing the autocorrelation parameters at which the + Gaussian Process model parameters should be determined. + Default uses the built-in autocorrelation parameters + (ie ``theta = self.theta_``). + + f_vec: array_like, optional
 Owner GaelVaroquaux added a note Aug 4, 2013 You are missing a space, as in "f_vec :". Also, all parameters learned from the data should have a trailing underscore, which means that you need to protect them in the docstrings, as in "`f_vec_` :" Owner GaelVaroquaux added a note Aug 4, 2013 Oops, sorry, I got confused with regards to the trailing character, I though I was looking a the class docstring, not the function docstring. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff
sklearn/gaussian_process/gaussian_process.py
 ((606 lines not shown)) + + if (n_classes is not None + and n_classes < 2): + raise ValueError("at least two classes are needed for training.") + + theta0 = array2d(self.theta0) + if n_classes is not None: + # if for single class given, extend it to all classes + if theta0.shape[0] == 1: + theta0 = np.tile(theta0[0,:], (n_classes,1)) + elif theta0.shape[0] != n_classes: + raise ValueError("first dimension of theta0 (if non-single) must be number of classs (= %s)." % n_classes) + self.theta0 = theta0 + + # Check nugget value + self.nugget = np.asarray(self.nugget)
 Owner GaelVaroquaux added a note Aug 4, 2013 You shouldn't modify any of the attributes given at construction time. You can keep a local version of this variable, as in: nugget = np.asarray(self.nugget) to join this conversation on GitHub. Already have an account? Sign in to comment

Thanks for the pull request. Two general remarks:

• As @ogrisel I am wondering if any code reuse could be done with the existing Gaussian process implementation

• The needs to be test, with a high test coverage.

Thanks for the review, I will try to address all the issues, and specially will add tests.

Owner

@dashesy What do you think of our implementation for regression? It doesn't follow the common terminology used in ML, which makes it difficult to use IMO. Personally, I would prefer if the API were as close as possible to SVR (e.g. ability to choose the kernel from "rbf", "poly", etc). If you feel like it, I would personally be +1 for a complete rewrite.

@mblondel IMO, It would certainly be more elegant and will reduce the confusion to use same terminology and avoid code duplication, as long as feature parity can be achieved. It does not need a complete rewrite fortunately, algorithm works fine. As soon as I am done with a deadline project here I will study SVR API and try to make GP more similar to it in a backward-compatible way.

Commits on Aug 2, 2013
1. dashesy authored
Commits on Aug 3, 2013
1. dashesy authored
2. dashesy authored
3. dashesy authored
4. dashesy authored
5. dashesy authored
6. dashesy authored
 @@ -1,19 +1,20 @@ -from __future__ import print_function #!/usr/bin/python # -*- coding: utf-8 -*- +from __future__ import print_function # Author: Vincent Dubourg # (mostly translation, see implementation details) # Licence: BSD 3 clause import numpy as np -from scipy import linalg, optimize, rand +from scipy import linalg, optimize, rand, sparse, stats -from ..base import BaseEstimator, RegressorMixin +from ..base import BaseEstimator, RegressorMixin, ClassifierMixin from ..metrics.pairwise import manhattan_distances from ..utils import array2d, check_random_state from . import regression_models as regression from . import correlation_models as correlation +from ..utils.extmath import logsumexp MACHINE_EPSILON = np.finfo(np.double).eps if hasattr(linalg, 'solve_triangular'): @@ -24,6 +25,43 @@ def solve_triangular(x, y, lower=True): return linalg.solve(x, y) +def inv_triangular(A, lower = False): Owner GaelVaroquaux added a note Aug 4, 2013 Two remarks on this function (which is cool!): It should be in utils.extmath The routing between the 2 approaches should be performed at module-loading time. In other words, you should implement 2 functions, and at module loading time assign one to 'inv_triangular', as it is currently done for np.unique in utils.fixes. dashesy added a note Aug 5, 2013 Thanks, I will, then I need to also move solve_triangular to extmath too. to join this conversation on GitHub. Already have an account? Sign in to comment + """ + Compute the inverse of triangular matrix + + Parameters + ---------- + + A - Triangular matrix + lower - if A is lower-triangular + + Returns + ------- + A_inv - Inverse of A (equivalant to linalg.inv(A)) + + """ + + A_inv = None + # see if available use scipy.linalg.lapack.clapack.dtrtri instead + if hasattr(linalg, 'lapack'): + if hasattr(linalg.lapack, 'clapack'): + dtrtri = linalg.lapack.clapack.dtrtri + A = np.double(A) + A_inv, info = dtrtri(A, lower = lower) + + if A_inv is None: + A_inv = solve_triangular(A, np.eye(len(A)), lower = lower) + + return A_inv + +if hasattr(sparse, 'block_diag'): + # only in scipy since 0.11 + block_diag = sparse.block_diag +else: + # slower, but works + def block_diag(lst, *params, **kwargs): + return sparse.csr_matrix(linalg.block_diag(*lst)) Owner GaelVaroquaux added a note Aug 4, 2013 For testing purposes, it is better to first define _block_diag independently of the version of scipy, and then assign it to block_diag. That way you can test the version for old scipy even on recent scipy. to join this conversation on GitHub. Already have an account? Sign in to comment + def l1_cross_distances(X): """ @@ -878,3 +916,643 @@ def _check_params(self, n_samples=None): # Force random_start type to int self.random_start = int(self.random_start) + + +class GaussianProcessClassifier(BaseEstimator, ClassifierMixin): + """The Gaussian Process Classifier class. Owner ogrisel added a note Aug 4, 2013 Just put "Gaussian Process Classifier" here. Also please add a small 2 or 3 lines paragraph that gives the gist of how the classification reduction from a regression model such as GP works if possible. to join this conversation on GitHub. Already have an account? Sign in to comment + + Parameters + ---------- + method : string, optional + 'laplace' - Use laplace approximation (default) + + verbose : boolean, optional + A boolean specifying the verbose level. + Default is verbose = False. + + normalize : boolean, optional + Input X is centered and reduced wrt + means and standard deviations estimated from the n_samples + observations provided. + Default is normalize = True so that data is normalized to ease + maximum likelihood estimations. + + nugget : double or ndarray, optional + Introduce a nugget effect to allow smooth predictions from noisy + data. If nugget is an ndarray, it must be the same length as the + number of data points used for the fit. + The nugget is added to the diagonal of the assumed training covariance; + in this way it acts as a Tikhonov regularization in the problem. In + the special case of the squared exponential correlation function, the + nugget mathematically represents the variance of the input values. + Default assumes a nugget close to machine precision for the sake of + robustness (nugget = 10. * MACHINE_EPSILON). + + max_iter : int, optional + Maximum number of iterations in newton algorithm in MAP + Default: 0 means 10 * n_samples * n_classes + + mc_iter: int, optional + Number of Monte Carlo (MC) samples to take to estimate test + Default: None, do enough sampling for 95% confidence interval + + tol: float + Convergence tolerance in MAP algorithm + Default: 10. * MACHINE_EPSILON + + theta0 : double array_like, optional + An array with shape (n_classes, n_features, ) or (n_classes,) or (1, ). + The parameters in the autocorrelation model. + Default assumes isotropic autocorrelation model with theta0 = 1e-1 for all classes and all features. + + Attributes + ---------- + `theta_`: array + Specified theta OR the best set of autocorrelation parameters (the \ + sought maximizer of the reduced likelihood function). + + `means_` : array, shape (`n_classes`, `n_samples`) + Mean parameters for each latent function value. + + `reduced_likelihood_function_value_`: array + The optimal reduced likelihood function value. + + Examples + -------- + >>> import numpy as np + >>> from sklearn.gaussian_process import GaussianProcessClassifier + >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) + >>> y = np.array([0, 0, 0, 1, 1, 1]) + >>> gpc = GaussianProcessClassifier() + >>> gpc.fit(X, y) + >>> gpc.predict([[-0.8, -1]]) + array([0]) + + Notes + ----- + The presentation implementation is based on the Gaussian + Processes for Machine Learning (Rassmussen Williams), + see reference [RASWIL06] + + References + ---------- + + .. [RASWIL06] C. E. Rasmussen & C. K. I. Williams, G, `Gaussian Processes for Machine Learning`, + the MIT Press, 2006, ISBN 026218253X + http://www.gaussianprocess.org/gpml/ + + """ + + _correlation_types = { + 'absolute_exponential': correlation.absolute_exponential, + 'squared_exponential': correlation.squared_exponential, + 'generalized_exponential': correlation.generalized_exponential, + 'cubic': correlation.cubic, + 'linear': correlation.linear} + + def __init__(self, method='laplace', corr='squared_exponential', verbose=False, + normalize=True, max_iter=0, mc_iter = None, tol=10 * MACHINE_EPSILON, theta0=1e-1, + nugget=10. * MACHINE_EPSILON): + + self.method = method + self.verbose = verbose + self.normalize = normalize + self.max_iter = max_iter + self.mc_iter = mc_iter + self.tol = tol + self.theta0 = theta0 + self.nugget = nugget + self.corr = corr + + def fit(self, X, y): + """ + The Gaussian Process model fitting method. + + Parameters + ---------- + X : double array_like + An array with shape (n_samples, n_features) with the input at which + observations were made. + + y : double array_like + An array with shape (n_samples, ) with the class labels + + Returns + ------- + gpc : self + A fitted Gaussian Process Classifier model object awaiting data to perform + predictions. + """ + + # TODO: maximize hyperparameter theta using arg_max_reduced_likelyhood + # TODO: Implement expectation maximization 'EM'/'EP' method for multiclass + # TODO: Can use PyMC for multiclass posterior approximation + # TODO: Implement variational approximation to the posterior + # TODO: better initialize f_vec + # TODO: investigate reformulating to use fmin_ncg but without the need to K^-1 + # TODO: implement numerical integration for prediction to use if mc_iter=0 + + # Force data to 2D numpy.array + X = array2d(X) + + # Check shapes of DOE & observations + n_samples_X, n_features = X.shape + n_samples_y = y.shape[0] + + if n_samples_X != n_samples_y: + raise ValueError("X and y must have the same number of rows.") + else: + n_samples = n_samples_X Owner GaelVaroquaux added a note Aug 4, 2013 sklearn.utils.check_arrays should do these checks and a bit more, rendering your code even more robust. to join this conversation on GitHub. Already have an account? Sign in to comment + + n_classes = len(np.unique(y)) + + # Run input checks + self._check_params(n_classes, n_samples) + + # Create empty cache + self._remove_cache() + + # Normalize data or don't + if self.normalize: + X_mean = np.mean(X, axis=0) + X_std = np.std(X, axis=0) + X_std[X_std == 0.] = 1. + # center and scale X if necessary + X = (X - X_mean) / X_std + else: + X_mean = np.zeros(1) + X_std = np.ones(1) + + D, ij = l1_cross_distances(X) Owner GaelVaroquaux added a note Aug 4, 2013 Could you avoid names that are made of a single initial. We made the mistake of using these in the beginning, and when coming back to code after a few years, it is really hard to understand what it means. You should try to find name that somewhat summarize what the variable is about (not saying that it's easy, but that it is useful). to join this conversation on GitHub. Already have an account? Sign in to comment + if (np.min(np.sum(D, axis=1)) == 0. + and self.corr != correlation.pure_nugget): + raise Exception("Multiple input features cannot have the same" + " value.") + + self._cache['D'] = D + self._cache['ij'] = ij + + # 0/1 encoding of the labels + y_vec = np.zeros((n_classes, n_samples)) + for c in range(n_classes): + y_vec[c, y == c] = 1 + y_vec = y_vec.reshape((n_classes * n_samples, 1)) + + # Keep model parameters + self.X = X + self.y = y_vec + self.X_mean, self.X_std = X_mean, X_std Owner GaelVaroquaux added a note Aug 4, 2013 All the above should have trailing underscores (as in 'X_'), as they are quantities derived from the data during training. to join this conversation on GitHub. Already have an account? Sign in to comment + self.n_classes = n_classes + + # Compute the maximum aposteriori value + self.theta_ = self.theta0 + self.reduced_likelihood_function_value_, par = self.reduced_likelihood_function() + + # Keep some needed computed values + self.means_ = par['f_vec'].reshape(n_classes, n_samples) # posterios mean + self.E = par['E'] + self.F_inv = par['F_inv'] + self.pi_vec = par['pi_vec'] + + # Remove cache + self._remove_cache() + + return self + + def predict(self, X): + """Predict label for data. + + Parameters + ---------- + X : array-like, shape = [n_eval, n_features] + + Returns + ------- + C : array, shape = (n_eval,) with predicted class labels + """ + responsibilities = self.predict_proba(X) + return responsibilities.argmax(axis=1) + + def predict_proba(self, X): + """ + This function evaluates the Gaussian Process model at X. + + Parameters + ---------- + X : array_like + An array with shape (n_eval, n_features) giving the point(s) at + which the prediction(s) should be made. + + Returns + ------- + responsibilities : array-like, shape = (n_eval, n_classes) + Returns the probability of the sample for each class. + + """ + + # Check input shapes + X = array2d(X) + n_eval, n_features_X = X.shape + n_samples, n_features = self.X.shape + n_classes = self.n_classes + + # Run input checks + self._check_params() + + if n_features_X != n_features: + raise ValueError(("The number of features in X (X.shape[1] = %d) " + "should match the sample size used for fit() " + "which is %d.") % (n_features_X, n_features)) + + # No memory management for now + # (evaluates all given points in a single batch run) + + # Normalize input + X = (X - self.X_mean) / self.X_std + + y_mat = self.y.reshape(n_classes, n_samples) + pi_mat = self.pi_vec.reshape(n_classes, n_samples) + + # Get pairwise componentwise L1-distances to the input training set + dx = manhattan_distances(X, Y=self.X, sum_over_features=False) + + # test mean + test_means_ = np.zeros((n_eval, n_classes)) + + K_test = [] + for c in range(n_classes): + # Test correlation model + r = self.corr(self.theta_[c,:], dx).reshape(n_eval, n_samples) + K_test.append(r) + + # Compute test mean + tmp = (y_mat[c,:] - pi_mat[c,:]).reshape(n_samples, 1) + test_means_[:, c] = np.dot(r, tmp).reshape(n_eval) + + # Compute test covariance + test_covars_ = np.zeros((n_eval, n_classes, n_classes)) + D = np.zeros((1, n_features)) + for c in range(n_classes): + # For each test point x_0, for each class c compute the vector of k_c(x_0, x_0) + # should be 1 for most interesting correlations + r = self.corr(self.theta_[c,:], D) + + b = np.dot(self.E[c], K_test[c].T) + # corrected based on errata + # note: first term could be saved during training + v = np.dot(np.dot(self.E[c], self.F_inv), b) + + for idx in range(n_eval): + for c_p in range(n_classes): + sigma = np.dot(K_test[c_p][idx, :], v[:,idx]) + test_covars_[idx,c,c_p] = sigma + test_covars_[idx][c,c] = test_covars_[idx][c,c] + r - np.dot(b[:, idx].T, K_test[c][idx, :]) + + # decision + p = np.zeros((n_eval, n_classes)) + # if montecarlo (MC) is specified then sample from the posterior + if self.mc_iter is None: + for idx in range(n_eval): + sv = linalg.svd(test_covars_[idx], compute_uv=False) + # only need to care about the component with maximum variance + max_std_idx = np.argmax(sv) + max_std = sv[max_std_idx] + + # Compute output standard deviation (perhaps should do this in a feedback loop) + f_vec = np.random.multivariate_normal(test_means_[idx], test_covars_[idx], 100).T + f_vec = f_vec.reshape(f_vec.size) + pi_vec = self.soft_max(f_vec) + pi_vec = pi_vec.reshape(n_classes, 100) + + out_std = np.std(pi_vec[max_std_idx,:]) + confidence = 1. / stats.norm.cdf((1 + .95)/2) + N = int((max_std * confidence / out_std) ** 2) + if self.verbose: + print("Computed MC samples for 95%% confidense is %s" % N) + + if N > 100: + f_vec = np.random.multivariate_normal(test_means_[idx], test_covars_[idx], N - 100).T + f_vec = f_vec.reshape(f_vec.size) + pi_vec2 = self.soft_max(f_vec) + pi_vec2 = pi_vec2.reshape(n_classes, N - 100) + # Looka at them all + pi_vec = np.hstack((pi_vec, pi_vec2)) + + p[idx] = pi_vec.mean(axis = 1) + + elif self.mc_iter > 0: + # If a specific number is given, use it + for idx in range(n_eval): + f_vec = np.random.multivariate_normal(test_means_[idx], test_covars_[idx], self.mc_iter).T Owner GaelVaroquaux added a note Aug 4, 2013 You should never call np.random, but use a random_state that gets set during construction. Check other learners (grep the source code for random_state). to join this conversation on GitHub. Already have an account? Sign in to comment + f_vec = f_vec.reshape(f_vec.size) + pi_vec = self.soft_max(f_vec) + pi_vec = pi_vec.reshape(n_classes, self.mc_iter) + p[idx] = pi_vec.mean(axis = 1) + else: + # TODO: use a integrals calculation + raise ValueError("Only monte carlo method implemented at this stage.") Owner GaelVaroquaux added a note Aug 4, 2013 Our policy in scikit-learn has been so far to avoid any monte-carlo based method, because they do not scale. Owner ogrisel added a note Aug 4, 2013 Well, RBM training is based on some sort of sampling, albeit very specific to the RBM architecture. Owner GaelVaroquaux added a note Aug 4, 2013 Well, RBM training is based on some sort of sampling, albeit very specific to the RBM architecture. Good point. And it doesn't scale :þ Owner ogrisel added a note Aug 4, 2013 Also GP themselves do not scale as as far as I know they need to maintain a full copy of the training dataset. Still they might be interesting for small smooth problems. dashesy added a note Aug 5, 2013 I plan to do a numerical integrals as default and keep MC sampling to be there because it is mentioned in the book. GPs usually do not scale but there are some interesting works that use exotic kernels or approximations that make them useable with large data. Simple GPs work very well with few number of samples and high dimensions (curse of dimensionality), that is at least what I found them to be most useful. Owner GaelVaroquaux added a note Aug 5, 2013 GPs usually do not scale but there are some interesting works that use exotic kernels or approximations that make them useable with large data. That's interesting! Simple GPs work very well with few number of samples and high dimensions (curse of dimensionality), that is at least what I found them to be most useful. Not everybody has high n, low p data, indeed. I just want to avoid something that is slow if p or n are high. to join this conversation on GitHub. Already have an account? Sign in to comment + + return p + + def soft_max(self, f_vec): + """ + Soft-max decision function + + Parameters + ---------- + f_vec : array_like shape=(n_classes * n_samples) + latent function values + + Returns + ------- + pi_vec : array_like + An array with shape (n_classes * n_samples) with + squashed probabilities for decision function values + """ + + # loggit softmax + loggit_soft_max = (lambda c,i,f: np.exp(f[c,i]) / np.exp(logsumexp(f[:,i]))) + n_classes = self.n_classes + n_samples = f_vec.size / n_classes + + # Reshape is cheap + f_vec = f_vec.reshape(n_classes, n_samples) + # decision function values: π of size Cn + pi_vec = np.zeros((n_classes, n_samples)) + for i in range(n_samples): + for c in range(n_classes): + pi_vec[c, i] = loggit_soft_max(c, i, f_vec) + pi_vec = pi_vec.reshape((n_classes * n_samples, 1)) + return pi_vec + + def _remove_cache(self): + """ + Remove cached values + """ + self._cache = {'R': None, 'D': None, 'ij': None} + + def _get_R(self): + """ + Create and cache R, return from cache if available + Returns + ------- + R - Sparse matrix of stacked identity matrices + """ + + n_samples = self.X.shape[0] + n_classes = self.n_classes + R_sparse = self._cache['R'] + if R_sparse is None: + # Matrix of stacked identity matrices + R_sparse = [] + for c in range(n_classes): + R = sparse.csr_matrix(np.eye(n_samples)) + R_sparse.append(R) + R_sparse = sparse.vstack(R_sparse).tocsr() + self._cache['R'] = R_sparse + + return R_sparse + + def _update_func(self, f_vec, K, K_sparse, R_sparse): + """ + Update function + + Parameters + ---------- + f_vec : array_like shape=(n_classes * n_samples, 1) + latent function values + + K : Kernel matrix in dense form, shape=(n_classes * n_samples, n_samples) + K_sparse : Kernel matrix in sparse form, shape=(n_classes * n_samples, n_samples) + + R_sparse: Stacked identity matrix, in sparse form + """ + + n_classes = self.n_classes + n_samples = f_vec.size / n_classes + + pi_vec = self.soft_max(f_vec) + pi_mat = pi_vec.reshape(n_classes, n_samples) + + # Stacked diagonal matrices: Π of size Cn x n + P = np.zeros((n_classes * n_samples, n_samples)) + + # Stack vertically to build Π + for c in range(n_classes): + P[c * n_samples : (c + 1) * n_samples,:] = np.diagflat(pi_mat[c,:]) + + # W = D - Π x Π^T + W = np.diagflat(pi_vec) - np.dot(P, P.T) + y_pi_diff = self.y - pi_vec + b = np.dot(W, f_vec) + y_pi_diff + + # E is block diagonal + E = [] + + F = np.zeros((n_samples, n_samples)) + # B = I + W^1/2 x K x W^1/2 + # Compute B^-1 and det(B) + B_log_det = 0 + for c in range(n_classes): + D_c_sqrt = np.diagflat(pi_mat[c,:] ** .5) Owner GaelVaroquaux added a note Aug 4, 2013 I would rather have an explicit 'np.sqrt' than using '** .5'. to join this conversation on GitHub. Already have an account? Sign in to comment + B_c = np.eye(n_samples) + np.dot(np.dot(D_c_sqrt, K[c]), D_c_sqrt) + # B_c is stable + L = linalg.cholesky(B_c, lower=True) + L_inv = inv_triangular(L, lower = True) + # det(L x L^T) = det(L)^2 + B_log_det = B_log_det + 2 * np.sum(np.log(np.diagonal(L))) + # B_c = L x L^T => B_c^-1 = (L^T)^-1 x L^-1 = (L^-1)^T x L^-1 + B_c_inv = np.dot(L_inv.T, L_inv) + E_c = np.dot(np.dot(D_c_sqrt, B_c_inv), D_c_sqrt) + E.append(E_c) + # F = R^T x E x R = sum{E_c} + F = F + E_c + + L = linalg.cholesky(F, lower=True) + L_inv = inv_triangular(L, lower = True) + F_inv = np.dot(L_inv.T, L_inv) + + E_sparse = block_diag(E, format='csr') + c = E_sparse.dot(K_sparse).dot(b) + ERM = E_sparse.dot(R_sparse).dot(F_inv) + RTC = R_sparse.T.dot(c) + # a = K^-1 x f + a = b - c + np.dot(ERM, RTC) + f_vec = K_sparse.dot(a) + # gradiant = -K^-1 x f + y - π + gradiant = -a + y_pi_diff + # At maximum must have f = K x (y - π) and gradiant = 0 Owner GaelVaroquaux added a note Aug 4, 2013 Please avoid using non-ascii characters in the source code. to join this conversation on GitHub. Already have an account? Sign in to comment + + return f_vec, E, F_inv, B_log_det, a, pi_vec, gradiant + + def reduced_likelihood_function(self, theta=None, f_vec=None): + """ + This function determines the laplace approximated marginal + likelihood function for the given autocorrelation parameters theta. + + Maximizing this function wrt the latent function values is + equivalent to maximizing the likelihood of the assumed joint Gaussian + distribution of the observations y evaluated onto the design of + experiments X. + + Parameters + ---------- + theta : array_like, optional + An array containing the autocorrelation parameters at which the + Gaussian Process model parameters should be determined. + Default uses the built-in autocorrelation parameters + (ie ``theta = self.theta_``). + + f_vec: array_like, optional Owner GaelVaroquaux added a note Aug 4, 2013 You are missing a space, as in "f_vec :". Also, all parameters learned from the data should have a trailing underscore, which means that you need to protect them in the docstrings, as in "`f_vec_` :" Owner GaelVaroquaux added a note Aug 4, 2013 Oops, sorry, I got confused with regards to the trailing character, I though I was looking a the class docstring, not the function docstring. to join this conversation on GitHub. Already have an account? Sign in to comment + Starting values for latent function values. + Default uses vector of zeros. + + Returns + ------- + reduced_likelihood_function_value : double + The value of the reduced likelihood function associated to the + given latent function values f_vec. + + par : dict + A dictionary containing the requested Gaussian Process model + parameters: + + f_vec + Latent function values + E + Computed block-diagonal matrix to find hessian inverse. + F_inv + Internal square matrix to find hessian inverse. + pi_vec + Squashed decision function values. + """ + + n_classes = self.n_classes + n_samples = self.X.shape[0] + + if theta is None: + # Use built-in autocorrelation parameters + theta = self.theta_ + + if f_vec is None: + # Column vector of latent function values at all training + # points for all classes, shape = (Cn,1) + f_vec = np.zeros((n_classes * n_samples, 1)) + + D, ij = self._cache['D'], self._cache['ij'] + + # Numerical considerations: + # should avoid computing K^-1, formulating based on fmin_ncg needs this term + # see if we can change the cost function, with new hessian and prime + # in a way that K^-1 is avoided (hopefully replaced by B^-1) + # then we could use fmin_ncg + + # Kernel is block diagonal with equivalant dense shape of Cn x Cn + K = [] + for c in range(n_classes): + # Assumption is that C latent processes are uncorrelated (between classes) + # Calculate matrix of distances D between samples for each class + + # Correlation model + r = self.corr(theta[c,:], D) + # Set up kernel matrix + R = np.eye(n_samples) * (1. + self.nugget) + R[ij[:, 0], ij[:, 1]] = r + R[ij[:, 1], ij[:, 0]] = r + + K.append(R) + + K_sparse = block_diag(K, format='csr') + + # Get R from cache + R_sparse = self._get_R() + + # Initialize output + reduced_likelihood_function_value = - np.inf + + k = 0 + xtol = len(f_vec) * self.tol + update = [2 * xtol] + + f_vec_old = f_vec + objective = 0 + while k < self.max_iter and np.sum(np.abs(update)) > xtol: + f_vec, E, F_inv, B_log_det, a, pi_vec, gradiant = self._update_func(f_vec, K, K_sparse, R_sparse) + # TODO: use gradiant to update step-size and feed it back to _update_func + + # update to x is inherent to the algorithm above + update = f_vec - f_vec_old + f_vec_old = f_vec + + objective = -0.5 * np.dot(a.T, f_vec) + np.dot(self.y.T, f_vec) + objective = np.asarray(objective).squeeze() + + f_mat = f_vec.reshape(n_classes, n_samples) + logsquash_term = np.sum(logsumexp(f_mat, axis = 0)) + objective = objective - logsquash_term + + k = k + 1 + + if self.verbose: + print("Maximum objective reached %s" % objective) + print("Number of iterations %s (of total %s)" % (k, self.max_iter)) + + # Reduced likelyhood value + reduced_likelihood_function_value = objective - 0.5 * B_log_det + + # Save those that are used for prediction + # soome can be computed from f_vec for a lightweight edition + par = {} + par['f_vec'] = f_vec + par['E'] = E + par['F_inv'] = F_inv + par['pi_vec'] = pi_vec + + return reduced_likelihood_function_value, par + + + def _check_params(self, n_classes = None, n_samples = None): + ''' + Check to make sure parameters are valid + ''' + + if self.method != 'laplace': + raise ValueError("Method can only be 'laplace' currently.") + + if (n_classes is not None + and n_classes < 2): + raise ValueError("at least two classes are needed for training.") + + theta0 = array2d(self.theta0) + if n_classes is not None: + # if for single class given, extend it to all classes + if theta0.shape[0] == 1: + theta0 = np.tile(theta0[0,:], (n_classes,1)) + elif theta0.shape[0] != n_classes: + raise ValueError("first dimension of theta0 (if non-single) must be number of classs (= %s)." % n_classes) + self.theta0 = theta0 + + # Check nugget value + self.nugget = np.asarray(self.nugget) Owner GaelVaroquaux added a note Aug 4, 2013 You shouldn't modify any of the attributes given at construction time. You can keep a local version of this variable, as in: nugget = np.asarray(self.nugget) to join this conversation on GitHub. Already have an account? Sign in to comment + if np.any(self.nugget) < 0.: + raise ValueError("nugget must be positive or zero.") + + if (n_samples is not None + and self.nugget.shape not in [(), (n_samples,)]): + raise ValueError("nugget must be either a scalar " + "or array of length n_samples.") + + if (n_samples is not None + and n_classes is not None): + if self.max_iter < 1: + self.max_iter = 10 * n_samples * n_classes + + # Check correlation model + if not callable(self.corr): + if self.corr in self._correlation_types: + self.corr = self._correlation_types[self.corr] + else: + raise ValueError("corr should be one of %s or callable, " + "%s was given." + % (self._correlation_types.keys(), self.corr)) +