[MRG] Gaussian process for arbitrary-dimensional output spaces #1611

Closed
wants to merge 26 commits into
from

Projects

None yet

6 participants

@JohnFNovak

I have fixed the Gaussian Process submodule so that Gaussian Processes can be trained on data which is vector like in y. Now when training a GP each point in both X and Y can be of arbitrary dimension. Previously, the values of Y were required to be scalars, which is a requirement which is not necessary for GPs.

The changes are fully backwards compatible, and the old tests work fine. I have not written new tests to test the new functionality, but I can if it is necessary.

@GaelVaroquaux
Member

That's a useful contribution! Thanks.

New tests are indeed useful. If multi-output tests were not around for the other models that support multi-output, I would have broken this support more than once when making changes.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jan 23, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -271,11 +271,14 @@ def fit(self, X, y):
# Force data to 2D numpy.array
X = array2d(X)
- y = np.asarray(y).ravel()[:, np.newaxis]
+ y = array2d(y)
+
+ if y.shape[1] == X.shape[0] and y.shape[0] == 1:
@GaelVaroquaux
GaelVaroquaux Jan 23, 2013 Member

That test seems a bit too general. I think that you are only trying to capture the situation when the original 'y' vector was 1D. You can do that by storing the number of dimension of 'y' before the call to array2d

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jan 23, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -449,7 +456,10 @@ def predict(self, X, eval_MSE=False, batch_size=None):
y_ = np.dot(f, self.beta) + np.dot(r, self.gamma)
# Predictor
- y = (self.y_mean + self.y_std * y_).ravel()
+ y = (self.y_mean + self.y_std * y_).reshape(n_eval, self.y.shape[1])
+
+ if self.y.shape[1] == 1:
@GaelVaroquaux
GaelVaroquaux Jan 23, 2013 Member

Same remark about the test being a bit general. I think that self.y should be stored with the initial shape that y had in the call to fit, and array2d applied to it in the predict. That's the road of least surprise for the user who does not know about multi-output situations.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Jan 23, 2013
sklearn/gaussian_process/gaussian_process.py
# Mean Squared Error might be slightly negative depending on
# machine precision: force to zero!
MSE[MSE < 0.] = 0.
+ if MSE.shape[1] == 1:
@GaelVaroquaux
GaelVaroquaux Jan 23, 2013 Member

I still don't like testing for .shape[1] :)

@JohnFNovak

I fixed the tests. The original shape of y is now stored and all the checks are to whether or not the length of the shape is 1 or not. If the original y is 1-d, it will have a shape = (n_samples,) and len(shape) = 1.

I apologize that it has taken an eternity for me to get around to fixing this; I am working on my doctorate and this code is at best tangentially related to some of the analysis code. It's been low on the to-do list.

@agramfort
Member

thanks @JohntheBear
can you add a test with y of shape (n_samples, 1) and check that it gives the same as if (n_samples, )
and add a test when y.shape = (n_samples, 2) ?

@JohnFNovak JohnFNovak Added test of y_training shape so that if it is given as a 2d array a…
…nd incorrectly transposed it will be fixed, also the return values will be given in the same form as the training values
0a1e6f7
@JohnFNovak

@agramfort
I didn't think of the case where the training y was 1-d but given as a 2d array and transposed incorrectly, that would have slipped past my test and failed a few seconds later. I have addressed that, and changed the test when the values are returned so that the return vales for y and MSE are given in the same form as the training y. Although it may be safest in general that if the input values are given in a non-sensical way the module fails instead of pressing on. The only reason I was including shape tests was 1) so the return values are given in the same form as the training values, 2) it works out that array2d(np.array([list of length n_samples])) returns an array of shape == (1, n_samples), which will fail.

But I don't think that is what you are asking about. Do you want me to write something to go in the build tests? I don't quite understand what you are asking for.

I do know that this code works when y.shape = (n_samples, 15) because that is what I have been using it for.

@agramfort
Member

have a look at the test file : test_gaussian_process.py in gaussian_process/tests folder.

that is the file that should be edited to test the new features.

@andyli
andyli commented Apr 18, 2013

I need to use this feature, and I would like to know if it will be merged soon?
Is that only unit test is missing?

@JohnFNovak

I'm not dead, and I haven't forgotten about this. My linux box died, I'm in grad school, and this hasn't been super high on my to do list. Sorry all.

@JohnFNovak

I think this is good now

@GaelVaroquaux
Member

I think this is good now

Could you add a '[MRG]' in the title of your PR, so that reviewers know
that they should prioritize it.

@agramfort agramfort commented on an outdated diff Jul 27, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -271,11 +271,19 @@ def fit(self, X, y):
# Force data to 2D numpy.array
X = array2d(X)
- y = np.asarray(y).ravel()[:, np.newaxis]
+ self.y_shape = np.array(y).shape
@agramfort
agramfort Jul 27, 2013 Member

every attribute that is data dependent should end with _

if you need keep the number of targets at attribute I'd recommend to name it self.n_targets_ or self.n_outputs_

@agramfort agramfort commented on an outdated diff Jul 27, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -271,11 +271,19 @@ def fit(self, X, y):
# Force data to 2D numpy.array
X = array2d(X)
- y = np.asarray(y).ravel()[:, np.newaxis]
+ self.y_shape = np.array(y).shape
+ y = array2d(y)
+
+ # If the y values are given to fit() as a list:
+ if len(self.y_shape) == 1:
@agramfort
agramfort Jul 27, 2013 Member

use y.ndim

since you use array2d above you'll never enter this branching.

@agramfort agramfort commented on an outdated diff Jul 27, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -271,11 +271,19 @@ def fit(self, X, y):
# Force data to 2D numpy.array
X = array2d(X)
- y = np.asarray(y).ravel()[:, np.newaxis]
+ self.y_shape = np.array(y).shape
+ y = array2d(y)
+
+ # If the y values are given to fit() as a list:
+ if len(self.y_shape) == 1:
+ y = y.T
+ # If the y vales are given to fit() as an array, but transposed wrong
+ elif self.y_shape[0] == 1 and self.y_shape[1] == X.shape[0]:
@agramfort
agramfort Jul 27, 2013 Member

as you have access you y directly why using self. everywhere? I makes longer lines and a self. causes
a dictionary lookup in the attributes.

@agramfort agramfort commented on an outdated diff Jul 27, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -407,11 +415,15 @@ def predict(self, X, eval_MSE=False, batch_size=None):
Returns
-------
y : array_like
@agramfort
agramfort Jul 27, 2013 Member

y : array_like, shape (n_samples, ) or (n_samples, n_targets)

@agramfort agramfort commented on an outdated diff Jul 27, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -476,13 +493,19 @@ def predict(self, X, eval_MSE=False, batch_size=None):
# Ordinary Kriging
u = np.zeros(y.shape)
- MSE = self.sigma2 * (1. - (rt ** 2.).sum(axis=0)
- + (u ** 2.).sum(axis=0))
+ MSE = np.vstack(map(lambda x: x * (1. - (rt ** 2.).sum(axis=0)
+ + (u.T ** 2.).sum(axis=0)),
+ self.sigma2.tolist())).sum(axis=0) / n_features
@agramfort
agramfort Jul 27, 2013 Member

you should be able to do this in a purely vectorized way avoiding lists and array reallocation.

@amueller
Member

We discussed this and I think we rather not want to merge this in a hurry. We actually wanted to do a feature freeze yesterday. It will probably be merged soon after the release. Sorry.

@JohnFNovak

That is just fine. Delayed is better than forgotten. It looks like there are new comments for me to address as well.

@GaelVaroquaux
Member

Delayed is better than forgotten.

Certainly not forgotten!

@colincsl

Is any chance this is going to be merged any time soon? I'm thinking about adding some extra stuff related to GPs and would like to base it on the multi-input/output version.

@JohnFNovak

My apologies. I am writing my dissertation. I'll take a look at the outstanding comments in the next day or two and address them.

@JohnFNovak

I think I have addressed the comments that were raised. If anyone has anymore, please let me know

@agramfort agramfort commented on an outdated diff Sep 3, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -34,30 +34,30 @@ def l1_cross_distances(X):
----------
X: array_like
- An array with shape (n_samples, n_features)
+ An array with shape (n_samples_, n_features_)
@agramfort
agramfort Sep 3, 2013 Member

why did you add the trailing underscores? It's not common in the scikit. Also be careful with _ as the have a meaning in rst so need to be protected with . You should try building the doc.

@agramfort agramfort commented on an outdated diff Sep 3, 2013
sklearn/gaussian_process/gaussian_process.py
The indices i and j of the vectors in X associated to the cross-
distances in D: D[k] = np.abs(X[ij[k, 0]] - Y[ij[k, 1]]).
"""
X = array2d(X)
- n_samples, n_features = X.shape
- n_nonzero_cross_dist = n_samples * (n_samples - 1) / 2
+ n_samples_, n_features_ = X.shape
@agramfort
agramfort Sep 3, 2013 Member

I think there is a confusion. object attributes that are data dependent need the trailing _ but not every data dependent variable in the code. makes sense?

@agramfort agramfort commented on an outdated diff Sep 3, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -476,13 +491,19 @@ def predict(self, X, eval_MSE=False, batch_size=None):
# Ordinary Kriging
u = np.zeros(y.shape)
- MSE = self.sigma2 * (1. - (rt ** 2.).sum(axis=0)
- + (u ** 2.).sum(axis=0))
+ MSE = (self.sigma2.reshape(n_targets_, 1) * (1. - (rt ** 2.).sum(axis=0)
@agramfort
agramfort Sep 3, 2013 Member

this line is too long. Did you run the pep8 tool on your code?

@JohnFNovak

I misunderstood about the trailing underscores. I thought you meant data dependent variables. I also realize now that there were a few other pep8 problems. I have addressed them. Thanks for the feedback, I do most of my coding in a vacuum, so I rarely get helpful criticism.

@agramfort agramfort commented on an outdated diff Sep 3, 2013
sklearn/gaussian_process/gaussian_process.py
"""
# Check input shapes
X = array2d(X)
- n_eval, n_features_X = X.shape
+ n_eval, n_featuresX = X.shape
@agramfort
agramfort Sep 3, 2013 Member

too quick find / replace :)

@agramfort agramfort commented on the diff Sep 3, 2013
sklearn/gaussian_process/tests/test_gaussian_process.py
@@ -65,6 +65,36 @@ def test_2d(regr=regression.constant, corr=correlation.squared_exponential,
assert_true(np.allclose(y_pred, y) and np.allclose(MSE, 0.))
+def test_2d_2d(regr=regression.constant, corr=correlation.squared_exponential,
+ random_start=10, beta0=None):
@agramfort
agramfort Sep 3, 2013 Member

run pep8 on test file too.

@JohnFNovak
JohnFNovak Sep 4, 2013

pep8 doesn't give any errors on this file. If I formatted it strange, just let me know how you would like it.

@agramfort
Member

besides looks good to me. Could it be illustrated in an example?

@JohnFNovak

I have used this code myself for multi-dimensional interpolation. In my research we have a relativistic blast-wave hydrodynamic heavy-ion nuclear physics model, but it is very slow. So, we ran it at a bunch of points in parameter space, then used the GP as a proxy/pseudo model while optimizing against experimental data because the GP is fast. [Edit: Sorry, that was jargon heavy. We have a model and it's slow. So, we interpolate model points with the GP because the GP is fast] Paper: http://arxiv.org/pdf/1303.5769v2.pdf (GP stuff kicks in around page 9-10). I doubt that this was what you had in mind when you asked for an example....

But... I could probably put together some sort of multidimensional interpolation example: 2D -> 2D (or something similar). It just gets very hard to visualize pretty quickly.

@agramfort agramfort commented on an outdated diff Sep 4, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -406,18 +411,23 @@ def predict(self, X, eval_MSE=False, batch_size=None):
Returns
-------
- y : array_like
- An array with shape (n_eval, ) with the Best Linear Unbiased
+ y : array_like, shape (n_samples, ) or (n_samples, n_targets)
+ An array with shape (n_eval, ) if the Gaussian Process was trained
+ on an array of shape (n_samples, ) or and array with shape
@agramfort
agramfort Sep 4, 2013 Member

and array -> an array

@agramfort
Member

fix the typo then +1 for merge on my side. For example I wanted more an illustration but forget it.

@JohnFNovak

I was trying to think of a good example, and while fiddling with the old example I realized that the MSE calculation had gotten broken. I've fixed it, but now I'm a little perplexed how it passed the build test while clearly not giving reasonable MSE values...

@JohnFNovak

I've got it figured out. The build tests only check that the MSE is small at the training points, it never tests that it is sane away from the training points. I'll add something to the build tests to check that the values are sane away from the training points as well.

@JohnFNovak

I added another test to the 1d case. it isn't the most rigorous, but it would have caught the error I introduced. Unfortunately, the MSE values will depend on things like the kernel, so you can't make blanket statements about what is necessarily "reasonable", but now if someone screws up like what I did, it should catch it.

I don't know what the system is for adding examples, but I have something in mind for 2d->2d interpolation. I'll see if I can get it working.

@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Sep 5, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -271,11 +271,16 @@ def fit(self, X, y):
# Force data to 2D numpy.array
X = array2d(X)
- y = np.asarray(y).ravel()[:, np.newaxis]
+ self.y_shape_ = np.array(y).shape
+ y = array2d(y)
+
+ # If the y vales are given to fit() as an array, but transposed wrong
+ if y.shape == (1, X.shape[0]):
+ y = y.T
@GaelVaroquaux
GaelVaroquaux Sep 5, 2013 Member

I don't think that we should be supporting this. We should be raising a meaningful error rather than silently transposing an input.

The meaningful array can be raised by using sklearn.utils.validation.check_arrays.

@agramfort
agramfort Sep 5, 2013 Member

it's a bit inelegant I agree but this case happens due to the array2d when y.ndim == 1 as input. The code next now only works with 2d y. Makes sense?

@GaelVaroquaux
GaelVaroquaux Sep 5, 2013 Member

it's a bit inelegant I agree but this case happens due to the array2d when y.ndim == 1 as input.

OK, but then we need to test and store the shape of y before the call to
array2d.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Sep 5, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -449,7 +459,12 @@ def predict(self, X, eval_MSE=False, batch_size=None):
y_ = np.dot(f, self.beta) + np.dot(r, self.gamma)
# Predictor
- y = (self.y_mean + self.y_std * y_).ravel()
+ y = (self.y_mean + self.y_std * y_).reshape(n_eval, n_targets)
+
+ if len(self.y_shape_) == 1:
+ y = y.ravel()
+ elif self.y_shape_ == (1, X.shape[0]):
+ y = y.T
@GaelVaroquaux
GaelVaroquaux Sep 5, 2013 Member

Same thing here: I rather not accept transposed arrays.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Sep 5, 2013
sklearn/gaussian_process/gaussian_process.py
@@ -474,15 +489,22 @@ def predict(self, X, eval_MSE=False, batch_size=None):
np.dot(self.Ft.T, rt) - f.T)
else:
# Ordinary Kriging
- u = np.zeros(y.shape)
+ u = np.zeros(y.shape).T
@GaelVaroquaux
GaelVaroquaux Sep 5, 2013 Member

If I understand well, the shape of this array (u) is such that the samples dimension is the second one?

I think that I would prefer if the first dimension was the samples, as it is standard in scikit-learn.

@GaelVaroquaux GaelVaroquaux commented on an outdated diff Sep 5, 2013
sklearn/gaussian_process/gaussian_process.py
# Mean Squared Error might be slightly negative depending on
# machine precision: force to zero!
MSE[MSE < 0.] = 0.
+ if len(self.y_shape_) == 1:
+ MSE = MSE.ravel()
+ elif self.y_shape_ == (1, X.shape[0]):
@GaelVaroquaux
GaelVaroquaux Sep 5, 2013 Member

Once again, I don't think that we should deal with an arbitrary transpose.

@GaelVaroquaux
Member

Appart from the transposition issues, this seems pretty much good to go to me.

@JohnFNovak

The Gaussian Process will not accept matrices which are transposed wrong, and I think I have removed anything that hints at the possibility that it might. There were a few if: ... else:... checks where the "else" would never happen, so they have been pruned. There may be a few ".T"s still in there, but that arrises from when a list is turned into a 2d array and it has the wrong dimensions. All checks for that sort of behavior should check for len(self.y_shape_) == 1, where self.y_shape_ is stored before y forced to be an array.

@agramfort
Member

I've rebased on master for a cleaner history and tried to simplify input checking. See:

#2482

if you're happy I'll update the whats_new page and ask for a merge.

@agramfort
Member

feel free to close this one if you want.

@JohnFNovak

Awesome! Thank you all for your patience!

@JohnFNovak JohnFNovak closed this Sep 29, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment