Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Gaussian process for arbitrary-dimensional output spaces #1611

Closed
wants to merge 26 commits into from

Conversation

JohnFNovak
Copy link

I have fixed the Gaussian Process submodule so that Gaussian Processes can be trained on data which is vector like in y. Now when training a GP each point in both X and Y can be of arbitrary dimension. Previously, the values of Y were required to be scalars, which is a requirement which is not necessary for GPs.

The changes are fully backwards compatible, and the old tests work fine. I have not written new tests to test the new functionality, but I can if it is necessary.

@GaelVaroquaux
Copy link
Member

That's a useful contribution! Thanks.

New tests are indeed useful. If multi-output tests were not around for the other models that support multi-output, I would have broken this support more than once when making changes.

y = np.asarray(y).ravel()[:, np.newaxis]
y = array2d(y)

if y.shape[1] == X.shape[0] and y.shape[0] == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That test seems a bit too general. I think that you are only trying to capture the situation when the original 'y' vector was 1D. You can do that by storing the number of dimension of 'y' before the call to array2d

@JohnFNovak
Copy link
Author

I fixed the tests. The original shape of y is now stored and all the checks are to whether or not the length of the shape is 1 or not. If the original y is 1-d, it will have a shape = (n_samples,) and len(shape) = 1.

I apologize that it has taken an eternity for me to get around to fixing this; I am working on my doctorate and this code is at best tangentially related to some of the analysis code. It's been low on the to-do list.

@agramfort
Copy link
Member

thanks @JohnTheBear
can you add a test with y of shape (n_samples, 1) and check that it gives the same as if (n_samples, )
and add a test when y.shape = (n_samples, 2) ?

…nd incorrectly transposed it will be fixed, also the return values will be given in the same form as the training values
@JohnFNovak
Copy link
Author

@agramfort
I didn't think of the case where the training y was 1-d but given as a 2d array and transposed incorrectly, that would have slipped past my test and failed a few seconds later. I have addressed that, and changed the test when the values are returned so that the return vales for y and MSE are given in the same form as the training y. Although it may be safest in general that if the input values are given in a non-sensical way the module fails instead of pressing on. The only reason I was including shape tests was 1) so the return values are given in the same form as the training values, 2) it works out that array2d(np.array([list of length n_samples])) returns an array of shape == (1, n_samples), which will fail.

But I don't think that is what you are asking about. Do you want me to write something to go in the build tests? I don't quite understand what you are asking for.

I do know that this code works when y.shape = (n_samples, 15) because that is what I have been using it for.

@agramfort
Copy link
Member

have a look at the test file : test_gaussian_process.py in gaussian_process/tests folder.

that is the file that should be edited to test the new features.

@agramfort
Copy link
Member

@andyli
Copy link

andyli commented Apr 18, 2013

I need to use this feature, and I would like to know if it will be merged soon?
Is that only unit test is missing?

@JohnFNovak
Copy link
Author

I'm not dead, and I haven't forgotten about this. My linux box died, I'm in grad school, and this hasn't been super high on my to do list. Sorry all.

@JohnFNovak
Copy link
Author

I think this is good now

@GaelVaroquaux
Copy link
Member

I think this is good now

Could you add a '[MRG]' in the title of your PR, so that reviewers know
that they should prioritize it.

@@ -271,11 +271,19 @@ def fit(self, X, y):

# Force data to 2D numpy.array
X = array2d(X)
y = np.asarray(y).ravel()[:, np.newaxis]
self.y_shape = np.array(y).shape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every attribute that is data dependent should end with _

if you need keep the number of targets at attribute I'd recommend to name it self.n_targets_ or self.n_outputs_

@amueller
Copy link
Member

We discussed this and I think we rather not want to merge this in a hurry. We actually wanted to do a feature freeze yesterday. It will probably be merged soon after the release. Sorry.

@JohnFNovak
Copy link
Author

That is just fine. Delayed is better than forgotten. It looks like there are new comments for me to address as well.

@GaelVaroquaux
Copy link
Member

Delayed is better than forgotten.

Certainly not forgotten!

@colincsl
Copy link

Is any chance this is going to be merged any time soon? I'm thinking about adding some extra stuff related to GPs and would like to base it on the multi-input/output version.

@JohnFNovak
Copy link
Author

I misunderstood about the trailing underscores. I thought you meant data dependent variables. I also realize now that there were a few other pep8 problems. I have addressed them. Thanks for the feedback, I do most of my coding in a vacuum, so I rarely get helpful criticism.

"""

# Check input shapes
X = array2d(X)
n_eval, n_features_X = X.shape
n_eval, n_featuresX = X.shape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too quick find / replace :)

@agramfort
Copy link
Member

besides looks good to me. Could it be illustrated in an example?

@JohnFNovak
Copy link
Author

I have used this code myself for multi-dimensional interpolation. In my research we have a relativistic blast-wave hydrodynamic heavy-ion nuclear physics model, but it is very slow. So, we ran it at a bunch of points in parameter space, then used the GP as a proxy/pseudo model while optimizing against experimental data because the GP is fast. [Edit: Sorry, that was jargon heavy. We have a model and it's slow. So, we interpolate model points with the GP because the GP is fast] Paper: http://arxiv.org/pdf/1303.5769v2.pdf (GP stuff kicks in around page 9-10). I doubt that this was what you had in mind when you asked for an example....

But... I could probably put together some sort of multidimensional interpolation example: 2D -> 2D (or something similar). It just gets very hard to visualize pretty quickly.

An array with shape (n_eval, ) with the Best Linear Unbiased
y : array_like, shape (n_samples, ) or (n_samples, n_targets)
An array with shape (n_eval, ) if the Gaussian Process was trained
on an array of shape (n_samples, ) or and array with shape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and array -> an array

@agramfort
Copy link
Member

fix the typo then +1 for merge on my side. For example I wanted more an illustration but forget it.

@JohnFNovak
Copy link
Author

I was trying to think of a good example, and while fiddling with the old example I realized that the MSE calculation had gotten broken. I've fixed it, but now I'm a little perplexed how it passed the build test while clearly not giving reasonable MSE values...

@JohnFNovak
Copy link
Author

I've got it figured out. The build tests only check that the MSE is small at the training points, it never tests that it is sane away from the training points. I'll add something to the build tests to check that the values are sane away from the training points as well.

@JohnFNovak
Copy link
Author

I added another test to the 1d case. it isn't the most rigorous, but it would have caught the error I introduced. Unfortunately, the MSE values will depend on things like the kernel, so you can't make blanket statements about what is necessarily "reasonable", but now if someone screws up like what I did, it should catch it.

I don't know what the system is for adding examples, but I have something in mind for 2d->2d interpolation. I'll see if I can get it working.


# If the y vales are given to fit() as an array, but transposed wrong
if y.shape == (1, X.shape[0]):
y = y.T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we should be supporting this. We should be raising a meaningful error rather than silently transposing an input.

The meaningful array can be raised by using sklearn.utils.validation.check_arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit inelegant I agree but this case happens due to the array2d when y.ndim == 1 as input. The code next now only works with 2d y. Makes sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit inelegant I agree but this case happens due to the array2d when y.ndim == 1 as input.

OK, but then we need to test and store the shape of y before the call to
array2d.

@GaelVaroquaux
Copy link
Member

Appart from the transposition issues, this seems pretty much good to go to me.

@JohnFNovak
Copy link
Author

The Gaussian Process will not accept matrices which are transposed wrong, and I think I have removed anything that hints at the possibility that it might. There were a few if: ... else:... checks where the "else" would never happen, so they have been pruned. There may be a few ".T"s still in there, but that arrises from when a list is turned into a 2d array and it has the wrong dimensions. All checks for that sort of behavior should check for len(self.y_shape_) == 1, where self.y_shape_ is stored before y forced to be an array.

@agramfort
Copy link
Member

I've rebased on master for a cleaner history and tried to simplify input checking. See:

#2482

if you're happy I'll update the whats_new page and ask for a merge.

@agramfort
Copy link
Member

feel free to close this one if you want.

@JohnFNovak
Copy link
Author

Awesome! Thank you all for your patience!

@JohnFNovak JohnFNovak closed this Sep 29, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants