Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX compute y_std properly with multi-target in GPR #20761

Merged
merged 24 commits into from Oct 20, 2021

Conversation

patrickctrf
Copy link
Contributor

Reference Issues/PRs

Fixes #17394. Fixes #18065.

What does this implement/fix? Explain your changes.

GaussianProcessRegressor could not retrieve y_std
for predict(X, return_std=True) when n_targets were bigger than 1.
This happened because line 415 in file
"sklearn/gaussian_process/_gpr.py" tried to multiply
y_var * self._y_train_std ** 2 using simple multiplication ( a1 * a2).

However, it fails when self._y_train_std has more than one feature
(when n_targets is more than 1), so we need to implement this
multiplication using np.outer product, because it will handle
the conventional scalar-array multiplication for EACH output feature
(self._y_train_std contains one normalization rate for each output
feature).

After this fix, each output feature will receive its respective
normalized variance when return_std == True.

A simple example of how to train gpr and reproduce the error:

#=====breaking-code-before-fix============================================

import numpy as np

X_train = np.random.rand((11, 10))
y_train = np.random.rand((11, 6)) # 6 target features == ERROR

X_test = np.random.rand((4, 10))

import sklearn.gaussian_process as gp

kernel = gp.kernels.ConstantKernel(1.0, (1e-1, 1e3)) * gp.kernels.RBF(10.0, (1e-3, 1e3))

model = gp.GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10, alpha=0.1, normalize_y=True)

model.fit(X_train, y_train)
params = model.kernel_.get_params()

y_pred, std = model.predict(X_test, return_std=True)

#=====end-of-breaking-code-before-fix======================================


#=====well-running-code-before-fix============================================

import numpy as np

X_train = np.random.rand((11, 1))
y_train = np.random.rand((11, 1)) # 1 target feature == FINE

X_test = np.random.rand((4, 1))

import sklearn.gaussian_process as gp

kernel = gp.kernels.ConstantKernel(1.0, (1e-1, 1e3)) * gp.kernels.RBF(10.0, (1e-3, 1e3))

model = gp.GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10, alpha=0.1, normalize_y=True)

model.fit(X_train, y_train)
params = model.kernel_.get_params()

y_pred, std = model.predict(X_test, return_std=True)

#=====end-of-well-running-code-before-fix======================================

Line 415 in file "sklearn/gaussian_process/_gpr.py":

# before
y_var = y_var * self._y_train_std ** 2

# after
y_var = np.outer(y_var, self._y_train_std ** 2)

It could not retrieve y_std for predict(X, return_std=True) when
n_targets were bigger than 1. This happened because line 415 in file
"sklearn/gaussian_process/_gpr.py" tried to multiply
y_var * self._y_train_std ** 2 using simple multiplication ( a1 * a2).

However, it fails when self._y_train_std has more than one feature
(when n_targets is more than 1), so we need to implement this
multiplication using np.outer product, because it will handle
the conventional scalar-array multiplication for each output feature
(self._y_train_std contains one normalization rate for each output
feature).
@glemaitre
Copy link
Member

Tests are failing, you need to:

  • reshape to (n_samples,) instead of (n_samples, 1) for a single target (this is the reason why the tests fail
  • add a non-regression test
  • add an entry in whats new

@glemaitre glemaitre changed the title fix: GaussianProcessRegressor fails to compute y_std when n_targets > 1 FIX compute y_std properly with multi-target in GPR Aug 17, 2021
@patrickctrf
Copy link
Contributor Author

Tests are failing, you need to:

  • reshape to (n_samples,) instead of (n_samples, 1) for a single target (this is the reason why the tests fail
  • add a non-regression test
  • add an entry in whats new

Thanks for your review! What is a non-regression task? I'm only editing GPR, not GPC (classifier).
Also, where is "what's new" session? I couldn't find it, sorry. It is my first contribution.

@glemaitre
Copy link
Member

We are still missing a non-regression test and an entry to the what's new.

@glemaitre
Copy link
Member

You can read the following regarding regression testing

Here, the idea is to write a test function that will check that a code that was previously failing is giving the proper result now. Basically, it corresponds to isolate the code that you put in your first comment breaking-code-before-fix and add some assert. This function should be implemented in test_gpr.py.

Please add an entry to the change log at doc/whats_new/v1.0.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

@glemaitre glemaitre added this to REVIEWED AND WAITING FOR CHANGES in Guillaume's pet Sep 6, 2021
@patrickctrf
Copy link
Contributor Author

I am getting a Linting error after I've added an entry in whats_new and the non-regressive test.
Can you tell me where the error is happening? I've tried different linting inspections but I had no success.

@glemaitre
Copy link
Member

You can see the failure there. You can as well click on the different "Details" that will land on this page.

The issue is that you need to run black on the file such that no style diff is detected by the CI. The following command would do the job:

black -t py37 sklearn/gaussian_process/tests/test_gpr.py

and commit the changes. Something handy instead to do this change manually is to install pre-commit as mentioned in the contributing guide (item 9.)

Check if GPR can compute y_std in predict() method when normalize_y==True
in multi-target regression.
"""
X_train = np.random.rand((11, 10))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your test is failing as well: cf. there

Here it would be best to define a random state and then you can just use randn:

rng = np.random.RandomState(42)

n_samples, n_features, n_targets = 12, 10, 6

X_train = rng.randn(n_samples, n_features)
y_train = rng.randn(n_samples, n_targets)
X_test = rng.randn(n_samples, n_features)

Comment on lines 668 to 670
# Generic kernel
kernel = kernels.ConstantKernel(1.0, (1e-1, 1e3))
kernel *= kernels.RBF(10.0, (1e-3, 1e3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Generic kernel
kernel = kernels.ConstantKernel(1.0, (1e-1, 1e3))
kernel *= kernels.RBF(10.0, (1e-3, 1e3))
# Generic kernel
kernel = (
kernels.ConstantKernel(1.0, (1e-1, 1e3))
* kernels.RBF(10.0, (1e-3, 1e3))
)

in multi-target regression.
"""
X_train = np.random.rand((11, 10))
# 6 target features -> multi-target
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this comment. Variable names and the code should be as much as possible self-explanatory.

kernel = kernels.ConstantKernel(1.0, (1e-1, 1e3))
kernel *= kernels.RBF(10.0, (1e-3, 1e3))

# normalize_y == True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this comment

alpha=0.1,
normalize_y=True)
model.fit(X_train, y_train)
y_pred, std = model.predict(X_test, return_std=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
y_pred, std = model.predict(X_test, return_std=True)
y_pred, y_std = model.predict(X_test, return_std=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that the previous code was failing already. But we could make a couple of assertions regarding the shape of the returned values.

assert y_pred.shape == (n_samples, n_targets)

and something similar for y_std

Comment on lines 658 to 662
"""
Regression test for issues #17394 and #18065.
Check if GPR can compute y_std in predict() method when normalize_y==True
in multi-target regression.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
Regression test for issues #17394 and #18065.
Check if GPR can compute y_std in predict() method when normalize_y==True
in multi-target regression.
"""
"""Check that `y_std` is properly computed when `normalize_y=True`.
Non-regression test for:
https://github.com/scikit-learn/scikit-learn/issues/17394
https://github.com/scikit-learn/scikit-learn/issues/18065
"""

:mod:`sklearn.gaussian_process`
.........................

- |Fix| Compute 'y_std' properly with multi-target in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- |Fix| Compute 'y_std' properly with multi-target in
- |Fix| Compute `y_std` properly with multi-target in

Comment on lines 826 to 827
:mod:`sklearn.gaussian_process`
.........................
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:mod:`sklearn.gaussian_process`
.........................
:mod:`sklearn.gaussian_process`
...............................

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will let this entry here for the moment but we will have to move it in a section 1.0.1 most probably.

@patrickctrf
Copy link
Contributor Author

Thanks for your support, @glemaitre. Just one more question: When I run black, other functions in test_gpr.py were formatted. Is it ok or should I change it back to its original state?

@glemaitre
Copy link
Member

Is it ok or should I change it back to its original state?

I think it is fine because the same command will be run and therefore lead to the same changes on the CI, normally.

@glemaitre
Copy link
Member

I relaunched the failing build but it certainly only a timeout. So everything should be fine otherwise.

@patrickctrf
Copy link
Contributor Author

Do I need to do anything else?

@glemaitre
Copy link
Member

everything looks fine

@glemaitre
Copy link
Member

Tagging this PR for 1.0.1 to include it in the next bug fix release

@glemaitre glemaitre added this to the 1.0.1 milestone Sep 16, 2021
@patrickctrf
Copy link
Contributor Author

Ok. It is asking for changes in whats_new section. Should I change it?

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @patrickctrf!

LGTM with a few modifications for the change log and to follow conventions.

sklearn/gaussian_process/tests/test_gpr.py Outdated Show resolved Hide resolved
kernel=kernel,
n_restarts_optimizer=n_restarts_optimizer,
random_state=0,
kernel=kernel, n_restarts_optimizer=n_restarts_optimizer, random_state=0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to this PR motives and also comes with no semantic changes.
Has black changed it? If it is not the case, could you revert this change, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, black changed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to revert those changes? Normally it should be possible to keep the original lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Done

Comment on lines 386 to 388
theta_opt, func_min = (
initial_theta,
obj_func(initial_theta, eval_gradient=False),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Black changed it too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identically, can you try to revert those changes? Normally it should be possible to keep the original lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Done

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved
sklearn/gaussian_process/_gpr.py Outdated Show resolved Hide resolved
patrickctrf and others added 4 commits October 9, 2021 01:28
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@glemaitre glemaitre self-assigned this Oct 14, 2021
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @patrickctrf. The same issue appears in the if return_cov branch. y_cov has a shape (n_samples, n_samples) but should have a shape (n_samples, n_samples, n_targets). Would you mind fixing that too ? np.einsum("ij,k->ijk", y_var, self._y_train_std ** 2) followed by a np.squeeze will do it.

sklearn/gaussian_process/_gpr.py Outdated Show resolved Hide resolved
@jeremiedbb jeremiedbb self-requested a review October 20, 2021 15:22
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @patrickctrf !

@jeremiedbb jeremiedbb merged commit 9b210ae into scikit-learn:main Oct 20, 2021
@patrickctrf
Copy link
Contributor Author

Thanks for your support, guys!

@glemaitre glemaitre mentioned this pull request Oct 23, 2021
10 tasks
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 23, 2021
@glemaitre glemaitre moved this from REVIEWED AND WAITING FOR CHANGES to MERGED in Guillaume's pet Oct 26, 2021
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
4 participants