changed pls_.py to properly scale data, modified documents #6077

OlaPaw · 2015-12-21T20:31:26Z

In many instances of the variable scale setting was ignored. Data was always centered, and always scaled. The data is now scaled appropriately when scale=True is set, and not scaled when it's False
The return from the predict() function was always de-centered, but never de-scaled, so that we got:
Ypred = Ypred + self.y_mean_ for all cases
instead of:
Ypred = (Ypred * self.y_std_) + self.y_mean_ when scale=True
In the nipals algorithm, either C and Q matrices are used for deflation modes (regression or canonical). This is now implemented. The #FIXME note is now not necessary, as coef_ is calculated differently depending on those two cases. The calculated coef_ was also scaled unnecesarily by standard deviations of X and Y: this is now commented out. Only matrices W and C are scaled by their norms in the inner loop. The outer loop scales P, Q, and C (canonical and regression) matrices by T.T*T or U.T*U
Some documentation was changed to reflect code modifications. Also, instructions to use sklearn.preprocessing.scale are included, as that could be a useful step for most users.
It should be noted that the example included in the comment sections now gives better predictions of Y.
if X and Y are scaled using sklearn.preprocessing.scale (or column mean centered and column variance scaled), the pls_ functions all return the SAME results when scale=True and scale=False this was not the case in the previous version of the code

and added changes to canonical regression

ogrisel · 2016-02-04T14:39:22Z

sklearn/cross_decomposition/pls_.py

-            self.coef_ = (1. / self.x_std_.reshape((p, 1)) * self.coef_ *
-                          self.y_std_)
+            # self.coef_ = (1. / self.x_std_.reshape((p, 1)) * self.coef_ *
+            #               self.y_std_)


I am not familiar with PLS but if this is no longer needed please delete those commented out lines.

ogrisel · 2016-02-04T14:43:44Z

Some tests are failing. They might need to be updated or relaxed a bit I am not sure.

@arthurmensch can probably have a look once the ICML deadline is over (tomorrow).

Also please add a one or more new test to highlight the cases that have been fixed in this PR and that were not properly tested previously.

arthurmensch · 2016-02-19T08:44:01Z

@OlaPaw are you still willing to work on this. Sorry for the review delay btw. as @ogrisel said, we'll need non regression test for the scaling bug in #6002. It would be great to see if it also fixes #6279. Do you happen to work under windows ?

arthurmensch · 2016-02-19T08:45:13Z

sklearn/cross_decomposition/pls_.py

@@ -41,7 +42,8 @@ def _nipals_twoblocks_inner_loop(X, Y, mode="A", max_iter=500, tol=1e-06,
    ite = 1
    X_pinv = Y_pinv = None
    eps = np.finfo(X.dtype).eps
-    # Inner loop of the Wold algo.
+    # Inner loop of the Wold algo. Very similar to chemometrics::pls2_nipals
+    # in R.  Tolarence looks at changes in the X matrix rather than Y


Tolerance. chemometrics is not a wide known reference, either reference it explicitly or I think it should be removed

arthurmensch · 2016-02-19T09:02:02Z

Quoting @rlgc79:

Hi OlaPaw,

I've seen that in your pull request you have added many statements of the type:

if self.scale is True:
Xr /= self.x_std_

I am not sure whether the "if" statement is necessary. If you look at the code that scales the data, we >can see that x_std and y_std are arrays of ones if we decided for not scaling the data:

if scale:
x_std = X.std(axis=0, ddof=1)
x_std[x_std == 0.0] = 1.0
X /= x_std
y_std = Y.std(axis=0, ddof=1)
y_std[y_std == 0.0] = 1.0
Y /= y_std
else:
x_std = np.ones(X.shape[1])
y_std = np.ones(Y.shape[1])
return X, Y, x_mean, y_mean, x_std, y_std

As a result, I believe that we can safely use Xr /= self.x_std_ in your pull request, without the if statement, whether we decide to
scale the data or not.

I think the changes you made to the .fit, .predict _scale_x_y are cosmetic one. The real bug #6002 fixing is in _nipals_twoblocks_inner_loop.

I think your changes make the code more explicit but I'd like to have another option on this.

arthurmensch · 2016-02-19T09:06:10Z

sklearn/cross_decomposition/pls_.py

        # 2.2 Normalize y_weights
        if norm_y_weights:
            y_weights /= np.sqrt(np.dot(y_weights.T, y_weights)) + eps
        # 2.3 Update y_score: the Y latent scores
-        y_score = np.dot(Y, y_weights) / (np.dot(y_weights.T, y_weights) + eps)
-        # y_score = np.dot(Y, y_weights) / np.dot(y_score.T, y_score) ## BUG
+        y_score = np.dot(Y, y_weights)


Effective bug fix

Not quite, but the dot product becomes unity, so is not necessary

ogrisel · 2016-02-19T16:40:58Z

@arthurmensch is @OlaPaw does not reply please feel free to go ahead and extract the bugfix in a separate new PR and please also enable the tests on appveyor as done in #6274 to check whether it fixes #6279.

OlaPaw · 2016-02-19T17:02:52Z

Hi @arthurmensch and @ogrisel,

Yes, I think you have the entire gist of the bug here. If you don't mind, I'll take a look at this over the weekend and test with all my separate test cases. I'm really new to github and debugging code, so thank you for your patience.

I just found a lot of the PLS code somewhat cryptic as it doesn't seem to follow the typical conventions I've seen in literature.

Finally, scikit-learn already has pre-scaling function and I'm wondering whether we should even scale / center anything in this function.

no need for separate check if scale is True, as self_std is set to ones if set to False Fixed re-scaling and re-centering issues reported previously

OlaPaw · 2016-02-21T03:50:59Z

Ok, I've removed the cosmetic changes, and just fixed the issues with improper scaling and centering. I tried to leave the rest of the code as pure as possible.

ogrisel · 2016-02-22T14:55:21Z

Can you please add a non-regression test and update the existing tests (there are some failures reported by travis).

lesteve · 2016-03-07T14:38:11Z

It would be great to see if it also fixes #6279

Unfortunately it does not, I just checked.

amueller · 2016-10-07T22:35:01Z

@OlaPaw are you still working on this? It would be good to have.

jabenitez · 2016-10-16T17:29:05Z

I'm now using PLSRegression. Pls let me know if I can help complete this PR. I got here after noticing my outputs of predict were out of scale.

amueller · 2016-10-17T17:36:01Z

@jabenitez feel free to help with this by adding the appropriate tests. You can create a new PR based on this one.

lesteve · 2017-06-09T12:26:28Z

@Gscorreia89 as a non PLS expert, it is hard to figure out what remains to be done on this one now that #7819 has been merged. Would you be kind enough to elaborate a bit?

changed pls_.py to properly scale data, modified documents

0cca05c

and added changes to canonical regression

Fenugreek mentioned this pull request Jan 26, 2016

Problems with data scaling in sklearn.cross_decomposition.PLSRegression #6002

Open

ogrisel reviewed Feb 4, 2016
View reviewed changes

ogrisel mentioned this pull request Feb 4, 2016

test_pls.test_scale_and_stability failure #6279

Closed

arthurmensch reviewed Feb 19, 2016
View reviewed changes

Update pls_.py

97d473b

no need for separate check if scale is True, as self_std is set to ones if set to False Fixed re-scaling and re-centering issues reported previously

jabenitez mentioned this pull request Nov 22, 2016

PLSRegression VIP score calculation #7050

Open

Gscorreia89 mentioned this pull request Mar 21, 2017

Fix PLS coef_ bug #7819

Merged

amueller added the Stalled label Aug 5, 2019

github-actions bot added the module:cross_decomposition label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:48

adrinjalali closed this Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changed pls_.py to properly scale data, modified documents #6077

changed pls_.py to properly scale data, modified documents #6077

OlaPaw commented Dec 21, 2015

ogrisel Feb 4, 2016

ogrisel commented Feb 4, 2016

arthurmensch commented Feb 19, 2016

arthurmensch Feb 19, 2016

arthurmensch commented Feb 19, 2016

arthurmensch Feb 19, 2016

OlaPaw Feb 21, 2016

ogrisel commented Feb 19, 2016

OlaPaw commented Feb 19, 2016

OlaPaw commented Feb 21, 2016

ogrisel commented Feb 22, 2016

lesteve commented Mar 7, 2016

amueller commented Oct 7, 2016

jabenitez commented Oct 16, 2016

amueller commented Oct 17, 2016

lesteve commented Jun 9, 2017

changed pls_.py to properly scale data, modified documents #6077

changed pls_.py to properly scale data, modified documents #6077

Conversation

OlaPaw commented Dec 21, 2015

ogrisel Feb 4, 2016

Choose a reason for hiding this comment

ogrisel commented Feb 4, 2016

arthurmensch commented Feb 19, 2016

arthurmensch Feb 19, 2016

Choose a reason for hiding this comment

arthurmensch commented Feb 19, 2016

arthurmensch Feb 19, 2016

Choose a reason for hiding this comment

OlaPaw Feb 21, 2016

Choose a reason for hiding this comment

ogrisel commented Feb 19, 2016

OlaPaw commented Feb 19, 2016

OlaPaw commented Feb 21, 2016

ogrisel commented Feb 22, 2016

lesteve commented Mar 7, 2016

amueller commented Oct 7, 2016

jabenitez commented Oct 16, 2016

amueller commented Oct 17, 2016

lesteve commented Jun 9, 2017