Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changed pls_.py to properly scale data, modified documents #6077

Closed
wants to merge 2 commits into from

Conversation

OlaPaw
Copy link

@OlaPaw OlaPaw commented Dec 21, 2015

  1. In many instances of the variable scale setting was ignored. Data was always centered, and always scaled. The data is now scaled appropriately when scale=True is set, and not scaled when it's False
  2. The return from the predict() function was always de-centered, but never de-scaled, so that we got:
    Ypred = Ypred + self.y_mean_ for all cases
    instead of:
    Ypred = (Ypred * self.y_std_) + self.y_mean_ when scale=True
  3. In the nipals algorithm, either C and Q matrices are used for deflation modes (regression or canonical). This is now implemented. The #FIXME note is now not necessary, as coef_ is calculated differently depending on those two cases. The calculated coef_ was also scaled unnecesarily by standard deviations of X and Y: this is now commented out. Only matrices W and C are scaled by their norms in the inner loop. The outer loop scales P, Q, and C (canonical and regression) matrices by T.T*T or U.T*U
  4. Some documentation was changed to reflect code modifications. Also, instructions to use sklearn.preprocessing.scale are included, as that could be a useful step for most users.
  5. It should be noted that the example included in the comment sections now gives better predictions of Y.
  6. if X and Y are scaled using sklearn.preprocessing.scale (or column mean centered and column variance scaled), the pls_ functions all return the SAME results when scale=True and scale=False this was not the case in the previous version of the code

and added changes to canonical regression
self.coef_ = (1. / self.x_std_.reshape((p, 1)) * self.coef_ *
self.y_std_)
# self.coef_ = (1. / self.x_std_.reshape((p, 1)) * self.coef_ *
# self.y_std_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with PLS but if this is no longer needed please delete those commented out lines.

@ogrisel
Copy link
Member

ogrisel commented Feb 4, 2016

Some tests are failing. They might need to be updated or relaxed a bit I am not sure.

@arthurmensch can probably have a look once the ICML deadline is over (tomorrow).

Also please add a one or more new test to highlight the cases that have been fixed in this PR and that were not properly tested previously.

@arthurmensch
Copy link
Contributor

@OlaPaw are you still willing to work on this. Sorry for the review delay btw. as @ogrisel said, we'll need non regression test for the scaling bug in #6002. It would be great to see if it also fixes #6279. Do you happen to work under windows ?

@@ -41,7 +42,8 @@ def _nipals_twoblocks_inner_loop(X, Y, mode="A", max_iter=500, tol=1e-06,
ite = 1
X_pinv = Y_pinv = None
eps = np.finfo(X.dtype).eps
# Inner loop of the Wold algo.
# Inner loop of the Wold algo. Very similar to chemometrics::pls2_nipals
# in R. Tolarence looks at changes in the X matrix rather than Y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tolerance. chemometrics is not a wide known reference, either reference it explicitly or I think it should be removed

@arthurmensch
Copy link
Contributor

Quoting @rlgc79:

Hi OlaPaw,

I've seen that in your pull request you have added many statements of the type:

if self.scale is True:
Xr /= self.x_std_

I am not sure whether the "if" statement is necessary. If you look at the code that scales the data, we >can see that x_std and y_std are arrays of ones if we decided for not scaling the data:

if scale:
x_std = X.std(axis=0, ddof=1)
x_std[x_std == 0.0] = 1.0
X /= x_std
y_std = Y.std(axis=0, ddof=1)
y_std[y_std == 0.0] = 1.0
Y /= y_std
else:
x_std = np.ones(X.shape[1])
y_std = np.ones(Y.shape[1])
return X, Y, x_mean, y_mean, x_std, y_std

As a result, I believe that we can safely use Xr /= self.x_std_ in your pull request, without the if statement, whether we decide to
scale the data or not.

I think the changes you made to the .fit, .predict _scale_x_y are cosmetic one. The real bug #6002 fixing is in _nipals_twoblocks_inner_loop.

I think your changes make the code more explicit but I'd like to have another option on this.

# 2.2 Normalize y_weights
if norm_y_weights:
y_weights /= np.sqrt(np.dot(y_weights.T, y_weights)) + eps
# 2.3 Update y_score: the Y latent scores
y_score = np.dot(Y, y_weights) / (np.dot(y_weights.T, y_weights) + eps)
# y_score = np.dot(Y, y_weights) / np.dot(y_score.T, y_score) ## BUG
y_score = np.dot(Y, y_weights)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effective bug fix

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, but the dot product becomes unity, so is not necessary

@ogrisel
Copy link
Member

ogrisel commented Feb 19, 2016

@arthurmensch is @OlaPaw does not reply please feel free to go ahead and extract the bugfix in a separate new PR and please also enable the tests on appveyor as done in #6274 to check whether it fixes #6279.

@OlaPaw
Copy link
Author

OlaPaw commented Feb 19, 2016

Hi @arthurmensch and @ogrisel,

Yes, I think you have the entire gist of the bug here. If you don't mind, I'll take a look at this over the weekend and test with all my separate test cases. I'm really new to github and debugging code, so thank you for your patience.

I just found a lot of the PLS code somewhat cryptic as it doesn't seem to follow the typical conventions I've seen in literature.

Finally, scikit-learn already has pre-scaling function and I'm wondering whether we should even scale / center anything in this function.

no need for separate check if scale is True, as self_std is set to ones if set to False
Fixed re-scaling and re-centering issues reported previously
@OlaPaw
Copy link
Author

OlaPaw commented Feb 21, 2016

Ok, I've removed the cosmetic changes, and just fixed the issues with improper scaling and centering. I tried to leave the rest of the code as pure as possible.

@ogrisel
Copy link
Member

ogrisel commented Feb 22, 2016

Can you please add a non-regression test and update the existing tests (there are some failures reported by travis).

@lesteve
Copy link
Member

lesteve commented Mar 7, 2016

It would be great to see if it also fixes #6279

Unfortunately it does not, I just checked.

@amueller
Copy link
Member

amueller commented Oct 7, 2016

@OlaPaw are you still working on this? It would be good to have.

@jabenitez
Copy link

I'm now using PLSRegression. Pls let me know if I can help complete this PR. I got here after noticing my outputs of predict were out of scale.

@amueller
Copy link
Member

@jabenitez feel free to help with this by adding the appropriate tests. You can create a new PR based on this one.

@lesteve
Copy link
Member

lesteve commented Jun 9, 2017

@Gscorreia89 as a non PLS expert, it is hard to figure out what remains to be done on this one now that #7819 has been merged. Would you be kind enough to elaborate a bit?

Base automatically changed from master to main January 22, 2021 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants