New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changed pls_.py to properly scale data, modified documents #6077
Conversation
and added changes to canonical regression
sklearn/cross_decomposition/pls_.py
Outdated
self.coef_ = (1. / self.x_std_.reshape((p, 1)) * self.coef_ * | ||
self.y_std_) | ||
# self.coef_ = (1. / self.x_std_.reshape((p, 1)) * self.coef_ * | ||
# self.y_std_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with PLS but if this is no longer needed please delete those commented out lines.
Some tests are failing. They might need to be updated or relaxed a bit I am not sure. @arthurmensch can probably have a look once the ICML deadline is over (tomorrow). Also please add a one or more new test to highlight the cases that have been fixed in this PR and that were not properly tested previously. |
sklearn/cross_decomposition/pls_.py
Outdated
@@ -41,7 +42,8 @@ def _nipals_twoblocks_inner_loop(X, Y, mode="A", max_iter=500, tol=1e-06, | |||
ite = 1 | |||
X_pinv = Y_pinv = None | |||
eps = np.finfo(X.dtype).eps | |||
# Inner loop of the Wold algo. | |||
# Inner loop of the Wold algo. Very similar to chemometrics::pls2_nipals | |||
# in R. Tolarence looks at changes in the X matrix rather than Y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tolerance. chemometrics is not a wide known reference, either reference it explicitly or I think it should be removed
Quoting @rlgc79:
I think the changes you made to the I think your changes make the code more explicit but I'd like to have another option on this. |
# 2.2 Normalize y_weights | ||
if norm_y_weights: | ||
y_weights /= np.sqrt(np.dot(y_weights.T, y_weights)) + eps | ||
# 2.3 Update y_score: the Y latent scores | ||
y_score = np.dot(Y, y_weights) / (np.dot(y_weights.T, y_weights) + eps) | ||
# y_score = np.dot(Y, y_weights) / np.dot(y_score.T, y_score) ## BUG | ||
y_score = np.dot(Y, y_weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Effective bug fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite, but the dot product becomes unity, so is not necessary
@arthurmensch is @OlaPaw does not reply please feel free to go ahead and extract the bugfix in a separate new PR and please also enable the tests on appveyor as done in #6274 to check whether it fixes #6279. |
Hi @arthurmensch and @ogrisel, Yes, I think you have the entire gist of the bug here. If you don't mind, I'll take a look at this over the weekend and test with all my separate test cases. I'm really new to github and debugging code, so thank you for your patience. I just found a lot of the PLS code somewhat cryptic as it doesn't seem to follow the typical conventions I've seen in literature. Finally, scikit-learn already has pre-scaling function and I'm wondering whether we should even scale / center anything in this function. |
no need for separate check if scale is True, as self_std is set to ones if set to False Fixed re-scaling and re-centering issues reported previously
Ok, I've removed the cosmetic changes, and just fixed the issues with improper scaling and centering. I tried to leave the rest of the code as pure as possible. |
Can you please add a non-regression test and update the existing tests (there are some failures reported by travis). |
Unfortunately it does not, I just checked. |
@OlaPaw are you still working on this? It would be good to have. |
I'm now using PLSRegression. Pls let me know if I can help complete this PR. I got here after noticing my outputs of predict were out of scale. |
@jabenitez feel free to help with this by adding the appropriate tests. You can create a new PR based on this one. |
@Gscorreia89 as a non PLS expert, it is hard to figure out what remains to be done on this one now that #7819 has been merged. Would you be kind enough to elaborate a bit? |
scale
setting was ignored. Data was always centered, and always scaled. The data is now scaled appropriately whenscale=True
is set, and not scaled when it'sFalse
predict()
function was always de-centered, but never de-scaled, so that we got:Ypred = Ypred + self.y_mean_
for all casesinstead of:
Ypred = (Ypred * self.y_std_) + self.y_mean_
whenscale=True
coef_
is calculated differently depending on those two cases. The calculatedcoef_
was also scaled unnecesarily by standard deviations of X and Y: this is now commented out. Only matrices W and C are scaled by their norms in the inner loop. The outer loop scales P, Q, and C (canonical and regression) matrices byT.T*T
orU.T*U
sklearn.preprocessing.scale
are included, as that could be a useful step for most users.sklearn.preprocessing.scale
(or column mean centered and column variance scaled), the pls_ functions all return the SAME results whenscale=True
andscale=False
this was not the case in the previous version of the code