Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Fixes test_scale_and_stability #18746

Merged

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Nov 3, 2020

Reference Issues/PRs

Fixes #18613
Fixes #6279
Simliar to #13903

What does this implement/fix? Explain your changes.

Mapping small weights to zero helps with stabilization.

Any other comments?

This can error can be reproduced on windows with the latest numpy + scipy installed from pypi and python 3.8.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (I trust you for the windows stability manual check :)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-tracting my previous +1 review because now I realize that I has mis-understood the code in my first review. I will need more time to understand what's going on and what kind of fix would make sense.

I think it would help if the failing test could be split/decomposed or parametrized to show for which kind of data the problem is happening.

@ogrisel ogrisel self-requested a review November 4, 2020 13:24
@thomasjpfan
Copy link
Member Author

thomasjpfan commented Nov 4, 2020

As background, this issue appears for CCA and X2, Y2 in the loop:

for (X, Y) in [(X1, Y1), (X2, Y2), (X3, Y3)]:

The issue stems from the following dot product having a different result depending on the platform+ numpy version

y_weights = np.dot(Y_pinv, x_score)

On windows one of the iterations get [1.56, 7.10e-15], while on linux it gets [1.56, 1.77e-15] this error propagates to the next iteration.

Edit There are small differences in the output of pinv2 between platforms as well.

@NicolasHug
Copy link
Member

Is this just a matter of setting max_iter or tol then?

(CCA is notoriously unstable especially for poorly conditions matrices. When Y2 is scaled into Ys it looks like a rank-1 matrix to me)

Copy link
Member Author

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this PR with by adjusting the cond of pinv2.

sklearn/cross_decomposition/_pls.py Show resolved Hide resolved
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM then, thanks @thomasjpfan

This might need a whats new entry since this may induce a change in results in some cases.

doc/whats_new/v0.24.rst Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution looks fine with me but I would be more confident if you could update the test_scale_and_stability to split it into 3 tests: test_scale_and_stability_linerud, test_scale_and_stability_non_regression_2821 and test_scale_and_stability_non_regression_7819 so that we know instantly which dataset is causing the instability shall we observe a new failure in the future.

Also we could use sklearn.processing.scale instead of manually scaling the input datasets.

I don't have a Windows machine handy so I cannot quickly check if the changes I suggest would still cause the failure to happen in master (scaling with a different ddof in particular).

@ogrisel
Copy link
Member

ogrisel commented Nov 5, 2020

Or alternatively, if you understand the cause of the problem well enough, could you implement a new test that would trigger a similar problem on any platform (not just windows + whatever BLAS is used on the failing CI) by making the instability even stronger?

@@ -127,6 +127,9 @@ Changelog
predictions for `est.transform(Y)` when the training data is single-target.
:pr:`17095` by `Nicolas Hug`_.

- |Fix| Increases the stability of :class:`cross_decomposition.CCA` :pr:`18746`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify that it's only for poorly conditioned matrices? So that not all users expect to have changes in the results

Copy link
Member Author

@thomasjpfan thomasjpfan Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue comes from:

https://github.com/scipy/scipy/blob/8e30f7797bd1ee442f4f1a25172e4402521c1e16/scipy/linalg/basic.py#L1385

when pinv2 is called. On windows and the pypi version of numpy the singular values for Xk are:

[0.7, 0.14, 9e-16]

which results in a rank of 3.

On other platforms, where the original test passes, the singular values are

[0.7, 0.14, 1.4e-17]

which results in a rank of 2. This is by designed as stated reference paper on page 12, where the rank should decrease when Xk gets updated.

Xk is updated here:

Xk -= np.outer(x_scores, x_loadings)

Given, this I do not think its for poorly conditioned matrices. It is our usage of pinv2 that was flaky.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when pinv2 is called. On windows and the pypi version of numpy the singular values for Xk are:

After which iteration? If this happens at the first iter, then this is indeed a problem of condition number I believe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(by first iteration I mean the first component, or equivalently the first call to _get_first_singular_vectors_power_method)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After which iteration? If this happens at the first iter, then this is indeed a problem of condition number I believe

This happened on the second call to _get_first_singular_vectors_power_method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would still be valuable to identify in which cases there's a stability improvement. Clearly, not all users will be impacted by this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this happens randomly. I updated test_scale_and_stability to generate random ys and found that there were seeds that fail on my machine and on windows.

From a user point of view, for some datasets they may have gotten an incorrect model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both random y and X? Because otherwise the problem might just be coming from X.

I still believe this is related to poorly conditioned matrices because what you pass to _get_first_singular_vectors_power_method at iteration i are the deflated matrices from iteration i - 1 (deflation = subtracting with a rank-1 matrix to obtain a (r - 1) rank matrix)

I'm quite certain that the name of the parameter to pinv (cond or rcond) is related to the condition number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both random y and X? Because otherwise the problem might just be coming from X.

Updated with randomly generated X and y. The condition number will alway gets much bigger when the matrix becomes a r - 1 rank matrix.

When the rank becomes r - 1, pinv2 by default is having trouble determining the rank of the matrix:

https://github.com/scipy/scipy/blob/8e30f7797bd1ee442f4f1a25172e4402521c1e16/scipy/linalg/basic.py#L1457-L1466

Setting cond= 10 * eps helps in this case.

Copy link
Member Author

@thomasjpfan thomasjpfan Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specificly when pinv2 thinks a r-1 matrix is rank r, it divides by a really small singular value when computing the inverse.

sklearn/cross_decomposition/tests/test_pls.py Outdated Show resolved Hide resolved
sklearn/cross_decomposition/tests/test_pls.py Outdated Show resolved Hide resolved
@cmarmo
Copy link
Contributor

cmarmo commented Nov 12, 2020

This is ready to be merged, right? @glemaitre? Thanks!

@jeremiedbb jeremiedbb added this to the 0.24 milestone Nov 13, 2020
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation @thomasjpfan provided above looks good to me and empirically backed by the fact that the updated test can fail with different random seeds even on linux 64 bit.

I am wondering: shall we try to remove the _UnstableArchMixin from the CCA mro?

@@ -128,6 +128,9 @@ Changelog
predictions for `est.transform(Y)` when the training data is single-target.
:pr:`17095` by `Nicolas Hug`_.

- |Fix| Increases the stability of :class:`cross_decomposition.CCA` :pr:`18746`
Copy link
Member

@ogrisel ogrisel Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not just for CCA but also for related PLSCanonical model, isn't it? PLSSVD uses a different fit method.

PLSCanonical uses mode="A" which does not rely on those pinv2 calls. So the what's new entry is correct.

@ogrisel
Copy link
Member

ogrisel commented Nov 13, 2020

Let me try to push this and trigger a full build of all the wheels on all the architectures (32 bit and 64 bit windows and linux).

@ogrisel
Copy link
Member

ogrisel commented Nov 13, 2020

Both the 32 bit Linux and Windows builds all passed without the _UnstableArchMixin which is apparently useless now. Let me delete it. _UnstableArchMixin is used by LocallyLinearEmbedding.

I will also launch the arm64 test just to be sure. Unfortunately we don't have any PPC CI configuration but I am pretty sure that this fix would also solve the numerical stability problem we observed for those as well.

@ogrisel
Copy link
Member

ogrisel commented Nov 13, 2020

It also passed in arm64. Waiting for the final Azure builds to complete and then I plan to merge, unless @NicolasHug or @thomasjpfan have an objection.

@ogrisel ogrisel merged commit 84bd4e2 into scikit-learn:master Nov 13, 2020
@ogrisel
Copy link
Member

ogrisel commented Nov 13, 2020

Merged! Thanks @thomasjpfan for tracking down the cause of this issue. I am really great to have a numerically stable CCA in scikit-learn after all those years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants