Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Enforce deterministic output in kernel PCA #13241

Merged
merged 8 commits into from Feb 26, 2019

Conversation

5 participants
@bellet
Copy link
Contributor

bellet commented Feb 25, 2019

Fixes #8798

This PR adds a call to svd_flip to ensure that the output of KPCA is deterministic, as done for PCA.

Example code:

import numpy as np
from sklearn.decomposition import KernelPCA

np.random.seed(42)
data = np.random.rand(12).reshape(4, 3)

for i in range(10):
    kpca = KernelPCA(n_components=2, eigen_solver='arpack')
    print(kpca.fit_transform(data)[0])

Output before this PR:

[-0.42930193 -0.11840723]
[0.42930193 0.11840723]
[-0.42930193 -0.11840723]
[0.42930193 0.11840723]
[-0.42930193 -0.11840723]
[0.42930193 0.11840723]
[-0.42930193 -0.11840723]
[ 0.42930193 -0.11840723]
[0.42930193 0.11840723]
[-0.42930193 -0.11840723]

With this PR:

[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]

Some comments:

  • Calling svd_flip directly requires to copy the fitted self.alphas_, which is not great. The alternative would be to add an equivalent of svd_flip for eigendecomposition (taking a single input instead of two). I can easily do this if this is the preferred solution.
  • This deterministic behavior does not seem to be tested for PCA, so I did not add any test here either. Can do if needed.
@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Feb 25, 2019

I confirm that it fixes the pb but can you still add a test and udpate what's new?

yes it should be done for PCA too.

@GaelVaroquaux GaelVaroquaux added this to To do in Sprint Paris 2019 via automation Feb 25, 2019

@GaelVaroquaux GaelVaroquaux moved this from To do to Needs review in Sprint Paris 2019 Feb 25, 2019

@bellet

This comment has been minimized.

Copy link
Contributor Author

bellet commented Feb 25, 2019

I confirm that it fixes the pb but can you still add a test and udpate what's new?

yes it should be done for PCA too.

Done.

@@ -71,6 +71,23 @@ def test_kernel_pca_consistent_transform():
assert_array_almost_equal(transformed1, transformed2)


def test_kernel_pca_deterministic_output():
state = np.random.RandomState(0)

This comment has been minimized.

@agramfort

agramfort Feb 25, 2019

Member

state -> rng

it's how we call RandomState usually.

@agramfort
Copy link
Member

agramfort left a comment

besides my nitpick

good to go from my end

@bellet bellet changed the title [MRG] Enforce deterministic output in kernel PCA [MRG+1] Enforce deterministic output in kernel PCA Feb 25, 2019

@bellet

This comment has been minimized.

Copy link
Contributor Author

bellet commented Feb 25, 2019

Done!

# flip eigenvectors' sign to enforce deterministic output
# note: copying the second element is needed so that both inputs do
# not refer to the same object
self.alphas_, _ = svd_flip(self.alphas_, self.alphas_.copy().T)

This comment has been minimized.

@vene

vene Feb 25, 2019

Member

would it work to do

svd_flip(self.alphas_, np.empty_like(self.alphas_).T)

to avoid a memory copy?

This comment has been minimized.

@bellet

bellet Feb 25, 2019

Author Contributor

Correct since the value of the second argument does not affect our output. Done

transformed_X[i, :] = kpca.fit_transform(X)[0]
i += 1

assert np.isclose(transformed_X, transformed_X[0, :]).all()

This comment has been minimized.

@rth

rth Feb 25, 2019

Member

Better to use numpy.testsing.assert_allclose

This comment has been minimized.

@rth

rth Feb 25, 2019

Member

I find that undestand what is done in this test takes some effort here because we check equality for both different random seeds and solvers (and we store the results in a matrix that has non nstandard (n_random_seeds*n_solvers, n_components) shape. Maybe something like,

    for solver in eigen_solver: 
        transformed_X = np.zeros((10, 2))
        for i in range(10):
            kpca = KernelPCA(n_components=2, eigen_solver=solver,
                             random_state=i)
            transformed_X[i, :] = kpca.fit_transform(X)[0]

        assert_allclose(transformed_X, transformed_X[0, :])

could be easier?

This comment has been minimized.

@vene

vene Feb 25, 2019

Member

also is it needed to set the random state to i rather than just use random_state=rng?

This comment has been minimized.

@bellet

bellet Feb 25, 2019

Author Contributor

random_state=rng would use the same seed for all runs. The point here is to check that the arpack random state does not result in different solutions because of sign ambiguity

This comment has been minimized.

@bellet

bellet Feb 25, 2019

Author Contributor

I used random_state=i to show explicitly that the random state changes from one run to the next. But removing it would be valid as well as the random state would still change

This comment has been minimized.

@vene

vene Feb 25, 2019

Member

I may be wrong but i don't agree that random_state=rng would use the same seed, unless rng is an int. Here, rng is a RandomState object (initialized a few lines above) so it has internal state. For instance:

~{main}$ ipython
Python 3.7.2 (default, Feb 11 2019, 11:12:39)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.2.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np

In [2]: rng = np.random.RandomState(0)

In [3]: rng.randn(3)
Out[3]: array([1.76405235, 0.40015721, 0.97873798])

In [4]: rng.randn(3)
Out[4]: array([ 2.2408932 ,  1.86755799, -0.97727788])

This comment has been minimized.

@bellet

bellet Feb 25, 2019

Author Contributor

You're right, I can switch to that

transformed_X[i, :] = pca.fit_transform(X)[0]
i += 1

assert np.isclose(transformed_X, transformed_X[0, :]).all()

This comment has been minimized.

@rth

rth Feb 25, 2019

Member

Same comments as above

# note: copying the second element is needed so that both inputs do
# not refer to the same object
self.alphas_, _ = svd_flip(self.alphas_,
np.empty_like(self.alphas_).T)

This comment has been minimized.

@vene

vene Feb 25, 2019

Member

awesome, thanks! maybe remove the note: above since it's no longer relevant.

bellet added some commits Feb 25, 2019

@bellet

This comment has been minimized.

Copy link
Contributor Author

bellet commented Feb 25, 2019

I have simplified the tests as suggested by @vene @rth

@vene

This comment has been minimized.

Copy link
Member

vene commented Feb 25, 2019

LGTM pending CI pass! Thanks @bellet

@vene

vene approved these changes Feb 25, 2019

@agramfort agramfort merged commit 8647658 into scikit-learn:master Feb 26, 2019

11 checks passed

LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 92.49%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +7.5% compared to f8b108d
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

Sprint Paris 2019 automation moved this from Needs review to Done Feb 26, 2019

@agramfort

This comment has been minimized.

Copy link
Member

agramfort commented Feb 26, 2019

thx @bellet

@rth

This comment has been minimized.

Copy link
Member

rth commented Feb 26, 2019

Thanks !

Does this mean we can remove the _UnstableOn32BitMixin from the KernelPCA class and tests will pass on 32 bit arch as well? (The corresponding estimator tag is non_deterministic)

Kiku-git added a commit to Kiku-git/scikit-learn that referenced this pull request Mar 4, 2019

[MRG+1] Enforce deterministic output in kernel PCA (scikit-learn#13241)
* enforce deterministic output in kernel PCA

* add tests and update whats new

* replace state by rng

* simplified assert

* avoid copy

* clarify tests

* remove now useless comment

* use rng as seed everywhere
@ogrisel

This comment has been minimized.

Copy link
Member

ogrisel commented Mar 5, 2019

The CRON jobs for nightly builds on linux 32 bit pass:

https://travis-ci.org/MacPython/scikit-learn-wheels/builds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.