Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] SPCA centering and fixing scaling issue #11585

Merged
merged 10 commits into from Jul 20, 2018

Conversation

@FollowKenny
Copy link
Contributor

@FollowKenny FollowKenny commented Jul 17, 2018

Reference Issues/PRs

fixes #9394

What does this implement/fix? Explain your changes.

@andrewww reported a scaling issue with SPCA : the transform scaling was depending on the number of samples. Investigating this issue we found several things that were disturbing :

  • In the transform of SPCA the last lines were consisting in a peculiar normalization sequence. These lines were the cause of the issue.
  • The SPCA with alpha=0 and alpha_ridge=0 and the standard PCA were giving inconsistent results.
  • The data were not centred during the fit.
  • The data were not centred either during the transform.
  • The components were not normalized. I'm still not sure that they should be but from what I saw in this paper I'm inclined to say yes but I did not read it thoroughly yet and I did not cross sources.

This PR aims to fix all that. With all the fixes implemented, the scaling issue disappear and the transform from PCA and SPCA(alpha=0, ridge_alpha=0) gives exactly the same results.

Any other comments?

TODO :

  • Decide on the normalization strategy (read more thoroughly)
  • Implement the fixes inside a deprecation path
  • Implement the fixes for MiniBatchSparsePCA
  • Implement scaling test (from the bug report)
  • Implement PCA vs SPCA(0, 0) test (to check the new behaviour is correct)
Ivan PANICO added 3 commits Jul 17, 2018
@FollowKenny
Copy link
Contributor Author

@FollowKenny FollowKenny commented Jul 17, 2018

@agramfort Rdy for first review (if tests pass)

@FollowKenny FollowKenny changed the title [WIP] SPCA centering and fixing scaling issue [MRG] SPCA centering and fixing scaling issue Jul 17, 2018
Ivan PANICO
@@ -66,6 +66,13 @@ class SparsePCA(BaseEstimator, TransformerMixin):
If None, the random number generator is the RandomState instance used
by `np.random`.
normalize_components : boolean

This comment has been minimized.

@massich

massich Jul 17, 2018
Contributor

I would show in the docstring that normalize_components is optional and which is its default.
like here:

bootstrap_features : boolean, optional (default=False)

normalize_components : boolean
False : Use a version of Sparse PCA without components normalization
and without data centring. This is likely a bug and eventhough it's
the default for backward compatibility, this should not be used

This comment has been minimized.

@massich

massich Jul 17, 2018
Contributor

if normalize_components=False should not be used it but its the default for back compatibility, the usual deprecation cycle should be added

This comment has been minimized.

@massich

massich Jul 17, 2018
Contributor

my fault. You already have the deprecation warning. You should add when wash this added and when would it be removed, in the docstring

This comment has been minimized.

@massich

massich Jul 17, 2018
Contributor

class SparsePCA(BaseEstimator, TransformerMixin):
    """Sparse Principal Components Analysis (SparsePCA)
 
...
    Parameters
    ----------
...
    normalize_components : boolean, optional (default=False) 
        False : Use a version of Sparse PCA without ...

        .. versionadded:: 0.20
        .. deprecated:: 0.22
   		``normalize_components`` was added and set to ``False`` for bakward compatibility. It would be set to ``True`` from 0.22 onwards.

This comment has been minimized.

@massich

massich Jul 17, 2018
Contributor

Now I'm not sure if we do this, so feel free to contradict me.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 17, 2018
Member

Yes, technically we add the parameter, then change its value, then remove it. It's long!

This comment has been minimized.

@glemaitre

glemaitre Jul 17, 2018
Contributor

If it is a bug fix we should just change the behavior, isn't it?

Copy link
Contributor

@massich massich left a comment

would it normalize_components be removed in version 0.22 and kept to true?

Or would that be a second deprecation cycle, and removed in 0.24?

if removed in 0.22, it should be stated in the deprecation messages.

normalize_components : boolean
False : Use a version of Sparse PCA without components normalization
and without data centring. This is likely a bug and eventhough it's
the default for backward compatibility, this should not be used

This comment has been minimized.

@massich

massich Jul 17, 2018
Contributor

Now I'm not sure if we do this, so feel free to contradict me.

random_state=rng, normalize_components=True)
results_train = spca_lars.fit_transform(Y)
results_test = spca_lars.transform(Y[:10])
assert_array_equal(results_train[0], results_test[0])

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 17, 2018
Member

This is probably too stringent (and it's failing on travis). You probably want to use assert_array_almost_equal.

This comment has been minimized.

@glemaitre

glemaitre Jul 17, 2018
Contributor

assert_allclose in fact

self.mean_ = np.mean(X, axis=0)
X = X - self.mean_
else:
warnings.warn("normalize_components should be set to True. "

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jul 17, 2018
Member

I would be more explicit: "normalize_components=False is a backward-compatible setting that implements a non-standard definition of sparse PCA. This compatibility mode will be removed in 0.22."

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Aside from the small comments I made (including fixing travis), there needs to be an entry added to whats_new. This is an important change.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jul 17, 2018

I am confused here. Do we actually want to keep the buggy behaviour?

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 17, 2018

@@ -66,6 +66,13 @@ class SparsePCA(BaseEstimator, TransformerMixin):
If None, the random number generator is the RandomState instance used
by `np.random`.
normalize_components : boolean
False : Use a version of Sparse PCA without components normalization

This comment has been minimized.

@glemaitre

glemaitre Jul 17, 2018
Contributor

Use indent

* If False, ...
* If True, ...
@@ -66,6 +66,13 @@ class SparsePCA(BaseEstimator, TransformerMixin):
If None, the random number generator is the RandomState instance used
by `np.random`.
normalize_components : boolean
False : Use a version of Sparse PCA without components normalization
and without data centring. This is likely a bug and eventhough it's

This comment has been minimized.

@glemaitre

glemaitre Jul 17, 2018
Contributor

centring -> centering

normalize_components : boolean
False : Use a version of Sparse PCA without components normalization
and without data centring. This is likely a bug and eventhough it's
the default for backward compatibility, this should not be used

This comment has been minimized.

@glemaitre

glemaitre Jul 17, 2018
Contributor

If it is a bug fix we should just change the behavior, isn't it?

and without data centring. This is likely a bug and eventhough it's
the default for backward compatibility, this should not be used
True : Use a version of Sparse PCA with components normalization
and data centring

This comment has been minimized.

@glemaitre

glemaitre Jul 17, 2018
Contributor

centering

This comment has been minimized.

@glemaitre

glemaitre Jul 17, 2018
Contributor

add a full stop.

@@ -77,6 +84,11 @@ class SparsePCA(BaseEstimator, TransformerMixin):
n_iter_ : int
Number of iterations run.
mean_ : array, shape (n_features,)
Per-feature empirical mean, estimated from the training set.

This comment has been minimized.

@glemaitre

glemaitre Jul 17, 2018
Contributor

remove this blank line

mean_ : array, shape (n_features,)
Per-feature empirical mean, estimated from the training set.
Equal to `X.mean(axis=0)`.

This comment has been minimized.

@glemaitre

glemaitre Jul 17, 2018
Contributor

double backsticks

@FollowKenny
Copy link
Contributor Author

@FollowKenny FollowKenny commented Jul 17, 2018

All right. I’m off for tonight but I’ll take care of it tomorow evening if that’s all right. I’ll take all the comments into account and update the what’s new. Are we still ok with the double deprecation path we chose initialy (0.20 deprecate Default option False, 0.22 deprecate param and change the default to true, 0.24 delete param) ?

Ivan PANICO
@FollowKenny
Copy link
Contributor Author

@FollowKenny FollowKenny commented Jul 18, 2018

@GaelVaroquaux @glemaitre @massich Is travis supposed to fail on Depreciation warnings ? Because I think excepted some last changes, that's the last step

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 19, 2018

Yes, travis is supposed to fail on uncaught deprecation warnings. You should catch them in the test, using warning.simplefilter or @pytest.mark.filterwarnings

@agramfort
Copy link
Member

@agramfort agramfort commented Jul 19, 2018

maths are correct (i think)

@FollowKenny you need to make sure now that normalize_components is set to True in all examples and documentation pages so no deprecation warning pops up.

Ivan PANICO added 2 commits Jul 19, 2018
@FollowKenny
Copy link
Contributor Author

@FollowKenny FollowKenny commented Jul 19, 2018

I used git grep "SparsePCA(" and it only showed examples/decomposition/plot_faces_decomposition.py. Am I missing something or is it safe to assume this is the only doc update ?
Tests running

@@ -237,5 +237,6 @@ def test_pca_vs_spca():
def test_spca_deprecation_warning(spca):
rng = np.random.RandomState(0)
Y, _, _ = generate_toy_data(3, 10, (8, 8), random_state=rng)
with pytest.warns(DeprecationWarning, match="normalize_components"):
spca(normalize_components=False).fit(Y)
warn_message ="normalize_components"

This comment has been minimized.

@jeremiedbb

jeremiedbb Jul 19, 2018
Member

why not X ? Y is a bit confusing

This comment has been minimized.

@FollowKenny

FollowKenny Jul 19, 2018
Author Contributor

Indeed but all the other tests are with Y (and equally confusing) so I kept the local convention. Should I change this ?

Ivan PANICO
@GaelVaroquaux GaelVaroquaux changed the title [MRG] SPCA centering and fixing scaling issue [MRG+1] SPCA centering and fixing scaling issue Jul 20, 2018
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

LGTM.

+1 for merge

@GaelVaroquaux GaelVaroquaux added the Bug label Jul 20, 2018
@GaelVaroquaux GaelVaroquaux added this to the 0.20 milestone Jul 20, 2018
Copy link
Contributor

@glemaitre glemaitre left a comment

Couple of nitpicks before merging

@@ -66,6 +66,20 @@ class SparsePCA(BaseEstimator, TransformerMixin):
If None, the random number generator is the RandomState instance used
by `np.random`.
normalize_components : boolean, optional (default=False)
- if False, Use a version of Sparse PCA without components

This comment has been minimized.

@glemaitre

glemaitre Jul 20, 2018
Contributor

if False, Use -> If False, use

normalization and without data centering. This is likely a bug and
even though it's the default for backward compatibility,
this should not be used.
- if True, Use a version of Sparse PCA with components normalization

This comment has been minimized.

@glemaitre

glemaitre Jul 20, 2018
Contributor

if True, Use -> If True, use

and data centering.
.. versionadded:: 0.20
.. deprecated:: 0.22

This comment has been minimized.

@glemaitre

glemaitre Jul 20, 2018
Contributor

I would add a line in between the version added and deprecated.

.. versionadded:: 0.20
.. deprecated:: 0.22
``normalize_components`` was added and set to ``False`` for

This comment has been minimized.

@glemaitre

glemaitre Jul 20, 2018
Contributor

You should start the line under the "d" of deprecated of the previous line.
Refer to: http://www.sphinx-doc.org/en/stable/markup/para.html#directive-deprecated for an example

from sklearn.utils import check_random_state

import pytest

This comment has been minimized.

@glemaitre

glemaitre Jul 20, 2018
Contributor

Put this import in line 5 above import numpy as np

@@ -116,6 +136,17 @@ def fit(self, X, y=None):
"""
random_state = check_random_state(self.random_state)
X = check_array(X)

if self.normalize_components:
self.mean_ = np.mean(X, axis=0)

This comment has been minimized.

@glemaitre

glemaitre Jul 20, 2018
Contributor

I would almost call X.mean(axis=0) but it is true that we don't support sparse matrix

random_state=0).fit(Y).transform(Y)
random_state=0,
normalize_components=norm_comp)\
.fit(Y).transform(Y)

This comment has been minimized.

@glemaitre

glemaitre Jul 20, 2018
Contributor

avoid that \. Just create an instance and then call fit_transform on a second line. It is more readable.

finally:
joblib_par.multiprocessing = _mp
else: # we can efficiently use parallelism
U2 = MiniBatchSparsePCA(n_components=3, n_jobs=2, alpha=alpha,
random_state=0).fit(Y).transform(Y)
random_state=0,
normalize_components=norm_comp)\

This comment has been minimized.

@glemaitre

glemaitre Jul 20, 2018
Contributor

Same comment than before

model.fit(rng.randn(5, 4))
assert_array_equal(model.components_, V_init)
if norm_comp:
assert_array_equal(model.components_,

This comment has been minimized.

@glemaitre

glemaitre Jul 20, 2018
Contributor

It seems to be a float comparison. Shall we use assert_allclose

This comment has been minimized.

@FollowKenny

FollowKenny Jul 20, 2018
Author Contributor

all right but in the else I'm leaving assert_array_equal as this comes from the previous test and it should really be equal as nothing is done on the array in this case.

assert_array_equal(model.components_,
V_init / np.linalg.norm(V_init, axis=1)[:, None])
else:
assert_array_equal(model.components_, V_init)

This comment has been minimized.

@glemaitre

glemaitre Jul 20, 2018
Contributor

It seems to be a float comparison. Shall we use assert_allclose

This comment has been minimized.

@FollowKenny

FollowKenny Jul 20, 2018
Author Contributor

ah I did not see that you suggested this change too hence my previous comment...

Ivan PANICO
@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jul 20, 2018

Waiting for the CI to be green

@glemaitre glemaitre merged commit 6eb1983 into scikit-learn:master Jul 20, 2018
7 checks passed
7 checks passed
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 95.12% of diff hit (within 1% threshold of 95.37%)
Details
codecov/project 95.31% (-0.06%) compared to 58fa28e
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jul 20, 2018

@FollowKenny Thanks a lot!!!!

@FollowKenny
Copy link
Contributor Author

@FollowKenny FollowKenny commented Jul 20, 2018

Awesome! @agramfort @GaelVaroquaux @glemaitre @massich Thanks for your help!

@@ -182,6 +183,14 @@ Decomposition, manifold learning and clustering
This applies to the dictionary and sparse code.
:issue:`6374` by :user:`John Kirkham <jakirkham>`.

- :class:`decomposition.SparsePCA` now exposes ``normalize_components``. When
set to True, the train and test data are centered with the train mean
repsectively during the fit phase and the transform phase. This fixes the

This comment has been minimized.

@amueller

amueller Jul 21, 2018
Member

respectively

@amueller
Copy link
Member

@amueller amueller commented Jul 21, 2018

I'm surprised we do a backward deprecation for a bug. We haven't really done that in the past and I'm -0 on it. @jnothman you have an opinion?
I'd rather remove the deprecation (does that count as talking you out of it @GaelVaroquaux).
My main argument would be consistency with how we usually do things, which is break behavior if we consider it a bug. I haven't looked at the issue, though from the description it sounds like a bug.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 21, 2018

Sure, if you want to remove the deprecation, it does count as talking me out of it.

It's really a change in behavior, but the previous behavior had no statistical meaning whatsoever.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jul 21, 2018

@FollowKenny : thanks a lot for this work. It was hard and important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

7 participants
You can’t perform that action at this time.