Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add function score_samples to Pipeline (fix issue #12542) #13806

Merged
merged 23 commits into from Jun 5, 2019

Conversation

@ab-anssi
Copy link
Contributor

ab-anssi commented May 6, 2019

Reference Issues/PRs

Fix issue #12542.

What does this implement/fix? Explain your changes.

The pull request adds a function score_samples to the Pipeline, GridSearchCV, and RandomizedSearchCVclasses (code very similar to the functions predict_proba, decision_function, predict_log_proba).

@ab-anssi ab-anssi changed the title [MRG] Add function score_samples to Pipeline (fix issue #12542) [WIP] Add function score_samples to Pipeline (fix issue #12542) May 7, 2019
@ab-anssi ab-anssi force-pushed the ab-anssi:pipeline__score_samples branch from 9b6d9e7 to 30e8c6d May 7, 2019
@ab-anssi

This comment has been minimized.

Copy link
Contributor Author

ab-anssi commented May 7, 2019

I don't understand why it does not work with Python 3.7 (failed tests: Linux pylatest_conda and Windows py37_64). I don't understand the error messages. I have tried with OneClassSVM, and LOF and the same kind of error occurs.
Do you have any idea about how to fix it ?

@ab-anssi ab-anssi closed this May 7, 2019
@ab-anssi ab-anssi reopened this May 7, 2019
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented May 8, 2019

@ab-anssi

This comment has been minimized.

Copy link
Contributor Author

ab-anssi commented May 9, 2019

The error occurs in the test I have added, but not when the newly implemented function, score_samples is called. The call to fit raises the error.

Failure in test test_pipeline_score_samples_pca_lof

    def test_pipeline_score_samples_pca_lof():
        iris = load_iris()
        X = iris.data
        # Test with PCA + OneClassSVM
        clf = LocalOutlierFactor(novelty=True)
        pca = PCA(svd_solver='full', n_components='mle', whiten=True)
        pipe = Pipeline([('pca', pca), ('lof', clf)])
        pipe.fit(X)

### The error message for pipe.fit(X)

X          = array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
  ...],
       [6.3, 2.5, 5. , 1.9],
       [6.5, 3. , 5.2, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , 5.1, 1.8]])
clf        = LocalOutlierFactor(algorithm='auto', contamination='legacy', leaf_size=30,
                   metric='minkowski', metric_params=None, n_jobs=None,
                   n_neighbors=20, novelty=True, p=2)
iris       = {'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5,...idth (cm)', 'petal length (cm)', 'petal width (cm)'], 'filename': '/home/vsts/work/1/s/sklearn/datasets/data/iris.csv'}
pca        = PCA(copy=True, iterated_power='auto', n_components='mle', random_state=None,
    svd_solver='full', tol=0.0, whiten=True)
pipe       = Pipeline(memory=None,
         steps=[('pca',
                 PCA(copy=True, iterated_power='auto', n_components='mle...ms=None, n_jobs=None,
                                    n_neighbors=20, novelty=True, p=2))],
         verbose=False)

../1/s/sklearn/tests/test_pipeline.py:343: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../1/s/sklearn/pipeline.py:355: in fit
    self._final_estimator.fit(Xt, y, **fit_params)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = LocalOutlierFactor(algorithm='auto', contamination='legacy', leaf_size=30,
                   metric='minkowski', metric_params=None, n_jobs=None,
                   n_neighbors=20, novelty=True, p=2)
X = array([[-1.30533786,  0.64836932, -0.09981716],
       [-1.31993521, -0.35930856, -0.75257299],
       [-1.40496732, -...6008172,  0.46657302],
       [ 0.9244616 ,  0.23675217,  2.58618523],
       [ 0.67607348, -0.57379543,  1.29768343]])
y = None

    def fit(self, X, y=None):
        """Fit the model using X as training data.
    
        Parameters
        ----------
        X : {array-like, sparse matrix, BallTree, KDTree}
            Training data. If array or matrix, shape [n_samples, n_features],
            or [n_samples, n_samples] if metric='precomputed'.
    
        y : Ignored
            not used, present for API consistency by convention.
    
        Returns
        -------
        self : object
        """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
>                         FutureWarning)
E           FutureWarning: default contamination parameter 0.1 will change in version 0.22 to "auto". This will change the predict method behavior.

X          = array([[-1.30533786,  0.64836932, -0.09981716],
       [-1.31993521, -0.35930856, -0.75257299],
       [-1.40496732, -...6008172,  0.46657302],
       [ 0.9244616 ,  0.23675217,  2.58618523],
       [ 0.67607348, -0.57379543,  1.29768343]])
__class__  = <class 'sklearn.neighbors.lof.LocalOutlierFactor'>
self       = LocalOutlierFactor(algorithm='auto', contamination='legacy', leaf_size=30,
                   metric='minkowski', metric_params=None, n_jobs=None,
                   n_neighbors=20, novelty=True, p=2)
y          = None
@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented May 10, 2019

You need to catch the warning usin pytest.mark.filterwarnings on the top of the test.

@ab-anssi

This comment has been minimized.

Copy link
Contributor Author

ab-anssi commented May 10, 2019

Thanks for your help ! It seems ok now. Can you please review the PR ?

@ab-anssi ab-anssi changed the title [WIP] Add function score_samples to Pipeline (fix issue #12542) [MRG] Add function score_samples to Pipeline (fix issue #12542) May 10, 2019
Copy link
Member

jnothman left a comment

Thanks

@@ -342,8 +342,8 @@ def test_pipeline_score_samples_pca_lof():
pca = PCA(svd_solver='full', n_components='mle', whiten=True)
pipe = Pipeline([('pca', pca), ('lof', clf)])
pipe.fit(X)
assert_equal(list(pipe.score_samples(X[:2])),
[-0.9564705258081416, -1.0371487183896932])
assert_array_almost_equal(pipe.score_samples(X[:2]),

This comment has been minimized.

Copy link
@jnothman

jnothman May 22, 2019

Member

assert_allclose is the preferred helper

This comment has been minimized.

Copy link
@ab-anssi

ab-anssi May 22, 2019

Author Contributor

Done.

This comment has been minimized.

Copy link
@thomasjpfan

thomasjpfan May 28, 2019

Member

assert_array_almost_equal is still being used.

This comment has been minimized.

Copy link
@ab-anssi

ab-anssi May 28, 2019

Author Contributor

Sorry. I did something wrong when rebasing the branch with master. I lost two commits along the way (rm assert_array_almost_equal and update what's new). Now it should be ok.

Copy link
Member

jnothman left a comment

I'm happy with supporting this.

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

@ab-anssi ab-anssi force-pushed the ab-anssi:pipeline__score_samples branch 4 times, most recently from 7c1de43 to c29a559 May 22, 2019
@ab-anssi

This comment has been minimized.

Copy link
Contributor Author

ab-anssi commented May 22, 2019

Thanks for your review.

@ab-anssi

This comment has been minimized.

Copy link
Contributor Author

ab-anssi commented May 26, 2019

A test fails with the last version I have committed. However, it does not seem to be related to the changes I have made.

__________________________ test_scale_and_stability ___________________________

def test_scale_and_stability():
    # We test scale=True parameter
    # This allows to check numerical stability over platforms as well

    d = load_linnerud()
    X1 = d.data
    Y1 = d.target
    # causes X[:, -1].std() to be zero
    X1[:, -1] = 1.0

    # From bug #2821
    # Test with X2, T2 s.t. clf.x_score[:, 1] == 0, clf.y_score[:, 1] == 0
    # This test robustness of algorithm when dealing with value close to 0
    X2 = np.array([[0., 0., 1.],
                   [1., 0., 0.],
                   [2., 2., 2.],
                   [3., 5., 4.]])
    Y2 = np.array([[0.1, -0.2],
                   [0.9, 1.1],
                   [6.2, 5.9],
                   [11.9, 12.3]])

    for (X, Y) in [(X1, Y1), (X2, Y2)]:
        X_std = X.std(axis=0, ddof=1)
        X_std[X_std == 0] = 1
        Y_std = Y.std(axis=0, ddof=1)
        Y_std[Y_std == 0] = 1

        X_s = (X - X.mean(axis=0)) / X_std
        Y_s = (Y - Y.mean(axis=0)) / Y_std

        for clf in [CCA(), pls_.PLSCanonical(), pls_.PLSRegression(),
                    pls_.PLSSVD()]:
            clf.set_params(scale=True)
            X_score, Y_score = clf.fit_transform(X, Y)
            clf.set_params(scale=False)
            X_s_score, Y_s_score = clf.fit_transform(X_s, Y_s)
          assert_array_almost_equal(X_s_score, X_score)

E AssertionError:
E Arrays are not almost equal to 6 decimals
E
E Mismatch: 50%
E Max absolute difference: 7.04879453e-05
E Max relative difference: 0.00169015
E x: array([[-1.337317, -0.041776],
E [-1.108472, 0.0982 ],
E [ 0.407632, -0.103027],
E [ 2.038158, 0.046602]])
E y: array([[-1.337317, -0.041705],
E [-1.108472, 0.098154],
E [ 0.407632, -0.103083],
E [ 2.038158, 0.046634]])

X = array([[0., 0., 1.],
[1., 0., 0.],
[2., 2., 2.],
[3., 5., 4.]])
X1 = array([[ 5., 162., 1.],
[ 2., 110., 1.],
[ 12., 101., 1.],
[ 12., 105., 1.],
[ 1...0., 1.],
[ 4., 60., 1.],
[ 11., 230., 1.],
[ 15., 225., 1.],
[ 2., 110., 1.]])
X2 = array([[0., 0., 1.],
[1., 0., 0.],
[2., 2., 2.],
[3., 5., 4.]])
X_s = array([[-1.161895 , -0.7406129 , -0.43915503],
[-0.38729833, -0.7406129 , -1.02469508],
[ 0.38729833, 0.10580184, 0.14638501],
[ 1.161895 , 1.37542395, 1.3174651 ]])
X_s_score = array([[-1.3373174 , -0.04177564],
[-1.10847164, 0.09820033],
[ 0.40763151, -0.1030265 ],
[ 2.03815753, 0.04660181]])
X_score = array([[-1.3373174 , -0.04170515],
[-1.10847164, 0.09815418],
[ 0.40763151, -0.10308337],
[ 2.03815753, 0.04663434]])
X_std = array([1.29099445, 2.36290781, 1.70782513])
Y = array([[ 0.1, -0.2],
[ 0.9, 1.1],
[ 6.2, 5.9],
[11.9, 12.3]])
Y1 = array([[191., 36., 50.],
[189., 37., 52.],
[193., 38., 58.],
[162., 35., 62.],
[18...7., 62.],
[176., 37., 54.],
[157., 32., 52.],
[156., 33., 54.],
[138., 33., 68.]])
Y2 = array([[ 0.1, -0.2],
[ 0.9, 1.1],
[ 6.2, 5.9],
[11.9, 12.3]])
Y_s = array([[-0.85511537, -0.87878921],
[-0.70878547, -0.64915585],
[ 0.26065014, 0.19872118],
[ 1.3032507 , 1.32922388]])
Y_s_score = array([[-8.55115369e-01, -8.12697758e-14],
[-7.08785466e-01, 1.92383868e-13],
[ 2.60650139e-01, -2.01848251e-13],
[ 1.30325070e+00, 9.07341589e-14]])
Y_score = array([[-8.55115369e-01, -1.89870573e-10],
[-7.08785466e-01, 4.46868095e-10],
[ 2.60650139e-01, -4.69311104e-10],
[ 1.30325070e+00, 2.12313582e-10]])
Y_std = array([5.46709856, 5.66119834])
clf = CCA(copy=True, max_iter=500, n_components=2, scale=False, tol=1e-06)
d = {'data_filename': 'c:\hostedtoolcache\windows\python\3.5.4\x86\lib\site-packages\sklearn\datasets\data/linne..... topic:: References\n\n * Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions Technic.\n'}

@thomasjpfan

This comment has been minimized.

Copy link
Member

thomasjpfan commented May 26, 2019

Merging with master should fix this issue.

@ab-anssi ab-anssi force-pushed the ab-anssi:pipeline__score_samples branch from c29a559 to a2953f5 May 26, 2019
ab-anssi added 3 commits May 28, 2019
ab-anssi and others added 2 commits May 29, 2019
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Copy link
Contributor

NicolasHug left a comment

Please add a test making sure the appropriate error is raised if we try using score_samples() when the final estimator doesn't support it

The function score_samples is then automatically available for GridSearchCV and RandomizedSearchCV.

I'm a bit confused, are you sure? If that's the case it should be tested and mentionned in the what's new

sklearn/pipeline.py Outdated Show resolved Hide resolved
@ab-anssi ab-anssi force-pushed the ab-anssi:pipeline__score_samples branch from 48e3208 to 87c90a9 May 29, 2019
Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>
@ab-anssi ab-anssi force-pushed the ab-anssi:pipeline__score_samples branch from 87c90a9 to 4f8723e May 29, 2019
@ab-anssi

This comment has been minimized.

Copy link
Contributor Author

ab-anssi commented May 31, 2019

Hi @NicolasHug,
I will add the test you've mentioned.
As regards GridSearchCV and RandomizedSearchCV, you are right, the score_samples function is not automatically available. I will add them in the class BaseSearchCV.

ab-anssi added 2 commits May 31, 2019
score_samples is then available for GridSearchCV and RandomizedSearchCV.
@ab-anssi

This comment has been minimized.

Copy link
Contributor Author

ab-anssi commented May 31, 2019

I have added score_samples to BaseSearchCV, but I still need to add some tests. I have looked at the file sklearn/model_selection/tests/test_search.py but it is not straightforward for me to add tests for this new function. I will come back to you when I have a solution.

pca = PCA(svd_solver='full', n_components='mle', whiten=True)
clf = LogisticRegression()
pipe = Pipeline([('pca', pca), ('clf', clf)])
assert_raises_regex(

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug May 31, 2019

Contributor

we're trying to avoid assert_raises_regex and use with pytest.raises(): instead

This comment has been minimized.

Copy link
@ab-anssi

ab-anssi May 31, 2019

Author Contributor

I have updated the code accordingly.

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented May 31, 2019

I don't think you can implement this for the search estimators. Outlier detectors don't have a proper scoring so they can't be used with grid search or random search.

ab-anssi added 2 commits May 31, 2019
Review by NicolasHug.
@ab-anssi

This comment has been minimized.

Copy link
Contributor Author

ab-anssi commented May 31, 2019

Ok. I have removed the score_samples method from BaseSearchCV.

@ab-anssi

This comment has been minimized.

Copy link
Contributor Author

ab-anssi commented May 31, 2019

I think I have taken into account all your reviews. Thanks for your time !



def test_score_samples_on_pipeline_without_score_samples():
iris = load_iris()

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug May 31, 2019

Contributor

For these kind of error checking tests, you don't need to use real-world data, or good parameters. We just need them to check what's needed, and to run fast.

pipe = make_pipeline(LogisticRegression()) is enough

You could also just hard code X and y, or use make_classification

LGTM regardless, thanks @ab-anssi

@@ -330,6 +332,37 @@ def test_pipeline_methods_pca_svm():
pipe.score(X, y)


@pytest.mark.filterwarnings('ignore: default contamination parameter') # 0.22

This comment has been minimized.

Copy link
@glemaitre

glemaitre Jun 3, 2019

Contributor

You don't need this line anymore. We merge the PR which remove the deprecation for 0.22

This comment has been minimized.

Copy link
@ab-anssi

ab-anssi Jun 5, 2019

Author Contributor

Ok.

@glemaitre

This comment has been minimized.

Copy link
Contributor

glemaitre commented Jun 3, 2019

2 small changes and I think that we are good.

ab-anssi and others added 2 commits Jun 5, 2019
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@ab-anssi

This comment has been minimized.

Copy link
Contributor Author

ab-anssi commented Jun 5, 2019

@glemaitre, I have taken into account all your comments. Thanks !

@glemaitre glemaitre merged commit ec35ed2 into scikit-learn:master Jun 5, 2019
16 checks passed
16 checks passed
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.8%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +3.19% compared to 9adba49
Details
scikit-learn.scikit-learn Build #20190605.7 succeeded
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas succeeded
Details
scikit-learn.scikit-learn (Linux py35_np_atlas) Linux py35_np_atlas succeeded
Details
scikit-learn.scikit-learn (Linux pylatest_conda) Linux pylatest_conda succeeded
Details
scikit-learn.scikit-learn (Windows py35_32) Windows py35_32 succeeded
Details
scikit-learn.scikit-learn (Windows py37_64) Windows py37_64 succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda) macOS pylatest_conda succeeded
Details
@ab-anssi ab-anssi deleted the ab-anssi:pipeline__score_samples branch Jun 5, 2019
koenvandevelde added a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.