Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add function score_samples to Pipeline (fix issue #12542) #13806

Merged
merged 23 commits into from Jun 5, 2019

Conversation

ab-anssi
Copy link

@ab-anssi ab-anssi commented May 6, 2019

Reference Issues/PRs

Fix issue #12542.

What does this implement/fix? Explain your changes.

The pull request adds a function score_samples to the Pipeline, GridSearchCV, and RandomizedSearchCVclasses (code very similar to the functions predict_proba, decision_function, predict_log_proba).

@ab-anssi ab-anssi changed the title [MRG] Add function score_samples to Pipeline (fix issue #12542) [WIP] Add function score_samples to Pipeline (fix issue #12542) May 7, 2019
@ab-anssi ab-anssi force-pushed the pipeline__score_samples branch from 9b6d9e7 to 30e8c6d Compare May 7, 2019
@ab-anssi
Copy link
Author

@ab-anssi ab-anssi commented May 7, 2019

I don't understand why it does not work with Python 3.7 (failed tests: Linux pylatest_conda and Windows py37_64). I don't understand the error messages. I have tried with OneClassSVM, and LOF and the same kind of error occurs.
Do you have any idea about how to fix it ?

@ab-anssi ab-anssi closed this May 7, 2019
@ab-anssi ab-anssi reopened this May 7, 2019
@jnothman
Copy link
Member

@jnothman jnothman commented May 8, 2019

@ab-anssi
Copy link
Author

@ab-anssi ab-anssi commented May 9, 2019

The error occurs in the test I have added, but not when the newly implemented function, score_samples is called. The call to fit raises the error.

Failure in test test_pipeline_score_samples_pca_lof

    def test_pipeline_score_samples_pca_lof():
        iris = load_iris()
        X = iris.data
        # Test with PCA + OneClassSVM
        clf = LocalOutlierFactor(novelty=True)
        pca = PCA(svd_solver='full', n_components='mle', whiten=True)
        pipe = Pipeline([('pca', pca), ('lof', clf)])
        pipe.fit(X)

### The error message for pipe.fit(X)

X          = array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
  ...],
       [6.3, 2.5, 5. , 1.9],
       [6.5, 3. , 5.2, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , 5.1, 1.8]])
clf        = LocalOutlierFactor(algorithm='auto', contamination='legacy', leaf_size=30,
                   metric='minkowski', metric_params=None, n_jobs=None,
                   n_neighbors=20, novelty=True, p=2)
iris       = {'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5,...idth (cm)', 'petal length (cm)', 'petal width (cm)'], 'filename': '/home/vsts/work/1/s/sklearn/datasets/data/iris.csv'}
pca        = PCA(copy=True, iterated_power='auto', n_components='mle', random_state=None,
    svd_solver='full', tol=0.0, whiten=True)
pipe       = Pipeline(memory=None,
         steps=[('pca',
                 PCA(copy=True, iterated_power='auto', n_components='mle...ms=None, n_jobs=None,
                                    n_neighbors=20, novelty=True, p=2))],
         verbose=False)

../1/s/sklearn/tests/test_pipeline.py:343: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../1/s/sklearn/pipeline.py:355: in fit
    self._final_estimator.fit(Xt, y, **fit_params)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = LocalOutlierFactor(algorithm='auto', contamination='legacy', leaf_size=30,
                   metric='minkowski', metric_params=None, n_jobs=None,
                   n_neighbors=20, novelty=True, p=2)
X = array([[-1.30533786,  0.64836932, -0.09981716],
       [-1.31993521, -0.35930856, -0.75257299],
       [-1.40496732, -...6008172,  0.46657302],
       [ 0.9244616 ,  0.23675217,  2.58618523],
       [ 0.67607348, -0.57379543,  1.29768343]])
y = None

    def fit(self, X, y=None):
        """Fit the model using X as training data.
    
        Parameters
        ----------
        X : {array-like, sparse matrix, BallTree, KDTree}
            Training data. If array or matrix, shape [n_samples, n_features],
            or [n_samples, n_samples] if metric='precomputed'.
    
        y : Ignored
            not used, present for API consistency by convention.
    
        Returns
        -------
        self : object
        """
        if self.contamination == "legacy":
            warnings.warn('default contamination parameter 0.1 will change '
                          'in version 0.22 to "auto". This will change the '
                          'predict method behavior.',
>                         FutureWarning)
E           FutureWarning: default contamination parameter 0.1 will change in version 0.22 to "auto". This will change the predict method behavior.

X          = array([[-1.30533786,  0.64836932, -0.09981716],
       [-1.31993521, -0.35930856, -0.75257299],
       [-1.40496732, -...6008172,  0.46657302],
       [ 0.9244616 ,  0.23675217,  2.58618523],
       [ 0.67607348, -0.57379543,  1.29768343]])
__class__  = <class 'sklearn.neighbors.lof.LocalOutlierFactor'>
self       = LocalOutlierFactor(algorithm='auto', contamination='legacy', leaf_size=30,
                   metric='minkowski', metric_params=None, n_jobs=None,
                   n_neighbors=20, novelty=True, p=2)
y          = None

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented May 10, 2019

You need to catch the warning usin pytest.mark.filterwarnings on the top of the test.

@ab-anssi
Copy link
Author

@ab-anssi ab-anssi commented May 10, 2019

Thanks for your help ! It seems ok now. Can you please review the PR ?

@ab-anssi ab-anssi changed the title [WIP] Add function score_samples to Pipeline (fix issue #12542) [MRG] Add function score_samples to Pipeline (fix issue #12542) May 10, 2019
sklearn/tests/test_pipeline.py Outdated Show resolved Hide resolved
Copy link
Member

@jnothman jnothman left a comment

Thanks

@@ -342,8 +342,8 @@ def test_pipeline_score_samples_pca_lof():
pca = PCA(svd_solver='full', n_components='mle', whiten=True)
pipe = Pipeline([('pca', pca), ('lof', clf)])
pipe.fit(X)
assert_equal(list(pipe.score_samples(X[:2])),
[-0.9564705258081416, -1.0371487183896932])
assert_array_almost_equal(pipe.score_samples(X[:2]),
Copy link
Member

@jnothman jnothman May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_allclose is the preferred helper

Copy link
Author

@ab-anssi ab-anssi May 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@thomasjpfan thomasjpfan May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_array_almost_equal is still being used.

Copy link
Author

@ab-anssi ab-anssi May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I did something wrong when rebasing the branch with master. I lost two commits along the way (rm assert_array_almost_equal and update what's new). Now it should be ok.

Copy link
Member

@jnothman jnothman left a comment

I'm happy with supporting this.

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

@ab-anssi ab-anssi force-pushed the pipeline__score_samples branch 4 times, most recently from 7c1de43 to c29a559 Compare May 22, 2019
@ab-anssi
Copy link
Author

@ab-anssi ab-anssi commented May 22, 2019

Thanks for your review.

@ab-anssi
Copy link
Author

@ab-anssi ab-anssi commented May 26, 2019

A test fails with the last version I have committed. However, it does not seem to be related to the changes I have made.

__________________________ test_scale_and_stability ___________________________

def test_scale_and_stability():
    # We test scale=True parameter
    # This allows to check numerical stability over platforms as well

    d = load_linnerud()
    X1 = d.data
    Y1 = d.target
    # causes X[:, -1].std() to be zero
    X1[:, -1] = 1.0

    # From bug #2821
    # Test with X2, T2 s.t. clf.x_score[:, 1] == 0, clf.y_score[:, 1] == 0
    # This test robustness of algorithm when dealing with value close to 0
    X2 = np.array([[0., 0., 1.],
                   [1., 0., 0.],
                   [2., 2., 2.],
                   [3., 5., 4.]])
    Y2 = np.array([[0.1, -0.2],
                   [0.9, 1.1],
                   [6.2, 5.9],
                   [11.9, 12.3]])

    for (X, Y) in [(X1, Y1), (X2, Y2)]:
        X_std = X.std(axis=0, ddof=1)
        X_std[X_std == 0] = 1
        Y_std = Y.std(axis=0, ddof=1)
        Y_std[Y_std == 0] = 1

        X_s = (X - X.mean(axis=0)) / X_std
        Y_s = (Y - Y.mean(axis=0)) / Y_std

        for clf in [CCA(), pls_.PLSCanonical(), pls_.PLSRegression(),
                    pls_.PLSSVD()]:
            clf.set_params(scale=True)
            X_score, Y_score = clf.fit_transform(X, Y)
            clf.set_params(scale=False)
            X_s_score, Y_s_score = clf.fit_transform(X_s, Y_s)
          assert_array_almost_equal(X_s_score, X_score)

E AssertionError:
E Arrays are not almost equal to 6 decimals
E
E Mismatch: 50%
E Max absolute difference: 7.04879453e-05
E Max relative difference: 0.00169015
E x: array([[-1.337317, -0.041776],
E [-1.108472, 0.0982 ],
E [ 0.407632, -0.103027],
E [ 2.038158, 0.046602]])
E y: array([[-1.337317, -0.041705],
E [-1.108472, 0.098154],
E [ 0.407632, -0.103083],
E [ 2.038158, 0.046634]])

X = array([[0., 0., 1.],
[1., 0., 0.],
[2., 2., 2.],
[3., 5., 4.]])
X1 = array([[ 5., 162., 1.],
[ 2., 110., 1.],
[ 12., 101., 1.],
[ 12., 105., 1.],
[ 1...0., 1.],
[ 4., 60., 1.],
[ 11., 230., 1.],
[ 15., 225., 1.],
[ 2., 110., 1.]])
X2 = array([[0., 0., 1.],
[1., 0., 0.],
[2., 2., 2.],
[3., 5., 4.]])
X_s = array([[-1.161895 , -0.7406129 , -0.43915503],
[-0.38729833, -0.7406129 , -1.02469508],
[ 0.38729833, 0.10580184, 0.14638501],
[ 1.161895 , 1.37542395, 1.3174651 ]])
X_s_score = array([[-1.3373174 , -0.04177564],
[-1.10847164, 0.09820033],
[ 0.40763151, -0.1030265 ],
[ 2.03815753, 0.04660181]])
X_score = array([[-1.3373174 , -0.04170515],
[-1.10847164, 0.09815418],
[ 0.40763151, -0.10308337],
[ 2.03815753, 0.04663434]])
X_std = array([1.29099445, 2.36290781, 1.70782513])
Y = array([[ 0.1, -0.2],
[ 0.9, 1.1],
[ 6.2, 5.9],
[11.9, 12.3]])
Y1 = array([[191., 36., 50.],
[189., 37., 52.],
[193., 38., 58.],
[162., 35., 62.],
[18...7., 62.],
[176., 37., 54.],
[157., 32., 52.],
[156., 33., 54.],
[138., 33., 68.]])
Y2 = array([[ 0.1, -0.2],
[ 0.9, 1.1],
[ 6.2, 5.9],
[11.9, 12.3]])
Y_s = array([[-0.85511537, -0.87878921],
[-0.70878547, -0.64915585],
[ 0.26065014, 0.19872118],
[ 1.3032507 , 1.32922388]])
Y_s_score = array([[-8.55115369e-01, -8.12697758e-14],
[-7.08785466e-01, 1.92383868e-13],
[ 2.60650139e-01, -2.01848251e-13],
[ 1.30325070e+00, 9.07341589e-14]])
Y_score = array([[-8.55115369e-01, -1.89870573e-10],
[-7.08785466e-01, 4.46868095e-10],
[ 2.60650139e-01, -4.69311104e-10],
[ 1.30325070e+00, 2.12313582e-10]])
Y_std = array([5.46709856, 5.66119834])
clf = CCA(copy=True, max_iter=500, n_components=2, scale=False, tol=1e-06)
d = {'data_filename': 'c:\hostedtoolcache\windows\python\3.5.4\x86\lib\site-packages\sklearn\datasets\data/linne..... topic:: References\n\n * Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions Technic.\n'}

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented May 26, 2019

Merging with master should fix this issue.

@ab-anssi ab-anssi force-pushed the pipeline__score_samples branch from c29a559 to a2953f5 Compare May 26, 2019
Anaël Beaugnon and others added 3 commits May 29, 2019
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Copy link
Member

@NicolasHug NicolasHug left a comment

Please add a test making sure the appropriate error is raised if we try using score_samples() when the final estimator doesn't support it

The function score_samples is then automatically available for GridSearchCV and RandomizedSearchCV.

I'm a bit confused, are you sure? If that's the case it should be tested and mentionned in the what's new

sklearn/pipeline.py Outdated Show resolved Hide resolved
@ab-anssi ab-anssi force-pushed the pipeline__score_samples branch from 48e3208 to 87c90a9 Compare May 29, 2019
Co-Authored-By: Nicolas Hug <contact@nicolas-hug.com>
@ab-anssi ab-anssi force-pushed the pipeline__score_samples branch from 87c90a9 to 4f8723e Compare May 29, 2019
@ab-anssi
Copy link
Author

@ab-anssi ab-anssi commented May 31, 2019

Hi @NicolasHug,
I will add the test you've mentioned.
As regards GridSearchCV and RandomizedSearchCV, you are right, the score_samples function is not automatically available. I will add them in the class BaseSearchCV.

Anaël Beaugnon added 2 commits May 31, 2019
score_samples is then available for GridSearchCV and RandomizedSearchCV.
@ab-anssi
Copy link
Author

@ab-anssi ab-anssi commented May 31, 2019

I have added score_samples to BaseSearchCV, but I still need to add some tests. I have looked at the file sklearn/model_selection/tests/test_search.py but it is not straightforward for me to add tests for this new function. I will come back to you when I have a solution.

pca = PCA(svd_solver='full', n_components='mle', whiten=True)
clf = LogisticRegression()
pipe = Pipeline([('pca', pca), ('clf', clf)])
assert_raises_regex(
Copy link
Member

@NicolasHug NicolasHug May 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're trying to avoid assert_raises_regex and use with pytest.raises(): instead

Copy link
Author

@ab-anssi ab-anssi May 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code accordingly.

@NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented May 31, 2019

I don't think you can implement this for the search estimators. Outlier detectors don't have a proper scoring so they can't be used with grid search or random search.

@ab-anssi
Copy link
Author

@ab-anssi ab-anssi commented May 31, 2019

Ok. I have removed the score_samples method from BaseSearchCV.

@ab-anssi
Copy link
Author

@ab-anssi ab-anssi commented May 31, 2019

I think I have taken into account all your reviews. Thanks for your time !



def test_score_samples_on_pipeline_without_score_samples():
iris = load_iris()
Copy link
Member

@NicolasHug NicolasHug May 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these kind of error checking tests, you don't need to use real-world data, or good parameters. We just need them to check what's needed, and to run fast.

pipe = make_pipeline(LogisticRegression()) is enough

You could also just hard code X and y, or use make_classification

LGTM regardless, thanks @ab-anssi

@@ -330,6 +332,37 @@ def test_pipeline_methods_pca_svm():
pipe.score(X, y)


@pytest.mark.filterwarnings('ignore: default contamination parameter') # 0.22
Copy link
Contributor

@glemaitre glemaitre Jun 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this line anymore. We merge the PR which remove the deprecation for 0.22

Copy link
Author

@ab-anssi ab-anssi Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

sklearn/tests/test_pipeline.py Outdated Show resolved Hide resolved
@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Jun 3, 2019

2 small changes and I think that we are good.

Anaël Beaugnon and others added 2 commits Jun 5, 2019
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@ab-anssi
Copy link
Author

@ab-anssi ab-anssi commented Jun 5, 2019

@glemaitre, I have taken into account all your comments. Thanks !

@glemaitre glemaitre merged commit ec35ed2 into scikit-learn:master Jun 5, 2019
16 checks passed
@ab-anssi ab-anssi deleted the pipeline__score_samples branch Jun 5, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this issue Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants