[MRG+1] Uncontroversial fixes from estimator tags branch #8086

Merged
merged 24 commits into from Jun 6, 2017

Conversation

Projects
None yet
6 participants
@amueller
Member

amueller commented Dec 19, 2016

These are some of the simple fixes from #8022 which should be fairly uncontroversial and easy to review.
I hope that makes the review of #8022 easier and allows that PR to focus on the API.

doc/whats_new.rst
+ - Fixes to the input validation in :class:`sklearn.covariance.EllipticEnvelope` by
+ `Andreas Müller`_.
+
+ - Fix shape output shape of :class:`sklearn.decomposition.DictionaryLearning` transform

This comment has been minimized.

@amueller

amueller Dec 19, 2016

Member

this still needs a test, I think...

@amueller

amueller Dec 19, 2016

Member

this still needs a test, I think...

doc/whats_new.rst
+
+ - Gradient boosting base models are not longer estimators. By `Andreas Müller`_.
+
+ - :class:`feature_extraction.text.TfidfTransformer` now supports numpy

This comment has been minimized.

@amueller

amueller Dec 19, 2016

Member

this needs tests, I guess

@amueller

amueller Dec 19, 2016

Member

this needs tests, I guess

@amueller amueller changed the title from Uncontroversial fixes from estimator tags branch to [MRG] Uncontroversial fixes from estimator tags branch Dec 19, 2016

sklearn/covariance/outlier_detection.py
@@ -101,7 +102,7 @@ def predict(self, X):
return is_inlier
-class EllipticEnvelope(ClassifierMixin, OutlierDetectionMixin, MinCovDet):
+class EllipticEnvelope(OutlierDetectionMixin, MinCovDet):

This comment has been minimized.

@amueller

amueller Dec 19, 2016

Member

This is not a classifier and should not inherit from ClassifierMixin.

@amueller

amueller Dec 19, 2016

Member

This is not a classifier and should not inherit from ClassifierMixin.

@@ -93,7 +93,7 @@ class GaussianRandomProjectionHash(ProjectionToHashMixin,
GaussianRandomProjection):
"""Use GaussianRandomProjection to produce a cosine LSH fingerprint"""
def __init__(self,
- n_components=8,
+ n_components=32,

This comment has been minimized.

@amueller

amueller Dec 19, 2016

Member

This class is never instantiated with default parameters and the code doesn't run with n_compenents != 32

@amueller

amueller Dec 19, 2016

Member

This class is never instantiated with default parameters and the code doesn't run with n_compenents != 32

This comment has been minimized.

@jnothman

jnothman Dec 27, 2016

Member

ha!

This comment has been minimized.

@agramfort

agramfort Jun 6, 2017

Member

is the error captured if not 32? should it be a parameter if only one param value works?

@agramfort

agramfort Jun 6, 2017

Member

is the error captured if not 32? should it be a parameter if only one param value works?

This comment has been minimized.

@amueller

amueller Jun 6, 2017

Member

I told @ogrisel to look into it ;)

@amueller

amueller Jun 6, 2017

Member

I told @ogrisel to look into it ;)

sklearn/feature_extraction/text.py
- else:
- # convert counts or binary occurrences to floats
- X = sp.csr_matrix(X, dtype=np.float64, copy=copy)
+ X = check_array(X, accept_sparse=["csr"], copy=copy,

This comment has been minimized.

@amueller

amueller Dec 19, 2016

Member

I'm not actually somewhat uncertain of whether this is the right approach. In fit we always convert to sparse. Maybe we should be consistent? I'm not sure.

@amueller

amueller Dec 19, 2016

Member

I'm not actually somewhat uncertain of whether this is the right approach. In fit we always convert to sparse. Maybe we should be consistent? I'm not sure.

This comment has been minimized.

@jnothman

jnothman Dec 27, 2016

Member

I think the principle is that for df to be meaningful, X needs to be fundamentally sparse (regardless of data structure). I can see why you consider this broken, but I'm tempted to say don't fix. You're creating a backwards-incompatibility.

@jnothman

jnothman Dec 27, 2016

Member

I think the principle is that for df to be meaningful, X needs to be fundamentally sparse (regardless of data structure). I can see why you consider this broken, but I'm tempted to say don't fix. You're creating a backwards-incompatibility.

This comment has been minimized.

@amueller

amueller May 15, 2017

Member

reverted

@amueller

amueller May 15, 2017

Member

reverted

doc/whats_new.rst
+
+ - :class:`feature_extraction.text.TfidfTransformer` now supports numpy
+ arrays as inputs, and produces numpy arrays for list inputs and numpy
+ array inputs. By `Andreas Müller_`.

This comment has been minimized.

@TomDLT

TomDLT Dec 20, 2016

Member

swap _ and `

@TomDLT

TomDLT Dec 20, 2016

Member

swap _ and `

@jnothman

otherwise LGTM, I think.

Thanks! But I can't say it's so easy to review a heterogeneous patch like this, and I wish you'd pulled the entirely cosmetic things out separately.

sklearn/covariance/outlier_detection.py
@@ -176,3 +177,29 @@ def fit(self, X, y=None):
self.threshold_ = sp.stats.scoreatpercentile(
self.dist_, 100. * (1. - self.contamination))
return self
+
+ def score(self, X, y, sample_weight=None):

This comment has been minimized.

@jnothman

jnothman Dec 27, 2016

Member

Why is this not part of OutlierDetectionMixin?

@jnothman

jnothman Dec 27, 2016

Member

Why is this not part of OutlierDetectionMixin?

This comment has been minimized.

@amueller

amueller May 15, 2017

Member

moved it there.

@amueller

amueller May 15, 2017

Member

moved it there.

sklearn/dummy.py
@@ -117,6 +117,9 @@ def fit(self, X, y, sample_weight=None):
self.sparse_output_ = sp.issparse(y)
+ X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])

This comment has been minimized.

@jnothman

jnothman Dec 27, 2016

Member

I think we need to be liberal wrt X: DummyClassifier will often substitute for a pipeline / grid search, where X may be a list of strings, a dataframe, a list of open files, or other mess. Don't demand too much of it. Here you ensure it is 2d and numeric and of a particular format, but why?

@jnothman

jnothman Dec 27, 2016

Member

I think we need to be liberal wrt X: DummyClassifier will often substitute for a pipeline / grid search, where X may be a list of strings, a dataframe, a list of open files, or other mess. Don't demand too much of it. Here you ensure it is 2d and numeric and of a particular format, but why?

This comment has been minimized.

@amueller

amueller May 15, 2017

Member

removed.

@amueller

amueller May 15, 2017

Member

removed.

sklearn/feature_extraction/text.py
- else:
- # convert counts or binary occurrences to floats
- X = sp.csr_matrix(X, dtype=np.float64, copy=copy)
+ X = check_array(X, accept_sparse=["csr"], copy=copy,

This comment has been minimized.

@jnothman

jnothman Dec 27, 2016

Member

I think the principle is that for df to be meaningful, X needs to be fundamentally sparse (regardless of data structure). I can see why you consider this broken, but I'm tempted to say don't fix. You're creating a backwards-incompatibility.

@jnothman

jnothman Dec 27, 2016

Member

I think the principle is that for df to be meaningful, X needs to be fundamentally sparse (regardless of data structure). I can see why you consider this broken, but I'm tempted to say don't fix. You're creating a backwards-incompatibility.

doc/whats_new.rst
+
+ - :class:`feature_selection.SelectFromModel` now validates the ``threshold``
+ parameter and sets the ``threshold_`` attribute during the call to
+ ``fit``, and no longer during the call to ``transform```, by `Andreas

This comment has been minimized.

@jnothman

jnothman Dec 27, 2016

Member

I think it's still being set in transform now, no?

@jnothman

jnothman Dec 27, 2016

Member

I think it's still being set in transform now, no?

@@ -93,7 +93,7 @@ class GaussianRandomProjectionHash(ProjectionToHashMixin,
GaussianRandomProjection):
"""Use GaussianRandomProjection to produce a cosine LSH fingerprint"""
def __init__(self,
- n_components=8,
+ n_components=32,

This comment has been minimized.

@jnothman

jnothman Dec 27, 2016

Member

ha!

sklearn/naive_bayes.py
@@ -480,13 +480,13 @@ def partial_fit(self, X, y, classes=None, sample_weight=None):
y : array-like, shape = [n_samples]
Target values.
- classes : array-like, shape = [n_classes], optional (default=None)
+ classes : array-like, shape = [n_classes], (default=None)

This comment has been minimized.

@jnothman

jnothman Dec 27, 2016

Member

no comma before parentheses?

@jnothman

jnothman Dec 27, 2016

Member

no comma before parentheses?

sklearn/naive_bayes.py
List of all the classes that can possibly appear in the y vector.
Must be provided at the first call to partial_fit, can be omitted
in subsequent calls.
- sample_weight : array-like, shape = [n_samples], optional (default=None)
+ sample_weight : array-like, shape = [n_samples], (default=None)

This comment has been minimized.

@jnothman

jnothman Dec 27, 2016

Member

no comma before parentheses?

@jnothman

jnothman Dec 27, 2016

Member

no comma before parentheses?

@codecov

This comment has been minimized.

Show comment
Hide comment
@codecov

codecov bot Feb 25, 2017

Codecov Report

Merging #8086 into master will decrease coverage by 0.05%.
The diff coverage is 98.01%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8086      +/-   ##
==========================================
- Coverage   95.53%   95.48%   -0.06%     
==========================================
  Files         333      342       +9     
  Lines       61184    60958     -226     
==========================================
- Hits        58451    58204     -247     
- Misses       2733     2754      +21
Impacted Files Coverage Δ
sklearn/utils/multiclass.py 96.52% <ø> (+0.69%) ⬆️
sklearn/naive_bayes.py 100% <ø> (ø) ⬆️
sklearn/feature_selection/rfe.py 97.45% <ø> (ø) ⬆️
sklearn/neighbors/approximate.py 98.95% <ø> (ø) ⬆️
sklearn/multiclass.py 94.78% <100%> (ø)
sklearn/decomposition/truncated_svd.py 94.11% <100%> (-0.23%) ⬇️
sklearn/covariance/outlier_detection.py 97.22% <100%> (+0.25%) ⬆️
sklearn/decomposition/dict_learning.py 93.45% <100%> (ø) ⬆️
sklearn/tests/test_multiclass.py 100% <100%> (ø)
sklearn/ensemble/gradient_boosting.py 95.79% <100%> (-0.01%) ⬇️
... and 135 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0301e06...f4c9d60. Read the comment docs.

codecov bot commented Feb 25, 2017

Codecov Report

Merging #8086 into master will decrease coverage by 0.05%.
The diff coverage is 98.01%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8086      +/-   ##
==========================================
- Coverage   95.53%   95.48%   -0.06%     
==========================================
  Files         333      342       +9     
  Lines       61184    60958     -226     
==========================================
- Hits        58451    58204     -247     
- Misses       2733     2754      +21
Impacted Files Coverage Δ
sklearn/utils/multiclass.py 96.52% <ø> (+0.69%) ⬆️
sklearn/naive_bayes.py 100% <ø> (ø) ⬆️
sklearn/feature_selection/rfe.py 97.45% <ø> (ø) ⬆️
sklearn/neighbors/approximate.py 98.95% <ø> (ø) ⬆️
sklearn/multiclass.py 94.78% <100%> (ø)
sklearn/decomposition/truncated_svd.py 94.11% <100%> (-0.23%) ⬇️
sklearn/covariance/outlier_detection.py 97.22% <100%> (+0.25%) ⬆️
sklearn/decomposition/dict_learning.py 93.45% <100%> (ø) ⬆️
sklearn/tests/test_multiclass.py 100% <100%> (ø)
sklearn/ensemble/gradient_boosting.py 95.79% <100%> (-0.01%) ⬇️
... and 135 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0301e06...f4c9d60. Read the comment docs.

@@ -64,7 +64,7 @@
from ..exceptions import NotFittedError
-class QuantileEstimator(BaseEstimator):
+class QuantileEstimator(object):

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Feb 27, 2017

Member

Why this change? (there is probably a good reason, just asking)

@GaelVaroquaux

GaelVaroquaux Feb 27, 2017

Member

Why this change? (there is probably a good reason, just asking)

This comment has been minimized.

@amueller

amueller May 15, 2017

Member

These are not scikit-learn estimators, they don't fulfill the sklearn estimator API and the inheritance doesn't provide any functionality. So I think having them inherit is rather confusing.
The more direct reason is that I don't want these to be discovered by the common tests, as they are not sklearn estimators and will definitely fail the tests.

@amueller

amueller May 15, 2017

Member

These are not scikit-learn estimators, they don't fulfill the sklearn estimator API and the inheritance doesn't provide any functionality. So I think having them inherit is rather confusing.
The more direct reason is that I don't want these to be discovered by the common tests, as they are not sklearn estimators and will definitely fail the tests.

This comment has been minimized.

@jnothman

jnothman May 17, 2017

Member

They are also not in a position for get_params/set_params to be used.

@jnothman

jnothman May 17, 2017

Member

They are also not in a position for get_params/set_params to be used.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Feb 27, 2017

Member

Overall +1 for merge, but I added a small comment. It also seems that there are unanswered comments by @jnothman

Member

GaelVaroquaux commented Feb 27, 2017

Overall +1 for merge, but I added a small comment. It also seems that there are unanswered comments by @jnothman

sklearn/dummy.py
@@ -117,6 +117,9 @@ def fit(self, X, y, sample_weight=None):
self.sparse_output_ = sp.issparse(y)
+ X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])

This comment has been minimized.

@amueller

amueller May 15, 2017

Member

removed.

@amueller

amueller May 15, 2017

Member

removed.

@@ -64,7 +64,7 @@
from ..exceptions import NotFittedError
-class QuantileEstimator(BaseEstimator):
+class QuantileEstimator(object):

This comment has been minimized.

@amueller

amueller May 15, 2017

Member

These are not scikit-learn estimators, they don't fulfill the sklearn estimator API and the inheritance doesn't provide any functionality. So I think having them inherit is rather confusing.
The more direct reason is that I don't want these to be discovered by the common tests, as they are not sklearn estimators and will definitely fail the tests.

@amueller

amueller May 15, 2017

Member

These are not scikit-learn estimators, they don't fulfill the sklearn estimator API and the inheritance doesn't provide any functionality. So I think having them inherit is rather confusing.
The more direct reason is that I don't want these to be discovered by the common tests, as they are not sklearn estimators and will definitely fail the tests.

sklearn/feature_extraction/text.py
- else:
- # convert counts or binary occurrences to floats
- X = sp.csr_matrix(X, dtype=np.float64, copy=copy)
+ X = check_array(X, accept_sparse=["csr"], copy=copy,

This comment has been minimized.

@amueller

amueller May 15, 2017

Member

reverted

@amueller

amueller May 15, 2017

Member

reverted

doc/whats_new.rst
+ - Gradient boosting base models are not longer estimators. By `Andreas Müller`_.
+
+ - :class:`feature_selection.SelectFromModel` now validates the ``threshold``
+ parameter and sets the ``threshold_`` attribute during the call to

This comment has been minimized.

@amueller

amueller May 15, 2017

Member

Now actually doesn't set it any more during transform. This is a backward incompatible change, but could be considered a bug in the previous implementation? I tried to make it clearer what happens now.

@amueller

amueller May 15, 2017

Member

Now actually doesn't set it any more during transform. This is a backward incompatible change, but could be considered a bug in the previous implementation? I tried to make it clearer what happens now.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 15, 2017

Member

Ok so I reworked the SelectFromModel again. The point of self.threshold_ is to provide a threshold to the user, it's not used internally (before or after this PR). However, before the PR, if the user re-assigned threshold - a use-case that we explicitly support and test - then self.threshold_ is outdated until transform is called. Even calling fit would not update it!

The PR now makes threshold_ a property so it doesn't go stale. Computing it is quick so it's not a big deal. We can think about deprecating the property and making it a method, which it should have probably been in the first place.

Member

amueller commented May 15, 2017

Ok so I reworked the SelectFromModel again. The point of self.threshold_ is to provide a threshold to the user, it's not used internally (before or after this PR). However, before the PR, if the user re-assigned threshold - a use-case that we explicitly support and test - then self.threshold_ is outdated until transform is called. Even calling fit would not update it!

The PR now makes threshold_ a property so it doesn't go stale. Computing it is quick so it's not a big deal. We can think about deprecating the property and making it a method, which it should have probably been in the first place.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 15, 2017

Member
Member

jnothman commented May 15, 2017

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller May 16, 2017

Member

Can anyone explain to me what codecov is doing? I'm looking at https://codecov.io/gh/scikit-learn/scikit-learn/compare/master...tags_backup/changes but the changes seem to be in the docstrings?!

Member

amueller commented May 16, 2017

Can anyone explain to me what codecov is doing? I'm looking at https://codecov.io/gh/scikit-learn/scikit-learn/compare/master...tags_backup/changes but the changes seem to be in the docstrings?!

sklearn/feature_extraction/text.py
- else:
- return None
+ # if _idf_diag is not set, this will raise an attribute error,
+ # which means hasatt(self, "idf_") is False

This comment has been minimized.

@jnothman

jnothman May 17, 2017

Member

*hasattr

@jnothman

jnothman May 17, 2017

Member

*hasattr

This comment has been minimized.

@jnothman

jnothman May 17, 2017

Member

Also, test this?

@jnothman

jnothman May 17, 2017

Member

Also, test this?

This comment has been minimized.

@amueller

amueller May 17, 2017

Member

It is tested :)

@amueller

amueller May 17, 2017

Member

It is tested :)

@@ -169,6 +169,12 @@ def fit(self, X, y=None, **fit_params):
self.estimator_.fit(X, y, **fit_params)
return self
+ @property
+ def threshold_(self):
+ scores = _get_feature_importances(self.estimator_, self.norm_order)

This comment has been minimized.

@jnothman

jnothman May 17, 2017

Member

This can be expensive. So where a user used to be able to use sfm.threshold_ with impunity, now it may take some time.

@jnothman

jnothman May 17, 2017

Member

This can be expensive. So where a user used to be able to use sfm.threshold_ with impunity, now it may take some time.

This comment has been minimized.

@amueller

amueller May 17, 2017

Member

When is this expensive? I guess if you have a many classes and many features the reduction to the vector is expensive?

Before, when they used sfm.threshold_ it might have been wrong ;)
We could store the scores in fit and / or transform. If people re-assign estimator that would go stale, though. So we could support re-assigning threshold but not estimator?

@amueller

amueller May 17, 2017

Member

When is this expensive? I guess if you have a many classes and many features the reduction to the vector is expensive?

Before, when they used sfm.threshold_ it might have been wrong ;)
We could store the scores in fit and / or transform. If people re-assign estimator that would go stale, though. So we could support re-assigning threshold but not estimator?

This comment has been minimized.

@amueller

amueller May 17, 2017

Member

If we cache scores, then the property might still require recomputing the mean or std. But we can't avoid that if we want threshold_ to be up-to-date with threshold unless we overwrite setattr which I'm not sure is a good idea ;)

@amueller

amueller May 17, 2017

Member

If we cache scores, then the property might still require recomputing the mean or std. But we can't avoid that if we want threshold_ to be up-to-date with threshold unless we overwrite setattr which I'm not sure is a good idea ;)

This comment has been minimized.

@amueller

amueller May 17, 2017

Member

no matter if they are used for computing the threshold or not. I could make the computation dependent on whether self.threshold is a string type.

@amueller

amueller May 17, 2017

Member

no matter if they are used for computing the threshold or not. I could make the computation dependent on whether self.threshold is a string type.

This comment has been minimized.

@vene

vene Jun 5, 2017

Member

I agree, the cost of evaluating the column-wise norm and the mean/median shouldn't be so significant. Can't think of use cases where this would be called in a tight loop.

@vene

vene Jun 5, 2017

Member

I agree, the cost of evaluating the column-wise norm and the mean/median shouldn't be so significant. Can't think of use cases where this would be called in a tight loop.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jun 6, 2017

Member

What is the logic for making this a property, compared to setting it in fit?

@GaelVaroquaux

GaelVaroquaux Jun 6, 2017

Member

What is the logic for making this a property, compared to setting it in fit?

This comment has been minimized.

@amueller

amueller Jun 6, 2017

Member

we explicitly allow people to change a threshold after fitting the estimator. There's a test for that.

@amueller

amueller Jun 6, 2017

Member

we explicitly allow people to change a threshold after fitting the estimator. There's a test for that.

sklearn/feature_selection/from_model.py
- self.threshold_ = _calculate_threshold(estimator, scores,
- self.threshold)
- return scores >= self.threshold_
+ threshold = _calculate_threshold(self.estimator, scores,

This comment has been minimized.

@jnothman

jnothman May 17, 2017

Member

should use estimator, not self.estimator

@jnothman

jnothman May 17, 2017

Member

should use estimator, not self.estimator

This comment has been minimized.

@jnothman

jnothman May 17, 2017

Member

Needs a test

@jnothman

jnothman May 17, 2017

Member

Needs a test

This comment has been minimized.

@amueller

amueller May 17, 2017

Member

Actually, doesn't matter mostly. _calculate_threshold only checks whether estimator.penalty=='l1' in some cases. I guess you could come up with a case where that's true for estimator but not estimator_ or vice versa, but changing penalty during fit is actually violating API constraints, right? (because no trailing underscore).
I changed it, because it's more obvious.

What did you want tested?

@amueller

amueller May 17, 2017

Member

Actually, doesn't matter mostly. _calculate_threshold only checks whether estimator.penalty=='l1' in some cases. I guess you could come up with a case where that's true for estimator but not estimator_ or vice versa, but changing penalty during fit is actually violating API constraints, right? (because no trailing underscore).
I changed it, because it's more obvious.

What did you want tested?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman May 17, 2017

Member

Yes, some of that codecov output is weird.

Member

jnothman commented May 17, 2017

Yes, some of that codecov output is weird.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 3, 2017

Member

can I haz review? ;)

Member

amueller commented Jun 3, 2017

can I haz review? ;)

sklearn/decomposition/dict_learning.py
if code.ndim == 1:
- code = code[np.newaxis, :]
+ code = code[:, np.newaxis]

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jun 3, 2017

Member

I guess that this is the controversial part of the PR.

Do we have any estimation on how much user code we will break? Maybe we should make this change more visible in the whats_new

@GaelVaroquaux

GaelVaroquaux Jun 3, 2017

Member

I guess that this is the controversial part of the PR.

Do we have any estimation on how much user code we will break? Maybe we should make this change more visible in the whats_new

This comment has been minimized.

@amueller

amueller Jun 5, 2017

Member

You mean fix? ;) This is a bugfix, but I'm happy to make it more visible in the whatsnew. It's been a while since I wrote this, I need to look up the exact edge cases this fixes.

The controversial parts where supposed to be the complete refactoring of all the common tests and the introduction of estimator tags. This PR fixes all the bugs that better testing finds.

@amueller

amueller Jun 5, 2017

Member

You mean fix? ;) This is a bugfix, but I'm happy to make it more visible in the whatsnew. It's been a while since I wrote this, I need to look up the exact edge cases this fixes.

The controversial parts where supposed to be the complete refactoring of all the common tests and the introduction of estimator tags. This PR fixes all the bugs that better testing finds.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jun 5, 2017

Member

OK, your logic is that the return dimensions varied whether n_jobs=1 or not, hence this is a clear-cut bug?

I am worried that some users might rely on this behavior. Upgrading scikit-learn might break there code. Do you think it's unlikely? I don't know, I don't personally use this part of scikit-learn.

This is why I was suggesting a more prominent entry in whats_new

@GaelVaroquaux

GaelVaroquaux Jun 5, 2017

Member

OK, your logic is that the return dimensions varied whether n_jobs=1 or not, hence this is a clear-cut bug?

I am worried that some users might rely on this behavior. Upgrading scikit-learn might break there code. Do you think it's unlikely? I don't know, I don't personally use this part of scikit-learn.

This is why I was suggesting a more prominent entry in whats_new

This comment has been minimized.

@amueller

amueller Jun 5, 2017

Member

It's a clear-cut bug fix for its use in DictionaryLearning, which was broken with n_components=1. I need to double check how n_jobs affects this and what this means for people that directly use SparseEncode

@amueller

amueller Jun 5, 2017

Member

It's a clear-cut bug fix for its use in DictionaryLearning, which was broken with n_components=1. I need to double check how n_jobs affects this and what this means for people that directly use SparseEncode

This comment has been minimized.

@amueller

amueller Jun 5, 2017

Member

(one direction of code needs to be n_features so their could only be ambiguity when n_features=1. Both n_components=1 and n_features=1 seems odd, but I'll add some tests that make sure both work as expected and see if this has any behavior change.

@amueller

amueller Jun 5, 2017

Member

(one direction of code needs to be n_features so their could only be ambiguity when n_features=1. Both n_components=1 and n_features=1 seems odd, but I'll add some tests that make sure both work as expected and see if this has any behavior change.

This comment has been minimized.

@amueller

amueller Jun 5, 2017

Member

n_components=1 is still broken with n_jobs > 1

@amueller

amueller Jun 5, 2017

Member

n_components=1 is still broken with n_jobs > 1

This comment has been minimized.

@amueller

amueller Jun 5, 2017

Member

Sorry my second to last comment doesn't make sense, I was confused about code vs dictionary.

@amueller

amueller Jun 5, 2017

Member

Sorry my second to last comment doesn't make sense, I was confused about code vs dictionary.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 5, 2017

Member

I can also split this up into one PR per bugfix? seems like the changes are not that simple after all... In particular without regression tests. The point is kind of that adding the estimator tags will find all these bugs, but reviewing without regression tests might be too tricky.

Member

amueller commented Jun 5, 2017

I can also split this up into one PR per bugfix? seems like the changes are not that simple after all... In particular without regression tests. The point is kind of that adding the estimator tags will find all these bugs, but reviewing without regression tests might be too tricky.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jun 5, 2017

Member

WRT splitting this PR in smaller one: it is indeed a though review, but I think that we need to make sure that you don't loose time. I think that we can review and merge this guy in the next few days.

Member

GaelVaroquaux commented Jun 5, 2017

WRT splitting this PR in smaller one: it is indeed a though review, but I think that we need to make sure that you don't loose time. I think that we can review and merge this guy in the next few days.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 5, 2017

Member

@GaelVaroquaux ok cool. I was hoping that we can discuss the main PR #8022 soon, as this will fix a whole bunch of issues with the testing.

Member

amueller commented Jun 5, 2017

@GaelVaroquaux ok cool. I was hoping that we can discuss the main PR #8022 soon, as this will fix a whole bunch of issues with the testing.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 5, 2017

Member

@vene this one is the first step to more contrib friendly tests ;)

Member

amueller commented Jun 5, 2017

@vene this one is the first step to more contrib friendly tests ;)

- assert_true(dico.components_.shape == (n_components, n_features))
+ assert_equal(dico.components_.shape, (n_components, n_features))
+
+ n_components = 1

This comment has been minimized.

@amueller

amueller Jun 5, 2017

Member

@GaelVaroquaux this is the regression test for the SparseEncode change. I can add a more direct test on SparseEncode, though.

@amueller

amueller Jun 5, 2017

Member

@GaelVaroquaux this is the regression test for the SparseEncode change. I can add a more direct test on SparseEncode, though.

This comment has been minimized.

@GaelVaroquaux

GaelVaroquaux Jun 5, 2017

Member

No, I think that this is good.

@GaelVaroquaux

GaelVaroquaux Jun 5, 2017

Member

No, I think that this is good.

This comment has been minimized.

@amueller

amueller Jun 5, 2017

Member

Added more test now, one testing the core issue in sparse_encode, the other one the resulting bug in DictonaryLearning

@amueller

amueller Jun 5, 2017

Member

Added more test now, one testing the core issue in sparse_encode, the other one the resulting bug in DictonaryLearning

amueller added some commits Jun 5, 2017

Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew
Merge branch 'master' into tags_backup
# Conflicts:
#	.travis.yml
@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 5, 2017

Member

Ok I completely rewrote the fix for DictionaryLearning and added tests for sparse_encode directly. Previously n_jobs>1 code-paths gave ValueError and sparse encode and dictionary learning returned nonsensical results in several cases.

Member

amueller commented Jun 5, 2017

Ok I completely rewrote the fix for DictionaryLearning and added tests for sparse_encode directly. Previously n_jobs>1 code-paths gave ValueError and sparse encode and dictionary learning returned nonsensical results in several cases.

@GaelVaroquaux

This comment has been minimized.

Show comment
Hide comment
@GaelVaroquaux

GaelVaroquaux Jun 5, 2017

Member
Member

GaelVaroquaux commented Jun 5, 2017

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 5, 2017

Member

@GaelVaroquaux neither did I. I just delved into it with @vene ;) Your comment gave rise to a way better fix and tests.

Member

amueller commented Jun 5, 2017

@GaelVaroquaux neither did I. I just delved into it with @vene ;) Your comment gave rise to a way better fix and tests.

model.fit(data, y)
X_transform = model.transform(data)
# Set a higher threshold to filter out more features.
- model.threshold = 1.0
+ model.threshold = "1.0 * mean"

This comment has been minimized.

@vene

vene Jun 5, 2017

Member

is this just for covering the expression parsing? The tests should not fail with the old values, right?

@vene

vene Jun 5, 2017

Member

is this just for covering the expression parsing? The tests should not fail with the old values, right?

This comment has been minimized.

@amueller

amueller Jun 5, 2017

Member

It was also to make sure that we're recomputing self.threshold_

@amueller

amueller Jun 5, 2017

Member

It was also to make sure that we're recomputing self.threshold_

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 5, 2017

Member

My only remaining controversy in this PR is whether we want to change the GaussianRandomProjections default n_components like that, or fix that in a separate PR. (But I don't want to delay this.)

Member

vene commented Jun 5, 2017

My only remaining controversy in this PR is whether we want to change the GaussianRandomProjections default n_components like that, or fix that in a separate PR. (But I don't want to delay this.)

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 5, 2017

Member

Is @ogrisel around? I think he did the GaussianRandomProjection. It's only used internally. I guess we should open an issue for that?

Member

amueller commented Jun 5, 2017

Is @ogrisel around? I think he did the GaussianRandomProjection. It's only used internally. I guess we should open an issue for that?

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 5, 2017

Member

issue #8029 also mentions that these classes are not the greatest. (btw this is GaussianRandomProjectionHash, sorry for the typo.)

I'll comment under #8029 with the weird behaviour we are finding.

Member

vene commented Jun 5, 2017

issue #8029 also mentions that these classes are not the greatest. (btw this is GaussianRandomProjectionHash, sorry for the typo.)

I'll comment under #8029 with the weird behaviour we are finding.

sklearn/covariance/outlier_detection.py
+ X : array-like, shape = (n_samples, n_features)
+ Test samples.
+
+ y : array-like, shape = (n_samples) or (n_samples, n_outputs)

This comment has been minimized.

@agramfort

agramfort Jun 6, 2017

Member

shape = (n_samples,) or (n_samples, n_outputs)

@agramfort

agramfort Jun 6, 2017

Member

shape = (n_samples,) or (n_samples, n_outputs)

sklearn/covariance/outlier_detection.py
+ y : array-like, shape = (n_samples) or (n_samples, n_outputs)
+ True labels for X.
+
+ sample_weight : array-like, shape = [n_samples], optional

This comment has been minimized.

@agramfort

agramfort Jun 6, 2017

Member

array-like, shape = (n_samples,), optional

@agramfort

agramfort Jun 6, 2017

Member

array-like, shape = (n_samples,), optional

@@ -484,13 +484,13 @@ def partial_fit(self, X, y, classes=None, sample_weight=None):
y : array-like, shape = [n_samples]
Target values.
- classes : array-like, shape = [n_classes], optional (default=None)
+ classes : array-like, shape = [n_classes] (default=None)

This comment has been minimized.

@agramfort

agramfort Jun 6, 2017

Member

I see that we are not consistent here.

I think we should use tuples for shapes in doctrings so:

array-like, shape = (n_classes,) (default=None)

@agramfort

agramfort Jun 6, 2017

Member

I see that we are not consistent here.

I think we should use tuples for shapes in doctrings so:

array-like, shape = (n_classes,) (default=None)

This comment has been minimized.

@amueller

amueller Jun 6, 2017

Member

well I could change that but that would really increase the size of the pull request, even if I just do it for naive_bayes.py

@amueller

amueller Jun 6, 2017

Member

well I could change that but that would really increase the size of the pull request, even if I just do it for naive_bayes.py

doc/whats_new.rst
+ ``fit``, and no longer during the call to ``transform```, by `Andreas
+ Müller`_.
+
+ - :class:`features_selection.SelectFromModel` now has a ``partial_fit``

This comment has been minimized.

@TomDLT

TomDLT Jun 6, 2017

Member

features -> feature

@TomDLT

TomDLT Jun 6, 2017

Member

features -> feature

@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Jun 6, 2017

Member

that's it for me.

Member

agramfort commented Jun 6, 2017

that's it for me.

doc/whats_new.rst
+ :issue:`8086` by `Andreas Müller`_.
+
+ - Fix output shape and bugs with n_jobs > 1 in
+ :class:`sklearn.decomposition.SparseEncoder` transform and :func:`sklarn.decomposition.sparse_encode`

This comment has been minimized.

@TomDLT

TomDLT Jun 6, 2017

Member

SparseEncoder -> SparseCoder

@TomDLT

TomDLT Jun 6, 2017

Member

SparseEncoder -> SparseCoder

+ if dictionary.shape[1] != X.shape[1]:
+ raise ValueError("Dictionary and X have different numbers of features:"
+ "dictionary.shape: {} X.shape{}".format(
+ dictionary.shape, X.shape))

This comment has been minimized.

@vene

vene Jun 6, 2017

Member

could use check_consistent_length(X.T, dictionary.T)

You could argue that's less clear though.

@vene

vene Jun 6, 2017

Member

could use check_consistent_length(X.T, dictionary.T)

You could argue that's less clear though.

This comment has been minimized.

@amueller

amueller Jun 6, 2017

Member

It'll say "different number of samples" in the error, right?

@amueller

amueller Jun 6, 2017

Member

It'll say "different number of samples" in the error, right?

doc/whats_new.rst
@@ -213,7 +213,8 @@ Bug fixes
- Fixed a bug where :class:`sklearn.linear_model.LassoLars` does not give
the same result as the LassoLars implementation available
- in R (lars library). :issue:`7849` by :user:`Jair Montoya Martinez <jmontoyam>`
+ in R (lars library). :issue:`7849` by `Jair Montoya Martinez`_

This comment has been minimized.

@TomDLT

TomDLT Jun 6, 2017

Member

There is no link for Jair Montoya Martinez.
Revert this one?

@TomDLT

TomDLT Jun 6, 2017

Member

There is no link for Jair Montoya Martinez.
Revert this one?

This comment has been minimized.

This comment has been minimized.

@amueller

amueller Jun 6, 2017

Member

thanks, I guess that got lost in a merge.

@amueller

amueller Jun 6, 2017

Member

thanks, I guess that got lost in a merge.

This comment has been minimized.

@TomDLT

TomDLT Jun 6, 2017

Member

use this url : https://github.com/jmontoyam

Then revert it to

:user:`Jair Montoya Martinez <jmontoyam>`
@TomDLT

TomDLT Jun 6, 2017

Member

use this url : https://github.com/jmontoyam

Then revert it to

:user:`Jair Montoya Martinez <jmontoyam>`
doc/whats_new.rst
- being subtracted from the centroids. :issue:`7872` by `Josh Karnofsky <https://github.com/jkarno>`_.
+ - Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a sparse
+ array X and initial centroids, where X's means were unnecessarily being
+ subtracted from the centroids. :issue:`7872` by `Josh Karnofsky

This comment has been minimized.

@TomDLT

TomDLT Jun 6, 2017

Member
:user:`Josh Karnofsky <jkarno>`
@TomDLT

TomDLT Jun 6, 2017

Member
:user:`Josh Karnofsky <jkarno>`
@agramfort

This comment has been minimized.

Show comment
Hide comment
@agramfort

agramfort Jun 6, 2017

Member

good to go on my end

Member

agramfort commented Jun 6, 2017

good to go on my end

@agramfort agramfort changed the title from [MRG] Uncontroversial fixes from estimator tags branch to [MRG+1] Uncontroversial fixes from estimator tags branch Jun 6, 2017

doc/whats_new.rst
@@ -336,6 +351,20 @@ API changes summary
:func:`sklearn.model_selection.cross_val_predict`.
:issue:`2879` by :user:`Stephen Hoover <stephen-hoover>`.
+
+ - Gradient boosting base models are not longer estimators. By `Andreas Müller`_.

This comment has been minimized.

@vene

vene Jun 6, 2017

Member

not longer -> no longer

@vene

vene Jun 6, 2017

Member

not longer -> no longer

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 6, 2017

Member

Comment moved to new issue: #8998, this is not relevant to this PR.

Member

vene commented Jun 6, 2017

Comment moved to new issue: #8998, this is not relevant to this PR.

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 6, 2017

Member

(Since we plan to deprecate the GaussianRandomProjectionHash class in this cycle, maybe we should simply skip it from the common tests rather than change its default n_components. I really doubt that class is seeing any external use, and internally it is only ever called with a fixed n_components.) scratch everything up to here. The code simply doesn't work with the old default so there can be no breakage.

Other than the very minor check_consistent_length question in dict learning this PR looks good to me.

Member

vene commented Jun 6, 2017

(Since we plan to deprecate the GaussianRandomProjectionHash class in this cycle, maybe we should simply skip it from the common tests rather than change its default n_components. I really doubt that class is seeing any external use, and internally it is only ever called with a fixed n_components.) scratch everything up to here. The code simply doesn't work with the old default so there can be no breakage.

Other than the very minor check_consistent_length question in dict learning this PR looks good to me.

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 6, 2017

Member

@GaelVaroquaux do you wanna have a final look or merge?

Member

amueller commented Jun 6, 2017

@GaelVaroquaux do you wanna have a final look or merge?

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Jun 6, 2017

Member

@vene merge?

Member

amueller commented Jun 6, 2017

@vene merge?

@vene vene merged commit 1c41368 into scikit-learn:master Jun 6, 2017

3 of 5 checks passed

codecov/patch 95.65% of diff hit (target 95.91%)
Details
codecov/project 95.91% (-0.01%) compared to dd7a3dd
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] Uncontroversial fixes from estimator tags branch (#8086)
* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+1] Uncontroversial fixes from estimator tags branch (#8086)
* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+1] Uncontroversial fixes from estimator tags branch (#8086)
* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+1] Uncontroversial fixes from estimator tags branch (#8086)
* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] Uncontroversial fixes from estimator tags branch (#8086)
* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo

AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

[MRG+1] Uncontroversial fixes from estimator tags branch (#8086)
* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] Uncontroversial fixes from estimator tags branch (#8086)
* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+1] Uncontroversial fixes from estimator tags branch (#8086)
* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment