Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Uncontroversial fixes from estimator tags branch #8086

Merged
merged 24 commits into from Jun 6, 2017

Conversation

@amueller
Copy link
Member

@amueller amueller commented Dec 19, 2016

These are some of the simple fixes from #8022 which should be fairly uncontroversial and easy to review.
I hope that makes the review of #8022 easier and allows that PR to focus on the API.

- Fixes to the input validation in :class:`sklearn.covariance.EllipticEnvelope` by
`Andreas Müller`_.

- Fix shape output shape of :class:`sklearn.decomposition.DictionaryLearning` transform
Copy link
Member Author

@amueller amueller Dec 19, 2016

this still needs a test, I think...


- Gradient boosting base models are not longer estimators. By `Andreas Müller`_.

- :class:`feature_extraction.text.TfidfTransformer` now supports numpy
Copy link
Member Author

@amueller amueller Dec 19, 2016

this needs tests, I guess

@amueller amueller changed the title Uncontroversial fixes from estimator tags branch [MRG] Uncontroversial fixes from estimator tags branch Dec 19, 2016
@@ -101,7 +102,7 @@ def predict(self, X):
return is_inlier


class EllipticEnvelope(ClassifierMixin, OutlierDetectionMixin, MinCovDet):
class EllipticEnvelope(OutlierDetectionMixin, MinCovDet):
Copy link
Member Author

@amueller amueller Dec 19, 2016

This is not a classifier and should not inherit from ClassifierMixin.

@@ -93,7 +93,7 @@ class GaussianRandomProjectionHash(ProjectionToHashMixin,
GaussianRandomProjection):
"""Use GaussianRandomProjection to produce a cosine LSH fingerprint"""
def __init__(self,
n_components=8,
n_components=32,
Copy link
Member Author

@amueller amueller Dec 19, 2016

This class is never instantiated with default parameters and the code doesn't run with n_compenents != 32

Copy link
Member

@jnothman jnothman Dec 27, 2016

ha!

Copy link
Member

@agramfort agramfort Jun 6, 2017

is the error captured if not 32? should it be a parameter if only one param value works?

Copy link
Member Author

@amueller amueller Jun 6, 2017

I told @ogrisel to look into it ;)

else:
# convert counts or binary occurrences to floats
X = sp.csr_matrix(X, dtype=np.float64, copy=copy)
X = check_array(X, accept_sparse=["csr"], copy=copy,
Copy link
Member Author

@amueller amueller Dec 19, 2016

I'm not actually somewhat uncertain of whether this is the right approach. In fit we always convert to sparse. Maybe we should be consistent? I'm not sure.

Copy link
Member

@jnothman jnothman Dec 27, 2016

I think the principle is that for df to be meaningful, X needs to be fundamentally sparse (regardless of data structure). I can see why you consider this broken, but I'm tempted to say don't fix. You're creating a backwards-incompatibility.

Copy link
Member Author

@amueller amueller May 15, 2017

reverted


- :class:`feature_extraction.text.TfidfTransformer` now supports numpy
arrays as inputs, and produces numpy arrays for list inputs and numpy
array inputs. By `Andreas Müller_`.
Copy link
Member

@TomDLT TomDLT Dec 20, 2016

swap _ and `

Copy link
Member

@jnothman jnothman left a comment

otherwise LGTM, I think.

Thanks! But I can't say it's so easy to review a heterogeneous patch like this, and I wish you'd pulled the entirely cosmetic things out separately.

@@ -176,3 +177,29 @@ def fit(self, X, y=None):
self.threshold_ = sp.stats.scoreatpercentile(
self.dist_, 100. * (1. - self.contamination))
return self

def score(self, X, y, sample_weight=None):
Copy link
Member

@jnothman jnothman Dec 27, 2016

Why is this not part of OutlierDetectionMixin?

Copy link
Member Author

@amueller amueller May 15, 2017

moved it there.

sklearn/dummy.py Outdated
@@ -117,6 +117,9 @@ def fit(self, X, y, sample_weight=None):

self.sparse_output_ = sp.issparse(y)

X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
Copy link
Member

@jnothman jnothman Dec 27, 2016

I think we need to be liberal wrt X: DummyClassifier will often substitute for a pipeline / grid search, where X may be a list of strings, a dataframe, a list of open files, or other mess. Don't demand too much of it. Here you ensure it is 2d and numeric and of a particular format, but why?

Copy link
Member Author

@amueller amueller May 15, 2017

removed.

else:
# convert counts or binary occurrences to floats
X = sp.csr_matrix(X, dtype=np.float64, copy=copy)
X = check_array(X, accept_sparse=["csr"], copy=copy,
Copy link
Member

@jnothman jnothman Dec 27, 2016

I think the principle is that for df to be meaningful, X needs to be fundamentally sparse (regardless of data structure). I can see why you consider this broken, but I'm tempted to say don't fix. You're creating a backwards-incompatibility.


- :class:`feature_selection.SelectFromModel` now validates the ``threshold``
parameter and sets the ``threshold_`` attribute during the call to
``fit``, and no longer during the call to ``transform```, by `Andreas
Copy link
Member

@jnothman jnothman Dec 27, 2016

I think it's still being set in transform now, no?

@@ -93,7 +93,7 @@ class GaussianRandomProjectionHash(ProjectionToHashMixin,
GaussianRandomProjection):
"""Use GaussianRandomProjection to produce a cosine LSH fingerprint"""
def __init__(self,
n_components=8,
n_components=32,
Copy link
Member

@jnothman jnothman Dec 27, 2016

ha!

@@ -480,13 +480,13 @@ def partial_fit(self, X, y, classes=None, sample_weight=None):
y : array-like, shape = [n_samples]
Target values.
classes : array-like, shape = [n_classes], optional (default=None)
classes : array-like, shape = [n_classes], (default=None)
Copy link
Member

@jnothman jnothman Dec 27, 2016

no comma before parentheses?

List of all the classes that can possibly appear in the y vector.
Must be provided at the first call to partial_fit, can be omitted
in subsequent calls.
sample_weight : array-like, shape = [n_samples], optional (default=None)
sample_weight : array-like, shape = [n_samples], (default=None)
Copy link
Member

@jnothman jnothman Dec 27, 2016

no comma before parentheses?

@codecov
Copy link

@codecov codecov bot commented Feb 25, 2017

Codecov Report

Merging #8086 into master will decrease coverage by 0.05%.
The diff coverage is 98.01%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8086      +/-   ##
==========================================
- Coverage   95.53%   95.48%   -0.06%     
==========================================
  Files         333      342       +9     
  Lines       61184    60958     -226     
==========================================
- Hits        58451    58204     -247     
- Misses       2733     2754      +21
Impacted Files Coverage Δ
sklearn/utils/multiclass.py 96.52% <ø> (+0.69%) ⬆️
sklearn/naive_bayes.py 100% <ø> (ø) ⬆️
sklearn/feature_selection/rfe.py 97.45% <ø> (ø) ⬆️
sklearn/neighbors/approximate.py 98.95% <ø> (ø) ⬆️
sklearn/multiclass.py 94.78% <100%> (ø)
sklearn/decomposition/truncated_svd.py 94.11% <100%> (-0.23%) ⬇️
sklearn/covariance/outlier_detection.py 97.22% <100%> (+0.25%) ⬆️
sklearn/decomposition/dict_learning.py 93.45% <100%> (ø) ⬆️
sklearn/tests/test_multiclass.py 100% <100%> (ø)
sklearn/ensemble/gradient_boosting.py 95.79% <100%> (-0.01%) ⬇️
... and 135 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0301e06...f4c9d60. Read the comment docs.

@@ -64,7 +64,7 @@
from ..exceptions import NotFittedError


class QuantileEstimator(BaseEstimator):
class QuantileEstimator(object):
Copy link
Member

@GaelVaroquaux GaelVaroquaux Feb 27, 2017

Why this change? (there is probably a good reason, just asking)

Copy link
Member Author

@amueller amueller May 15, 2017

These are not scikit-learn estimators, they don't fulfill the sklearn estimator API and the inheritance doesn't provide any functionality. So I think having them inherit is rather confusing.
The more direct reason is that I don't want these to be discovered by the common tests, as they are not sklearn estimators and will definitely fail the tests.

Copy link
Member

@jnothman jnothman May 17, 2017

They are also not in a position for get_params/set_params to be used.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Feb 27, 2017

Overall +1 for merge, but I added a small comment. It also seems that there are unanswered comments by @jnothman

sklearn/dummy.py Outdated
@@ -117,6 +117,9 @@ def fit(self, X, y, sample_weight=None):

self.sparse_output_ = sp.issparse(y)

X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
Copy link
Member Author

@amueller amueller May 15, 2017

removed.

@@ -64,7 +64,7 @@
from ..exceptions import NotFittedError


class QuantileEstimator(BaseEstimator):
class QuantileEstimator(object):
Copy link
Member Author

@amueller amueller May 15, 2017

These are not scikit-learn estimators, they don't fulfill the sklearn estimator API and the inheritance doesn't provide any functionality. So I think having them inherit is rather confusing.
The more direct reason is that I don't want these to be discovered by the common tests, as they are not sklearn estimators and will definitely fail the tests.

else:
# convert counts or binary occurrences to floats
X = sp.csr_matrix(X, dtype=np.float64, copy=copy)
X = check_array(X, accept_sparse=["csr"], copy=copy,
Copy link
Member Author

@amueller amueller May 15, 2017

reverted

- Gradient boosting base models are not longer estimators. By `Andreas Müller`_.

- :class:`feature_selection.SelectFromModel` now validates the ``threshold``
parameter and sets the ``threshold_`` attribute during the call to
Copy link
Member Author

@amueller amueller May 15, 2017

Now actually doesn't set it any more during transform. This is a backward incompatible change, but could be considered a bug in the previous implementation? I tried to make it clearer what happens now.

@amueller
Copy link
Member Author

@amueller amueller commented May 15, 2017

Ok so I reworked the SelectFromModel again. The point of self.threshold_ is to provide a threshold to the user, it's not used internally (before or after this PR). However, before the PR, if the user re-assigned threshold - a use-case that we explicitly support and test - then self.threshold_ is outdated until transform is called. Even calling fit would not update it!

The PR now makes threshold_ a property so it doesn't go stale. Computing it is quick so it's not a big deal. We can think about deprecating the property and making it a method, which it should have probably been in the first place.

@jnothman
Copy link
Member

@jnothman jnothman commented May 15, 2017

@amueller
Copy link
Member Author

@amueller amueller commented Jun 5, 2017

Is @ogrisel around? I think he did the GaussianRandomProjection. It's only used internally. I guess we should open an issue for that?

@vene
Copy link
Member

@vene vene commented Jun 5, 2017

issue #8029 also mentions that these classes are not the greatest. (btw this is GaussianRandomProjectionHash, sorry for the typo.)

I'll comment under #8029 with the weird behaviour we are finding.

X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
Copy link
Member

@agramfort agramfort Jun 6, 2017

shape = (n_samples,) or (n_samples, n_outputs)

y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Copy link
Member

@agramfort agramfort Jun 6, 2017

array-like, shape = (n_samples,), optional

@@ -484,13 +484,13 @@ def partial_fit(self, X, y, classes=None, sample_weight=None):
y : array-like, shape = [n_samples]
Target values.
classes : array-like, shape = [n_classes], optional (default=None)
classes : array-like, shape = [n_classes] (default=None)
Copy link
Member

@agramfort agramfort Jun 6, 2017

I see that we are not consistent here.

I think we should use tuples for shapes in doctrings so:

array-like, shape = (n_classes,) (default=None)

Copy link
Member Author

@amueller amueller Jun 6, 2017

well I could change that but that would really increase the size of the pull request, even if I just do it for naive_bayes.py

``fit``, and no longer during the call to ``transform```, by `Andreas
Müller`_.

- :class:`features_selection.SelectFromModel` now has a ``partial_fit``
Copy link
Member

@TomDLT TomDLT Jun 6, 2017

features -> feature

@agramfort
Copy link
Member

@agramfort agramfort commented Jun 6, 2017

that's it for me.

:issue:`8086` by `Andreas Müller`_.

- Fix output shape and bugs with n_jobs > 1 in
:class:`sklearn.decomposition.SparseEncoder` transform and :func:`sklarn.decomposition.sparse_encode`
Copy link
Member

@TomDLT TomDLT Jun 6, 2017

SparseEncoder -> SparseCoder

if dictionary.shape[1] != X.shape[1]:
raise ValueError("Dictionary and X have different numbers of features:"
"dictionary.shape: {} X.shape{}".format(
dictionary.shape, X.shape))
Copy link
Member

@vene vene Jun 6, 2017

could use check_consistent_length(X.T, dictionary.T)

You could argue that's less clear though.

Copy link
Member Author

@amueller amueller Jun 6, 2017

It'll say "different number of samples" in the error, right?

@@ -213,7 +213,8 @@ Bug fixes

- Fixed a bug where :class:`sklearn.linear_model.LassoLars` does not give
the same result as the LassoLars implementation available
in R (lars library). :issue:`7849` by :user:`Jair Montoya Martinez <jmontoyam>`
in R (lars library). :issue:`7849` by `Jair Montoya Martinez`_
Copy link
Member

@TomDLT TomDLT Jun 6, 2017

There is no link for Jair Montoya Martinez.
Revert this one?

Copy link
Member

@agramfort agramfort Jun 6, 2017

Copy link
Member Author

@amueller amueller Jun 6, 2017

thanks, I guess that got lost in a merge.

Copy link
Member

@TomDLT TomDLT Jun 6, 2017

use this url : https://github.com/jmontoyam

Then revert it to

:user:`Jair Montoya Martinez <jmontoyam>`

being subtracted from the centroids. :issue:`7872` by `Josh Karnofsky <https://github.com/jkarno>`_.
- Fix a bug regarding fitting :class:`sklearn.cluster.KMeans` with a sparse
array X and initial centroids, where X's means were unnecessarily being
subtracted from the centroids. :issue:`7872` by `Josh Karnofsky
Copy link
Member

@TomDLT TomDLT Jun 6, 2017

:user:`Josh Karnofsky <jkarno>`

@agramfort
Copy link
Member

@agramfort agramfort commented Jun 6, 2017

good to go on my end

@agramfort agramfort changed the title [MRG] Uncontroversial fixes from estimator tags branch [MRG+1] Uncontroversial fixes from estimator tags branch Jun 6, 2017
@@ -336,6 +351,20 @@ API changes summary
:func:`sklearn.model_selection.cross_val_predict`.
:issue:`2879` by :user:`Stephen Hoover <stephen-hoover>`.


- Gradient boosting base models are not longer estimators. By `Andreas Müller`_.
Copy link
Member

@vene vene Jun 6, 2017

not longer -> no longer

@vene
Copy link
Member

@vene vene commented Jun 6, 2017

Comment moved to new issue: #8998, this is not relevant to this PR.

@vene
Copy link
Member

@vene vene commented Jun 6, 2017

(Since we plan to deprecate the GaussianRandomProjectionHash class in this cycle, maybe we should simply skip it from the common tests rather than change its default n_components. I really doubt that class is seeing any external use, and internally it is only ever called with a fixed n_components.) scratch everything up to here. The code simply doesn't work with the old default so there can be no breakage.

Other than the very minor check_consistent_length question in dict learning this PR looks good to me.

@amueller
Copy link
Member Author

@amueller amueller commented Jun 6, 2017

@GaelVaroquaux do you wanna have a final look or merge?

@amueller
Copy link
Member Author

@amueller amueller commented Jun 6, 2017

@vene merge?

@vene vene merged commit 1c41368 into scikit-learn:master Jun 6, 2017
3 of 5 checks passed
Sundrique added a commit to Sundrique/scikit-learn that referenced this issue Jun 14, 2017
…n#8086)

* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo
dmohns added a commit to dmohns/scikit-learn that referenced this issue Aug 7, 2017
…n#8086)

* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo
dmohns added a commit to dmohns/scikit-learn that referenced this issue Aug 7, 2017
…n#8086)

* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo
NelleV added a commit to NelleV/scikit-learn that referenced this issue Aug 11, 2017
…n#8086)

* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo
paulha added a commit to paulha/scikit-learn that referenced this issue Aug 19, 2017
…n#8086)

* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo
AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this issue Aug 29, 2017
…n#8086)

* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this issue Nov 15, 2017
…n#8086)

* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this issue Dec 18, 2017
…n#8086)

* some bug fixes.

* minor fixes to whatsnew

* typo in whatsnew

* add test for n_components = 1 transform in dict learning

* feature extraction doc fix

* fix broken test

* revert aggressive input validation changes

* in SelectFromModel, don't store threshold_ in transform. If we called "fit", use estimates from last "fit".

* move score from EllipticEnvelope to OutlierDetectionMixin

* revert changes to Tfidf documentation

* remove dummy input validation from whatsnew

* fix text feature tests

* rewrite from_model threshold again...

* remove stray condition

* fix self.estimator -> estimator, slightly more interesting test

* typo in comment

* Fix issues in SparseEncoder, add tests.
more explicit explanation of SparseEncoder change, add issue numbers to whatsnew

* minor fixes in whats_new.rst

* slightly more consistency with tuples for shapes

* not longer typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants