Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Implements ducktyping to allow for alternative Memory implementations #9584

Merged
merged 68 commits into from Aug 30, 2017
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
160d9bc
Initial Commit
thechargedneutron Aug 18, 2017
c907dfb
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
thechargedneutron Aug 18, 2017
28e6254
Import removed in hierarchical
thechargedneutron Aug 19, 2017
d4b5024
Minor change
thechargedneutron Aug 19, 2017
8d48223
check_memory added
thechargedneutron Aug 19, 2017
8cdebae
Conflict Resolved
thechargedneutron Aug 19, 2017
d608fc9
Changes added
thechargedneutron Aug 19, 2017
8e4845a
Mistake removed
thechargedneutron Aug 19, 2017
e26703c
Parameters revised
thechargedneutron Aug 19, 2017
7df3e38
test_pipeline revised
thechargedneutron Aug 19, 2017
2d1b58c
cachedir added in Dummy()
thechargedneutron Aug 19, 2017
2a063a4
ValueError msg edited
thechargedneutron Aug 19, 2017
b7f4c99
Pep8 errors removed
thechargedneutron Aug 19, 2017
065b595
Pep8 failures removed
thechargedneutron Aug 19, 2017
869c35e
Changes added
thechargedneutron Aug 22, 2017
9b68a22
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
thechargedneutron Aug 22, 2017
88ce9b0
Import statements added
thechargedneutron Aug 22, 2017
4b40c74
Changes added
thechargedneutron Aug 22, 2017
79b9982
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
thechargedneutron Aug 22, 2017
ece5049
Changes added
thechargedneutron Aug 22, 2017
419ef3b
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
thechargedneutron Aug 22, 2017
9145f42
assert
thechargedneutron Aug 22, 2017
2fba11b
cachedir name changed
thechargedneutron Aug 22, 2017
bc7515b
cachedir format updated
thechargedneutron Aug 22, 2017
b0afc2b
Added description in whats_new.rst and other fixes
thechargedneutron Aug 22, 2017
008e2d5
missing import added
thechargedneutron Aug 22, 2017
976646d
Adds the suggested changes
thechargedneutron Aug 23, 2017
a0d0cc9
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
thechargedneutron Aug 23, 2017
688ca8a
Error message revised
thechargedneutron Aug 23, 2017
5cd86fc
changes added
thechargedneutron Aug 23, 2017
3337c41
changed
thechargedneutron Aug 23, 2017
59f2d38
Changes added
thechargedneutron Aug 23, 2017
975dfa6
Chenges added
thechargedneutron Aug 23, 2017
9e1d6b5
Error removed
thechargedneutron Aug 23, 2017
e0e3e6d
doc changed
thechargedneutron Aug 23, 2017
b86dc4c
- removed
thechargedneutron Aug 23, 2017
b3f1dfa
Tests again added in test_pipeline.py
thechargedneutron Aug 24, 2017
4775fa2
getattr added
thechargedneutron Aug 24, 2017
1ad7814
setattr added to set cachedir to None by default
thechargedneutron Aug 24, 2017
b73f674
changes added
thechargedneutron Aug 24, 2017
fcef75b
not statement removed
thechargedneutron Aug 24, 2017
cf4d794
Update pipeline.py
thechargedneutron Aug 24, 2017
3e22d23
Merge branch 'third' of https://github.com/thechargedneutron/scikit-l…
thechargedneutron Aug 24, 2017
792245e
changes added
thechargedneutron Aug 28, 2017
309c88a
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
thechargedneutron Aug 28, 2017
0d12170
__init__ removed
thechargedneutron Aug 28, 2017
b123d8f
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
thechargedneutron Aug 28, 2017
a4e9d66
prints memory instead of type
thechargedneutron Aug 28, 2017
14e11d1
changes added
thechargedneutron Aug 29, 2017
967b0f1
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
thechargedneutron Aug 29, 2017
8c1b826
error message corrected
thechargedneutron Aug 29, 2017
b6cf291
Space removed
thechargedneutron Aug 29, 2017
bf31e1b
Change added
thechargedneutron Aug 29, 2017
510a46b
dummy variable added
thechargedneutron Aug 29, 2017
d2c6c7e
Changes added
thechargedneutron Aug 29, 2017
2aea261
test_pipeline.py updated
thechargedneutron Aug 29, 2017
ce88f2d
documentation revised
thechargedneutron Aug 29, 2017
06fcd71
parameter revised
thechargedneutron Aug 29, 2017
648749d
error statement revised
thechargedneutron Aug 29, 2017
6bcc0a8
pep8 failure removed
thechargedneutron Aug 29, 2017
f7b1f72
edited
thechargedneutron Aug 29, 2017
26f2927
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
thechargedneutron Aug 29, 2017
f078fbf
doc changed
thechargedneutron Aug 30, 2017
2953c3e
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
thechargedneutron Aug 30, 2017
ac53970
changes added
thechargedneutron Aug 30, 2017
d1eb1a7
Formatting
jnothman Aug 30, 2017
c1c2707
save what's new for 0.19.1
jnothman Aug 30, 2017
84ec245
Wording
jnothman Aug 30, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Expand Up @@ -1380,6 +1380,7 @@ Low-level methods
utils.validation.check_symmetric
utils.validation.column_or_1d
utils.validation.has_fit_parameter
utils.validation.check_memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put in alphabetical order.


Recently deprecated
===================
Expand Down
21 changes: 7 additions & 14 deletions sklearn/cluster/hierarchical.py
Expand Up @@ -15,7 +15,6 @@
from scipy.sparse.csgraph import connected_components

from ..base import BaseEstimator, ClusterMixin
from ..externals.joblib import Memory
from ..externals import six
from ..metrics.pairwise import paired_distances, pairwise_distances
from ..utils import check_array
Expand All @@ -26,6 +25,8 @@

from ..externals.six.moves import xrange

from sklearn.utils.validation import check_memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually make relative import:

..utils.validation import check_memory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to line 21 as well under the other utils


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary space

###############################################################################
# For non fully-connected graphs

Expand Down Expand Up @@ -196,7 +197,8 @@ def ward_tree(X, connectivity=None, n_clusters=None, return_distance=False):
else:
if n_clusters > n_samples:
raise ValueError('Cannot provide more clusters than samples. '
'%i n_clusters was asked, and there are %i samples.'
'%i n_clusters was asked, and there are'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be sure. was it not PEP8 previosuly. If it was less than 80 characters avoid to make a diff for nothing.

' %i samples.'
% (n_clusters, n_samples))
n_nodes = 2 * n_samples - n_clusters

Expand Down Expand Up @@ -609,7 +611,7 @@ class AgglomerativeClustering(BaseEstimator, ClusterMixin):
"manhattan", "cosine", or 'precomputed'.
If linkage is "ward", only "euclidean" is accepted.

memory : Instance of sklearn.externals.joblib.Memory or string, optional \
memory : Instance of joblib.Memory or string, optional \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fitting on a single line?

(default=None)
Used to cache the output of the computation of the tree.
By default, no caching is done. If a string is given, it is the
Expand Down Expand Up @@ -693,16 +695,7 @@ def fit(self, X, y=None):
self
"""
X = check_array(X, ensure_min_samples=2, estimator=self)
memory = self.memory
if memory is None:
memory = Memory(cachedir=None, verbose=0)
elif isinstance(memory, six.string_types):
memory = Memory(cachedir=memory, verbose=0)
elif not isinstance(memory, Memory):
raise ValueError("'memory' should either be a string or"
" a sklearn.externals.joblib.Memory"
" instance, got 'memory={!r}' instead.".format(
type(memory)))
memory = check_memory(self.memory)

if self.n_clusters <= 0:
raise ValueError("n_clusters should be an integer greater than 0."
Expand Down Expand Up @@ -779,7 +772,7 @@ class FeatureAgglomeration(AgglomerativeClustering, AgglomerationTransform):
"manhattan", "cosine", or 'precomputed'.
If linkage is "ward", only "euclidean" is accepted.

memory : Instance of sklearn.externals.joblib.Memory or string, optional \
memory : Instance of joblib.Memory or string, optional \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fitting on a single line

(default=None)
Used to cache the output of the computation of the tree.
By default, no caching is done. If a string is given, it is the
Expand Down
13 changes: 13 additions & 0 deletions sklearn/cluster/tests/test_hierarchical.py
Expand Up @@ -34,6 +34,7 @@
from sklearn.utils.fast_dict import IntFloatDict
from sklearn.utils.testing import assert_array_equal
from sklearn.utils.testing import assert_warns
from sklearn.utils.tests.test_validation import DummyMemory, WrongDummyMemory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need @jnothman to tell me if doing this is fine.
I am not sure anymore that it is usually done in that way (even if we avoid code duplication)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, once we're using check_memory it's not that important to have a test in test_hierarchical.py. Yes, it's good to have some kind of integration testing, but I don't think we get much value out of it in this case. I'd happily see this PR without the changes to test_hierarchical and test_pipeline.



def test_deprecation_of_n_components_in_linkage_tree():
Expand All @@ -50,6 +51,7 @@ def test_deprecation_of_n_components_in_linkage_tree():
assert_equal(n_leaves, n_leaves_t)
assert_equal(parent, parent_t)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this change?

Going in the same direction as @jnothman's remark above, adding unrelated changes like on this in a PR adds noise to the diff, makes reviewing harder, and makes it less likely to get good feeback on your PR.

That means you need to look carefully at your diff before and after pushing a commit into a PR. Maybe your editor is doing it automatically in which case you need to find out how to disable this feature.

def test_linkage_misc():
# Misc tests on linkage
rng = np.random.RandomState(42)
Expand Down Expand Up @@ -140,6 +142,17 @@ def test_agglomerative_clustering_wrong_arg_memory():
assert_raises(ValueError, clustering.fit, X)


def test_agglomerative_clustering_with_cache_attribute():
rng = np.random.RandomState(0)
n_samples = 100
X = rng.randn(n_samples, 50)
clustering = AgglomerativeClustering(memory=DummyMemory())
clustering.fit(X)

clustering = AgglomerativeClustering(memory=WrongDummyMemory())
assert_raises(ValueError, clustering.fit, X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert_raises_message or assert_raises_regex to check the message which is raised.



def test_agglomerative_clustering():
# Check that we obtain the correct number of clusters with
# agglomerative clustering.
Expand Down
16 changes: 4 additions & 12 deletions sklearn/pipeline.py
Expand Up @@ -22,6 +22,7 @@
from .utils import Bunch

from .utils.metaestimators import _BaseComposition
from sklearn.utils.validation import check_memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make relative import as previously stated


__all__ = ['Pipeline', 'FeatureUnion']

Expand Down Expand Up @@ -52,7 +53,7 @@ class Pipeline(_BaseComposition):
chained, in the order in which they are chained, with the last object
an estimator.

memory : Instance of sklearn.external.joblib.Memory or string, optional \
memory : Instance of joblib.Memory or string, optional \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fitting on one line

(default=None)
Used to cache the fitted transformers of the pipeline. By default,
no caching is performed. If a string is given, it is the path to
Expand Down Expand Up @@ -187,16 +188,7 @@ def _final_estimator(self):
def _fit(self, X, y=None, **fit_params):
self._validate_steps()
# Setup the memory
memory = self.memory
if memory is None:
memory = Memory(cachedir=None, verbose=0)
elif isinstance(memory, six.string_types):
memory = Memory(cachedir=memory, verbose=0)
elif not isinstance(memory, Memory):
raise ValueError("'memory' should either be a string or"
" a sklearn.externals.joblib.Memory"
" instance, got 'memory={!r}' instead.".format(
type(memory)))
memory = check_memory(self.memory)

fit_transform_one_cached = memory.cache(_fit_transform_one)

Expand Down Expand Up @@ -538,7 +530,7 @@ def make_pipeline(*steps, **kwargs):
----------
*steps : list of estimators,

memory : Instance of sklearn.externals.joblib.Memory or string, optional \
memory : Instance of joblib.Memory or string, optional \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one line

(default=None)
Used to cache the fitted transformers of the pipeline. By default,
no caching is performed. If a string is given, it is the path to
Expand Down
16 changes: 13 additions & 3 deletions sklearn/tests/test_pipeline.py
Expand Up @@ -19,6 +19,7 @@
from sklearn.utils.testing import assert_array_equal
from sklearn.utils.testing import assert_array_almost_equal
from sklearn.utils.testing import assert_dict_equal
from sklearn.utils.tests.test_validation import DummyMemory, WrongDummyMemory

from sklearn.base import clone, BaseEstimator
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union
Expand Down Expand Up @@ -266,6 +267,16 @@ def test_pipeline_sample_weight_unsupported():
)


def test_pipeline_with_cache_attribute():
X = np.array([[1, 2]])
pipe = Pipeline([('transf', Transf()), ('clf', Mult())], memory=DummyMemory())
pipe.fit(X, y=None)

pipe = Pipeline([('transf', Transf()), ('clf', Mult())],
memory=WrongDummyMemory())
assert_raises(ValueError, pipe.fit, X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert_raises_message or assert_raises_regex to check the message which is raised.



def test_pipeline_raise_set_params_error():
# Test pipeline raises set params error message for nested models.
pipe = Pipeline([('cls', LinearRegression())])
Expand Down Expand Up @@ -852,9 +863,8 @@ def test_pipeline_wrong_memory():
memory = 1
cached_pipe = Pipeline([('transf', DummyTransf()), ('svc', SVC())],
memory=memory)
assert_raises_regex(ValueError, "'memory' should either be a string or a"
" sklearn.externals.joblib.Memory instance, got",
cached_pipe.fit, X, y)
assert_raises_regex(ValueError, "'memory' should either be a string or"
" a joblib.Memory instance", cached_pipe.fit, X, y)


def test_pipeline_memory():
Expand Down
21 changes: 21 additions & 0 deletions sklearn/utils/tests/test_validation.py
Expand Up @@ -31,6 +31,7 @@
check_is_fitted,
check_consistent_length,
assert_all_finite,
check_memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a note for you future PR:
We only make put absolute path in the test file like here :)

)
import sklearn

Expand Down Expand Up @@ -539,3 +540,23 @@ def test_suppress_validation():
assert_all_finite(X)
sklearn.set_config(assume_finite=False)
assert_raises(ValueError, assert_all_finite, X)


class DummyMemory(object):
def __init__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need init as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

pass

def cache(self, func):
return func

cachedir = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this in __init__ method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to require cachedir, I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be useful for later checking maybe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean. An example?

What I mean is that we might as well accept (and therefore test) objects that have a cache method which operates as expected, regardless of whether it's got all the other attributes that a joblib.Memory has.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok.

I only wanted to make test something like

memory = check_memory(DummyMemory())
assert_equal(memory.cachedir = 'something')

but it can be cachedir or anything else as attributes

Copy link
Member

@jnothman jnothman Aug 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just:

dummy = DummyMemory()
memory = check_memory(dummy)
assert memory is dummy

??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes much better indeed



class WrongDummyMemory(object):
def __init__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need init as well

pass


def test_check_memory():
memory = check_memory(DummyMemory())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need to test what happens when a string or None is passed in.

assert_raises(ValueError, check_memory, WrongDummyMemory())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assert_raises_regex to check the message raised

27 changes: 27 additions & 0 deletions sklearn/utils/validation.py
Expand Up @@ -20,6 +20,7 @@
from ..exceptions import NonBLASDotWarning
from ..exceptions import NotFittedError
from ..exceptions import DataConversionWarning
from ..externals.joblib import Memory


FLOAT_DTYPES = (np.float64, np.float32, np.float16)
Expand Down Expand Up @@ -155,6 +156,32 @@ def _shape_repr(shape):
return "(%s)" % joined


def check_memory(memory):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to test this functions in test_validation.py. Create the DummyMemory and WrongDummyMemory classes there.

"""Check that the memory is an instance of joblib.Memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to come up with a better wording here and everywhere else in this PR. joblib.Memory is misleading since joblib may not be installed.

I would use "memory-like" (maybe Memory-like i.e.with a capital M), similarly to array-like in numpy.

The docstring would read something like

"""Check that ``memory`` is memory-like.

memory-like means that ``memory`` can be converted into a
sklearn.externals.joblib.Memory instance (typically a str denoting the
``cachedir``) or has the same interface.

Parameters
----------
memory : memory-like or None

Returns
------
memory : object with the sklearn.externals.joblib.Memory interface

Raises
------
ValueError
    if ``memory`` is not memory-like
"""

And then use memory-like as the type info in all the docstrings, possibly with a rst link to check_memory for more details.


Raises a ValueError if the passed object does not have a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a blank line between the short description and the long description.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no blank line here, please

cache attribute.

Parameters
----------
memory: input object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need space before :. The right of the colon should only specify type, i.e. str or something with cache or None


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lesteve we should put a description to follow numpydoc? (That's true that there is no much to say apart of "memory instance to be validated" and "Validated memory instance"

Returns
-------
memory: the input memory if it is valid. A valueError if invalid memory instance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need space before :. The right of the colon should only specify type, i.e. Memory.

"""

if memory is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add a docstring

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can actually tackle this case in the same way as a string:

if memory is None or isinstance(memory, six.string_types):
    memory = Memory(cachedir=memory, verbose=0)t
elif not hasattr(memory, 'cache'):
    # rest of the code is the same

memory = Memory(cachedir=None, verbose=0)
elif isinstance(memory, six.string_types):
memory = Memory(cachedir=memory, verbose=0)
elif not hasattr(memory, 'cache'):
raise ValueError("'memory' should either be a string or"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find the error message enough meaningful.

'memory' is not a string or a Memory instance implementing a cache method. Got a {} instance, instead.'.format(type(memory))

" a joblib.Memory instance")
return memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the raise statement should come here.



def check_consistent_length(*arrays):
"""Check that all arrays have consistent first dimensions.

Expand Down