New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Implements ducktyping to allow for alternative Memory implementations #9584

Merged
merged 68 commits into from Aug 30, 2017
Commits
Jump to file or symbol
Failed to load files and symbols.
+119 −42
Diff settings

Always

Just for now

@@ -43,6 +43,11 @@ should be used when applicable.
be sliced or indexed using safe_index. This is used to validate input for
cross-validation.
The :func:`check_memory` checks that input is ``joblib.Memory``-like,
which means that it can be converted into a
``sklearn.externals.joblib.Memory`` instance (typically a str denoting
the ``cachedir``) or has the same interface.
If your code relies on a random number generator, it should never use
functions like ``numpy.random.random`` or ``numpy.random.normal``. This
approach can lead to repeatability issues in unit tests. Instead, a
View
@@ -1378,6 +1378,7 @@ Low-level methods
utils.sparsefuncs.inplace_swap_column
utils.sparsefuncs.mean_variance_axis
utils.validation.check_is_fitted
utils.validation.check_memory
utils.validation.check_symmetric
utils.validation.column_or_1d
utils.validation.has_fit_parameter
View
@@ -63,6 +63,10 @@ Decomposition, manifold learning and clustering
division on Python 2 versions. :issue:`9492` by
:user:`James Bourbeau <jrbourbeau>`.
- Add :func:`utils.validation.check_memory` to validate that input is
``joblib.Memory``-like. issue:`9563` by
:user:`Kumar Ashutosh <thechargedneutron>`
Version 0.19
============
@@ -15,17 +15,18 @@
from scipy.sparse.csgraph import connected_components
from ..base import BaseEstimator, ClusterMixin
from ..externals.joblib import Memory
from ..externals import six
from ..metrics.pairwise import paired_distances, pairwise_distances
from ..utils import check_array
from ..utils.validation import check_memory
from . import _hierarchical
from ._feature_agglomeration import AgglomerationTransform
from ..utils.fast_dict import IntFloatDict
from ..externals.six.moves import xrange

This comment has been minimized.

@glemaitre

glemaitre Aug 29, 2017

Contributor

unnecessary space

###############################################################################
# For non fully-connected graphs
@@ -609,8 +610,7 @@ class AgglomerativeClustering(BaseEstimator, ClusterMixin):
"manhattan", "cosine", or 'precomputed'.
If linkage is "ward", only "euclidean" is accepted.
memory : Instance of sklearn.externals.joblib.Memory or string, optional \
(default=None)
memory : joblib.Memory-like, optional (default=None)
Used to cache the output of the computation of the tree.
By default, no caching is done. If a string is given, it is the
path to the caching directory.
@@ -693,16 +693,7 @@ def fit(self, X, y=None):
self
"""
X = check_array(X, ensure_min_samples=2, estimator=self)
memory = self.memory
if memory is None:
memory = Memory(cachedir=None, verbose=0)
elif isinstance(memory, six.string_types):
memory = Memory(cachedir=memory, verbose=0)
elif not isinstance(memory, Memory):
raise ValueError("'memory' should either be a string or"
" a sklearn.externals.joblib.Memory"
" instance, got 'memory={!r}' instead.".format(
type(memory)))
memory = check_memory(self.memory)
if self.n_clusters <= 0:
raise ValueError("n_clusters should be an integer greater than 0."
@@ -779,8 +770,7 @@ class FeatureAgglomeration(AgglomerativeClustering, AgglomerationTransform):
"manhattan", "cosine", or 'precomputed'.
If linkage is "ward", only "euclidean" is accepted.
memory : Instance of sklearn.externals.joblib.Memory or string, optional \
(default=None)
memory : joblib.Memory-like, optional (default=None)
Used to cache the output of the computation of the tree.
By default, no caching is done. If a string is given, it is the
path to the caching directory.
View
@@ -19,6 +19,7 @@
from .externals import six
from .utils.metaestimators import if_delegate_has_method
from .utils import Bunch
from .utils.validation import check_memory
from .utils.metaestimators import _BaseComposition
@@ -51,8 +52,7 @@ class Pipeline(_BaseComposition):
chained, in the order in which they are chained, with the last object
an estimator.
memory : Instance of sklearn.external.joblib.Memory or string, optional \
(default=None)
memory : joblib.Memory-like, optional (default=None)
Used to cache the fitted transformers of the pipeline. By default,
no caching is performed. If a string is given, it is the path to
the caching directory. Enabling caching triggers a clone of
@@ -186,16 +186,7 @@ def _final_estimator(self):
def _fit(self, X, y=None, **fit_params):
self._validate_steps()
# Setup the memory
memory = self.memory
if memory is None:
memory = Memory(cachedir=None, verbose=0)
elif isinstance(memory, six.string_types):
memory = Memory(cachedir=memory, verbose=0)
elif not isinstance(memory, Memory):
raise ValueError("'memory' should either be a string or"
" a sklearn.externals.joblib.Memory"
" instance, got 'memory={!r}' instead.".format(
type(memory)))
memory = check_memory(self.memory)
fit_transform_one_cached = memory.cache(_fit_transform_one)
@@ -209,7 +200,7 @@ def _fit(self, X, y=None, **fit_params):
if transformer is None:
pass
else:
if memory.cachedir is None:
if hasattr(memory, 'cachedir') and memory.cachedir is None:
# we do not clone when caching is disabled to preserve
# backward compatibility
cloned_transformer = transformer
@@ -537,8 +528,7 @@ def make_pipeline(*steps, **kwargs):
----------
*steps : list of estimators,
memory : Instance of sklearn.externals.joblib.Memory or string, optional \
(default=None)
memory : joblib.Memory-like, optional (default=None)

This comment has been minimized.

@jnothman

jnothman Aug 29, 2017

Member

I think we need to say string is an option in the type spec or we make this seem harder to use than it is. If we're short of space, I don't feel default=None adds anything.

This comment has been minimized.

@thechargedneutron

thechargedneutron Aug 29, 2017

Contributor

@lesteve suggested this to me, WDYT should be here?

This comment has been minimized.

@glemaitre

glemaitre Aug 29, 2017

Contributor

Agreed, did not spot it.

This comment has been minimized.

@thechargedneutron

thechargedneutron Aug 29, 2017

Contributor

Should I add or string ?

This comment has been minimized.

@glemaitre

glemaitre Aug 29, 2017

Contributor

drop (default=None) for sure. I would add or string since it would work if you pass one.

This comment has been minimized.

@thechargedneutron

thechargedneutron Aug 29, 2017

Contributor

What's the reason for dropping default=None ? We are not short of space.

This comment has been minimized.

@jnothman

jnothman Aug 30, 2017

Member

It's something we're not very consistent with here. I find that default=None is very uninformative:

  1. Default values (but not their meaning) are mentioned in the function signature, and key projects like numpy etc do not mention them here in the type spec.
  2. "optional" and "default=..." are redundant together.
  3. "default=None" has a different semantics (i.e. it does something different) everywhere you see it: sometimes it means "determine it automatically"; sometimes "switch the feature off"; sometimes "all"; sometimes "none"; sometimes "this feature is deprecated".

So I much prefer not to say "default=None", where "optional" already implies (and can be made explicit in the description) that the user can live without the feature.

But you certainly should state that passing a string is one way to specify memory.

Used to cache the fitted transformers of the pipeline. By default,
no caching is performed. If a string is given, it is the path to
the caching directory. Enabling caching triggers a clone of
@@ -868,9 +868,33 @@ def test_pipeline_wrong_memory():
memory = 1
cached_pipe = Pipeline([('transf', DummyTransf()), ('svc', SVC())],
memory=memory)
assert_raises_regex(ValueError, "'memory' should either be a string or a"
" sklearn.externals.joblib.Memory instance, got",
cached_pipe.fit, X, y)
assert_raises_regex(ValueError, "'memory' should be None, a string or"
" have the same interface as "
"sklearn.externals.joblib.Memory."
" Got memory='1' instead.", cached_pipe.fit, X, y)
class DummyMemory(object):
def cache(self, func):
return func
class WrongDummyMemory(object):
pass
def test_pipeline_with_cache_attribute():
X = np.array([[1, 2]])
pipe = Pipeline([('transf', Transf()), ('clf', Mult())],
memory=DummyMemory())
pipe.fit(X, y=None)
dummy = WrongDummyMemory()
pipe = Pipeline([('transf', Transf()), ('clf', Mult())],
memory=dummy)
assert_raises_regex(ValueError, "'memory' should be None, a string or"
" have the same interface as "
"sklearn.externals.joblib.Memory."
" Got memory='{}' instead.".format(dummy), pipe.fit, X)
def test_pipeline_memory():
@@ -1,6 +1,7 @@
"""Tests for input validation functions"""
import warnings
import os
from tempfile import NamedTemporaryFile
from itertools import product
@@ -10,7 +11,8 @@
import scipy.sparse as sp
from sklearn.utils.testing import assert_true, assert_false, assert_equal
from sklearn.utils.testing import assert_raises, assert_raises_regexp
from sklearn.utils.testing import assert_raises
from sklearn.utils.testing import assert_raises_regex
from sklearn.utils.testing import assert_no_warnings
from sklearn.utils.testing import assert_warns_message
from sklearn.utils.testing import assert_warns
@@ -31,6 +33,7 @@
check_is_fitted,
check_consistent_length,
assert_all_finite,
check_memory

This comment has been minimized.

@glemaitre

glemaitre Aug 23, 2017

Contributor

Just as a note for you future PR:
We only make put absolute path in the test file like here :)

)
import sklearn
@@ -39,6 +42,7 @@
from sklearn.utils.testing import assert_raise_message
def test_as_float_array():
# Test function for as_float_array
X = np.ones((3, 10), dtype=np.int32)
@@ -506,17 +510,17 @@ def test_check_consistent_length():
check_consistent_length([1], [2], [3], [4], [5])
check_consistent_length([[1, 2], [[1, 2]]], [1, 2], ['a', 'b'])
check_consistent_length([1], (2,), np.array([3]), sp.csr_matrix((1, 2)))
assert_raises_regexp(ValueError, 'inconsistent numbers of samples',
check_consistent_length, [1, 2], [1])
assert_raises_regexp(TypeError, 'got <\w+ \'int\'>',
check_consistent_length, [1, 2], 1)
assert_raises_regexp(TypeError, 'got <\w+ \'object\'>',
check_consistent_length, [1, 2], object())
assert_raises_regex(ValueError, 'inconsistent numbers of samples',
check_consistent_length, [1, 2], [1])
assert_raises_regex(TypeError, 'got <\w+ \'int\'>',
check_consistent_length, [1, 2], 1)
assert_raises_regex(TypeError, 'got <\w+ \'object\'>',
check_consistent_length, [1, 2], object())
assert_raises(TypeError, check_consistent_length, [1, 2], np.array(1))
# Despite ensembles having __len__ they must raise TypeError
assert_raises_regexp(TypeError, 'estimator', check_consistent_length,
[1, 2], RandomForestRegressor())
assert_raises_regex(TypeError, 'estimator', check_consistent_length,
[1, 2], RandomForestRegressor())
# XXX: We should have a test with a string, but what is correct behaviour?
@@ -539,3 +543,31 @@ def test_suppress_validation():
assert_all_finite(X)
sklearn.set_config(assume_finite=False)
assert_raises(ValueError, assert_all_finite, X)
class DummyMemory(object):
def cache(self, func):
return func
class WrongDummyMemory(object):
pass
def test_check_memory():
memory = check_memory("cache_directory")
assert_equal(memory.cachedir, os.path.join('cache_directory', 'joblib'))
memory = check_memory(None)
assert_equal(memory.cachedir, None)
dummy = DummyMemory()
memory = check_memory(dummy)
assert memory is dummy
assert_raises_regex(ValueError, "'memory' should be None, a string or"
" have the same interface as "
"sklearn.externals.joblib.Memory."
" Got memory='1' instead.", check_memory, 1)
dummy = WrongDummyMemory()
assert_raises_regex(ValueError, "'memory' should be None, a string or"
" have the same interface as "
"sklearn.externals.joblib.Memory. Got memory='{}' "
"instead.".format(dummy), check_memory, dummy)
@@ -20,6 +20,7 @@
from ..exceptions import NonBLASDotWarning
from ..exceptions import NotFittedError
from ..exceptions import DataConversionWarning
from ..externals.joblib import Memory
FLOAT_DTYPES = (np.float64, np.float32, np.float16)
@@ -155,6 +156,36 @@ def _shape_repr(shape):
return "(%s)" % joined
def check_memory(memory):

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

You need to test this functions in test_validation.py. Create the DummyMemory and WrongDummyMemory classes there.

"""Check that ``memory`` is joblib.Memory-like.
joblib.Memory-like means that ``memory`` can be converted into a
sklearn.externals.joblib.Memory instance (typically a str denoting the
``cachedir``) or has the same interface.
Parameters
----------
memory : joblib.Memory-like

This comment has been minimized.

@jnothman

jnothman Aug 30, 2017

Member

or string or None

This comment has been minimized.

@glemaitre

glemaitre Aug 29, 2017

Contributor

@lesteve we should put a description to follow numpydoc? (That's true that there is no much to say apart of "memory instance to be validated" and "Validated memory instance"

Returns
-------
memory : object with the joblib.Memory interface
Raises
------
ValueError
if ``memory`` is not joblib.Memory-like

This comment has been minimized.

@glemaitre

glemaitre Aug 29, 2017

Contributor

Capital letter and final full mark:

       If ``memory`` is not joblib.Memory-like.
"""
if memory is None or isinstance(memory, six.string_types):
memory = Memory(cachedir=memory, verbose=0)
elif not hasattr(memory, 'cache'):
raise ValueError("'memory' should be None, a string or have the same"
" interface as sklearn.externals.joblib.Memory."
" Got memory='{}' instead.".format(memory))
return memory

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

I think that the raise statement should come here.

def check_consistent_length(*arrays):
"""Check that all arrays have consistent first dimensions.
ProTip! Use n and p to navigate between commits in a pull request.