New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Implements ducktyping to allow for alternative Memory implementations #9584

Merged
merged 68 commits into from Aug 30, 2017

Conversation

Projects
None yet
6 participants
@thechargedneutron
Contributor

thechargedneutron commented Aug 18, 2017

Reference Issue

Fixes #9563

What does this implement/fix? Explain your changes.

Introduced ducktyping in pipeline.py as well as in hierarchical.py

Any other comments?

Added tests as well.

@thechargedneutron

This comment has been minimized.

Contributor

thechargedneutron commented Aug 18, 2017

How to import joblib. Tests failed because of ImportError. Do I need to mention somewhere that I used a new module?

@jnothman

I think in practice we also access memory.cachedir. Seeing as this is just used as a proxy for "is there actual caching happening", we can just use getattr(memory, 'cachedir', True) in its place.

@@ -15,7 +15,7 @@
from scipy.sparse.csgraph import connected_components
from ..base import BaseEstimator, ClusterMixin
from ..externals.joblib import Memory
from joblib import Memory

This comment has been minimized.

@jnothman

jnothman Aug 19, 2017

Member

I don't think this needs importing if all references to it in code have been removed...?

thechargedneutron added some commits Aug 19, 2017

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 19, 2017

@thechargedneutron

This comment has been minimized.

Contributor

thechargedneutron commented Aug 19, 2017

I removed those lines and added a check memory.cachedir condition. Can you please check the last commit I made and decide whether it works?

thechargedneutron added some commits Aug 19, 2017

@thechargedneutron

This comment has been minimized.

Contributor

thechargedneutron commented Aug 19, 2017

I hope this satisfies the requirements. Kindly review it and suggest changes, if any.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Aug 21, 2017

You should change your title to MRG. You will get more attention from the developers :)

@thechargedneutron

This comment has been minimized.

Contributor

thechargedneutron commented Aug 21, 2017

I am waiting for a review. I think it's complete but not sure ;) What do you think?

@glemaitre

You need to add check_memory in the documentation in doc/modules/classes.rst and doc/developers/utilities.rst
You need also to add an entry in the what's new.

@@ -155,6 +156,14 @@ def _shape_repr(shape):
return "(%s)" % joined
def check_memory(memory):
if memory is None:

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

You need to add a docstring

memory = Memory(cachedir=None, verbose=0)
elif isinstance(memory, six.string_types):
memory = Memory(cachedir=memory, verbose=0)
return memory

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

I think that the raise statement should come here.

memory = Memory(cachedir=memory, verbose=0)
elif not isinstance(memory, Memory):
memory = check_memory(self.memory)
if not hasattr(memory, 'cache'):

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

This raise statement will be duplicate. You need to include it in check_memory.

memory = Memory(cachedir=memory, verbose=0)
elif not isinstance(memory, Memory):
memory = check_memory(self.memory)
if not hasattr(memory, 'cache'):

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

This raise statement will be duplicate. You need to include it in check_memory.

@@ -140,6 +141,30 @@ def test_agglomerative_clustering_wrong_arg_memory():
assert_raises(ValueError, clustering.fit, X)
class Dummy(object):

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

Don't call this Dummy. It is not straightforward for something that it is not used to the code. DummyMemory is more informative

@@ -140,6 +141,30 @@ def test_agglomerative_clustering_wrong_arg_memory():
assert_raises(ValueError, clustering.fit, X)
class Dummy(object):

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

I think that those two class should be moved into the testing of check_memory and could be imported here.

@@ -155,6 +156,14 @@ def _shape_repr(shape):
return "(%s)" % joined
def check_memory(memory):

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

You need to test this functions in test_validation.py. Create the DummyMemory and WrongDummyMemory classes there.

clustering.fit(X)
clustering = AgglomerativeClustering(memory=Wrong_Dummy())
assert_raises(ValueError, clustering.fit, X)

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

Use assert_raises_message or assert_raises_regex to check the message which is raised.

pipe = Pipeline([('transf', Transf()), ('clf', Mult())],
memory=Wrong_Dummy())
assert_raises(ValueError, pipe.fit, X)

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

Use assert_raises_message or assert_raises_regex to check the message which is raised.

assert_raises_regex(ValueError, "'memory' should either be a string or"
" a joblib.Memory"
" instance, got 'memory={!r}' instead.".format(
type(memory)),

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

Hard code type(memory). We don't want the test to be able to modify depending of the type of memory.

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 21, 2017

I am waiting for a review. I think it's complete but not sure ;) What do you think?

WIP means: Hello everyone this is a snippet of what I'm working on. WDYT?

MRG means: I've done what I'm going to do to make this mergeable. Now I need review.

@@ -34,6 +34,7 @@
from sklearn.utils.fast_dict import IntFloatDict
from sklearn.utils.testing import assert_array_equal
from sklearn.utils.testing import assert_warns
from sklearn.utils.tests.test_validation import DummyMemory, WrongDummyMemory

This comment has been minimized.

@glemaitre

glemaitre Aug 22, 2017

Contributor

I would need @jnothman to tell me if doing this is fine.
I am not sure anymore that it is usually done in that way (even if we avoid code duplication)

This comment has been minimized.

@jnothman

jnothman Aug 22, 2017

Member

TBH, once we're using check_memory it's not that important to have a test in test_hierarchical.py. Yes, it's good to have some kind of integration testing, but I don't think we get much value out of it in this case. I'd happily see this PR without the changes to test_hierarchical and test_pipeline.

def test_check_memory():
memory = check_memory(DummyMemory())

This comment has been minimized.

@jnothman

jnothman Aug 22, 2017

Member

we also need to test what happens when a string or None is passed in.

Parameters
----------
memory: input object.

This comment has been minimized.

@jnothman

jnothman Aug 22, 2017

Member

need space before :. The right of the colon should only specify type, i.e. str or something with cache or None

Returns
-------
memory: the input memory if it is valid. A valueError if invalid memory instance.

This comment has been minimized.

@jnothman

jnothman Aug 22, 2017

Member

need space before :. The right of the colon should only specify type, i.e. Memory.

"""Check that the memory is an instance of joblib.Memory.
Raises a ValueError if the passed object does not have a

This comment has been minimized.

@jnothman

jnothman Aug 22, 2017

Member

no blank line here, please

@glemaitre

LGTM, only nitpick ... @lesteve I would almost do them if I would have push rights :)

from . import _hierarchical
from ._feature_agglomeration import AgglomerationTransform
from ..utils.fast_dict import IntFloatDict
from ..externals.six.moves import xrange

This comment has been minimized.

@glemaitre

glemaitre Aug 29, 2017

Contributor

unnecessary space

Parameters
----------
memory : joblib.Memory-like

This comment has been minimized.

@glemaitre

glemaitre Aug 29, 2017

Contributor

@lesteve we should put a description to follow numpydoc? (That's true that there is no much to say apart of "memory instance to be validated" and "Validated memory instance"

Raises
------
ValueError
if ``memory`` is not joblib.Memory-like

This comment has been minimized.

@glemaitre

glemaitre Aug 29, 2017

Contributor

Capital letter and final full mark:

       If ``memory`` is not joblib.Memory-like.
@thechargedneutron

This comment has been minimized.

Contributor

thechargedneutron commented Aug 29, 2017

@glemaitre Done! except the 2nd comment.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Aug 29, 2017

@glemaitre Done! except the 2nd comment.

Great. Was it a review to remove them previously. If yes, I am fine for merging.

@@ -537,8 +528,7 @@ def make_pipeline(*steps, **kwargs):
----------
*steps : list of estimators,
memory : Instance of sklearn.externals.joblib.Memory or string, optional \
(default=None)
memory : joblib.Memory-like, optional (default=None)

This comment has been minimized.

@jnothman

jnothman Aug 29, 2017

Member

I think we need to say string is an option in the type spec or we make this seem harder to use than it is. If we're short of space, I don't feel default=None adds anything.

This comment has been minimized.

@thechargedneutron

thechargedneutron Aug 29, 2017

Contributor

@lesteve suggested this to me, WDYT should be here?

This comment has been minimized.

@glemaitre

glemaitre Aug 29, 2017

Contributor

Agreed, did not spot it.

This comment has been minimized.

@thechargedneutron

thechargedneutron Aug 29, 2017

Contributor

Should I add or string ?

This comment has been minimized.

@glemaitre

glemaitre Aug 29, 2017

Contributor

drop (default=None) for sure. I would add or string since it would work if you pass one.

This comment has been minimized.

@thechargedneutron

thechargedneutron Aug 29, 2017

Contributor

What's the reason for dropping default=None ? We are not short of space.

This comment has been minimized.

@jnothman

jnothman Aug 30, 2017

Member

It's something we're not very consistent with here. I find that default=None is very uninformative:

  1. Default values (but not their meaning) are mentioned in the function signature, and key projects like numpy etc do not mention them here in the type spec.
  2. "optional" and "default=..." are redundant together.
  3. "default=None" has a different semantics (i.e. it does something different) everywhere you see it: sometimes it means "determine it automatically"; sometimes "switch the feature off"; sometimes "all"; sometimes "none"; sometimes "this feature is deprecated".

So I much prefer not to say "default=None", where "optional" already implies (and can be made explicit in the description) that the user can live without the feature.

But you certainly should state that passing a string is one way to specify memory.

@jnothman jnothman added this to the 0.19.1 milestone Aug 30, 2017

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 30, 2017

I've marked this for inclusion in 0.19.1 because it may help a lot with the new feature in Pipeline.

@thechargedneutron

This comment has been minimized.

Contributor

thechargedneutron commented Aug 30, 2017

@jnothman Changed the function description. Review please.

Parameters
----------
memory : joblib.Memory-like

This comment has been minimized.

@jnothman

jnothman Aug 30, 2017

Member

or string or None

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 30, 2017

Fix that up and I'll merge on green.

thechargedneutron and others added some commits Aug 30, 2017

@jnothman jnothman merged commit e66aa6d into scikit-learn:master Aug 30, 2017

0 of 4 checks passed

ci/circleci CircleCI is running your tests
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
lgtm analysis: Python Running analyses for revisions
Details
@jnothman

This comment has been minimized.

Member

jnothman commented Aug 30, 2017

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Aug 30, 2017

@thechargedneutron

This comment has been minimized.

Contributor

thechargedneutron commented Aug 30, 2017

@jnothman welcome!! Thanks for the help :)

@lesteve

This comment has been minimized.

Member

lesteve commented Aug 30, 2017

I am not sure why we added string on top of joblib.Memory-like explicitly to the type info. As per the check_memory docstring:

joblib.Memory-like means that memory can be converted into a
sklearn.externals.joblib.Memory instance (typically a str denoting the
cachedir) or has the same interface (has a cache method).

In my mind, joblib.Memory-like was supposed to be modelled after array-like in numpy. You don't see this in numpy docstrings:

parameter : array_like or list

I would agree that being explicit and adding string makes it easier to use for user not familiar with joblib.Memory (in general the docstring does say what happens if a string is passed in though). I was hoping that just using joblib.Memory-like in the type info and either adding a link to check_memory function for more details or copy and pasting the same docstring around would work.

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 30, 2017

@lesteve

This comment has been minimized.

Member

lesteve commented Aug 30, 2017

Well, we can debate it, but I strongly disagree that "joblib.Memory-like" is sufficiently helpful to a user who's never heard of joblib.

Makes sense but then I would change the wording in the check_memory docstring, probably so that joblib.Memory-like means "same interface as joblib.Memory".

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Aug 30, 2017

@thechargedneutron thechargedneutron deleted the thechargedneutron:third branch Aug 30, 2017

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment