New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid using isinstance(memory, Memory) to allow for alternative Memory implementations #9563

Closed
jnothman opened this Issue Aug 16, 2017 · 11 comments

Comments

Projects
None yet
4 participants
@jnothman
Member

jnothman commented Aug 16, 2017

As raised in #9561, we should not be using isinstance(memory, Memory) to validate Memory arguments, as it's not straightforward for users to realise they need to use sklearn.externals.joblib.Memory (rather than joblib.Memory).

I think we should be using ducktyping wherever we check for memory. If a memory parameter has a method cache, we should treat it as Memory. We should have test cases with dummy replacements for memory that have the right cache interface to make sure this is sufficient for our needs.

This affects hierarchical clustering and Pipeline.

@amueller

This comment has been minimized.

Member

amueller commented Aug 16, 2017

I agree, but also, we should stop vendoring joblib ;)

@thechargedneutron

This comment has been minimized.

Contributor

thechargedneutron commented Aug 16, 2017

I want to work on this. Can one of you please provide some more background on this.

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 16, 2017

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Aug 17, 2017

I remember to have the discussion IRL with @lesteve when working on this PR 4c33ea2

I really had a hard time getting the error "memory needs to be a joblib.Memory" while passing a joblib.Memory. This said checking ".cache" was also in the discussion

@thechargedneutron let me know if you are on it or I can take this one.

@thechargedneutron

This comment has been minimized.

Contributor

thechargedneutron commented Aug 17, 2017

@glemaitre I am working on this. I hope I'll be able to solve this. If not, I'll ping you. Thanks

@thechargedneutron

This comment has been minimized.

Contributor

thechargedneutron commented Aug 17, 2017

What is the test function supposed to do? I suppose it to create a tmp memory and then run the fit function. Is it a good idea to just use assert_equal on two fit functions, one which uses memory argument and one which doesn't. We can certainly check if ducktyping works or not here, but is it enough?
Also, do we intend to use ducktyping in linear_model.randomized_l1.py. This uses isinstance(memory, Memory) but was not mentioned above.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Aug 17, 2017

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 18, 2017

@thechargedneutron

This comment has been minimized.

Contributor

thechargedneutron commented Aug 18, 2017

As suggested by @glemaitre , I intend to make add the following function. Does this look good?

class Dummy(object):
    pass
def test_agglomerative_clustering_with_cache_attribute():
    dummy = Dummy()
    setattr(dummy, 'cache', 'nothing')
    rng = np.random.RandomState(0)
    n_samples = 100
    X = rng.randn(n_samples, 50)
    clustering = AgglomerativeClustering(memory = dummy)
    clustering.fit(X)
@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Aug 18, 2017

The potential test should around that:

import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn.utils.testing import assert_raises_regex


class DummyMemory(object):

    def __init__(self):
        pass

    def cache(self, func):
        return func


class WrongMemory(object):

    def __init__(self):
        pass


X = np.random.random((100, 1000))
clustering = AgglomerativeClustering(memory=DummyMemory())
clustering.fit(X)

clustering = AgglomerativeClustering(memory=WrongMemory())
assert_raises_regex(ValueError, "'memory' should either be",
                    clustering.fit, X)

and the solve in the codebase should change

elif not isinstance(memory, Memory):
            raise ValueError("'memory' should either be a string or"
                             " a sklearn.externals.joblib.Memory"
                             " instance, got 'memory={!r}' instead.".format(
                                 type(memory)))

to

elif getattr(memory, "cache", None) is None:
            raise ValueError("'memory' should either be a string or"
                             " a joblib.Memory"
                             " instance, got 'memory={!r}' instead.".format(
                                 type(memory)))
@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Aug 18, 2017

This should work but it is on the top of the head.
Open a PR and the review will improve your solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment