New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+2] Implements ducktyping to allow for alternative Memory implementations #9584

Merged
merged 68 commits into from Aug 30, 2017
Commits
Jump to file or symbol
Failed to load files and symbols.
+60 −12
Diff settings

Always

Just for now

Next

Initial Commit

  • Loading branch information...
thechargedneutron committed Aug 18, 2017
commit 160d9bc18bda99fade4bd7fb9a5ff108a7fccc6f
Copy path View file
@@ -15,7 +15,7 @@
from scipy.sparse.csgraph import connected_components
from ..base import BaseEstimator, ClusterMixin
from ..externals.joblib import Memory
from joblib import Memory

This comment has been minimized.

@jnothman

jnothman Aug 19, 2017

Member

I don't think this needs importing if all references to it in code have been removed...?

from ..externals import six
from ..metrics.pairwise import paired_distances, pairwise_distances
from ..utils import check_array
@@ -196,7 +196,8 @@ def ward_tree(X, connectivity=None, n_clusters=None, return_distance=False):
else:
if n_clusters > n_samples:
raise ValueError('Cannot provide more clusters than samples. '
'%i n_clusters was asked, and there are %i samples.'
'%i n_clusters was asked, and there are'

This comment has been minimized.

@glemaitre

glemaitre Aug 23, 2017

Contributor

just to be sure. was it not PEP8 previosuly. If it was less than 80 characters avoid to make a diff for nothing.

' %i samples.'
% (n_clusters, n_samples))
n_nodes = 2 * n_samples - n_clusters
@@ -609,7 +610,7 @@ class AgglomerativeClustering(BaseEstimator, ClusterMixin):
"manhattan", "cosine", or 'precomputed'.
If linkage is "ward", only "euclidean" is accepted.
memory : Instance of sklearn.externals.joblib.Memory or string, optional \
memory : Instance of joblib.Memory or string, optional \

This comment has been minimized.

@glemaitre

glemaitre Aug 22, 2017

Contributor

Is it fitting on a single line?

(default=None)
Used to cache the output of the computation of the tree.
By default, no caching is done. If a string is given, it is the
@@ -698,9 +699,9 @@ def fit(self, X, y=None):
memory = Memory(cachedir=None, verbose=0)
elif isinstance(memory, six.string_types):
memory = Memory(cachedir=memory, verbose=0)
elif not isinstance(memory, Memory):
elif not hasattr(memory, 'cache'):
raise ValueError("'memory' should either be a string or"
" a sklearn.externals.joblib.Memory"
" a joblib.Memory"
" instance, got 'memory={!r}' instead.".format(
type(memory)))
@@ -779,7 +780,7 @@ class FeatureAgglomeration(AgglomerativeClustering, AgglomerationTransform):
"manhattan", "cosine", or 'precomputed'.
If linkage is "ward", only "euclidean" is accepted.
memory : Instance of sklearn.externals.joblib.Memory or string, optional \
memory : Instance of joblib.Memory or string, optional \

This comment has been minimized.

@glemaitre

glemaitre Aug 22, 2017

Contributor

Is it fitting on a single line

(default=None)
Used to cache the output of the computation of the tree.
By default, no caching is done. If a string is given, it is the
@@ -50,6 +50,7 @@ def test_deprecation_of_n_components_in_linkage_tree():
assert_equal(n_leaves, n_leaves_t)
assert_equal(parent, parent_t)

This comment has been minimized.

@lesteve

lesteve Aug 28, 2017

Member

Can you revert this change?

Going in the same direction as @jnothman's remark above, adding unrelated changes like on this in a PR adds noise to the diff, makes reviewing harder, and makes it less likely to get good feeback on your PR.

That means you need to look carefully at your diff before and after pushing a commit into a PR. Maybe your editor is doing it automatically in which case you need to find out how to disable this feature.

def test_linkage_misc():
# Misc tests on linkage
rng = np.random.RandomState(42)
@@ -140,6 +141,28 @@ def test_agglomerative_clustering_wrong_arg_memory():
assert_raises(ValueError, clustering.fit, X)
class Dummy(object):

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

Don't call this Dummy. It is not straightforward for something that it is not used to the code. DummyMemory is more informative

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

I think that those two class should be moved into the testing of check_memory and could be imported here.

def __init__(self):
pass
def cache(self):
pass
class Wrong_Dummy(object):

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

don't use _. WrongDummyMemory is a better name.

def __init__(self):
pass
def test_agglomerative_clustering_with_cache_attribute():
X = np.random.random((100, 1000))
clustering = AgglomerativeClustering(memory=Dummy())
clustering.fit(X)
clustering = AgglomerativeClustering(memory=Wrong_Dummy())
assert_raises(ValueError, clustering.fit, X)

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

Use assert_raises_message or assert_raises_regex to check the message which is raised.

def test_agglomerative_clustering():
# Check that we obtain the correct number of clusters with
# agglomerative clustering.
Copy path View file
@@ -15,7 +15,8 @@
from scipy import sparse
from .base import clone, TransformerMixin
from .externals.joblib import Parallel, delayed, Memory
from .externals.joblib import Parallel, delayed
from joblib import Memory
from .externals import six
from .utils import tosequence
from .utils.metaestimators import if_delegate_has_method
@@ -52,7 +53,7 @@ class Pipeline(_BaseComposition):
chained, in the order in which they are chained, with the last object
an estimator.
memory : Instance of sklearn.external.joblib.Memory or string, optional \
memory : Instance of joblib.Memory or string, optional \

This comment has been minimized.

@glemaitre

glemaitre Aug 22, 2017

Contributor

Is it fitting on one line

(default=None)
Used to cache the fitted transformers of the pipeline. By default,
no caching is performed. If a string is given, it is the path to
@@ -192,9 +193,9 @@ def _fit(self, X, y=None, **fit_params):
memory = Memory(cachedir=None, verbose=0)
elif isinstance(memory, six.string_types):
memory = Memory(cachedir=memory, verbose=0)
elif not isinstance(memory, Memory):
elif not hasattr(memory, 'cache'):
raise ValueError("'memory' should either be a string or"
" a sklearn.externals.joblib.Memory"
" a joblib.Memory"
" instance, got 'memory={!r}' instead.".format(
type(memory)))
@@ -538,7 +539,7 @@ def make_pipeline(*steps, **kwargs):
----------
*steps : list of estimators,
memory : Instance of sklearn.externals.joblib.Memory or string, optional \
memory : Instance of joblib.Memory or string, optional \

This comment has been minimized.

@glemaitre

glemaitre Aug 22, 2017

Contributor

one line

(default=None)
Used to cache the fitted transformers of the pipeline. By default,
no caching is performed. If a string is given, it is the path to
Copy path View file
@@ -31,7 +31,7 @@
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals.joblib import Memory
from joblib import Memory
JUNK_FOOD_DOCS = (
@@ -266,6 +266,29 @@ def test_pipeline_sample_weight_unsupported():
)
class Dummy(object):
def __init__(self):
pass
def cache(self):
pass
class Wrong_Dummy(object):
def __init__(self):
pass
def test_pipeline_with_cache_attribute():
X = np.array([[1, 2]])
pipe = Pipeline([('transf', Transf()), ('clf', Mult())], memory=Dummy())
pipe.fit(X, y=None)
pipe = Pipeline([('transf', Transf()), ('clf', Mult())],
memory=Wrong_Dummy())
assert_raises(ValueError, pipe.fit, X)

This comment has been minimized.

@glemaitre

glemaitre Aug 21, 2017

Contributor

Use assert_raises_message or assert_raises_regex to check the message which is raised.

def test_pipeline_raise_set_params_error():
# Test pipeline raises set params error message for nested models.
pipe = Pipeline([('cls', LinearRegression())])
ProTip! Use n and p to navigate between commits in a pull request.