Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

[MRG] Changed problematic API of KNeighbors classes, #2609 #3093

Closed
wants to merge 12 commits into
from

Conversation

Projects
None yet
6 participants
Contributor

nmayorov commented Apr 21, 2014

The modification was made according to the advice in issue #2609.

This also addresses a bug that set_params hasn't actually changed 'p' parameter of minkowski metric.

Nikolay Mayorov added some commits Apr 19, 2014

Nikolay Mayorov Fixed issue #2609 + bug with set_params on minkowski 'p'.
There was a bug that set_params didn't effectively change the value
of minkowski 'p'. Because 'p' was put in self.metric_kwds in
_init_params and then it never changed there.
bd54e33
Nikolay Mayorov Modified docstrings. 941f738
Nikolay Mayorov A small grammar change of the deprecation message. 2612fa4
Contributor

nmayorov commented Apr 21, 2014

Sorry, I see there is a problem with the default value of metric_kwds. I'll try to fix it soon.

Coverage Status

Coverage remained the same when pulling 9e9d326 on nmayorov:neighbors_fixes into f2fa887 on scikit-learn:master.

Nikolay Mayorov added some commits Apr 21, 2014

Coverage Status

Coverage remained the same when pulling 8b52bef on nmayorov:neighbors_fixes into f2fa887 on scikit-learn:master.

Contributor

nmayorov commented Apr 27, 2014

Hey guys! Could someone actually review this, please?

I understand that this is more or less a serious issue. And my experience, well, is not very rich. But the try won't hurt. It doesn't break anything, the old interface is still here. Seems fine to me.

Owner

arjoly commented Apr 29, 2014

Thanks for your contribution!

Could you change tests in sklearn/neighbors as to conform to the new api?

There is some deprecation warnings when I test the module:

(sklearn) ± nosetests sklearn/neighbors 
......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................./Users/ajoly/git/scikit-learn/sklearn/neighbors/tests/test_neighbors.py:837: DeprecationWarning: Passing additional arguments to the metric function as **kwargs is deprecated. Use metric_kwds dict instead.
  metric=metric, **kwds)
/usr/local/Cellar/python/2.7.4/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py:476: DeprecationWarning: Passing additional arguments to the metric function as **kwargs is deprecated. Use metric_kwds dict instead.
  callableObj(*args, **kwargs)
..
----------------------------------------------------------------------
Ran 697 tests in 1.768s
Owner

arjoly commented Apr 29, 2014

Can you also change add to the title of your pr the suffix [MRG] as to attract reviewer?

@arjoly arjoly and 1 other commented on an outdated diff Apr 29, 2014

sklearn/neighbors/base.py
@@ -125,11 +134,8 @@ def _init_params(self, n_neighbors=None, radius=None,
raise ValueError("Metric '%s' not valid for algorithm '%s'"
% (metric, algorithm))
- if self.metric in ['wminkowski', 'minkowski']:
- self.metric_kwds['p'] = p
- if p < 1:
- raise ValueError("p must be greater than one "
- "for minkowski metric")
+ if self.metric in ['wminkowski', 'minkowski'] and self.p < 1:
+ raise ValueError("p must be greater than one for minkowski metric")
@arjoly

arjoly Apr 29, 2014

Owner

What should happens when both p and metric_kwds['p'] are set?

@nmayorov

nmayorov Apr 29, 2014

Contributor

Good point.

I think in this case metric_kwds['p'] should be prioritized over p. Because if a user added p to metric_kwds he definetely knew what he was doing. And p from __init__ becomes irrelevant.

Another option would be to remove p from __init__ all together. It makes sense, because after all p is also a metric parameter and belongs to metric_kwds. But I'm not sure if it's a good idea.

So what are your thoughts?

@arjoly

arjoly Apr 29, 2014

Owner

Can you point me to an example where some custom metrics are used?

@arjoly

arjoly Apr 29, 2014

Owner

The point of having p is to have an easy way to define a grid over the parameter p.

With your modification, is it possible to define a grid of parameters as

{"metric": "mindowski", metric_kwds__p: [1, 2, 3] }

or would we have to do

[ {"metric": "mindowski",  "metric_kwds": {"p": p} } for p in range(1, 4)]

?

@arjoly

arjoly Apr 29, 2014

Owner

One point to consider is that removing p will break backward compatibility.

@nmayorov

nmayorov Apr 29, 2014

Contributor

At the moment I can't point you to a famous paper or something on this topic. But, for example, it can be useful for searching of feature weights using wminkowski metric. In order to find relative importance of features in distance computation. As far as I know it's called "metric learning" in general.

Another situation is when a user implemented his own distance function which accepts some parameter. Without explicit metric_kwds he won't be able to cross validate his estimator. It's very disappointing.

Currently you can setup a grid as

{"metric": ["minkowski"], "p": [1, 2, 3]}

or

{"metric": ["minkowski"], "metric_kwds": [{"p": p} for p in range(1, 4)]}

Could you explain how it can possibly accept "metric_kwds__p"? As far as I understand such names are used for set_params on estimators that contains other estimators (like Pipeline).

OK, let's not remove p. But if metric_kwds['p'] is set, then p from __init__ is ignored. Agreed?

@arjoly

arjoly Apr 30, 2014

Owner

At the moment I can't point you to a famous paper or something on this topic. But, for example, it can be useful for searching of feature weights using wminkowski metric. In order to find relative importance of features in distance computation. As far as I know it's called "metric learning" in general.

Another situation is when a user implemented his own distance function which accepts some parameter. Without explicit metric_kwds he won't be able to cross validate his estimator. It's very disappointing.

I wasn't clear. Is there an example in the documentation that illustrates the use of custom metrics? While reviewing, I quickly searched for an example and haven't found one.

Could you explain how it can possibly accept "metric_kwds__p"? As far as I understand such names are used for set_params on estimators that contains other estimators (like Pipeline).

Indeed. Without a get_params and set_params, you can't.

OK, let's not remove p. But if metric_kwds['p'] is set, then p from init is ignored. Agreed?

This would require proper documentation.

On the other hand, this might be time to clean everything.
( Special cases aren't special enough to break the rules.)
Sorry for puzzling. Ping @jakevdp, @ogrisel.

@arjoly arjoly commented on an outdated diff Apr 29, 2014

sklearn/neighbors/base.py
@@ -137,11 +143,14 @@ def _init_params(self, n_neighbors=None, radius=None,
def _fit(self, X):
self.effective_metric_ = self.metric
- self.effective_metric_kwds_ = self.metric_kwds
+ self.effective_metric_kwds_ = \
+ self.metric_kwds.copy() if self.metric_kwds is not None else {}
@arjoly

arjoly Apr 29, 2014

Owner

Can you avoid using \?
It might better to use a traditional if... else...

@arjoly arjoly commented on an outdated diff Apr 29, 2014

sklearn/neighbors/base.py
@@ -137,11 +143,14 @@ def _init_params(self, n_neighbors=None, radius=None,
def _fit(self, X):
self.effective_metric_ = self.metric
- self.effective_metric_kwds_ = self.metric_kwds
+ self.effective_metric_kwds_ = \
+ self.metric_kwds.copy() if self.metric_kwds is not None else {}
+
+ if self.effective_metric_ in ['wminkowski', 'minkowski']:
@arjoly

arjoly Apr 29, 2014

Owner

This looks like a change from previous implementation.

Owner

arjoly commented Apr 29, 2014

Could you fix those pep8 violations?

(sklearn) ± flake8 sklearn/neighbors
sklearn/neighbors/base.py:102:80: E501 line too long (80 > 79 characters)
sklearn/neighbors/base.py:103:80: E501 line too long (85 > 79 characters)
sklearn/neighbors/classification.py:326:80: E501 line too long (80 > 79 characters)
...
sklearn/neighbors/regression.py:260:80: E501 line too long (80 > 79 characters)
sklearn/neighbors/unsupervised.py:82:69: W291 trailing whitespace
sklearn/neighbors/unsupervised.py:87:62: W291 trailing whitespace
...

@nmayorov nmayorov changed the title from Changed problematic API of KNeighbors classes, #2609 to [MRG] Changed problematic API of KNeighbors classes, #2609 Apr 29, 2014

Contributor

nmayorov commented Apr 29, 2014

@arjoly thank you very much for your feedback.

I will try to fix everything.

@arjoly arjoly and 1 other commented on an outdated diff Apr 29, 2014

sklearn/neighbors/base.py
@@ -97,13 +97,23 @@ def __init__(self):
def _init_params(self, n_neighbors=None, radius=None,
algorithm='auto', leaf_size=30, metric='minkowski',
- p=2, **kwargs):
+ p=2, metric_kwds=None, **kwargs):
@arjoly

arjoly Apr 29, 2014

Owner

Why metric_kwds instead of metrics_kwargs?

@nmayorov

nmayorov Apr 29, 2014

Contributor

There is the internal variable self.metric_kwds (in the current master) , so I decided to name our new argument like this.

We can choose from: metric_kwds, metric_kwargs or metric_params. I prefer the last one. Should we go with metric_params? Then I think it's a good idea to rename internal variables as well.

@arjoly

arjoly Apr 29, 2014

Owner

Good point, however I don't think it's worth breaking backward compatibility (with serialization) if we can avoid it.

Sorry for the noise.

Owner

arjoly commented Apr 29, 2014

Could you add a non regression test ?

Contributor

nmayorov commented Apr 29, 2014

Please give me a hint what such a test should do?

Owner

arjoly commented Apr 30, 2014

Could you add a non regression test ?

You could add or modify a test which will ensure that the proper warning is raised if the user used the previous behaviour, but not with the new metric_kwds argument.

The same could be done with the handling of parameter p.

Contributor

nmayorov commented Apr 30, 2014

Could you clearly state if I can remove p from __init__ arguments or not? It will determine the future work. At the moment I'm quite confused what I should do.

Owner

arjoly commented May 1, 2014

Sorry to have confused you. Right now, it's also not clear to me what would be the best solution to get rid of the **kwargs. So far, we have discussed two solutions:

  1. Put everything in an argument metric_kwds. Unfortunately, it will break backward compatibility. Furthermore, it will be harder to do grid search and we won't be able to use RandomizedSearchCV.
  2. Put everything in an argument metric_kwds, except parameter p. It also breaks backward compatibility for the "obscured" metrics. If metric_kwds["p"] != p, we have some problems. This seems to be unintuitive to users and it brings a special case. And we know that "Special cases aren't special enough to break the rules.". This is also a mixed solution for specifying hyper-parameter search.

A third solution would be to explicitly define all arguments of supported metric in the __init__. This would preserve backward compatibility (in term of interface). It would also allows to easily define a hyper-parameter search. However I don't know how many parameters would have to be added to each neighbours class.

Could you clarify the following points:

  • What are the supported metrics?
  • Which one require some special parameters?

One part of the confusion could be alleviated by improving the documentation.

  • Where is the documentation about all pairwise distance definition/support/parameters?
  • Is there an example with pairwise distance that require the **kwargs?
Member

jakevdp commented May 1, 2014

All the available metrics are listed in the docstring of sklearn.neighbors.dist_metrics.DistanceMetric, along with the additional arguments they take.

For the built-in metrics, the keywords needed are p, V, VI, w, and func. If we go the route of defining these explicitly, we'd probably also want to allow passing metric_kwds for use in user-defined functions.

Member

jakevdp commented May 1, 2014

Removing p from the keyword list will break backward compatibility, so this is not an option without a deprecation cycle (though I think that might be worth doing)

Contributor

nmayorov commented May 1, 2014

Thank you for sharing your view.

Could you clarify the following points:

What are the supported metrics?
Which one require some special parameters?

In DistanceMetric the following metrics accept additional arguments:

  1. minkowski : p
  2. wminkowski: p, w
  3. seuclidean: V — essentially the same as wminkowski with p=2 and w = 1 / V
  4. mahalanobis: accepts V or its inverse VI

We must also keep in mind that a user may supply his own metric function which accepts additional arguments. I believe that putting all possible arguments to __init__ is a rather hopeless approach.

Where is the documentation about all pairwise distance definition/support/parameters?
Is there an example with pairwise distance that require the **kwargs?

Everything related to distance metrics is here DistanceMetric . I believe there is no example of using a metric with **kwargs (besides p) in the current documentation.

Contributor

nmayorov commented May 1, 2014

With your modification, is it possible to define a grid of parameters as

{"metric": "mindowski", metric_kwds__p: [1, 2, 3] }

What if we go with this approach? Modify set_params such that it will work with dicts. Then we can conveniently grid search over metric parameters. The dubious moment is that metric_kwds might be None (as default). In this case None needs to be converted to an empty dict.

As @jakevdp confirmed p must be preserved at least for now.

Do you mind if I work in this direction (set_params + handling p and metric_kwds[p])? Then you can review.

Also what name should be prefered for metric parameters dict? Is metric_kwds already accepted?

Owner

arjoly commented May 2, 2014

Thanks @nmayorov and @ jakevdp !

Side question. For 'wminkowski', what is the difference between w and a sample_weight parameter in the fit?

If we go the route of defining these explicitly, we'd probably also want to allow passing metric_kwds for use in user-defined functions.

Without metric_kwds, the user could still use functools.partial to specify the argument of the metrics.

What if we go with this approach? Modify set_params such that it will work with dicts. Then we can conveniently grid search over metric parameters.

Modifying set_params to handle that case is tempting, however this would require further consensus.

Consensus seems to be for adding a metric_params dictionary, so let's go in that direction.

Member

jakevdp commented May 2, 2014

Which sample_weight are you referring to? A grep didn't find that anywhere in the neighbors submodule.

Owner

arjoly commented May 3, 2014

Which sample_weight are you referring to? A grep didn't find that anywhere in the neighbors submodule.

I wondered if the w parameter would be equivalent to add sample_weight support. However this won't be the case the prediction won't take into w, but should take it into account for sample_weight.

Contributor

nmayorov commented May 3, 2014

I did some easy things and also modified tests.

How about merging this to maser in the current state? The motivation is following:

  • the previous interface is fully workable + we can finally grid search over p
  • we can supply additional metric parameters via metric_params and do grid search or cross validation

Sure the interface can be further improved. But since there is no consensus on how to proceed and no wide interest to this issue, I suggest to merge this improvement and leave it for now.

Owner

larsmans commented Aug 25, 2014

I confirm that with this PR, the example from #3589

>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.neighbors import KNeighborsClassifier
>>> X = np.random.randn(10, 3)
>>> y = np.random.randint(0, 2, 10)
>>> grid = {"n_neighbors": [2,3], "metric": ['euclidean']}
>>> GridSearchCV(KNeighborsClassifier(), grid).fit(X, y)

works correctly. I just rebased and cleaned up the PR locally. @jakevdp @arjoly do you mind if I merge it? It's been open for months now.

Owner

GaelVaroquaux commented Aug 25, 2014

works correctly. I just rebased and cleaned up the PR locally. @jakevdp
@arjoly do you mind if I merge it? It's been open for months now.

Go ahead. Thanks a lot!

Owner

larsmans commented Aug 25, 2014

Merged as ad751a3. Thank you @nmayorov, and sorry to keep you waiting for so long!

@larsmans larsmans closed this Aug 25, 2014

Member

jakevdp commented Aug 25, 2014

Thanks @larsmans

@nmayorov nmayorov deleted the nmayorov:neighbors_fixes branch Sep 30, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment