Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG Fit pairwise #803

Merged
merged 32 commits into from Aug 27, 2012
Merged

MRG Fit pairwise #803

merged 32 commits into from Aug 27, 2012

Conversation

amueller
Copy link
Member

This is supposed to be (a draft of?) the fit_pairwise proposal.
It creates fit_pairwise and predict_pairwise functions for SVM, KernelPCA, SpectralClustering and AffinityPropagation.
It also makes all these usable using GridSearchCV with a new fit_pairwise for the grid search.

Now uses a property to check whether the estimator was fit using a "pairwise" X.

@mblondel
Copy link
Member

In retrospect, I am a bit worried by the fact that this will add 4 new methods (fit_pairwise, predict_pairwise, transform_pairwise and fit_transform_pairwise). Another way to solve the grid search problem would be to agree upon on a flag that must be set by the estimator and can be checked by the grid search. So, for example, if kernel="precomputed", we internally set self.pairwise = True, flag that would be checked by grid search.

In algorithms that use a similarity / affinity matrices in fit, we should add a new metric option. The input X in fit and other methods would then be considered as a similarity / distance matrix if and only if metric is set to "precomputed". Otherwise, X would be conveniently transformed for the user by calling pairwise_distances with the given metric.

@amueller
Copy link
Member Author

Yeah, four new methods are a lot. I didn't really see that until I tried to implement everything and adjust the tests.
Just setting a self.pairwise variable might be a good thing.

I would not call the option you suggest "metric" as I think it does not need to be one. This can probably be decided on an per-algorithm base and made as consistent as possible ;)
As long as internally self.pairwise is set, this should be fine.

@mblondel
Copy link
Member

I would not call the option you suggest "metric" as I think it does not

need to be one. This can probably be decided on an per-algorithm base and
made as consistent as possible ;)
As long as internally self.pairwise is set, this should be fine.

Agreed. But, if internally pairwise_distances is used, we can call it
metric (for pairwise_kernels, I would have preferred kernel...)

@agramfort
Copy link
Member

self.pairwise is a simple and tempting idea. I would however use a function like is_pairwise() as you probably don't want to have svc.kernel and svc.pairwise defined in the init (required by clone). A property would work too but a function is more explicit.

@mblondel
Copy link
Member

If I'm not mistaken, this requirement only holds if the attribute is a constructor parameter, which wouldn't be the case here.

@agramfort
Copy link
Member

I certainly agree but to rephrase, svc.pairwise to remain properly defined after a clone should either be an init param or a property. Otherwise we could have svc.is_pairwise()

@mblondel
Copy link
Member

Ah ok, I got your point now. A 3rd way is to modify clone so as to check for the presence of pairwise and reset it in the cloned object.

@agramfort
Copy link
Member

Ah ok, I got your point now. A 3rd way is to modify clone so as to check for the presence of pairwise and reset it in the cloned object.

why not if it's an exception for pairwise. I am sure @GaelVaroquaux
will have a strong opinion on this :)

@mblondel
Copy link
Member

I wouldn't be shocked to have it in clone since pairwise could be thought as a reserved attribute that is part of the API specification. Let's wait for @GaelVaroquaux 's verdict :)

@amueller
Copy link
Member Author

amueller commented Jun 3, 2012

ping @mblondel @agramfort @GaelVaroquaux
I would love some opinions on this PR ;)

@amueller
Copy link
Member Author

amueller commented Jun 3, 2012

Maybe the affinity propagation still needs some love... though I'm not very familiar with that.

@amueller
Copy link
Member Author

amueller commented Aug 7, 2012

One more comment on property vs function:
If _pairwise is a function instead of a property, the test becomes

getattr(est, '_pairwise', lambda: False)()

Which looks a bit ugly to me. Therefor I'm +0 on property vs function.

@amueller
Copy link
Member Author

Any opinions on this at all?

@ogrisel
Copy link
Member

ogrisel commented Aug 21, 2012

I would +1 reusing pairwise_kernels whenever it makes sense, possibly extending it with new affinities (not necessarily PSD as the sigmoid kernel is not PSD) such as "nearest_neighbors" as implemented in the current SpectralClustering fit method.

I don't see why you say that spectral clustering and affinity propagation use different styles of affinities: you can select it according to what make sense depending on the nature of the data and the assumption you make on the cluster structure.

@ogrisel
Copy link
Member

ogrisel commented Aug 21, 2012

As for the _pairwise attribute / property vs method debate I would be +0 on the attribute / property option as it seems more natural that a method without argument.

@amueller
Copy link
Member Author

@ogrisel thanks for your comment. Maybe I'm just not familiar enough with affinity propagation. Have you an example where it is used with something that is not the euclidean distance?
I think I'll add string options to affinity propagation and spectral clustering, extend the pairwise_kernels, rename the attribute in affinity propagation and then hopefully we are good to go.

@ogrisel
Copy link
Member

ogrisel commented Aug 21, 2012

2012/8/22 Andreas Mueller notifications@github.com

@ogrisel https://github.com/ogrisel thanks for your comment. Maybe I'm
just not familiar enough with affinity propagation. Have you an example
where it is used with something that is not the euclidean distance?

I think that cosine similarity (e.g. dot product of normalized TF-IDF
vectors) would make sense for text data for instance.

Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

…ightly change the meaning of scalar parameter. Scaling the medium seems more intuitive that giving absolute values.

Parameters
----------

S: array [n_points, n_points]
X: array [n_points, n_points]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shape should be [n_samples, n_samples] if affinity == "precomputed" and [n_samples, n_features] otherwise.

…f a ``precomputed`` parameter, to support other affinities in the future.

np.exp(-gamma * d(X,X) ** 2)

or a k-nearest neighbors matrix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connectivity matrix

@amueller
Copy link
Member Author

@mblondel Thanks for your comments. You gave -1 on something somewhere but i can't see it any more. Was that the thing I just fixed?

@mblondel
Copy link
Member

I thought you wanted to rename affinity="precomputed" to affinity="affinity"... Your change is fine :)

@mblondel
Copy link
Member

It's nice that SpectralClustering and AffinityPropagation will follow the scikit-learn API, at last :)

As I said on the ML, we could postpone the decision about the code factorization of affinity propagations to later. This way we can address the more cosmetic changes (like using a property or not) and merge soon. This PR has been pending for too long :)

@amueller
Copy link
Member Author

I think this looks ok now. It does not do a lot of code sharing, because of the inclusion loop problem (see ML).
But it does change the API in a way that should make the refactoring very easy.

What this PR accomplishes now is:

  • Have a consistent API for clustering, that is the same as in the rest of the scikit
  • Enable grid-search for all estimators with precomputed similarities / kernels

This is important because it make these next steps possible:

  • Work on the precomputed kernel interface for SVMs (sparse case)
  • Unify testing for clustering
  • Enable grid search for clustering (needs scoring function)
  • Refactor affinity computations

@mblondel I fully agree :)

@ogrisel
Copy link
Member

ogrisel commented Aug 27, 2012

+1 for postponing the affinity == neighbors refactoring and get this merged first.

@ogrisel
Copy link
Member

ogrisel commented Aug 27, 2012

I think I am 👍 for merging if all the tests pass. Can you quickly check the test coverage report to check that the new code does not introduce new untested code blocks?

Also can you check that the affinity / spectral clustering example still run and have their plots look good with your refactoring?

BTW thanks for working on this.

@amueller
Copy link
Member Author

Always a pleasure ;)
I'll add some more tests and then merge.

@amueller amueller merged commit dcbe01f into scikit-learn:master Aug 27, 2012
@amueller
Copy link
Member Author

hurray :)

@GaelVaroquaux
Copy link
Member

Good job!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants