Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add predict and fit_predict to more clustering algorithms #901

Closed
mblondel opened this issue Jun 12, 2012 · 13 comments
Closed

add predict and fit_predict to more clustering algorithms #901

mblondel opened this issue Jun 12, 2012 · 13 comments

Comments

@mblondel
Copy link
Member

We should add predict and fit_predict to other clustering algorithms than KMeans: they are useful to retrieve cluster labels independently of the underlying attribute names...

@amueller
Copy link
Member

This could be done via a mixin since fit should set the labels_ attribute.
Also it could then be easily tested using my PR. But first we need #806 ;)

@mblondel
Copy link
Member Author

Are there clustering algorithms in scikit-learn which are transductive? (i.e. we cannot retrieve the cluster membership of new data) For those algorithms, we can define fit_predict but not predict.

@amueller
Copy link
Member

Yes, SpectralClustering should be one of them.
Does this relate to my comment or was that just a general question?

@mblondel
Copy link
Member Author

A bit of both ;) (we may need to take this into account for our design decision)

@mblondel
Copy link
Member Author

@robertlayton, is DBSCAN transductive?

@amueller
Copy link
Member

Well, it doesn't have a predict function ;) In principal, all clusterings can get a predict using nearest neighbor prediction.
For example we could build a NN object on the first call to predict and issue a warning or something....

@mblondel
Copy link
Member Author

Indeed, we could use nearest prediction!

predict should never change the state of the object so we can just use kNN with brute-force search.

@amueller
Copy link
Member

Well, this wouldn't really be changing the state, rather doing some caching. A similar strategy was taked about for SVC in the linear case, where we concluded it would be better to do the computation in any case. Maybe take it to the ml?

@mblondel
Copy link
Member Author

Ok, I understand what you mean now. Another option would be to add a constructor parameter like fit_inverse_transform in KernelPCA.

One problem with the nearest neighbor approach is that we need to keep a reference to the entire training set... So maybe we should just implement fit_predict in those case.

@robertlayton
Copy link
Member

I'd be reluctant to use KNN by default for the predict function, as it doesn't hold in many cases. A standard example is k-means, where the class is the nearest centroid, not KNN.

For DBSCAN it is also different, as DBSCAN can also determine if a point is noise. The procedure for DBSCAN is to check whether a given sample is within eps distance of one of the core_samples. If it is, it takes the label of the core sample, if it is not, it's noise. The procedure is a little ambiguous in the case where more than one core sample is within eps of the point, but I believe we would just go with "whichever is closest".

@mblondel
Copy link
Member Author

On a second thought, using kNN is quite arbitrary... In fact, any multi-class classifier would do.

sc = SpectralClustering(n_clusters=5)
y_tr = sc.fit_predict(X_tr)
clf = LinearSVC().fit(X_tr, y_tr)
y_te = clf.predict(X_te)

Conclusion: in the transductive case we should implement fit_predict only.

@amueller
Copy link
Member

Ok. Was just an idea ;)

@amueller
Copy link
Member

amueller commented Sep 2, 2012

Closing this, as the new mixin from #907 does exactly that. And the tests ensure that it works consistently. Also, all classifiers have a consistent label_ attribute.

@amueller amueller closed this as completed Sep 2, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants