New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add predict and fit_predict to more clustering algorithms #901
Comments
This could be done via a mixin since |
Are there clustering algorithms in scikit-learn which are transductive? (i.e. we cannot retrieve the cluster membership of new data) For those algorithms, we can define |
Yes, |
A bit of both ;) (we may need to take this into account for our design decision) |
@robertlayton, is DBSCAN transductive? |
Well, it doesn't have a |
Indeed, we could use nearest prediction!
|
Well, this wouldn't really be changing the state, rather doing some caching. A similar strategy was taked about for SVC in the linear case, where we concluded it would be better to do the computation in any case. Maybe take it to the ml? |
Ok, I understand what you mean now. Another option would be to add a constructor parameter like One problem with the nearest neighbor approach is that we need to keep a reference to the entire training set... So maybe we should just implement |
I'd be reluctant to use KNN by default for the predict function, as it doesn't hold in many cases. A standard example is k-means, where the class is the nearest centroid, not KNN. For DBSCAN it is also different, as DBSCAN can also determine if a point is noise. The procedure for DBSCAN is to check whether a given sample is within |
On a second thought, using kNN is quite arbitrary... In fact, any multi-class classifier would do. sc = SpectralClustering(n_clusters=5)
y_tr = sc.fit_predict(X_tr)
clf = LinearSVC().fit(X_tr, y_tr)
y_te = clf.predict(X_te) Conclusion: in the transductive case we should implement |
Ok. Was just an idea ;) |
Closing this, as the new mixin from #907 does exactly that. And the tests ensure that it works consistently. Also, all classifiers have a consistent |
We should add
predict
andfit_predict
to other clustering algorithms thanKMeans
: they are useful to retrieve cluster labels independently of the underlying attribute names...The text was updated successfully, but these errors were encountered: