add predict and fit_predict to more clustering algorithms #901

mblondel · 2012-06-12T09:24:09Z

We should add predict and fit_predict to other clustering algorithms than KMeans: they are useful to retrieve cluster labels independently of the underlying attribute names...

The text was updated successfully, but these errors were encountered:

amueller · 2012-06-12T10:21:30Z

This could be done via a mixin since fit should set the labels_ attribute.
Also it could then be easily tested using my PR. But first we need #806 ;)

mblondel · 2012-06-12T12:09:11Z

Are there clustering algorithms in scikit-learn which are transductive? (i.e. we cannot retrieve the cluster membership of new data) For those algorithms, we can define fit_predict but not predict.

amueller · 2012-06-12T12:33:15Z

Yes, SpectralClustering should be one of them.
Does this relate to my comment or was that just a general question?

mblondel · 2012-06-12T12:38:17Z

A bit of both ;) (we may need to take this into account for our design decision)

mblondel · 2012-06-12T12:47:30Z

@robertlayton, is DBSCAN transductive?

amueller · 2012-06-12T12:59:10Z

Well, it doesn't have a predict function ;) In principal, all clusterings can get a predict using nearest neighbor prediction.
For example we could build a NN object on the first call to predict and issue a warning or something....

mblondel · 2012-06-12T13:27:30Z

Indeed, we could use nearest prediction!

predict should never change the state of the object so we can just use kNN with brute-force search.

amueller · 2012-06-12T14:08:21Z

Well, this wouldn't really be changing the state, rather doing some caching. A similar strategy was taked about for SVC in the linear case, where we concluded it would be better to do the computation in any case. Maybe take it to the ml?

mblondel · 2012-06-12T14:32:36Z

Ok, I understand what you mean now. Another option would be to add a constructor parameter like fit_inverse_transform in KernelPCA.

One problem with the nearest neighbor approach is that we need to keep a reference to the entire training set... So maybe we should just implement fit_predict in those case.

robertlayton · 2012-06-12T22:43:07Z

I'd be reluctant to use KNN by default for the predict function, as it doesn't hold in many cases. A standard example is k-means, where the class is the nearest centroid, not KNN.

For DBSCAN it is also different, as DBSCAN can also determine if a point is noise. The procedure for DBSCAN is to check whether a given sample is within eps distance of one of the core_samples. If it is, it takes the label of the core sample, if it is not, it's noise. The procedure is a little ambiguous in the case where more than one core sample is within eps of the point, but I believe we would just go with "whichever is closest".

mblondel · 2012-06-13T02:34:46Z

On a second thought, using kNN is quite arbitrary... In fact, any multi-class classifier would do.

sc = SpectralClustering(n_clusters=5)
y_tr = sc.fit_predict(X_tr)
clf = LinearSVC().fit(X_tr, y_tr)
y_te = clf.predict(X_te)

Conclusion: in the transductive case we should implement fit_predict only.

amueller · 2012-06-13T06:48:52Z

Ok. Was just an idea ;)

amueller · 2012-09-02T10:09:18Z

Closing this, as the new mixin from #907 does exactly that. And the tests ensure that it works consistently. Also, all classifiers have a consistent label_ attribute.

amueller mentioned this issue Jun 13, 2012

MRG add ClusterMixin that adds a fit_predict convenience method to all clustering classes. #907

Closed

amueller closed this as completed Sep 2, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add predict and fit_predict to more clustering algorithms #901

add predict and fit_predict to more clustering algorithms #901

mblondel commented Jun 12, 2012

amueller commented Jun 12, 2012

mblondel commented Jun 12, 2012

amueller commented Jun 12, 2012

mblondel commented Jun 12, 2012

mblondel commented Jun 12, 2012

amueller commented Jun 12, 2012

mblondel commented Jun 12, 2012

amueller commented Jun 12, 2012

mblondel commented Jun 12, 2012

robertlayton commented Jun 12, 2012

mblondel commented Jun 13, 2012

amueller commented Jun 13, 2012

amueller commented Sep 2, 2012

add predict and fit_predict to more clustering algorithms #901

add predict and fit_predict to more clustering algorithms #901

Comments

mblondel commented Jun 12, 2012

amueller commented Jun 12, 2012

mblondel commented Jun 12, 2012

amueller commented Jun 12, 2012

mblondel commented Jun 12, 2012

mblondel commented Jun 12, 2012

amueller commented Jun 12, 2012

mblondel commented Jun 12, 2012

amueller commented Jun 12, 2012

mblondel commented Jun 12, 2012

robertlayton commented Jun 12, 2012

mblondel commented Jun 13, 2012

amueller commented Jun 13, 2012

amueller commented Sep 2, 2012