-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Added Kernel k-means (Issue #5373) #5483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/cluster/kernel_kmeans_.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe one extra label update is needed here, after the loop: because cluster centers have been updated but training labels haven't, it's possible for the loop to finish in a way that self.fit(X).predict(X)
returns different results than self.fit_predict(X)
. See e.g. #5231
Really great start! Nice, clean, easy to read code. One overall comment: I think an |
sklearn/cluster/kernel_kmeans_.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Print statement needs parentheses. Also, we probably want from __future__ import print_function
at the top of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from __future__ import division
would be useful as well – then you wouldn't have to cast to float in e.g. float(n_same) / n_samples
and other places (also prevents potential hard-to-track errors between Python 2 & 3)
@jakevdp I have done some of the corrections. This code is not completely written by me. I have added tests and examples and some minor modifications so I have a few doubts |
sklearn/cluster/kernel_kmeans_.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mask_j ? Or is it too heavy ?
In standard K means, the chosen result is based on minimizing the "inertia", or the sum of square distances from cluster centers. We could do something similar here. |
The last label update is needed because the cluster centers might change at the end of the final iteration; the |
@jakevdp So what do you suggest to be used in place of inertia? |
I think you can generalize the inertia to be the sum-of-square-kernel-dissimilarities. These dissimilarities are already computed within each iteration; I think it's as easy as doing |
ca4bc24
to
f629b54
Compare
Sorry for a long absence. I completely forgot about this PR. |
1800df7
to
e40c184
Compare
I'd rebase on master - I think those circleCI issues have been fixed. |
@jakevdp I have already done that but still circleCI build is failing. |
sklearn/cluster/kernel_kmeans_.py
Outdated
|
||
kernel : {'linear_kernel', 'polynomial_kernel', 'sigmoid_kernel', | ||
'rbf_kernel', 'chi2_kernel'}, default: 'linear_kernel' | ||
The type of kernel to be used in kernel k means algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will render well in the HTML docs. After the colon you should specify the type, not the allowed values. See svm for an example of what I mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, ending each with _kernel
is a bit redundant. Through the rest of the package we tend to use just "linear"
, "sigmoid"
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have corrected this. Please have a look.
Looks pretty good, but tests are currently lacking. There are about 10 parameters that control the behavior of the Kernel K Means object, but only a few of them are ever used in the test script as far as I can tell. In particular, we should run a test of all the valid |
fa4b135
to
a1982f6
Compare
4efa460
to
b25c223
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a first pass
doc/modules/clustering.rst
Outdated
|
||
* - :ref:'Kernel K-Means' | ||
- number of clusters | ||
- Medium ``n_clusters`` and ``n_samples`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation
doc/modules/clustering.rst
Outdated
Kernel K-Means | ||
============== | ||
|
||
The :class:`KernelKMeans` algorithm is an enhancement of the :class:`KMeans` algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to keep lines under 80 chars
sklearn/cluster/kernel_kmeans.py
Outdated
@@ -0,0 +1,287 @@ | |||
""" | |||
Kernel K-means clustering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put on first line
Data used in clustering. | ||
n_iter_ : Iteration in which algorithm converged | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
References section belongs here, as does perhaps a brief doctest example
|
||
# sanity check: re-predict labeling for training set samples | ||
pred = clf.predict(X) | ||
assert_array_equal(pred, clf.labels_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also check that prediction is possible on a non-training dataset, and results should at least differ between datasets.
doc/modules/clustering.rst
Outdated
============== | ||
|
||
The :class:`KernelKMeans` algorithm is an enhancement of the :class:`KMeans` algorithm | ||
which uses a kernel function to generate an appropriate non-linear mapping drom the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drom -> from
doc/modules/clustering.rst
Outdated
The :class:`KernelKMeans` algorithm is an enhancement of the :class:`KMeans` algorithm | ||
which uses a kernel function to generate an appropriate non-linear mapping drom the | ||
original (input) space to a higher-dimensional feature space to extract clusters | ||
that are non-linearly seperable in input space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"non-" -> "not"
sklearn/cluster/kernel_kmeans.py
Outdated
sample_weight_ : array-like, shape=(n_samples,) | ||
labels_ : shape=(n_samples,) | ||
Labels of each point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*each training sample
sklearn/cluster/kernel_kmeans.py
Outdated
K : Kernel matrix | ||
dist : array-like, shape=(n_samples, n_clusters) | ||
Distance of each sample from cluster centers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this array is used for output; as an input, it should be all-zero.
sklearn/cluster/kernel_kmeans.py
Outdated
within_distances : array, shape=(n_clusters,) | ||
Distance update. | ||
update_within : {true, false} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{True, False}
or just bool
edited docs of kernel_params removed gamma if not passed by user
d354daa
to
7e235f6
Compare
@jnothman I have completed all the changes. |
Codecov Report
@@ Coverage Diff @@
## master #5483 +/- ##
==========================================
- Coverage 96.19% 95.48% -0.71%
==========================================
Files 348 344 -4
Lines 64645 61084 -3561
==========================================
- Hits 62187 58328 -3859
- Misses 2458 2756 +298
Continue to review full report at Codecov.
|
Hi @IshankGulati, I got here because I'm trying to use K-Means with cosine distance. Looks like the kernel can be used for that?. I see a callable option but couldn't find an example of how to write and use a custom kernel. Could you direct me in the right direction please? Would it work to implement it like this and set kernel=my_cosine_function? thanks |
note you should be using cursive similarity kernel, not cosine distance
…On 11 Mar 2017 1:51 pm, "Ishank Gulati" ***@***.***> wrote:
Hi @lrusnac <https://github.com/lrusnac>, sklearn already has cosine
<https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L875>
kernel. If you want to use a custom kernel, you can take help of the link I
mentioned to implement a custom kernel and then pass it as a callable. If
you face any problem feel free to ping me.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5483 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6zsWPWMiJnK34OKjnxl8vxoZzQ38ks5rkgwWgaJpZM4GSH4j>
.
|
@IshankGulati and @jnothman thanks.
|
likely no reason.
…On 14 Mar 2017 9:04 pm, "Leonid" ***@***.***> wrote:
@IshankGulati <https://github.com/IshankGulati> and @jnothman
<https://github.com/jnothman> thanks.
I was able to use the cosine_similarity. But I got found an issue.
KernelKMeans has no n_jobs parameter so basically it doesn't run in
parallel. I found out that the bottleneck is in the function _get_kernel
where it calls pairwise_kernels. This last function has a parameter
n_jobs but by default is 1 so just one core. I was able to set it's value
to -1 and it increased the speed of clustering. Is there a reason
KernelKMeans has no parallelisation or it's something that got forgotten?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5483 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60hy7uyTFXAy3mGWY44RPrTCRx-8ks5rlmYbgaJpZM4GSH4j>
.
|
Is anyone planning on merging this? |
I suppose we need to decide if it's something we're keen on, or whether it would better reside in scikit-learn-extra. Have you tried using it, @ajilling? What does it help you solve, and is it efficient? |
@jnothman I have used it. It has one weakness in that it fails whenever it encounters an empty cluster - this differs from most of the other sklearn clustering implementations which seem to be able to handle that. Efficiency is on par with everything else. I'm working on a project where I apply different clustering algorithms to datasets and compare the results. Sklearn is the first place I look, so to me, the more algorithms in there the better. Otherwise I have to resort to unknown packages or embedded R code. |
Well if the value in having something is for completeness, rather than out
of applied need, it's not really worth it for us to merge and maintain it.
Thanks for noting the issue with empty clusters.
|
Just in case someone finds this closed issue by googling, one alternative would be to do: from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.kernel_approximation import Nystroem
from sklearn.cluster import KMeans
X_train = ...
fake_kernel_kmeans = make_pipeline(
Nystroem(n_components=1000, kernel="rbf", gamma=1e-3), # gamma needs to be tuned
PCA(50),
KMeans(n_clusters=10),
).fit(X_train) Note: I am not sure if the PCA step to reduce the dimensionality is helpful or detrimental, prior to k-means. Maybe dropping the PCA step and reducing It's probably also necessary to add a column transformer to one-hot encode categorical features and StandardScale or QuantileTransform or SplineTransform the numerical features to get a meaningful kernel expansion. |
added kernel k-means with tests and examples (Issue #5373 )