Finding next medoid only selects medoid from within same cluster #2

thomaskern · 2014-12-07T22:07:03Z

Let's try it one last time ;)

On line 25 [1] the current elements from the medoid cluster are passed to the compute_new_medoid function. As far as I can see k-medoids swaps the current medoid with any non-medoid element, not just within the current cluster. If that is true, the masking done on line 39 [2] seems wrong.

What I am fuzzy about is what 'swap' means, as I don't yet have access to the actual paper. If an item from another cluster is selected, swapping might have wider implications.
If it just means the new item is assigned as the medoid of the cluster and the old medoid remains in the sme cluster, the computation seems to be easier to handle.

Am i wrong, again? ;)

[1] https://github.com/salspaugh/machine_learning/blob/master/clustering/kmedoids.py#L25
[2] https://github.com/salspaugh/machine_learning/blob/master/clustering/kmedoids.py#L39

salspaugh · 2014-12-10T00:33:36Z

Thanks for posting -- I will be able to look at this and reply on Thursday night or Friday. Sorry I can't get to it sooner. I hope it's not urgent.

thomaskern · 2014-12-10T09:32:26Z

Cool, i hope i can get a pull request up by then

thomaskern · 2014-12-10T23:14:57Z

I have found two different variations: one where you can only choose from within the cluster [1] and one where you can choose any non-medoids [2].
You seem to have implemented [1] and as far as i can see it is actually doing what it is supposed to ;)
That being said, function assign_points_to_clusters is pretty much definitely wrong as it doesn't even assign medoids to their own clusters, hence after running assign_points_to_cluster there are medoids which have no (!) object assigned to it.

I circumenvented that by doing

def assign_points_to_clusters(medoids, distances):
    distances_to_medoids = distances[:,medoids]
    clusters = medoids[np.argmin(distances_to_medoids, axis=1)]
    clusters[medoids] = medoids
    return clusters

Problem is, if you don't have each medoid assigned, they are going to collapse when you do the iteration over the medoids and compute_new_medoid and you basically end up with only 1 object being used as a medoid for all the k medoids.

Sorry for going through your code like that. Since you only did it for class work I suppose you don't care about it anyways.

[1] http://www.math.le.ac.uk/people/ag153/homepage/KmeansKmedoids/Kmeans_Kmedoids.html
[2] Sergios Theodoridis & Konstantinos Koutroumbas (2006). Pattern Recognition 3rd ed. p. 635. / http://en.wikipedia.org/wiki/K-medoids

salspaugh · 2014-12-12T21:55:06Z

I took a closer look and run it a few times and based on my observations it appears that line 34 does assign medoids to their own clusters, unless there is some other medoid which is also zero distance to that medoid, which I suppose could happen if the two medoids were equal (the same point or two different points which are equal). Did you observe this happening? When I test it out with random Gaussian clusters with an assert (clusters[medoids] == medoids).all() added after line 34 to make sure that medoids are assigned to their own clusters, it seems to all work fine. To reason out loud, np.argmin should return, for each point (i.e. each row of distances_to_medoids), the column index of the minimum value, which corresponds to the distance to the closest medoid. It uses these indices to get the relevant medoids -- this becomes clusters in your change. Since the distance between a medoid and itself is zero, that index should always be the one returned by np.argmin, unless there is another zero distance medoid, in which case, I think it's true that it could break, though I didn't reproduce this behavior.

I don't mind you going through it! It's public so it's fair game. I am not actively using it and I didn't really expect anyone else would either, but it's always good to fix bugs where they exist.

salspaugh · 2014-12-12T22:16:55Z

Updated the code. Thanks for commenting.

thomaskern · 2014-12-12T23:04:31Z

Yes, i did observe it, all the time actually. I am doing TF-IDF on short text and the distance between two items is regularly the same (unfortunately). If you want to experience it too, I prepared a numpy array.

here is a numpy array: https://www.dropbox.com/s/j68f9av6eyithsz/tfidf.txt.zip?dl=0

try this (with the old code):

tfidf = np.loadtxt("tfidf.txt")
cluster(tfidf, 300)

this should never converge.

salspaugh · 2014-12-12T23:10:10Z

Ah, I see. Thanks!

salspaugh closed this as completed Dec 12, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding next medoid only selects medoid from within same cluster #2

Finding next medoid only selects medoid from within same cluster #2

thomaskern commented Dec 7, 2014

salspaugh commented Dec 10, 2014

thomaskern commented Dec 10, 2014

thomaskern commented Dec 10, 2014

salspaugh commented Dec 12, 2014

salspaugh commented Dec 12, 2014

thomaskern commented Dec 12, 2014

salspaugh commented Dec 12, 2014

Finding next medoid only selects medoid from within same cluster #2

Finding next medoid only selects medoid from within same cluster #2

Comments

thomaskern commented Dec 7, 2014

salspaugh commented Dec 10, 2014

thomaskern commented Dec 10, 2014

thomaskern commented Dec 10, 2014

salspaugh commented Dec 12, 2014

salspaugh commented Dec 12, 2014

thomaskern commented Dec 12, 2014

salspaugh commented Dec 12, 2014