Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding next medoid only selects medoid from within same cluster #2

Closed
thomaskern opened this issue Dec 7, 2014 · 7 comments
Closed

Comments

@thomaskern
Copy link

Let's try it one last time ;)

On line 25 [1] the current elements from the medoid cluster are passed to the compute_new_medoid function. As far as I can see k-medoids swaps the current medoid with any non-medoid element, not just within the current cluster. If that is true, the masking done on line 39 [2] seems wrong.

What I am fuzzy about is what 'swap' means, as I don't yet have access to the actual paper. If an item from another cluster is selected, swapping might have wider implications.
If it just means the new item is assigned as the medoid of the cluster and the old medoid remains in the sme cluster, the computation seems to be easier to handle.

Am i wrong, again? ;)

[1] https://github.com/salspaugh/machine_learning/blob/master/clustering/kmedoids.py#L25
[2] https://github.com/salspaugh/machine_learning/blob/master/clustering/kmedoids.py#L39

@salspaugh
Copy link
Owner

Thanks for posting -- I will be able to look at this and reply on Thursday night or Friday. Sorry I can't get to it sooner. I hope it's not urgent.

@thomaskern
Copy link
Author

Cool, i hope i can get a pull request up by then

@thomaskern
Copy link
Author

I have found two different variations: one where you can only choose from within the cluster [1] and one where you can choose any non-medoids [2].
You seem to have implemented [1] and as far as i can see it is actually doing what it is supposed to ;)
That being said, function assign_points_to_clusters is pretty much definitely wrong as it doesn't even assign medoids to their own clusters, hence after running assign_points_to_cluster there are medoids which have no (!) object assigned to it.

I circumenvented that by doing

def assign_points_to_clusters(medoids, distances):
    distances_to_medoids = distances[:,medoids]
    clusters = medoids[np.argmin(distances_to_medoids, axis=1)]
    clusters[medoids] = medoids
    return clusters

Problem is, if you don't have each medoid assigned, they are going to collapse when you do the iteration over the medoids and compute_new_medoid and you basically end up with only 1 object being used as a medoid for all the k medoids.

Sorry for going through your code like that. Since you only did it for class work I suppose you don't care about it anyways.

[1] http://www.math.le.ac.uk/people/ag153/homepage/KmeansKmedoids/Kmeans_Kmedoids.html
[2] Sergios Theodoridis & Konstantinos Koutroumbas (2006). Pattern Recognition 3rd ed. p. 635. / http://en.wikipedia.org/wiki/K-medoids

@salspaugh
Copy link
Owner

I took a closer look and run it a few times and based on my observations it appears that line 34 does assign medoids to their own clusters, unless there is some other medoid which is also zero distance to that medoid, which I suppose could happen if the two medoids were equal (the same point or two different points which are equal). Did you observe this happening? When I test it out with random Gaussian clusters with an assert (clusters[medoids] == medoids).all() added after line 34 to make sure that medoids are assigned to their own clusters, it seems to all work fine. To reason out loud, np.argmin should return, for each point (i.e. each row of distances_to_medoids), the column index of the minimum value, which corresponds to the distance to the closest medoid. It uses these indices to get the relevant medoids -- this becomes clusters in your change. Since the distance between a medoid and itself is zero, that index should always be the one returned by np.argmin, unless there is another zero distance medoid, in which case, I think it's true that it could break, though I didn't reproduce this behavior.

I don't mind you going through it! It's public so it's fair game. I am not actively using it and I didn't really expect anyone else would either, but it's always good to fix bugs where they exist.

@salspaugh
Copy link
Owner

Updated the code. Thanks for commenting.

@thomaskern
Copy link
Author

Yes, i did observe it, all the time actually. I am doing TF-IDF on short text and the distance between two items is regularly the same (unfortunately). If you want to experience it too, I prepared a numpy array.

here is a numpy array: https://www.dropbox.com/s/j68f9av6eyithsz/tfidf.txt.zip?dl=0

try this (with the old code):

tfidf = np.loadtxt("tfidf.txt")
cluster(tfidf, 300)

this should never converge.

@salspaugh
Copy link
Owner

Ah, I see. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants