New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finding next medoid only selects medoid from within same cluster #2
Comments
Thanks for posting -- I will be able to look at this and reply on Thursday night or Friday. Sorry I can't get to it sooner. I hope it's not urgent. |
Cool, i hope i can get a pull request up by then |
I have found two different variations: one where you can only choose from within the cluster [1] and one where you can choose any non-medoids [2]. I circumenvented that by doing
Problem is, if you don't have each medoid assigned, they are going to collapse when you do the iteration over the medoids and compute_new_medoid and you basically end up with only 1 object being used as a medoid for all the k medoids. Sorry for going through your code like that. Since you only did it for class work I suppose you don't care about it anyways. [1] http://www.math.le.ac.uk/people/ag153/homepage/KmeansKmedoids/Kmeans_Kmedoids.html |
I took a closer look and run it a few times and based on my observations it appears that line 34 does assign medoids to their own clusters, unless there is some other medoid which is also zero distance to that medoid, which I suppose could happen if the two medoids were equal (the same point or two different points which are equal). Did you observe this happening? When I test it out with random Gaussian clusters with an I don't mind you going through it! It's public so it's fair game. I am not actively using it and I didn't really expect anyone else would either, but it's always good to fix bugs where they exist. |
Updated the code. Thanks for commenting. |
Yes, i did observe it, all the time actually. I am doing TF-IDF on short text and the distance between two items is regularly the same (unfortunately). If you want to experience it too, I prepared a numpy array. here is a numpy array: https://www.dropbox.com/s/j68f9av6eyithsz/tfidf.txt.zip?dl=0 try this (with the old code):
this should never converge. |
Ah, I see. Thanks! |
Let's try it one last time ;)
On line 25 [1] the current elements from the medoid cluster are passed to the compute_new_medoid function. As far as I can see k-medoids swaps the current medoid with any non-medoid element, not just within the current cluster. If that is true, the masking done on line 39 [2] seems wrong.
What I am fuzzy about is what 'swap' means, as I don't yet have access to the actual paper. If an item from another cluster is selected, swapping might have wider implications.
If it just means the new item is assigned as the medoid of the cluster and the old medoid remains in the sme cluster, the computation seems to be easier to handle.
Am i wrong, again? ;)
[1] https://github.com/salspaugh/machine_learning/blob/master/clustering/kmedoids.py#L25
[2] https://github.com/salspaugh/machine_learning/blob/master/clustering/kmedoids.py#L39
The text was updated successfully, but these errors were encountered: