Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: SNN clustering #11

Closed
maia opened this issue Aug 27, 2019 · 7 comments
Closed

feature request: SNN clustering #11

maia opened this issue Aug 27, 2019 · 7 comments
Assignees

Comments

@maia
Copy link

maia commented Aug 27, 2019

I'd like to add a feature request for SNN Clustering (Shared Nearest Neighbors) and possibly also HDBSCAN.

The advantage of SNN over DBSCAN is to be able to identify clusters with different densities, and also it does not need to be provided a fixed number of clusters as k-means does. Also its way of identifying similarity between items has advantages over euclidian distance when working with a higher number of dimensions.

HDBSCAN is an interesting improvement over DBSCAN, as it only requires one hyperparameter, and reports a probability of assignment to a cluster (which can be used to optimize the minPts hyperparameter). Towards Data Science: How to cluster in High Dimensions has an interesting overview including possible advantages of HDBSCAN and SNN (in the variant SNN-cliq).

Both algorithms have no Ruby implementation yet, as far to my knowledge.

PS: thanks for all your work on Rumale so far, it's greatly appreciated.

@yoshoku yoshoku self-assigned this Aug 28, 2019
@yoshoku
Copy link
Owner

yoshoku commented Aug 28, 2019

Thank you for your interesting suggestion. I will try to implement SNN Clustering 💪 Because the algorithm of HDBSCAN is complicated, such as extracting MST, it takes time to implement the HDBSCAN. I would like to make it a future issue.

@maia
Copy link
Author

maia commented Aug 28, 2019

Thats great to hear! It seems that SNN should be able to use large parts of the DBSCAN algorithm, just with a different distance metric than euclidian (and in general it might be a good idea to allow switching the distance metric in DBSCAN, some people might prefer manhattan distance, or ranked distance).

As for HDBSCAN, I just found a lua implemetation jariasf/hdbscan that has a very brief MST calculation, but it might be too naive and not performing well. Unfortunately I'm not yet familiar enough with this problem.

@yoshoku
Copy link
Owner

yoshoku commented Sep 1, 2019

I have implemented and released SNN clustering in version 0.13.1. In addition, DBSCAN becomes able to use arbitrary distance metric.

require 'rumale'

# SNN
estimator = Rumale::Clustering::SNN.new(n_neighbors: 10, eps: 6, metric: 'euclidean')
cluster_labels = estimator.fit_predict(x)

# DBSCAN with manhattan distance.
distance_mat = Rumale::PairwiseMetric.manhattan_distance(x)
estimator = Rumale::Clustering::DBSCAN.new(eps: 0.5, metric: 'precomputed')
cluster_labels = estimator.fit_predict(distance_mat)

The HDBSCAN implementation will probably require Ruby Extension. I need time to design the program structure. I would like to consider the implementation of HDBSCAN when implementing hierarchical clustering such as the Ward method in the future.

@maia
Copy link
Author

maia commented Sep 5, 2019

That's great news, thanks a lot! I can confirm SNN works as intended. I just had to figure out what exactly needs to be passed to #fit_predict (or which data types are accepted), and as I think that's not too intuitive (also: #fit, #fit_transform), that can be something new users will be challenged with.

@yoshoku
Copy link
Owner

yoshoku commented Sep 5, 2019

Rumale's method name and interface refer to scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN.fit_predict). Currently, I do not plan to change the Rumale's interface.

The documentation issue is that sometimes rubydoc.info fails to parse yard documents. I uploaded the yard document generated in the local environment earlier, so please refer to it.
https://yoshoku.github.io/rumale/doc/Rumale/Clustering/SNN.html
Moreover, some blog posts about Rumale are useful.
https://dev.to/kojix2/rumale-cheat-sheet-4pg7

I think one of the best solutions is to write a detail user guide, however I do not have enough time to do that.

@maia
Copy link
Author

maia commented Sep 6, 2019

@yoshoku Thanks a lot. I'm aware of how much work a user guide is, so let's hope someone will be able to contribute a blog post in the near future. Rumale is really helpful in instances where one does not want a python setup on an additional server, but to perform quick tasks from within a rails app itself.

@yoshoku
Copy link
Owner

yoshoku commented Sep 22, 2019

@maia I have implemented and released HDBSCAN clustering in version 0.13.4. Thank you for your request. I have learned a lot from implementing that. I will close this issue.

Example code:

require 'rumale'

samples, _ = Rumale::Dataset.make_moons(500, noise: 0.08, random_seed: 1)
outliers = (Rumale::Utils.rand_uniform([50, 2], Random.new(1)) - 0.3) * 3.5 - Numo::DFloat[0.2, 0.5]
x = Numo::NArray.vstack([samples, outliers])

hdbscan = Rumale::Clustering::HDBSCAN.new(min_samples: 5)
cluster_labels = hdbscan.fit_predict(x)

Result:
hdbscan

@yoshoku yoshoku closed this as completed Sep 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants