Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add graph-based clustering #21570

Open
dayyass opened this issue Nov 6, 2021 · 2 comments · May be fixed by #21571
Open

add graph-based clustering #21570

dayyass opened this issue Nov 6, 2021 · 2 comments · May be fixed by #21571
Labels
module:cluster Needs Decision - Include Feature Requires decision regarding including feature New Feature

Comments

@dayyass
Copy link

dayyass commented Nov 6, 2021

Describe the workflow you want to enable

Graph-Based Clustering (original repo link)

Graph-Based Clustering using connected components and minimum spanning trees.
Both suggested clustering methods are transductive - meaning they are not designed to be applied to new, unseen data.

ConnectedComponentsClustering

This method computes pairwise distances matrix on the input data, and using threshold (parameter provided by the user) to binarize pairwise distances matrix makes an undirected graph in order to find connected components to perform the clustering.

SpanTreeConnectedComponentsClustering

This method computes pairwise distances matrix on the input data, builds a graph on the obtained matrix, finds minimum spanning tree, and finaly, performs the clustering through dividing the graph into n_clusters (parameter given by the user) by removing n-1 edges with the highest weights.

Describe your proposed solution

ConnectedComponentsClustering

Required arguments:

  • threshold - paremeter to binarize pairwise distances matrix and make undirected graph

Optional arguments:

Example:

import numpy as np
from graph_based_clustering import ConnectedComponentsClustering

X = np.array([[0, 1], [1, 0], [1, 1]])

clustering = ConnectedComponentsClustering(
    threshold=0.275,
    metric="euclidean",
    n_jobs=-1,
)

clustering.fit(X)
labels_pred = clustering.labels_

# alternative
labels_pred = clustering.fit_predict(X)

SpanTreeConnectedComponentsClustering

Required arguments:

  • n_clusters - the number of clusters to find

Optional arguments:

Example:

import numpy as np
from graph_based_clustering import SpanTreeConnectedComponentsClustering

X = np.array([[0, 1], [1, 0], [1, 1]])

clustering = SpanTreeConnectedComponentsClustering(
    n_clusters=3,
    metric="euclidean",
    n_jobs=-1,
)

clustering.fit(X)
labels_pred = clustering.labels_

# alternative
labels_pred = clustering.fit_predict(X)

Describe alternatives you've considered, if relevant

No response

Additional context

ConnectedComponentsClustering

image

SpanTreeConnectedComponentsClustering

image

@dayyass dayyass linked a pull request Nov 6, 2021 that will close this issue
@glemaitre
Copy link
Member

Could you provide the references linked to these methods and emphasize what are the benefits of using these clustering approaches compared to the available ones currently implemented in scikit-learn?

@adrinjalali
Copy link
Member

It also looks to me as we could introduce this as an improvement to the existing ones.

@cmarmo cmarmo added module:cluster Needs Decision - Include Feature Requires decision regarding including feature labels Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:cluster Needs Decision - Include Feature Requires decision regarding including feature New Feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants