add graph-based clustering #21570

dayyass · 2021-11-06T09:47:17Z

Describe the workflow you want to enable

Graph-Based Clustering (original repo link)

Graph-Based Clustering using connected components and minimum spanning trees.
Both suggested clustering methods are transductive - meaning they are not designed to be applied to new, unseen data.

ConnectedComponentsClustering

This method computes pairwise distances matrix on the input data, and using threshold (parameter provided by the user) to binarize pairwise distances matrix makes an undirected graph in order to find connected components to perform the clustering.

SpanTreeConnectedComponentsClustering

This method computes pairwise distances matrix on the input data, builds a graph on the obtained matrix, finds minimum spanning tree, and finaly, performs the clustering through dividing the graph into n_clusters (parameter given by the user) by removing n-1 edges with the highest weights.

Describe your proposed solution

ConnectedComponentsClustering

Required arguments:

threshold - paremeter to binarize pairwise distances matrix and make undirected graph

Optional arguments:

metric - sklearn.metrics.pairwise_distances parameter (default: "euclidean")
n_jobs - sklearn.metrics.pairwise_distances parameter (default: None)

Example:

import numpy as np
from graph_based_clustering import ConnectedComponentsClustering

X = np.array([[0, 1], [1, 0], [1, 1]])

clustering = ConnectedComponentsClustering(
    threshold=0.275,
    metric="euclidean",
    n_jobs=-1,
)

clustering.fit(X)
labels_pred = clustering.labels_

# alternative
labels_pred = clustering.fit_predict(X)

SpanTreeConnectedComponentsClustering

Required arguments:

n_clusters - the number of clusters to find

Optional arguments:

metric - sklearn.metrics.pairwise_distances parameter (default: "euclidean")
n_jobs - sklearn.metrics.pairwise_distances parameter (default: None)

Example:

import numpy as np
from graph_based_clustering import SpanTreeConnectedComponentsClustering

X = np.array([[0, 1], [1, 0], [1, 1]])

clustering = SpanTreeConnectedComponentsClustering(
    n_clusters=3,
    metric="euclidean",
    n_jobs=-1,
)

clustering.fit(X)
labels_pred = clustering.labels_

# alternative
labels_pred = clustering.fit_predict(X)

Describe alternatives you've considered, if relevant

No response

Additional context

ConnectedComponentsClustering

SpanTreeConnectedComponentsClustering

glemaitre · 2021-11-07T17:27:32Z

Could you provide the references linked to these methods and emphasize what are the benefits of using these clustering approaches compared to the available ones currently implemented in scikit-learn?

adrinjalali · 2021-11-09T10:47:15Z

It also looks to me as we could introduce this as an improvement to the existing ones.

dayyass added the New Feature label Nov 6, 2021

dayyass linked a pull request Nov 6, 2021 that will close this issue

add graph-based clustering #21571

Open

cmarmo added module:cluster Needs Decision - Include Feature Requires decision regarding including feature labels Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add graph-based clustering #21570

add graph-based clustering #21570

dayyass commented Nov 6, 2021 •

edited

glemaitre commented Nov 7, 2021

adrinjalali commented Nov 9, 2021

add graph-based clustering #21570

add graph-based clustering #21570

Comments

dayyass commented Nov 6, 2021 • edited

Describe the workflow you want to enable

Graph-Based Clustering (original repo link)

ConnectedComponentsClustering

SpanTreeConnectedComponentsClustering

Describe your proposed solution

ConnectedComponentsClustering

SpanTreeConnectedComponentsClustering

Describe alternatives you've considered, if relevant

Additional context

ConnectedComponentsClustering

SpanTreeConnectedComponentsClustering

glemaitre commented Nov 7, 2021

adrinjalali commented Nov 9, 2021

dayyass commented Nov 6, 2021 •

edited