Stochastic Cluster Embedding (SCE)

The code implements the algorithm in the SCE paper.

@article{sce,
    title={Stochastic Cluster Embedding},
    author={Zhirong Yang and Yuwei Chen and Denis Sedov and Samuel Kaski and Jukka Corander}
    journal={Statistics and Computing},
    year={2023},
    volume={33},
    number={12},
}

SCE is a data visualization method which shows the clusters in well-clusterable data. The input is a pairwise similarity matrix, which is often obtained by using Entropic Affinity or K-nearest-neighbor graph.

The SCE program is a standalone excutable which has been tested in Ubuntu. We also provide wrappers in Python and Matlab.

Prerequisite

Install GSL (GNU Scientific Library) before compilation by sudo apt-get install libgsl-dev

Compilation

Modify the CUDA architecture in compile.sh to fit your GPU. Here I use Quadro RTX 4000, so it is "-gencode arch=compute_75,code=sm_75". Then run compile.sh.

Demo usage

For Matlab, see demo.m. For Python, see demo_vectors.py or demo_similarities.py.

Demonstration

You can download the data here to repeat the following demos. Note that some similarity matrices may be non-normalized (SCE will normalize them internally).

SHUTTLE data set

58,000 data points. Data source. Pre-computed similarity matrix (entropic affinity with perplexity 30).

SCE (69 seconds)	t-SNE	LargeVis	UMAP

MNIST data set

70,000 data points, Data source, Pre-computed similarity matrix (10-nearst-neighbor graph).

SCE (47 seconds)	t-SNE	LargeVis	UMAP

IJCNN data set

126,701 data points. Data source, see ijcnn1. Pre-computed similarity matrix (entropic affinity with perplexity 30), where class labels stand for ten engines instead of the original binary.

SCE (75 seconds)	t-SNE	LargeVis	UMAP

TOMORADAR data set

120,024 data points. The data was collected via a helicopter-borne microwave profiling radar termed FGI-Tomoradar to investigate the vertical topography structure of forests. The original vectorial data (NB! 5.6GB). Pre-computed similarity matrix (50-nearest-neighbor graph).

SCE (660 seconds)	t-SNE	LargeVis	UMAP

FLOW-CYTOMETRY data set

1,000,000 data points. Data source. Preprocessed vectorial data. Pre-computed similarity matrix (15-nearest-neighbor graph).

SCE (7713 seconds)	t-SNE	LargeVis	UMAP

HIGGS data set

11,000,000 points. Data source. Pre-computed similarity matrix (5-nearest-neighbor graph).

SCE (8969 seconds)	t-SNE	LargeVis	UMAP

FAQs

Q: SCE cannot find clusters for my data. Why?
A: SCE is designed for well-clustered data. SCE may not identify meaningful clusters if your data is not well-clustered.

Q: The clusters found by SCE do not match the ground truth classes in my data. Why?
A: SCE is an unsupervised method. There is no guarantee that the clusters must be aligned with the supervised class labels.

Q: The clusters found by SCE looked too small. How can I improve it?
A: You can tune the parameter alpha to adjust the attraction-repulsion trade-off. By default alpha=0.5. When alpha approaches 0, it behaves similarly to SNE. When alpha approaches 1, the clusters would be more shattered or more compact.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
figures		figures
.gitignore		.gitignore
DisplayVisualization.m		DisplayVisualization.m
LICENSE		LICENSE
README.md		README.md
compile.sh		compile.sh
convert_to_python.py		convert_to_python.py
demo.m		demo.m
demo_similarity.py		demo_similarity.py
demo_vectors.py		demo_vectors.py
distinguishable_colors.m		distinguishable_colors.m
sce.cu		sce.cu
sce.m		sce.m
sce.py		sce.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stochastic Cluster Embedding (SCE)

Prerequisite

Compilation

Demo usage