The code implements the algorithm in the SCE paper.
@article{sce,
title={Stochastic Cluster Embedding},
author={Zhirong Yang and Yuwei Chen and Denis Sedov and Samuel Kaski and Jukka Corander}
journal={Statistics and Computing},
year={2023},
volume={33},
number={12},
}
SCE is a data visualization method which shows the clusters in well-clusterable data. The input is a pairwise similarity matrix, which is often obtained by using Entropic Affinity or K-nearest-neighbor graph.
The SCE program is a standalone excutable which has been tested in Ubuntu. We also provide wrappers in Python and Matlab.
Install GSL (GNU Scientific Library) before compilation by
sudo apt-get install libgsl-dev
Modify the CUDA architecture in compile.sh to fit your GPU. Here I use Quadro RTX 4000, so it is "-gencode arch=compute_75,code=sm_75". Then run compile.sh.
For Matlab, see demo.m. For Python, see demo_vectors.py or demo_similarities.py.
You can download the data here to repeat the following demos. Note that some similarity matrices may be non-normalized (SCE will normalize them internally).
58,000 data points. Data source. Pre-computed similarity matrix (entropic affinity with perplexity 30).
SCE (69 seconds) | t-SNE | LargeVis | UMAP |
![]() |
![]() |
![]() |
![]() |
70,000 data points, Data source, Pre-computed similarity matrix (10-nearst-neighbor graph).
SCE (47 seconds) | t-SNE | LargeVis | UMAP |
![]() |
![]() |
![]() |
![]() |
126,701 data points. Data source, see ijcnn1. Pre-computed similarity matrix (entropic affinity with perplexity 30), where class labels stand for ten engines instead of the original binary.
SCE (75 seconds) | t-SNE | LargeVis | UMAP |
![]() |
![]() |
![]() |
![]() |
120,024 data points. The data was collected via a helicopter-borne microwave profiling radar termed FGI-Tomoradar to investigate the vertical topography structure of forests. The original vectorial data (NB! 5.6GB). Pre-computed similarity matrix (50-nearest-neighbor graph).
SCE (660 seconds) | t-SNE | LargeVis | UMAP |
![]() |
![]() |
![]() |
![]() |
1,000,000 data points. Data source. Preprocessed vectorial data. Pre-computed similarity matrix (15-nearest-neighbor graph).
SCE (7713 seconds) | t-SNE | LargeVis | UMAP |
![]() |
![]() |
![]() |
![]() |
11,000,000 points. Data source. Pre-computed similarity matrix (5-nearest-neighbor graph).
SCE (8969 seconds) | t-SNE | LargeVis | UMAP |
![]() |
![]() |
![]() |
![]() |
- Q: SCE cannot find clusters for my data. Why?
- A: SCE is designed for well-clustered data. SCE may not identify meaningful clusters if your data is not well-clustered.
- Q: The clusters found by SCE do not match the ground truth classes in my data. Why?
- A: SCE is an unsupervised method. There is no guarantee that the clusters must be aligned with the supervised class labels.
- Q: The clusters found by SCE looked too small. How can I improve it?
- A: You can tune the parameter alpha to adjust the attraction-repulsion trade-off. By default alpha=0.5. When alpha approaches 0, it behaves similarly to SNE. When alpha approaches 1, the clusters would be more shattered or more compact.