Skip to content

rozyangno/sce

Repository files navigation

Stochastic Cluster Embedding (SCE)

The code implements the algorithm in the SCE paper.

@article{sce,
    title={Stochastic Cluster Embedding},
    author={Zhirong Yang and Yuwei Chen and Denis Sedov and Samuel Kaski and Jukka Corander}
    journal={Statistics and Computing},
    year={2023},
    volume={33},
    number={12},
}

SCE is a data visualization method which shows the clusters in well-clusterable data. The input is a pairwise similarity matrix, which is often obtained by using Entropic Affinity or K-nearest-neighbor graph.

The SCE program is a standalone excutable which has been tested in Ubuntu. We also provide wrappers in Python and Matlab.

Prerequisite

Install GSL (GNU Scientific Library) before compilation by sudo apt-get install libgsl-dev

Compilation

Modify the CUDA architecture in compile.sh to fit your GPU. Here I use Quadro RTX 4000, so it is "-gencode arch=compute_75,code=sm_75". Then run compile.sh.

Demo usage

For Matlab, see demo.m. For Python, see demo_vectors.py or demo_similarities.py.

Demonstration

You can download the data here to repeat the following demos. Note that some similarity matrices may be non-normalized (SCE will normalize them internally).

SHUTTLE data set

58,000 data points. Data source. Pre-computed similarity matrix (entropic affinity with perplexity 30).

SCE (69 seconds) t-SNE LargeVis UMAP
shuttle sce shuttle tsne shuttle largevis shuttle umap

MNIST data set

70,000 data points, Data source, Pre-computed similarity matrix (10-nearst-neighbor graph).

SCE (47 seconds) t-SNE LargeVis UMAP
mnist sce mnist tsne mnist largevis mnist umap

IJCNN data set

126,701 data points. Data source, see ijcnn1. Pre-computed similarity matrix (entropic affinity with perplexity 30), where class labels stand for ten engines instead of the original binary.

SCE (75 seconds) t-SNE LargeVis UMAP
ijcnn sce ijcnn tsne ijcnn largevis ijcnn umap

TOMORADAR data set

120,024 data points. The data was collected via a helicopter-borne microwave profiling radar termed FGI-Tomoradar to investigate the vertical topography structure of forests. The original vectorial data (NB! 5.6GB). Pre-computed similarity matrix (50-nearest-neighbor graph).

SCE (660 seconds) t-SNE LargeVis UMAP
tomoradar sce tomoradar tsne tomoradar largevis tomoradar umap

FLOW-CYTOMETRY data set

1,000,000 data points. Data source. Preprocessed vectorial data. Pre-computed similarity matrix (15-nearest-neighbor graph).

SCE (7713 seconds) t-SNE LargeVis UMAP
flow-cytometry sce flow-cytometry tsne flow-cytometry largevis flow-cytometry umap

HIGGS data set

11,000,000 points. Data source. Pre-computed similarity matrix (5-nearest-neighbor graph).

SCE (8969 seconds) t-SNE LargeVis UMAP
higgs sce higgs tsne higgs largevis higgs umap

FAQs

  • Q: SCE cannot find clusters for my data. Why?
  • A: SCE is designed for well-clustered data. SCE may not identify meaningful clusters if your data is not well-clustered.

  • Q: The clusters found by SCE do not match the ground truth classes in my data. Why?
  • A: SCE is an unsupervised method. There is no guarantee that the clusters must be aligned with the supervised class labels.

  • Q: The clusters found by SCE looked too small. How can I improve it?
  • A: You can tune the parameter alpha to adjust the attraction-repulsion trade-off. By default alpha=0.5. When alpha approaches 0, it behaves similarly to SNE. When alpha approaches 1, the clusters would be more shattered or more compact.