[BUG] the UMAP implementation much worse than on CPU #5707

maciejskorski · 2023-12-22T08:44:11Z

Describe the bug

The GPU-accelerated implementation from cuml can give much worse results than the CPU alternative from the package umap on a simple dataset. By visual inspection, we see that the clusters are less separable and there are many outliers. I wonder if the gap could be bridged somehow by non-obvious customization (that my example is missing)? Any help appreciated 🙏

NOTE: I show a toy example to facilitate debugging. I have also seen a complex NLP pipeline with UMAP responsible for dimensionality reduction, where switching from umap to cuml cost as much as 8% in terms of the coherence score.

Steps/Code to reproduce bug

from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
%matplotlib inline
import umap
import cuml

X,y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
y = y.astype(int)

umap_model_1 = umap.UMAP(random_state=42, n_components=2, n_neighbors=12, min_dist=0.0).fit(X)
umap_model_2 = cuml.UMAP(random_state=42, n_components=2, n_neighbors=12, min_dist=0.0).fit(X)

embeds_1 = umap_model_1.transform(X)
embeds_2 = umap_model_2.transform(X)

fig,axs = plt.subplots(1,2)
axs[0].scatter(embeds_1[:,0], embeds_1[:,1], c=y, s=0.1, cmap='Spectral')
axs[1].scatter(embeds_2[:,0], embeds_2[:,1], c=y, s=0.1, cmap='Spectral')
plt.show()

Expected behavior

Results should be much closer.

Environment details (please complete the following information):

Linux Distro/Architecture: [Ubuntu 20.04 x86_64]
GPU Model/Driver: [L4 / GeForce RTX 3090]
CUDA: [12.2]
Docker by NVIDIA: nvcr.io/nvidia/pytorch:23.09-py3

See also the original discussion on umap gh repo

dantegd · 2024-01-05T00:43:11Z

Thanks for the issue @maciejskorski, and thanks for the great and easy to repro example/code :). Can confirm repro'ing on totally different hardware, we'll be looking into it alongside a few updates we want to do to UMAP. The discrepancies, alongside points between clusters are larger than I would've expected. Will update issue as we progress with findings.

Bougeant · 2024-04-04T12:05:57Z

Can we get an update on this issue? I also faced it while trying to use cuml.UMAP to speed up BERTopic. I've noticed that the problem gets much worse when the number of prediction samples increases relative to the number of training samples.

Building on @maciejskorski's example above:

from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
%matplotlib inline
import umap
import cuml

X,y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
y = y.astype(int)

umap_model_1 = umap.UMAP(random_state=42, n_components=2, n_neighbors=12, min_dist=0.0).fit(X[:10000])
umap_model_2 = cuml.UMAP(random_state=42, n_components=2, n_neighbors=12, min_dist=0.0).fit(X[:10000])

embeds_1_small = umap_model_1.transform(X[10000:20000])
embeds_2_small = umap_model_2.transform(X[10000:20000])
embeds_1_large = umap_model_1.transform(X[10000:70000])
embeds_2_large = umap_model_2.transform(X[10000:70000])

fig,axs = plt.subplots(2,2, figsize=(12, 8))
axs[0, 0].scatter(embeds_1_small[:,0], embeds_1_small[:,1], c=y[10000:20000], s=0.1, cmap='Spectral')
axs[0, 0].set_title("CPU, 10k predictions")
axs[0, 1].scatter(embeds_2_small[:,0], embeds_2_small[:,1], c=y[10000:20000], s=0.1, cmap='Spectral')
axs[0, 1].set_title("GPU, 10k predictions")
axs[1, 0].scatter(embeds_1_large[:,0], embeds_1_large[:,1], c=y[10000:70000], s=0.1, cmap='Spectral')
axs[1, 0].set_title("CPU, 60k predictions")
axs[1, 1].scatter(embeds_2_large[:,0], embeds_2_large[:,1], c=y[10000:70000], s=0.1, cmap='Spectral')
axs[1, 1].set_title("GPU, 60k predictions")
plt.show()

sean-doody · 2024-04-12T15:12:42Z

Echoing @Bougeant, this has been my experience using cuML UMAP with BERTopic, as well, to the point that I never use the cuML implementation of UMAP. It simply has never worked well for any of the pre-trained embedding models I've used. I always get the results @Bougeant shows in the bottom right figure.

ascillitoe · 2024-04-15T21:22:52Z

Also echoing this, in my experience the results from cuML's UMAP are often unusable, which is a shame as it's so fast! Funnily enough though I've seen the opposite behaviour to @Bougeant; when running transform on ~25k instances, the clusters are fairly well separated when fit is run on a random sample of 1000 instances, but increasing this to 5000 or running fit_transform on the 25k leads to results like the bottom right figure.

Fingers crossed on a fix for this one @dantegd! 🤞🏻

Bougeant · 2024-05-16T11:35:26Z

@MaartenGr, as the author of Bertopic, you might be interested in this.

cjnolet · 2024-05-16T12:45:12Z

Hey everyone,

Sorry for being late to this discussion. I think one of the problems here might be the assumption that cuML's UMAP will always yield the exact same results as the CPU-based reference implementation for the exact same parameter settings. We did some parallelism magic on the GPU side to speed up the algorithm and as a result, it's possible that some of the parameters (such as the number of iterations for the solver) might need to be tweaked a bit.

In addition, the underlying GPU-accelerated spectral embedding initialization primitive has gotten fairly old by this point and hasn't been updated in quite some time so it's been accumulating little bugs as CUDA versions increase and the code itself becomes more stale. I suggest trying to use the random initialization, along with adjusting the number of neighbors and the number of iterations to see if that improves the quality of your embeddings.

We have an engineer ramping up to fix the spectral clustering initialization and they will also be working to improve the end to end quality of the results. Again, I apology sincerely for the delay in replying to this thread.

sean-doody · 2024-05-16T13:56:51Z

Thanks for the follow up @cjnolet — happy to hear this is being worked on!

Bougeant · 2024-05-16T14:36:10Z

Thanks for the update @cjnolet!

I think it's quite clear to me that we're not going to get exactly the same results with cuML's UMAP and the CPU-based umap.UMAP. However, while in my example above, the CPU and GPU clusters for 10k datapoints are probably of similar quality, the one with 60k datapoints is clearly worse for the GPU generated clusters (even though when we zoom into the +-15 range, the clusters look decent for the GPU case as well).

cjnolet · 2024-05-16T19:16:23Z

@Bougeant
Yes, the GPU should not look worse than CPU, that's not expected.

What is expected, though, is that the same parameter settings might yield different results, and that sometimes causes the need for the number of iterations to be tweaked.

If you have a moment to try the init=random, it would be helpful to know if that improves anything for you.

minimaxir · 2024-06-01T20:13:29Z

I ran into this issue myself trying to reduce 200k text embeddings to 2D. First off, it's impressive that the UMAP can run on such a larger dataset in only a couple seconds. :)

But yes, I am also seeing poor performance such as wide x/y ranges (+- 20) on the reduced embeddings, even when using init="random" and tweaking other parameters such as num_epochs.

ruiheesi · 2024-06-04T21:03:48Z

HI all, we are working on single cell spatial data with like millions of cells, the embedding we see from GPU is not as clear as ones from CPU implementation. Been watching this issue since Apr and had just check in to see the updates. It is true we do not expect to see exact result, but we would expect to see similar degree of separation or performance of resulting clouds. While the GPU is clearly the path to go for us to ramp up analysis speed, we do like some clarity on this issue. We are watching another issue here: #5782, which we think is related to the poor performance observed here, and no update in that ticket.

filipinascimento · 2024-06-07T17:40:16Z

The initialization of UMAP is indeed super important to attain quality in the global structure (https://www.nature.com/articles/s41587-020-00809-z). And cuML seems to have a bug regarding that (see #5782). The random initialization is not a good option according to these, and does not seem to capture the global structure well, at least with the default parameters.

I've managed to get better UMAPs by astronomically increasing the number of epochs (to 500000!), the number of neighbors and negative sample rate (corresponding to the repulsive realizations). By doing that I end up with nice results even for large datasets. However this completely undermines the usefulness of a GPU-based method as it can take more time than the CPU implementation to attain a similar quality. I think this is a critical limitation of cuML and a dangerous one for scientific analysis. Libraries for single cell analysis are now providing the option of using cuML to perform UMAP.

Would it be possible (and maybe easier to implement) a way to provide our own initialization as a parameter?

This would allow us to use PCA, for instance, which can lead to higher quality than the current implementation or random initialization. This would also allow us to apply UMAP and resume it afterwards with more iterations.

maciejskorski added ? - Needs Triage Need team to review and classify bug Something isn't working labels Dec 22, 2023

maciejskorski mentioned this issue Dec 22, 2023

Results of umap-learn and UMAP by cuML are different lmcinnes/umap#775

Open

beckernick mentioned this issue Jun 7, 2024

[BUG] UMAP spectral initialization fails to preserve global structure. #5782

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] the UMAP implementation much worse than on CPU #5707

[BUG] the UMAP implementation much worse than on CPU #5707

maciejskorski commented Dec 22, 2023 •

edited

Loading

dantegd commented Jan 5, 2024

Bougeant commented Apr 4, 2024

sean-doody commented Apr 12, 2024

ascillitoe commented Apr 15, 2024 •

edited

Loading

Bougeant commented May 16, 2024

cjnolet commented May 16, 2024

sean-doody commented May 16, 2024

Bougeant commented May 16, 2024

cjnolet commented May 16, 2024

minimaxir commented Jun 1, 2024 •

edited

Loading

ruiheesi commented Jun 4, 2024

filipinascimento commented Jun 7, 2024 •

edited

Loading

[BUG] the UMAP implementation much worse than on CPU #5707

[BUG] the UMAP implementation much worse than on CPU #5707

Comments

maciejskorski commented Dec 22, 2023 • edited Loading

dantegd commented Jan 5, 2024

Bougeant commented Apr 4, 2024

sean-doody commented Apr 12, 2024

ascillitoe commented Apr 15, 2024 • edited Loading

Bougeant commented May 16, 2024

cjnolet commented May 16, 2024

sean-doody commented May 16, 2024

Bougeant commented May 16, 2024

cjnolet commented May 16, 2024

minimaxir commented Jun 1, 2024 • edited Loading

ruiheesi commented Jun 4, 2024

filipinascimento commented Jun 7, 2024 • edited Loading

maciejskorski commented Dec 22, 2023 •

edited

Loading

ascillitoe commented Apr 15, 2024 •

edited

Loading

minimaxir commented Jun 1, 2024 •

edited

Loading

filipinascimento commented Jun 7, 2024 •

edited

Loading