[QST] Relationship between UMAP.embedding_ and reductions returned by UMAP.transform() #5188

drob-xx · 2023-02-01T16:58:12Z

After using umap-learn for some time I've written code that relies on embedding_ == reduction from transform(). I just found out that without setting hash_input=True this will not be the case with cuML's UMAP. I was a bit surprised. I have since re-read the documentation and while this difference is noted it seems to me something of an unfortunate "gotchya". Perhaps I'm missing something but it seems like the more conservative approach would be to default to the behavior umap-learn and provide additional tuning parameters for those who want to use them. At a minimum it might be nice to have a warning here.

The text was updated successfully, but these errors were encountered:

beckernick · 2023-03-14T18:56:57Z

@viclafargue @dantegd @cjnolet what do you think about making hash_input=True the default. Seems pretty reasonable on the surface, but interested in your thoughts.

Victor shared some additional context in a different issue.

EDIT: Looks like we already reached some level of agreement in that issue. This feels like a good first issue for a new contributor, but I'm going to tag it in the other issue.

cjnolet · 2023-03-14T19:36:13Z

I'm fine with that. It does introduce an additional overhead, which is why we made the default false to begin with. Maybe we could add a quick doc to the argument that states it's true by default but it comes with an overhead so if the user will never expect to be doing fit(A).transform(A) and expecting the exact same results then they can disable it.

antortjim · 2023-06-11T10:04:30Z

I think it would be great to develop a bit more the docs to explain the hash_input attribute. I am running the UMAP implementation in cuml with the mnist dataset from tenstoflow.keras, and I get significantly different results depending on whether hash_input is True or not. I tried generating a UMAP using the embedding_ attr, the output of transform() on the training data, and also the output of transform() on test data.

My finding is that the value of hash_input only dramatically affects the output whenever I run transform() on the training data.

I am really wondering why is this weird blob being produced when hash_input=False and I transform(training_data). I don't see how would anyone prefer that over the corresponding output when hash_input=True. The fact that the limits of the axes are much bigger helps me understand why the points seem to converge into the blob, but then I don't get why the limits are getting so much bigger (from around -10/10 to -40/40). Not only that, the position of the points with respect to one another is clearly less distinguishable (example: cluster of points for digit 1).

I am also really wondering why this blob does not occur when transform(test_data), which is a relief because that suggests the fitted algorithm will be able to compress other datasets.

To replicate the figures, run the script below (requires tensorflow.keras numpy matplotlib and cuml)

PS n_components=2, n_neighbors=60, min_dist=0.0, random_state=42 are taken from an existing program which uses the CPU umap.UMAP implementation, and I would like to keep them the same so that I can compare both implementations (unless a good reason to change the parameter exists)

import os.path
import matplotlib.pyplot as plt
import numpy as np

from tensorflow.keras.datasets import mnist # 60k datapoints
from cuml.manifold.umap import UMAP as cuUMAP

def load_tf_mnist():
   (train_images, train_labels), (test_images, test_labels) = mnist.load_data()

   # We'll just use the training data for simplicity
   images = np.reshape(train_images, (len(train_images), -1))
   test_images = np.reshape(test_images, (len(test_images), -1))
   

   return (images, train_labels), (test_images, test_labels)

def plot_umap(embedding, labels):
   plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
   plt.gca().set_aspect('equal', 'datalim')
   plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
   plt.title('UMAP projection of the MNIST dataset', fontsize=24)


def main():
   
   (data, labels), (test_data, test_labels) = load_tf_mnist()


   # hash_input = True
   learned_embeddings = cuUMAP(n_components=2, n_neighbors=60, min_dist=0.0, random_state=42, hash_input=True).fit(data)
   plot_umap(learned_embeddings.transform(data), labels)
   plt.savefig(os.path.join("gpu_umap_True_from_transform.png"))
   plt.clf()
   plot_umap(learned_embeddings.embedding_, labels)
   plt.savefig(os.path.join("gpu_umap_True_from_embedding_.png"))
   plt.clf()
   plot_umap(learned_embeddings.transform(test_data), test_labels)
   plt.savefig(os.path.join("gpu_umap_True_test_data.png"))
   plt.clf()

   # hash_input = False
   learned_embeddings = cuUMAP(n_components=2, n_neighbors=60, min_dist=0.0, random_state=42, hash_input=False).fit(data)
   plot_umap(learned_embeddings.transform(data), labels)
   plt.savefig(os.path.join("gpu_umap_False_from_transform.png"))
   plt.clf()
   plot_umap(learned_embeddings.embedding_, labels)
   plt.savefig(os.path.join("gpu_umap_False_from_embedding_.png"))
   plt.clf()
   plot_umap(learned_embeddings.transform(test_data), test_labels)
   plt.savefig(os.path.join("gpu_umap_False_test_data.png"))
   plt.clf()


if __name__ == "__main__":
    main()

Any help would be very much appreciated, thanks!

antortjim · 2023-06-13T16:06:38Z

For the sake of completion, these are the UMAPs if I transform with the training and test data

Updated script: test_hashinput.zip

Now it's even more confusing, because when hash_input=True one can still get the garbled output. But I don't get why it's fine with train and test sets separately, and not when combined.

drob-xx added ? - Needs Triage Need team to review and classify question Further information is requested labels Feb 1, 2023

github-actions bot added this to Needs prioritizing in Other Issue Triage Feb 1, 2023

drob-xx mentioned this issue Feb 3, 2023

BERTopic fit_transform producing a different clustering than when invoking HDBSCAN.fit_predict directly on UMAP reductions MaartenGr/BERTopic#969

Closed

beckernick mentioned this issue Mar 14, 2023

[BUG] Output of UMAP fit_transform() and fit().transform() should be the same #4854

Open

antortjim mentioned this issue Jun 21, 2023

[BUG] different outputs for UMAP on CPU vs. GPU #5474

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Relationship between UMAP.embedding_ and reductions returned by UMAP.transform() #5188

[QST] Relationship between UMAP.embedding_ and reductions returned by UMAP.transform() #5188

drob-xx commented Feb 1, 2023

beckernick commented Mar 14, 2023 •

edited

Loading

cjnolet commented Mar 14, 2023

antortjim commented Jun 11, 2023 •

edited

Loading

antortjim commented Jun 13, 2023 •

edited

Loading

[QST] Relationship between UMAP.embedding_ and reductions returned by UMAP.transform() #5188

[QST] Relationship between UMAP.embedding_ and reductions returned by UMAP.transform() #5188

Comments

drob-xx commented Feb 1, 2023

beckernick commented Mar 14, 2023 • edited Loading

cjnolet commented Mar 14, 2023

antortjim commented Jun 11, 2023 • edited Loading

antortjim commented Jun 13, 2023 • edited Loading

beckernick commented Mar 14, 2023 •

edited

Loading

antortjim commented Jun 11, 2023 •

edited

Loading

antortjim commented Jun 13, 2023 •

edited

Loading