Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Relationship between UMAP.embedding_ and reductions returned by UMAP.transform() #5188

Open
drob-xx opened this issue Feb 1, 2023 · 4 comments
Labels
? - Needs Triage Need team to review and classify question Further information is requested

Comments

@drob-xx
Copy link

drob-xx commented Feb 1, 2023

After using umap-learn for some time I've written code that relies on embedding_ == reduction from transform(). I just found out that without setting hash_input=True this will not be the case with cuML's UMAP. I was a bit surprised. I have since re-read the documentation and while this difference is noted it seems to me something of an unfortunate "gotchya". Perhaps I'm missing something but it seems like the more conservative approach would be to default to the behavior umap-learn and provide additional tuning parameters for those who want to use them. At a minimum it might be nice to have a warning here.

@beckernick
Copy link
Member

beckernick commented Mar 14, 2023

@viclafargue @dantegd @cjnolet what do you think about making hash_input=True the default. Seems pretty reasonable on the surface, but interested in your thoughts.

Victor shared some additional context in a different issue.

EDIT: Looks like we already reached some level of agreement in that issue. This feels like a good first issue for a new contributor, but I'm going to tag it in the other issue.

@cjnolet
Copy link
Member

cjnolet commented Mar 14, 2023

I'm fine with that. It does introduce an additional overhead, which is why we made the default false to begin with. Maybe we could add a quick doc to the argument that states it's true by default but it comes with an overhead so if the user will never expect to be doing fit(A).transform(A) and expecting the exact same results then they can disable it.

@antortjim
Copy link

antortjim commented Jun 11, 2023

I think it would be great to develop a bit more the docs to explain the hash_input attribute. I am running the UMAP implementation in cuml with the mnist dataset from tenstoflow.keras, and I get significantly different results depending on whether hash_input is True or not. I tried generating a UMAP using the embedding_ attr, the output of transform() on the training data, and also the output of transform() on test data.

My finding is that the value of hash_input only dramatically affects the output whenever I run transform() on the training data.

False
True

I am really wondering why is this weird blob being produced when hash_input=False and I transform(training_data). I don't see how would anyone prefer that over the corresponding output when hash_input=True. The fact that the limits of the axes are much bigger helps me understand why the points seem to converge into the blob, but then I don't get why the limits are getting so much bigger (from around -10/10 to -40/40). Not only that, the position of the points with respect to one another is clearly less distinguishable (example: cluster of points for digit 1).

I am also really wondering why this blob does not occur when transform(test_data), which is a relief because that suggests the fitted algorithm will be able to compress other datasets.

To replicate the figures, run the script below (requires tensorflow.keras numpy matplotlib and cuml)

PS n_components=2, n_neighbors=60, min_dist=0.0, random_state=42 are taken from an existing program which uses the CPU umap.UMAP implementation, and I would like to keep them the same so that I can compare both implementations (unless a good reason to change the parameter exists)

import os.path
import matplotlib.pyplot as plt
import numpy as np

from tensorflow.keras.datasets import mnist # 60k datapoints
from cuml.manifold.umap import UMAP as cuUMAP

def load_tf_mnist():
   (train_images, train_labels), (test_images, test_labels) = mnist.load_data()

   # We'll just use the training data for simplicity
   images = np.reshape(train_images, (len(train_images), -1))
   test_images = np.reshape(test_images, (len(test_images), -1))
   

   return (images, train_labels), (test_images, test_labels)

def plot_umap(embedding, labels):
   plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
   plt.gca().set_aspect('equal', 'datalim')
   plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
   plt.title('UMAP projection of the MNIST dataset', fontsize=24)


def main():
   
   (data, labels), (test_data, test_labels) = load_tf_mnist()


   # hash_input = True
   learned_embeddings = cuUMAP(n_components=2, n_neighbors=60, min_dist=0.0, random_state=42, hash_input=True).fit(data)
   plot_umap(learned_embeddings.transform(data), labels)
   plt.savefig(os.path.join("gpu_umap_True_from_transform.png"))
   plt.clf()
   plot_umap(learned_embeddings.embedding_, labels)
   plt.savefig(os.path.join("gpu_umap_True_from_embedding_.png"))
   plt.clf()
   plot_umap(learned_embeddings.transform(test_data), test_labels)
   plt.savefig(os.path.join("gpu_umap_True_test_data.png"))
   plt.clf()

   # hash_input = False
   learned_embeddings = cuUMAP(n_components=2, n_neighbors=60, min_dist=0.0, random_state=42, hash_input=False).fit(data)
   plot_umap(learned_embeddings.transform(data), labels)
   plt.savefig(os.path.join("gpu_umap_False_from_transform.png"))
   plt.clf()
   plot_umap(learned_embeddings.embedding_, labels)
   plt.savefig(os.path.join("gpu_umap_False_from_embedding_.png"))
   plt.clf()
   plot_umap(learned_embeddings.transform(test_data), test_labels)
   plt.savefig(os.path.join("gpu_umap_False_test_data.png"))
   plt.clf()


if __name__ == "__main__":
    main()

Any help would be very much appreciated, thanks!

@antortjim
Copy link

antortjim commented Jun 13, 2023

For the sake of completion, these are the UMAPs if I transform with the training and test data

gpu_umap_False_all_data
gpu_umap_True_all_data

Updated script: test_hashinput.zip

Now it's even more confusing, because when hash_input=True one can still get the garbled output. But I don't get why it's fine with train and test sets separately, and not when combined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify question Further information is requested
Projects
Other Issue Triage
Needs prioritizing
Development

No branches or pull requests

4 participants