Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding projector only loads first 100,000 vectors #773

Open
vitalyli opened this issue Nov 26, 2017 · 17 comments
Open

Embedding projector only loads first 100,000 vectors #773

vitalyli opened this issue Nov 26, 2017 · 17 comments
Labels
plugin:projector stat:contributions welcome theme:performance Performance, scalability, large data sizes, slowness, etc.

Comments

@vitalyli
Copy link

vitalyli commented Nov 26, 2017

Embedding projector only loads first 100000 vectors. In many real world applications, embedding dictionaries are well over 1Mil. Need some way to display vectors from larger sets or at least have a way to configure what the upper limit is.

@vitalyli vitalyli changed the title Embedding projector only loads first at 100,000 Embedding projector only loads first 100,000 Nov 26, 2017
@vitalyli vitalyli changed the title Embedding projector only loads first 100,000 Embedding projector only loads first 100,000 vectors Nov 26, 2017
@vitalyli
Copy link
Author

It appears that this limit is hardcoded here:
.//tensorboard/plugins/projector/vz_projector/data-provider-server.ts
export const LIMIT_NUM_POINTS = 100000;

@jart
Copy link
Contributor

jart commented Nov 28, 2017

Everything in the projector is done on the client side. There's a limit to how much the browser can handle. I'd be interested in hearing about whether or not things worked out if you changed the limit by hand.

@vitalyli
Copy link
Author

I tried to change this limit, but client still said showing first 100k, which made me wonder if
server dictates that limit. Or is that cached somewhere in the browser cache perhaps?
Would be good to be able to send that limit as parameter to the server.
Often it's about searching for a vector by label and looking for closest vectors;
if it simply takes first 100k, that means it's limited in what can be explored given 1mil plus embedding file. May be distance compute can be pushed to the server, thus removing need for client to do the filtering altogether.

@vitalyli
Copy link
Author

vitalyli commented Dec 18, 2017

If we can't make client handle more than 100k what would be really useful is to tell server to sample data instead of returning first 100k. Think of data sorted by popularity, always seeing first 100k out of 1mil is biased towards more popular items. Ideally server would return a stratified sample of random 10k from each 100k and so it would give a good representative sample from 1Mil.

@Seanspt
Copy link

Seanspt commented May 28, 2018

Up vote for sampling instead of return first 10k.
Also, it would be great that a group of wanted IDs could be passed in.

@nfelt
Copy link
Contributor

nfelt commented May 29, 2018

We'd welcome a contribution to implement server-side sampling if someone wants to take this on.

@kapilkd13
Copy link

Hi @nfelt I would like to take this. Can you point me to the files corresponding to the embedding projector. Also any suggstions/ideas?

@rahulkrishnan98
Copy link

@vitalyli once we run the projector on 100000+ vectors and metadata, it can limit the vector and sample 100000+ points on it, but the metadata for even the loaded points fail.

@hvout
Copy link

hvout commented Jul 24, 2019

Hello.
Sorry for bringing this up but the folder .//tensorboard/plugins/projector/vz_projector/ does not exist in my installation (installed with pip inside miniconda venv with python 3.6 - latest tensorboard version). Anyone knows where I can find that folder to increase the limit?

@hvout
Copy link

hvout commented Jul 24, 2019

I'm able to increase it in the projector_plugin.py file under tensorboard/plugins/projector and it does work. But T-SNE and PCA keep sampling data for "faster results" - I believe these limits are set in data.ts but when installed with pip the vz_projector folder does not exist

@bileschi bileschi added the theme:performance Performance, scalability, large data sizes, slowness, etc. label Dec 20, 2019
@alexdevmotion
Copy link

I also have this issue, has anyone found an easy fix?

@RSKothari
Copy link

Hey guys, any luck on this topic? In my case, it only samples 120 data points. A tip I could perhaps offer to speed things up would be to offer a "PCA + tSNE" option. It could drastically help reduce embedding sizes and reduce the load on RAM via browser.

@nlp4whp
Copy link

nlp4whp commented Sep 4, 2020

I'm able to increase it in the projector_plugin.py file under tensorboard/plugins/projector and it does work. But T-SNE and PCA keep sampling data for "faster results" - I believe these limits are set in data.ts but when installed with pip the vz_projector folder does not exist

you are right ... it look like we have to modify something in data.ts for PCA and T-SNE sampling

although 10k is defined in data-provider-server.ts as export const LIMIT_NUM_POINTS = 100000;, but it will be sent to back-end in projector_plugin.py where the final tensor is returned
(see _serve_metadata(self, request) or _serve_tensor(self, request))

@GeorgePearse
Copy link

@RSKothari @nlp4whp It might even make a lot of sense to fork the embedding projector component and remove the in-browser interactive dimensionality reduction (to be replaced with whatever dimensionality reduction technique a data scientist wants to use ahead of time. The embedding projector has a lot of value on its own as a high-performance 3d visualization tool with convenient access to metadata. Unless people know a better alternative for pointclouds with metadata?

@wizz92
Copy link

wizz92 commented Apr 1, 2023

You just need to change qO=1e5 to qO=1e6 in /tensorboard/plugins/projector/tf_projector_plugin/projector_binary.js
worked fine for me.

@saikot-paul
Copy link

Is it possible to change any of these parameters in a colab environment?

@arcra
Copy link
Member

arcra commented Oct 24, 2023

Given that this currently requires modifications to the source code, there is no way to change this behavior with the supported extension from colab. There might be ways to use a custom version of tensorboard with a "local runtime" in colab, but I'm not knowledgeable enough about colab to provide any guidance in that regard.

If a locally modified version of a tensorboard would be sufficient (i.e. just running a standalone TB, not in colab), you can take a look at our DEVELOPMENT guide for some pointers on how to run a local instance.

With respect to better supporting this as a feature in a feature release, I'm afraid it's unlikely we'll prioritize this, especially because there hasn't been any active development on this area/plugin, there are no people left in the team who are familiar with this part of the code, and it's probably not an easy thing to solve in a generic way (e.g. without affecting performance on some browsers/machines, and/or some UI support to allow users to configure these visualization parameters, etc.). If anybody is interested in contributing, they can get in touch with us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plugin:projector stat:contributions welcome theme:performance Performance, scalability, large data sizes, slowness, etc.
Projects
None yet
Development

No branches or pull requests