Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VectorDB to wanDB #5

Open
charlesfrye opened this issue Jan 31, 2023 · 2 comments
Open

VectorDB to wanDB #5

charlesfrye opened this issue Jan 31, 2023 · 2 comments
Labels
data enhancement New feature or request

Comments

@charlesfrye
Copy link
Collaborator

https://twitter.com/_ScottCondron/status/1620347174692454400

@charlesfrye
Copy link
Collaborator Author

charlesfrye commented Jun 7, 2023

started playing around with this. a demo of how this might support EDA and better PRs here, see also #39

limitations:

  • silent, afaict undocumented limit of ~128 dims for vector inputs. Lindenstrauss says we can almost definitely do a random projection and still preserve meaningful structure -- especially since we're going to UMAP down to two dims anyway. assuming there's no privileged basis (possibly too strong), i projected onto the first 128 canonical basis vectors, aka vec[:128]
  • silent (but documented) limit of 10k rows in a table

incorporating this into the workflow:

  • could upload a new wandb table every time we refresh the vector index, but that feels excessive. if we upload regularly, we want it to be a small diff
  • desire for small/meaningful diffs interacts poorly with random subsampling. could use lexical ordering of hashes to get a pseudo-random sample? but it's unclear that wandb artifact dedup works on a sub-file level
  • rather than doing it every time, could add a separate command for artifact storage and run intermittently

@charlesfrye
Copy link
Collaborator Author

turbojank implementation from last night is here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant