Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USearch v3: Text Indexes, Faster Views, Cleaner API and Docker #277

Merged
merged 2 commits into from Nov 5, 2023

Conversation

ashvardanian
Copy link
Contributor

@ashvardanian ashvardanian commented Sep 25, 2023

The v3 release of USearch is bringing several new features, and provides an excellent opportunity to refactor existing ones!

  • Uniform naming across bindings.
  • Making tqdm and ucall dependencies optional.
  • Co-locating vectors and graph data on disk.
  • Better compaction before serialization to make searches over view faster.
  • TextIndex, TF-IDF, and Levenshtein distances from StringZilla.
  • Docker image with a UCall-based REST API.
  • Tooling for sampling/pooling a hybrid embedding from long documents embedded in chunks.
  • Support batch-capable distance metrics and masked loads from SimSIMD.

Candidates:

  • External retrievers and node-level serialization. @Ngalstyan4
  • Batch-add and batch-search interfaces for C, Rust. @var77

@var77
Copy link
Contributor

var77 commented Sep 27, 2023

Thank you for your work! It will also be great to have a batch add functionality in Rust bindings.

@monatis
Copy link

monatis commented Sep 30, 2023

Thanks for the awesome work!

  • Docker image with a UCall

Will this support multiple indexes / collections? Like endpoints for creating a collection (vector index), listing available collections, searching in the given collection etc. all in a single Docker container.

@ashvardanian
Copy link
Contributor Author

@monatis, yes, that’s the plan 🤗

@monatis
Copy link

monatis commented Sep 30, 2023

Super news. It's already great for embedded use but a dockerized service mode can cater for new use cases.

@ashvardanian ashvardanian changed the title USearch v3: Text Indexes, Faster Views, and Cleaner API USearch v3: Text Indexes, Faster Views, Cleaner API and Docker Sep 30, 2023
@ashvardanian
Copy link
Contributor Author

I've released SimSIMD v2, which will be included in USearch v3. It brings several performance and versatility improvements and may soon feature Intel AMX support and more bitwise metrics.

@philippemnoel
Copy link

Thank you for your work! Do you have an estimate on when USearch v3 will be ready?

@ashvardanian
Copy link
Contributor Author

Thank you, @philippemnoel! I plan to release it before the end of October.

Here is the progress so far:

  • A better compaction plan for on-disk views. cc @al13n321.
  • Masked-loads support in SimSIMD shipped this weekend.
  • Levenstein distance present in StringZilla will be released with the next major release, potentially tomorrow.

Once we fix the build issues in UCall CI, all the pieces will fall into place, and the release will be easier to prepare. Feel free to help there 🤗

@aehlke
Copy link

aehlke commented Nov 1, 2023

Would be amazing to prioritize langchainJS integration. I can run it on web and mobile and desktop easily with that. You'll get attention more quickly from their many users.

@ashvardanian
Copy link
Contributor Author

@aehlke
Copy link

aehlke commented Nov 3, 2023

Ah I meant for this new text search functionality - maybe there's nothing to be added on their side. Looking forward to this PR, thanks for the work.

@ashvardanian ashvardanian merged commit a5bb475 into main Nov 5, 2023
5 checks passed
@ashvardanian
Copy link
Contributor Author

🎉 This PR is included in version 2.8.3 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@monatis
Copy link

monatis commented Nov 9, 2023

Hi @ashvardanian it seems that this pr is merged without V3 changes. I just wanted to ask if you are still planning to release them and if you have a new ETA?

I was thinking of using USearch in an upcoming product but a stable file format might be important.

Not necessarily related to V3, but are you planning to support incrementally adding vectors in memmaped files, or what is the recommended procedure to add new vectors after the initial upload?

@aehlke
Copy link

aehlke commented Nov 23, 2023

Is there an update on text index and related updates originally planned? Thanks

@ashvardanian
Copy link
Contributor Author

Hello @monatis and @aehlke,

I hope you're both doing well. The past few weeks have been incredibly busy as we wrapped up our year-long projects. You might have come across some of these initiatives:

Collaborating with large organizations has its challenges, particularly in terms of pace. This has required us to adjust our timelines slightly. Nonetheless, progress on v3 is steady, along with other updates and integrations, including contributions to SciPy and a potential integration with Sklearn.

Beating existing solutions in vector and text search, clustering, dimension reduction, and external memory access is doable. But achieving that in one package under 10,000 lines of maintainable code compatible with every OS and hardware architecture is very tricky. I want to make sure the design persists for years, so I'm not rushing to make sure we get it right.

Thank you for your continued support and patience! 🤗

@aehlke
Copy link

aehlke commented Nov 24, 2023

I appreciate the context, best of luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants