Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional Lindera tokenizer support (was: Custom tokenizer support) #200

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

cjrh
Copy link
Collaborator

@cjrh cjrh commented Jan 29, 2024

See #25

This is an early draft for discussion purposes. It's not ready to be used. The primary change is that we incorporate the Lindera tantivy tokenizer support under a feature flag. This means that tantivy-py must be built with that feature flag specified. The included tests have an example of this. It is possible to run the tests both with and without the Lindera support. We need to decide how we want this to be set up.

Cargo.toml Outdated Show resolved Hide resolved
session.install(
"--no-build-isolation",
'--config-settings',
'build-args="--features=lindera"',
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took a surprisingly long time to discover the correct way to spell how to get this value sent down to the build backend!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, even with the unified pyproject.toml approach, wiring Python builds is still a mess IMHO. Do you remember where in the Maturin documentation you would have expected this? Maybe an improvement is just a PR away?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maturin wasn't too bad, it is pretty easy to find "feature selection" and that shows up under the build-wheels section in the maturin docs. Much more difficult to find was how to propagate the feature parameter through the pip interface, as above.

I had a quick look, I think this section is probably the best place to mention the build-args parameter: https://www.maturin.rs/index.html?highlight=pip%20install#python-packaging-basics

src/lindera_tokenizer.rs Outdated Show resolved Hide resolved
src/index.rs Outdated Show resolved Hide resolved
@cjrh
Copy link
Collaborator Author

cjrh commented Feb 18, 2024

I've added all the lindera features, and also an interface to specify configurable options on registration. The code is still rough, for example I need to rename quite a few things. I'd love to also figure out how to spell the function signatures correctly, e.g. for the register_lindera_tokenizer() function.

@cjrh cjrh added enhancement New feature or request rust Pull requests that update Rust code labels Apr 21, 2024
@cjrh cjrh changed the title Custom tokenizer support Optional Lindera tokenizer support (was: Custom tokenizer support) May 7, 2024
@cjrh
Copy link
Collaborator Author

cjrh commented May 30, 2024

It would be much nicer to provide this as an additional package that can be installed. The lack of a stable ABI prevents this, unless we wrap up the Lindera stuff inside a C wrapper and then call that. I'm in two minds about proceeding with this PR as-is, that's why I haven't moved forward with it. Still thinking about it.

@cjrh
Copy link
Collaborator Author

cjrh commented May 30, 2024

Just thinking out loud, we could consider adding some kind of new ExternalTokenizer tokenizer, create an instance that encapsulates another tokenizer with a C interface, and that separate tokenizer is wrapped inside the thing with the C interface. Then the latter can be packaged as a wheel and pip installed. Lots of handwaving but basically add a soapy C layer around the tokenizer registration and usage.

@cjrh
Copy link
Collaborator Author

cjrh commented May 30, 2024

Must be something in the water, because this got published just yesterday. Many projects dealing with the plugin problem: https://www.arroyo.dev/blog/rust-plugin-systems

@cjrh
Copy link
Collaborator Author

cjrh commented Jun 27, 2024

Ok so the reason I haven't moved forward with this is because I want to explore different solutions that allow a user to pip install additional tokenizers and register them at runtime. It might be possible to create a C abi interface for tokenizers that will allow this to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants