Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any way to use a custom tokenizer? #791

Open
dzcpy opened this issue Jun 3, 2021 · 5 comments
Open

Is there any way to use a custom tokenizer? #791

dzcpy opened this issue Jun 3, 2021 · 5 comments

Comments

@dzcpy
Copy link

dzcpy commented Jun 3, 2021

Is your feature request related to a problem? Please describe.
https://github.com/tantivy-search/tantivy#features
One of the features tantivy provides is to support custom tokenizers. For example tantivy-jieba. Is it possible for Toshi to support this feature?

Does another search engine have this functionality? Can you describe it's function?

Do you have a specific use case you are trying to solve?

Additional context

@hntd187
Copy link
Member

hntd187 commented Jun 4, 2021

That would require you to have to build toshi with that support right? I suppose we could start conditionally including them and have releases that include tokenizers. Do you think that would solve your use?

@dzcpy
Copy link
Author

dzcpy commented Jun 5, 2021

@hntd187 Yes, that would be very helpful. Thanks for your awesome work, This project seems very promising

@hntd187
Copy link
Member

hntd187 commented Jun 6, 2021

What tokenizers specifically would you like to see included? I know the one you linked hasn't been updated in some time and is 2 versions behind on tantivy version so I do not know if it will work anymore.

@hntd187
Copy link
Member

hntd187 commented Jun 6, 2021

I added in https://github.com/toshi-search/Toshi/blob/master/toshi-server/src/lib.rs#L55 the ability to conditionally add tokenizer cang_jie if you build toshi with it. If you want we can add more tokenizers, I'll probably come up with some more general traits to make this impl easier for me in the future.

@dzcpy
Copy link
Author

dzcpy commented Jun 7, 2021

@hntd187 Thanks very much, I think that's pretty much what I need. In the future there might people want to use other tokenizers like Japanese and Korean ones though, but for me I only need Chinese.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants