New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a chinese tokenizer #2008
add a chinese tokenizer #2008
Conversation
Can you provide a simple source code installation document? Let me test this function |
You can clone the repository and checkout the branch for this PR, Remember to use |
Is this speed adjustable on branch chinese-tokenize? |
@yangshike yes when compiled as release |
when i run : console: Why does the cluster mode start? How to start stand-alone mode only?? |
self.token.text.clear(); | ||
self.token.position = self.token.position.wrapping_add(1); | ||
|
||
let mut iter = self.last_char.take().into_iter().chain(&mut self.chars); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
char_iter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean by this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A sorry, I meant iter
is not a great name.
char_iter
would be nicer.
You can just run As mentionned by @PSeitz, make sure you run in |
@trinity-1686a Possibly a silly idea. Do you think we should do a "is_ascii" on the whole string. If all ascii just run the regular tokenizer logic, and if not run this... Then maybe we could enable this tokenizer as the default if the perf diff is insignificant? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would deserve a couple more unit test and nicer naming here and there maybe but great job overall.
Do whatever fix you feel is good and you can merge.
curl "http://127.0.0.1:7280/api/v1/customer2/search/?query=234" when i run: Has No output and The console does not see the request log, I tested again. It is OK to use the command line, but not http This is my index file: doc_mapping: |
It has been solved. It's OK to do url conversion for Chinese like: curl "http://127.0.0.1:7280/api/v1/customer2/search/?query=wechat_name:%E5%91%B5%E5%91%B5" |
@fulmicoton I don't know, it depends a lot on the performance of this tokenizer vs the default. I'll try to do some benchmarks. |
add proptest rename tokenizer to chinese_compatible
@trinity-1686a By "this tokenizer" you mean the composite one I was describing or the one in this PR? In the composite version, the is_ascii stuff should give us a fast happy path. Once we have identified that the string is ascii, we could have whitespace tokenizer implementaiton that operates on a (Note that the whitespace did not need to decode utf-8 to begin with but that's another story) |
@yangshike thank you for the tests and the follow up. I aam surprised your browser did not do the URL encoding directly! |
The curl used is not a browser |
Ah that makes sense @yangshike ! :) |
@fulmicoton I mean both. There are many cases where we can get non-ascii text that would work fine with the current default tokenizer (a single emoji in a document, |
pure ascii: 20% slower Maybe having a composite tokenizer tokenizer can make sense to not get a 20% hit on ascii, but a close to 10% on non-ascii non-chinese text is a bit hefty in my opinion, and I'm not sure it can be made the default then. It's easy to get any of |
Description
This adds a simple tokenizer for CJK. Before, something like "你好世界" (hello world) would be a single token because it contains no whitespace. This means searching for "你好" would yield no result.
A more intelligent tokenizer would probably split in two tokens (hello, world). This tokenizer simply split at each char, creating 4 tokens. This is much faster at indexing, but requires using a phrase query to match a word written as two or more chars.
fix #1979
How was this PR tested?
Some tests added for the tokenizer, and a manual test by indexing the wiki-articles-10000 dataset, using the new tokenizer for the
body
field and searching for "毛藝" (name of a Chinese gymnast), "毛" (first half), "藝" (2nd half) and "藝毛" (wrong order):