add jieba tokenizer for Chinese #209

vincascm · 2020-05-08T03:16:35Z

add jieba tokenizer for Chinese

valeriansaliou · 2020-05-09T09:47:25Z

Hello there! Your code is being Chinese-specific, which is not fine by me in terms of code hygiene. Can you also explain the purpose of this PR? Not familiar with the Jieba tokenizer.

vincascm · 2020-05-09T10:13:01Z

@valeriansaliou

Chinese do not have a trivial word segmentation process . https://en.wikipedia.org/wiki/Text_segmentation
jieba is a widely used Chinese tokenizer. https://github.com/fxsjy/jieba#jieba-1
may add special feature for Chinese.

sftblw · 2020-05-15T02:26:06Z

It is common to use tokenizer in CJK, even in elasticsearch (google search of es + mecab tokenizer). As an example, NLP library spaCy requires external tokenizer installed in system for...

chinese	japanese	korean
jieba	mecab via fugashi	mecab-ko via natto-py

dzcpy · 2021-06-03T10:05:17Z

Have you guys found a solution for Chinese tokenizing?
Maybe we can borrow some ideas from https://github.com/tantivy-search/tantivy, like make the tokenizer configurable in order to support CJK languages?

hajiuxbz · 2021-08-13T03:28:55Z

Hope to merge

rcy17 · 2021-11-06T05:44:48Z

One and a half years later...... Sincerely hope to merge.

valeriansaliou · 2021-11-07T12:40:55Z

Thanks for the PR, sorry it took so much time. Currently updating the jieba library to latest and refactoring code to make it an optional feature, as not all Sonic users need Chinese tokenization support (the library adds some size overhead).

add jieba tokenizer for Chinese

f2d21aa

valeriansaliou merged commit 8793dae into valeriansaliou:master Nov 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add jieba tokenizer for Chinese #209

add jieba tokenizer for Chinese #209

vincascm commented May 8, 2020

valeriansaliou commented May 9, 2020

vincascm commented May 9, 2020

sftblw commented May 15, 2020

dzcpy commented Jun 3, 2021 •

edited

hajiuxbz commented Aug 13, 2021

rcy17 commented Nov 6, 2021

valeriansaliou commented Nov 7, 2021

add jieba tokenizer for Chinese #209

add jieba tokenizer for Chinese #209

Conversation

vincascm commented May 8, 2020

valeriansaliou commented May 9, 2020

vincascm commented May 9, 2020

sftblw commented May 15, 2020

dzcpy commented Jun 3, 2021 • edited

hajiuxbz commented Aug 13, 2021

rcy17 commented Nov 6, 2021

valeriansaliou commented Nov 7, 2021

dzcpy commented Jun 3, 2021 •

edited