Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not getting the same results as Kuromoji java #23

Closed
Citronelol opened this issue Jan 31, 2018 · 3 comments
Closed

Not getting the same results as Kuromoji java #23

Citronelol opened this issue Jan 31, 2018 · 3 comments

Comments

@Citronelol
Copy link

Hi,

I was trying to tokenize the following sentence :

第1条 この法人は、一般社団法人国際銀行協会(以下「本協会」という。)と称し、英文では、 International Bankers Association of Japanと記載する。

and the results are different when using the java version of kuromojin (with Ipadic dictionary) and the tokenizer provided by kuromoji.js. In particular, the following sequence 協会 is splitted in kuromoji.js.

I saw a closed issue (#16) stating this could due to the Viterbi version of the tokenizer. Is there a way to disable it ?

Many thanks in advance,

Best

@DJTB
Copy link

DJTB commented Mar 16, 2018

It appears to be an preference issue, it's matching both 協 and 会 as 接尾 (suffix) before the whole word.

#16 is matching 人名 (name) for 研 and 究 before the whole word.

Perhaps the matching algorithm needs to favor longer tokens before splitting into finer matches.

@takuyaa
Copy link
Owner

takuyaa commented Mar 19, 2018

@Citronelol I released fixed version of 0.1.2, and deployed the demo site https://takuyaa.github.io/kuromoji.js/demo/tokenize.html
FYI @DJTB

@takuyaa takuyaa closed this as completed Mar 19, 2018
@Citronelol
Copy link
Author

Thanks a lot !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants