New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement an "English only" Tokenizer #86
Comments
Hi, @anselmwang. Thank you for your excellent suggestion! I am sure that your suggestion is the best workaround except for implementing each language tokenizer 😄
I like to add a new tokenizer, named
Since I have changed my codebase and confirmed it as I expected, It's both OK. Please choose what prefer for you :)
Kind regards. Thanks! |
Thanks @tadashi-aikawa for your quick response! Since you already implement the feature, please go ahead. |
I have released v5.6.0.beta2 🚀 If you use BRAT, please confirm whether it works as you expected. I have created a new |
Thanks Tadashi, it works! |
I have released v5.6.0 🚀 Many thanks! |
Thanks a lot, I have upgraded to latest version! |
Hi, tadashi
Thanks a lot for creating this plugin! It accelerates my input speed a lot, especially on mobile phone.
However, after a while, I find two inconveniences for me. I am using "default" tokenization strategy and I only need English word completion. My note is like below
The two inconveniences are:
Both problems are due to that various-complement lacks an "English only" tokenization strategy.
I just read through the code and would like to add a new tokenizer. But I quickly find if we can expose TRIM_CHAR_PATTERN as a parameter, my problem can be perfectly solved. The TRIM_CHAR_PATTERN serves as a delimiter pattern to split words, so I just define it as "/^[a-zA-Z0-9_-]+$/", then "中文abcd" will be splitted to "中"、"文"、"abcd","$lmb" will be splitted to "$" and "lmb". It exactly matches my need.
I also suggest exposing the parameter as name "delimiter pattern" since it is more intuitive.
If exposing this parameter directly is too advanced for end users, we can expose an tokenization strategy "English only", which is in fact a DefaultTokenizer with TRIM_CHAR_PATTERN to be "/^[a-zA-Z0-9_-]+$/", exactly matches the regex used for "onlyComplementEnglishOnCurrentFileComplement" option.
Does the suggestion look good to you? I can submit a PR if you are fine with the change.
Thanks,
anselmwang
The text was updated successfully, but these errors were encountered: