Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement an "English only" Tokenizer #86

Closed
anselmwang opened this issue Feb 12, 2022 · 6 comments
Closed

Implement an "English only" Tokenizer #86

anselmwang opened this issue Feb 12, 2022 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@anselmwang
Copy link

anselmwang commented Feb 12, 2022

Hi, tadashi

Thanks a lot for creating this plugin! It accelerates my input speed a lot, especially on mobile phone.
However, after a while, I find two inconveniences for me. I am using "default" tokenization strategy and I only need English word completion. My note is like below

中文abcd
abc
$lmb

The two inconveniences are:

  1. After typing "abc", plugin doesn't show completion because in the first line, chinese characters are combined with "abcd" as a single word.
  2. After typing "lmb" (I define "lmb" to be "\mathbf{}" to quickly type latex equation on mobile), there are no completion since "$lmb" are considered as one word.

Both problems are due to that various-complement lacks an "English only" tokenization strategy.

I just read through the code and would like to add a new tokenizer. But I quickly find if we can expose TRIM_CHAR_PATTERN as a parameter, my problem can be perfectly solved. The TRIM_CHAR_PATTERN serves as a delimiter pattern to split words, so I just define it as "/^[a-zA-Z0-9_-]+$/", then "中文abcd" will be splitted to "中"、"文"、"abcd","$lmb" will be splitted to "$" and "lmb". It exactly matches my need.

I also suggest exposing the parameter as name "delimiter pattern" since it is more intuitive.

If exposing this parameter directly is too advanced for end users, we can expose an tokenization strategy "English only", which is in fact a DefaultTokenizer with TRIM_CHAR_PATTERN to be "/^[a-zA-Z0-9_-]+$/", exactly matches the regex used for "onlyComplementEnglishOnCurrentFileComplement" option.

Does the suggestion look good to you? I can submit a PR if you are fine with the change.

Thanks,
anselmwang

@anselmwang anselmwang changed the title Expose the TRIM_CHAR_PATTERN (i.e., delimiter pattern) as a parameter Implement an "English only" Tokenizer Feb 12, 2022
@tadashi-aikawa
Copy link
Owner

Hi, @anselmwang. Thank you for your excellent suggestion!

I am sure that your suggestion is the best workaround except for implementing each language tokenizer 😄

I also suggest exposing the parameter as name "delimiter pattern" since it is more intuitive. If exposing this parameter directly is too advanced for end users, we can expose an tokenization strategy "English only",

I like to add a new tokenizer, named EnglishOnlyTokenizer. As you said, TRIM_CHAR_PATTERN is too hard to explain for end users. There is no guarantee that it is a delimiter.

Does the suggestion look good to you? I can submit a PR if you are fine with the change.

Since I have changed my codebase and confirmed it as I expected, It's both OK. Please choose what prefer for you :)

  • You will create a PR
  • I will implement the feature

Kind regards. Thanks!

@tadashi-aikawa tadashi-aikawa added the enhancement New feature or request label Feb 12, 2022
@anselmwang
Copy link
Author

Thanks @tadashi-aikawa for your quick response!

Since you already implement the feature, please go ahead.
Looking forward to the greater success of various-complements!

@tadashi-aikawa
Copy link
Owner

tadashi-aikawa commented Feb 13, 2022

@anselmwang

I have released v5.6.0.beta2 🚀

If you use BRAT, please confirm whether it works as you expected.

image

I have created a new English-only tokenizer that is not changing TRIM_CHAR_PATTERN, but also tuning for analyzing the current input token :)

@anselmwang
Copy link
Author

Thanks Tadashi, it works!

@tadashi-aikawa
Copy link
Owner

I have released v5.6.0 🚀

Many thanks!

@anselmwang
Copy link
Author

Thanks a lot, I have upgraded to latest version!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

2 participants