Implement an "English only" Tokenizer #86

anselmwang · 2022-02-12T05:57:01Z

Hi, tadashi

Thanks a lot for creating this plugin! It accelerates my input speed a lot, especially on mobile phone.
However, after a while, I find two inconveniences for me. I am using "default" tokenization strategy and I only need English word completion. My note is like below

中文abcd
abc
$lmb

The two inconveniences are:

After typing "abc", plugin doesn't show completion because in the first line, chinese characters are combined with "abcd" as a single word.
After typing "lmb" (I define "lmb" to be "\mathbf{}" to quickly type latex equation on mobile), there are no completion since "$lmb" are considered as one word.

Both problems are due to that various-complement lacks an "English only" tokenization strategy.

I just read through the code and would like to add a new tokenizer. But I quickly find if we can expose TRIM_CHAR_PATTERN as a parameter, my problem can be perfectly solved. The TRIM_CHAR_PATTERN serves as a delimiter pattern to split words, so I just define it as "/^[a-zA-Z0-9_-]+$/", then "中文abcd" will be splitted to "中"、"文"、"abcd"，"$lmb" will be splitted to "$" and "lmb". It exactly matches my need.

I also suggest exposing the parameter as name "delimiter pattern" since it is more intuitive.

If exposing this parameter directly is too advanced for end users, we can expose an tokenization strategy "English only", which is in fact a DefaultTokenizer with TRIM_CHAR_PATTERN to be "/^[a-zA-Z0-9_-]+$/", exactly matches the regex used for "onlyComplementEnglishOnCurrentFileComplement" option.

Does the suggestion look good to you? I can submit a PR if you are fine with the change.

Thanks,
anselmwang

The text was updated successfully, but these errors were encountered:

tadashi-aikawa · 2022-02-12T09:32:01Z

Hi, @anselmwang. Thank you for your excellent suggestion!

I am sure that your suggestion is the best workaround except for implementing each language tokenizer 😄

I also suggest exposing the parameter as name "delimiter pattern" since it is more intuitive. If exposing this parameter directly is too advanced for end users, we can expose an tokenization strategy "English only",

I like to add a new tokenizer, named EnglishOnlyTokenizer. As you said, TRIM_CHAR_PATTERN is too hard to explain for end users. There is no guarantee that it is a delimiter.

Does the suggestion look good to you? I can submit a PR if you are fine with the change.

Since I have changed my codebase and confirmed it as I expected, It's both OK. Please choose what prefer for you :)

You will create a PR
I will implement the feature

Kind regards. Thanks!

anselmwang · 2022-02-12T13:58:00Z

Thanks @tadashi-aikawa for your quick response!

Since you already implement the feature, please go ahead.
Looking forward to the greater success of various-complements!

tadashi-aikawa · 2022-02-13T12:09:36Z

@anselmwang

I have released v5.6.0.beta2 🚀

If you use BRAT, please confirm whether it works as you expected.

I have created a new English-only tokenizer that is not changing TRIM_CHAR_PATTERN, but also tuning for analyzing the current input token :)

anselmwang · 2022-02-16T00:52:38Z

Thanks Tadashi, it works!

tadashi-aikawa · 2022-02-19T09:38:18Z

I have released v5.6.0 🚀

Many thanks!

anselmwang · 2022-02-19T15:21:48Z

Thanks a lot, I have upgraded to latest version!

anselmwang changed the title ~~Expose the TRIM_CHAR_PATTERN (i.e., delimiter pattern) as a parameter~~ Implement an "English only" Tokenizer Feb 12, 2022

tadashi-aikawa added the enhancement New feature or request label Feb 12, 2022

anselmwang closed this as completed Feb 12, 2022

tadashi-aikawa self-assigned this Feb 13, 2022

tadashi-aikawa reopened this Feb 13, 2022

tadashi-aikawa added a commit that referenced this issue Feb 13, 2022

Add a english-only to the strategy option (#86)

4b95146

tadashi-aikawa added a commit that referenced this issue Feb 13, 2022

Create a logic as original for English only tokenizer (#86)

6f0b044

tadashi-aikawa added a commit that referenced this issue Feb 13, 2022

Fix a bug that includes non-English tokens (#86)

bca579d

tadashi-aikawa closed this as completed Feb 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement an "English only" Tokenizer #86

Implement an "English only" Tokenizer #86

anselmwang commented Feb 12, 2022 •

edited

tadashi-aikawa commented Feb 12, 2022

anselmwang commented Feb 12, 2022

tadashi-aikawa commented Feb 13, 2022 •

edited

anselmwang commented Feb 16, 2022

tadashi-aikawa commented Feb 19, 2022

anselmwang commented Feb 19, 2022

Implement an "English only" Tokenizer #86

Implement an "English only" Tokenizer #86

Comments

anselmwang commented Feb 12, 2022 • edited

tadashi-aikawa commented Feb 12, 2022

anselmwang commented Feb 12, 2022

tadashi-aikawa commented Feb 13, 2022 • edited

anselmwang commented Feb 16, 2022

tadashi-aikawa commented Feb 19, 2022

anselmwang commented Feb 19, 2022

anselmwang commented Feb 12, 2022 •

edited

tadashi-aikawa commented Feb 13, 2022 •

edited