Customize `token_match` for training #13757

ivan-kleshnin · 2025-02-21T11:52:23Z

ivan-kleshnin
Feb 21, 2025

Documentation at https://spacy.io/usage/training#custom-tokenizer suggests to override Tokenizer properties instead of recreating it:

@registry.callbacks("customize_tokenizer")
def make_customize_tokenizer():
    def customize_tokenizer(nlp):
        # remove a suffix
        suffixes = list(nlp.Defaults.suffixes)
        suffixes.remove("\\[")
        suffix_regex = compile_suffix_regex(suffixes)
        nlp.tokenizer.suffix_search = suffix_regex.search

        # add a special case
        nlp.tokenizer.add_special_case("_SPECIAL_", [{"ORTH": "_SPECIAL_"}])
    return customize_tokenizer

it does work, hovewer

def token_match(text: str) -> re.Match | None: # just a matching dummy fn
  return None

@registry.callbacks("customize_tokenizer")
def make_customize_tokenizer():
    def customize_tokenizer(nlp):
        ...

        # add a special case
        nlp.tokenizer.add_special_case("_SPECIAL_", [{"ORTH": "_SPECIAL_"}])
+       nlp.tokenizer.token_match = token_match 
    return customize_tokenizer

fails with

File "/Users/.../.venv/lib/python3.12/site-packages/spacy/language.py", line 2143, in <lambda>
    serializers["tokenizer"] = lambda p: self.tokenizer.to_disk(  # type: ignore[union-attr]
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "spacy/tokenizer.pyx", line 771, in spacy.tokenizer.Tokenizer.to_disk
  File "spacy/tokenizer.pyx", line 772, in spacy.tokenizer.Tokenizer.to_disk
  File "spacy/tokenizer.pyx", line 808, in spacy.tokenizer.Tokenizer.to_bytes
  File "/Users/.../.venv/lib/python3.12/site-packages/spacy/util.py", line 1332, in to_bytes
    return srsly.msgpack_dumps(to_dict(getters, exclude))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/.../.venv/lib/python3.12/site-packages/spacy/util.py", line 1350, in to_dict
    serialized[key] = getter()
                      ^^^^^^^^
  File "spacy/tokenizer.pyx", line 803, in spacy.tokenizer.Tokenizer.to_bytes.lambda7
  File "spacy/tokenizer.pyx", line 861, in spacy.tokenizer._get_regex_pattern
AttributeError: 'function' object has no attribute '__self__'. Did you mean: '__call__'?

If I fall-back to supposedly lower-level:

@registry.callbacks("custom_tokenizer")
def make_custom_tokenizer():
    def custom_tokenizer(nlp):
      ...

      return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)

it works fine. If I pass nlp.tokenizer.token_match = nlp.Defaults.token_match instead of a custom function – it also works fine.

Docs tell that the function should be "A function matching the signature of re.compile(string).match to find token matches.".
The above signature def token_match(text: str) -> re.Match | None works outside of training scope, e.g for existing model.

How to provide a completely custom function if I want if-else based, fully controlled logic instead of regular expressions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Customize `token_match` for training #13757

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Customize token_match for training #13757

Uh oh!

Uh oh!

ivan-kleshnin Feb 21, 2025

Replies: 0 comments

Customize `token_match` for training #13757

ivan-kleshnin
Feb 21, 2025