Customize token_match
for training
#13757
Unanswered
ivan-kleshnin
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Documentation at https://spacy.io/usage/training#custom-tokenizer suggests to override
Tokenizer
properties instead of recreating it:it does work, hovewer
def token_match(text: str) -> re.Match | None: # just a matching dummy fn return None @registry.callbacks("customize_tokenizer") def make_customize_tokenizer(): def customize_tokenizer(nlp): ... # add a special case nlp.tokenizer.add_special_case("_SPECIAL_", [{"ORTH": "_SPECIAL_"}]) + nlp.tokenizer.token_match = token_match return customize_tokenizer
fails with
If I fall-back to supposedly lower-level:
it works fine. If I pass
nlp.tokenizer.token_match = nlp.Defaults.token_match
instead of a custom function – it also works fine.Docs tell that the function should be "A function matching the signature of
re.compile(string).match
to find token matches.".The above signature
def token_match(text: str) -> re.Match | None
works outside of training scope, e.g for existing model.How to provide a completely custom function if I want
if-else
based, fully controlled logic instead of regular expressions?Beta Was this translation helpful? Give feedback.
All reactions