Skip to content

Conversation

AngledLuffa
Copy link
Collaborator

Attaches the charlm to the tokenizer. Seems to reduce the Hebrew MWT error rate (although weirdly it sometimes grabs spaces in tokens now)

…rms of improving sentence splitting scores for certain languages. Also helps with MWT scores for Hebrew, not significantly tested on other languages yet

Save & load the tokenizers without putting a charlm (if relevant) into the model file

Pass the charlm_forward_file from the pipeline to the model

When training a tokenizer, the run_tokenizer script finds a charlm, if possible, and attaches it to the model

Ignore extra charlm when passed in from the run_ script or as part of the Pipeline if the saved model didn't use charlm
…ebuilding resources.json, if such a thing exists
…ing of the tokenizer... need to account for that when breaking down MWT
… and treats them as separate words, teaching the tokenizer to not treat 'can not' as a single token with the space

This wound up being a problem in Hebrew when using the new charlm-attached tokenizer, although on further reflection there's no reason it couldn't happen in any language or with non-pretrained-charlm tokenizers as well
@AngledLuffa AngledLuffa merged commit 4c4a4be into dev Sep 18, 2025
1 check passed
@AngledLuffa AngledLuffa deleted the tokenize_charlm branch September 18, 2025 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant