Tokenize charlm #1511

AngledLuffa · 2025-09-16T16:26:46Z

Attaches the charlm to the tokenizer. Seems to reduce the Hebrew MWT error rate (although weirdly it sometimes grabs spaces in tokens now)

…rms of improving sentence splitting scores for certain languages. Also helps with MWT scores for Hebrew, not significantly tested on other languages yet Save & load the tokenizers without putting a charlm (if relevant) into the model file Pass the charlm_forward_file from the pipeline to the model When training a tokenizer, the run_tokenizer script finds a charlm, if possible, and attaches it to the model Ignore extra charlm when passed in from the run_ script or as part of the Pipeline if the saved model didn't use charlm

…ebuilding resources.json, if such a thing exists

…ve the necessary endpoints

…ing of the tokenizer... need to account for that when breaking down MWT

… and treats them as separate words, teaching the tokenizer to not treat 'can not' as a single token with the space This wound up being a problem in Hebrew when using the new charlm-attached tokenizer, although on further reflection there's no reason it couldn't happen in any language or with non-pretrained-charlm tokenizers as well

Will need to bump version number to make the tokenizer upgrades

d490ecc

AngledLuffa force-pushed the tokenize_charlm branch from fdd4bd1 to 15d191c Compare September 17, 2025 22:10

AngledLuffa added 2 commits September 18, 2025 12:23

Fancy save name for the tokenizer

638757a

AngledLuffa force-pushed the tokenize_charlm branch from 13998e0 to c276e20 Compare September 18, 2025 19:26

AngledLuffa added 5 commits September 18, 2025 13:22

Connect the tokenizer for default_accurate to a charlm version when r…

d4bbebf

…ebuilding resources.json, if such a thing exists

When converting the HE coref treebank, skip any tokens which don't ha…

0d4193c

…ve the necessary endpoints

Apparently spaces can sometimes happen in MWT, depending on the train…

6ff29fa

…ing of the tokenizer... need to account for that when breaking down MWT

Add some doc to the load_mwt_dict function

81ce2e1

AngledLuffa force-pushed the tokenize_charlm branch from c276e20 to 52cea78 Compare September 18, 2025 20:22

AngledLuffa merged commit 4c4a4be into dev Sep 18, 2025
1 check passed

AngledLuffa deleted the tokenize_charlm branch September 18, 2025 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenize charlm #1511

Tokenize charlm #1511

Uh oh!

AngledLuffa commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

Tokenize charlm #1511

Tokenize charlm #1511

Uh oh!

Conversation

AngledLuffa commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!