Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to retain spiece token markers #862

Open
joprice opened this issue Jul 24, 2024 · 2 comments
Open

how to retain spiece token markers #862

joprice opened this issue Jul 24, 2024 · 2 comments
Labels
question Further information is requested

Comments

@joprice
Copy link

joprice commented Jul 24, 2024

Question

When evaluating a model that uses sentencepiece using transformer.js, I do not get the marker included in the output as I do when running from python. I'm using the qanastek/pos-french-camembert model with to do POS tagging and have situations where a single word such as a verb with a tense suffix is returned as two or more tokens. I'd like to process the group of tokens and decide how to handle the different labels. I see the pre_tokenizer and decoder fields of the model's tokenizer.json include references to the Metaspace parameter, but I'm unsure if it's possible to configure it to retain the space placeholder token.

@joprice joprice added the question Further information is requested label Jul 24, 2024
@joprice
Copy link
Author

joprice commented Jul 24, 2024

I should add that I the model I'm using is the result of calling python -m scripts.convert --quantize --model_id qanastek/pos-french-camembert

@joprice
Copy link
Author

joprice commented Jul 24, 2024

I read through the tokenizer code and found that nulling out the tokenizer's decoder tokenizer.decoder = null allows passing through the tokens due to this check: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js#L2997.

It seems that the code works correctly, as it's parsing the decoder and applying it

   "decoder": {
      "type": "Metaspace",
      "replacement": "",
      "add_prefix_space": true,
      "prepend_scheme": "always"
    },

But when playing around with this, it seems to me that the decoder should be applied to a list of tokens, not to a single token: When I null out the tokenizer's decoder and run the decoder on the full list of tokens, I recover the original input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant