how to retain spiece token markers #862

joprice · 2024-07-24T16:01:44Z

Question

When evaluating a model that uses sentencepiece using transformer.js, I do not get the ▁ marker included in the output as I do when running from python. I'm using the qanastek/pos-french-camembert model with to do POS tagging and have situations where a single word such as a verb with a tense suffix is returned as two or more tokens. I'd like to process the group of tokens and decide how to handle the different labels. I see the pre_tokenizer and decoder fields of the model's tokenizer.json include references to the Metaspace parameter, but I'm unsure if it's possible to configure it to retain the space placeholder token.

The text was updated successfully, but these errors were encountered:

joprice · 2024-07-24T16:09:37Z

I should add that I the model I'm using is the result of calling python -m scripts.convert --quantize --model_id qanastek/pos-french-camembert

joprice · 2024-07-24T17:14:58Z

I read through the tokenizer code and found that nulling out the tokenizer's decoder tokenizer.decoder = null allows passing through the tokens due to this check: https://github.com/xenova/transformers.js/blob/main/src/tokenizers.js#L2997.

It seems that the code works correctly, as it's parsing the decoder and applying it

   "decoder": {
      "type": "Metaspace",
      "replacement": "▁",
      "add_prefix_space": true,
      "prepend_scheme": "always"
    },

But when playing around with this, it seems to me that the decoder should be applied to a list of tokens, not to a single token: When I null out the tokenizer's decoder and run the decoder on the full list of tokens, I recover the original input.

joprice added the question Further information is requested label Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to retain spiece token markers #862

how to retain spiece token markers #862

joprice commented Jul 24, 2024

joprice commented Jul 24, 2024

joprice commented Jul 24, 2024

how to retain spiece token markers #862

how to retain spiece token markers #862

Comments

joprice commented Jul 24, 2024

Question

joprice commented Jul 24, 2024

joprice commented Jul 24, 2024