Result is wrong when decoding tokens one by one #853

zcbenz · 2024-07-18T02:55:11Z

System Info

Node.js 22.4.0
@xenova/transformers 2.17.2

Environment/Platform

Description

When decoding tokens which represents a multi-byte string, the result is wrong when decoding the tokens one by one.

import {StringDecoder} from 'node:string_decoder'
import {AutoTokenizer} from '@xenova/transformers'

const tokenizer = await AutoTokenizer.from_pretrained('Qwen/Qwen2-0.5B')

const tokens = [32, 13, 66521, 243, 28291]
console.log('Correct string:', tokenizer.decode(tokens))
console.log('Correct bytes:', Buffer.from(tokenizer.decode(tokens)))

const decoder = new StringDecoder('utf8')
let allBytes = []
process.stdout.write('\nWrong string: ')
for (const token of tokens) {
  const bytes = Buffer.from(tokenizer.decode([token]))
  allBytes.push(bytes)
  process.stdout.write(decoder.write(bytes))
}
process.stdout.write('\n')
console.log('Wrong bytes:', Buffer.concat(allBytes))

Reproduction

Running above script with Node and you can see the result:

Correct string: A. 单发
Correct bytes: <Buffer 41 2e 20 e5 8d 95 e5 8f 91>

Wrong string: A. ��发
Wrong bytes: <Buffer 41 2e 20 ef bf bd ef bf bd e5 8f 91>

I expect the bytes to be the same whether the tokens are decoded in one call, or decoded one by one.

This is probably intended results as a single token may be decoded into a partial unicode character. However this behavior makes it impossible to implement a correct streaming interface for LLMs, which I'm doing in my llm.js module.

The text was updated successfully, but these errors were encountered:

zcbenz · 2024-07-18T05:01:45Z

I have found a workaround by detecting the replacement char \uFFFD in the decoded string: frost-beta/llm.js@6e816b0.

zcbenz added the bug Something isn't working label Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Result is wrong when decoding tokens one by one #853

Result is wrong when decoding tokens one by one #853

zcbenz commented Jul 18, 2024

zcbenz commented Jul 18, 2024

Result is wrong when decoding tokens one by one #853

Result is wrong when decoding tokens one by one #853

Comments

zcbenz commented Jul 18, 2024

System Info

Environment/Platform

Description

Reproduction

zcbenz commented Jul 18, 2024