Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result is wrong when decoding tokens one by one #853

Open
1 of 5 tasks
zcbenz opened this issue Jul 18, 2024 · 1 comment
Open
1 of 5 tasks

Result is wrong when decoding tokens one by one #853

zcbenz opened this issue Jul 18, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@zcbenz
Copy link

zcbenz commented Jul 18, 2024

System Info

Node.js 22.4.0
@xenova/transformers 2.17.2

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

When decoding tokens which represents a multi-byte string, the result is wrong when decoding the tokens one by one.

import {StringDecoder} from 'node:string_decoder'
import {AutoTokenizer} from '@xenova/transformers'

const tokenizer = await AutoTokenizer.from_pretrained('Qwen/Qwen2-0.5B')

const tokens = [32, 13, 66521, 243, 28291]
console.log('Correct string:', tokenizer.decode(tokens))
console.log('Correct bytes:', Buffer.from(tokenizer.decode(tokens)))

const decoder = new StringDecoder('utf8')
let allBytes = []
process.stdout.write('\nWrong string: ')
for (const token of tokens) {
  const bytes = Buffer.from(tokenizer.decode([token]))
  allBytes.push(bytes)
  process.stdout.write(decoder.write(bytes))
}
process.stdout.write('\n')
console.log('Wrong bytes:', Buffer.concat(allBytes))

Reproduction

Running above script with Node and you can see the result:

Correct string: A. 单发
Correct bytes: <Buffer 41 2e 20 e5 8d 95 e5 8f 91>

Wrong string: A. ��发
Wrong bytes: <Buffer 41 2e 20 ef bf bd ef bf bd e5 8f 91>

I expect the bytes to be the same whether the tokens are decoded in one call, or decoded one by one.

This is probably intended results as a single token may be decoded into a partial unicode character. However this behavior makes it impossible to implement a correct streaming interface for LLMs, which I'm doing in my llm.js module.

@zcbenz zcbenz added the bug Something isn't working label Jul 18, 2024
@zcbenz
Copy link
Author

zcbenz commented Jul 18, 2024

I have found a workaround by detecting the replacement char \uFFFD in the decoded string: frost-beta/llm.js@6e816b0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant