You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running above script with Node and you can see the result:
Correct string: A. 单发
Correct bytes: <Buffer 41 2e 20 e5 8d 95 e5 8f 91>
Wrong string: A. ��发
Wrong bytes: <Buffer 41 2e 20 ef bf bd ef bf bd e5 8f 91>
I expect the bytes to be the same whether the tokens are decoded in one call, or decoded one by one.
This is probably intended results as a single token may be decoded into a partial unicode character. However this behavior makes it impossible to implement a correct streaming interface for LLMs, which I'm doing in my llm.js module.
The text was updated successfully, but these errors were encountered:
System Info
Node.js 22.4.0
@xenova/transformers 2.17.2
Environment/Platform
Description
When decoding tokens which represents a multi-byte string, the result is wrong when decoding the tokens one by one.
Reproduction
Running above script with Node and you can see the result:
I expect the bytes to be the same whether the tokens are decoded in one call, or decoded one by one.
This is probably intended results as a single token may be decoded into a partial unicode character. However this behavior makes it impossible to implement a correct streaming interface for LLMs, which I'm doing in my llm.js module.
The text was updated successfully, but these errors were encountered: