Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextDecoderStream: empty Uint8Array should result in an empty string #283

Closed
crowlKats opened this issue Mar 1, 2022 · 4 comments
Closed

Comments

@crowlKats
Copy link

Currently the specification states:

If outputChunk is non-empty, then enqueue outputChunk in decoder’s transform.

Lets say I have a Uint8Array based transform stream that splits by line, and we have an empty line (which means the transformstream will queue an empty Uint8Array), TextDecoderStream will make that empty line disappear into thin air, which is not great.

This argument also goes for TextEncoderStream discarding an empty Uint8Array.

@annevk
Copy link
Member

annevk commented Mar 2, 2022

cc @ricea @MattiasBuelens

@MattiasBuelens
Copy link

MattiasBuelens commented Mar 2, 2022

I think this is intentional. Readable byte streams also don't allow enqueuing empty chunks, and it's likely that we'll want to make TextEncoderStream.readable a proper byte stream in the future:

const rs = new ReadableStream({
  type: "bytes",
  start(controller) {
    controller.enqueue(new Uint8Array(0)); // throws
  }
});

We also don't want to enqueue empty strings if we're in the middle of a multi-byte character:

const { readable, writable } = new TextDecoderStream("utf-8");
const reader = readable.getReader();
const writer = writable.getWriter();
const readPromise = reader.read();
writer.write(new Uint8Array([0xF0, 0x9F, 0x99]));
// readPromise is still pending
writer.write(new Uint8Array([0x82]))
const { done, value } = await readPromise;
// -> value == "🙂"

I'm a bit surprised by Deno's LineStream design. I would expect a transform stream that splits text by line delimiters to accept strings as input and produce strings as output. Instead, it looks like it uses raw byte chunks as both input and output?

That means that LineStream is making an assumption about the text encoding, right? How exactly is that supposed to deal with multi-byte text encodings like utf-16? For example:

new TextDecoder("utf-16").decode(new Uint8Array([0x41, 0x00, 0x0A, 0x00, 0x42, 0x00]));
// -> "A\nB"

I would expect you first run these chunks through a TextDecoderStream, and then split by line delimiters:

const readable = new ReadableStream({
  start(controller) {
    controller.enqueue(new Uint8Array([0x41, 0x00, 0x0A, 0x00, 0x42, 0x00]));
    controller.close();
  }
});

readable
  .pipeThrough(new TextDecoderStream("utf-16"))
  .pipeThrough(new LineStream());
// -> stream with chunks "A" and "B"

Instead, with Deno's current LineStream, you get garbage:

readable
  .pipeThrough(new LineStream())
  .pipeThrough(new TextDecoderStream("utf-16"));
// -> stream with chunks "A", "䈀" and "�"

@crowlKats
Copy link
Author

I agree with the byte stream, that definitively would be great. I didn't take that concept into consideration when opening this issue, and as such I guess this issue is redundant.
One reason for LineStream to be Uint8array based was to align with a utility we have that does the equivalent but with Deno's Deno.Reader & Deno.Writer: so to have a 1-to-1 mapping & people effortlessly change over to the WHATWG streams based one; but looking back on it with this issue in mind, I see that might have been a mistake.

@crowlKats
Copy link
Author

Closing as nothing is wrong with the spec, but rather a badly implemented utility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants