Streaming API for transcoding? #348

kovidgoyal · 2023-11-14T06:14:28Z

As far as I can tell there is no way to stream transcode. I am interested in transcoding a stream of UTF-8 bytes into UTF-32. I dont see how to handle an incomplete UTF-32 character at the end of the stream of UTF-8 bytes. The simdutf::error struct returns the position of the first error and error type. But it does not return how many output chars have already been written when an error is encountered. This means I need to

Figure out how many incomplete bytes are at the end of the stream
Call the transcoding function again with the full buffer minus the trailing incomplete bytes

(2) In particular is quite inefficient for large buffer sizes. If the transcoding functions returned both the input bad byte position and the num of output characters (2) could be avoided.

Apologies if I am missing something in the API.

lemire · 2023-11-14T16:25:02Z

The functions that return an error are not meant to process truncated inputs, they are meant to help identify genuine errors. If you have simply a truncated input, the simplest strategy is to roll it back to the previous complete character.

Please see #349

I have added convenience functions.

kovidgoyal · 2023-11-14T16:41:24Z

Yes, pre-trimming is also an option, however in my use case, truncated final character should be rare, as such I was hoping to avoid the cost of reading the last 3 bytes on every chunk. Are you saying that I cannot rely on the function returning an error in case of truncated last char? My potential algorithm was run the simd transcode function, if there is a TOO_SHORT error trim the last char and re-run.

Thanks for the convenience function.

lemire · 2023-11-14T16:53:51Z

I was hoping to avoid the cost of reading the last 3 bytes on every chunk.

It is very cheap unless you are processing tiny pieces, in which case, you have other performance constraints.

Are you saying that I cannot rely on the function returning an error in case of truncated last char?

It will return an error. The error should be TOO_SHORT, as you indicated.

My potential algorithm was run the simd transcode function, if there is a TOO_SHORT error trim the last char and re-run.

That's likely slower than what I propose if you expect that the input is valid.

kovidgoyal · 2023-11-14T16:59:51Z

That's likely slower than what I propose if you expect that the input is valid.

I dont know if the input is valid, it could be invalid and have a trailing truncated char as well. Therefore, I have to use the function that returns an error anyway. Thus, given that a trailing truncated char is rare pre-trimming cannot be faster. But anyway, that's easily checked by measuring.

I was just hoping to convince you to add a field to the return struct indicating how much was output before the error was encountered. But I can work with the existing API, given truncated trailer is rare for me, trascode, check error return, strip trailer, re-transcode should not be too bad. I will benchmark and see.

lemire · 2023-11-14T17:36:11Z

@kovidgoyal Yes, you want support for decoding invalid inputs. It is not something simdutf supports at this time. The assumption is that the input is valid. We'll extend the library at some point in the future to include support for invalid inputs.

lemire mentioned this issue Nov 14, 2023

provide convenience functions to trim partial characters at the end of a string. #349

Merged

lemire closed this as completed in #349 Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming API for transcoding? #348

Streaming API for transcoding? #348

kovidgoyal commented Nov 14, 2023 •

edited

Loading

lemire commented Nov 14, 2023

kovidgoyal commented Nov 14, 2023

lemire commented Nov 14, 2023

kovidgoyal commented Nov 14, 2023

lemire commented Nov 14, 2023

Streaming API for transcoding? #348

Streaming API for transcoding? #348

Comments

kovidgoyal commented Nov 14, 2023 • edited Loading

lemire commented Nov 14, 2023

kovidgoyal commented Nov 14, 2023

lemire commented Nov 14, 2023

kovidgoyal commented Nov 14, 2023

lemire commented Nov 14, 2023

kovidgoyal commented Nov 14, 2023 •

edited

Loading