Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming API for transcoding? #348

Closed
kovidgoyal opened this issue Nov 14, 2023 · 5 comments · Fixed by #349
Closed

Streaming API for transcoding? #348

kovidgoyal opened this issue Nov 14, 2023 · 5 comments · Fixed by #349

Comments

@kovidgoyal
Copy link

kovidgoyal commented Nov 14, 2023

As far as I can tell there is no way to stream transcode. I am interested in transcoding a stream of UTF-8 bytes into UTF-32. I dont see how to handle an incomplete UTF-32 character at the end of the stream of UTF-8 bytes. The simdutf::error struct returns the position of the first error and error type. But it does not return how many output chars have already been written when an error is encountered. This means I need to

  1. Figure out how many incomplete bytes are at the end of the stream
  2. Call the transcoding function again with the full buffer minus the trailing incomplete bytes

(2) In particular is quite inefficient for large buffer sizes. If the transcoding functions returned both the input bad byte position and the num of output characters (2) could be avoided.

Apologies if I am missing something in the API.

@lemire
Copy link
Member

lemire commented Nov 14, 2023

The functions that return an error are not meant to process truncated inputs, they are meant to help identify genuine errors. If you have simply a truncated input, the simplest strategy is to roll it back to the previous complete character.

Please see #349

I have added convenience functions.

@kovidgoyal
Copy link
Author

Yes, pre-trimming is also an option, however in my use case, truncated final character should be rare, as such I was hoping to avoid the cost of reading the last 3 bytes on every chunk. Are you saying that I cannot rely on the function returning an error in case of truncated last char? My potential algorithm was run the simd transcode function, if there is a TOO_SHORT error trim the last char and re-run.

Thanks for the convenience function.

@lemire
Copy link
Member

lemire commented Nov 14, 2023

I was hoping to avoid the cost of reading the last 3 bytes on every chunk.

It is very cheap unless you are processing tiny pieces, in which case, you have other performance constraints.

Are you saying that I cannot rely on the function returning an error in case of truncated last char?

It will return an error. The error should be TOO_SHORT, as you indicated.

My potential algorithm was run the simd transcode function, if there is a TOO_SHORT error trim the last char and re-run.

That's likely slower than what I propose if you expect that the input is valid.

@kovidgoyal
Copy link
Author

That's likely slower than what I propose if you expect that the input is valid.

I dont know if the input is valid, it could be invalid and have a trailing truncated char as well. Therefore, I have to use the function that returns an error anyway. Thus, given that a trailing truncated char is rare pre-trimming cannot be faster. But anyway, that's easily checked by measuring.

I was just hoping to convince you to add a field to the return struct indicating how much was output before the error was encountered. But I can work with the existing API, given truncated trailer is rare for me, trascode, check error return, strip trailer, re-transcode should not be too bad. I will benchmark and see.

@lemire
Copy link
Member

lemire commented Nov 14, 2023

@kovidgoyal Yes, you want support for decoding invalid inputs. It is not something simdutf supports at this time. The assumption is that the input is valid. We'll extend the library at some point in the future to include support for invalid inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants