New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation with Incremental I/O? #361
Comments
I think I've solved my problem, using The method I came up with requires a slightly janky API (see code comments). Although my immediate issue is solved, it would be nice if this functionality could be integrated directly into the library, perhaps with a slightly cleaner API. I'll leave this issue open for consideration, but of course, feel free to close it if you don't think it's worth integrating. |
You don't need a non-trivial state variable. A simple integer ( You can do it in this manner (untested, but conceptually correct): size_t buffer_offset = 0; // bytes 0 to buffer_offset will contain trailing UTF-8 content (between 0 and 3 bytes)
for (;;) {
// we try to read between sizeof(buf) and sizeof(buf) - 3 bytes (depending on buffer_offset)
size_t readlen = fread(buf+buffer_offset, 1, sizeof(buf)-buffer_offset, stdin);
// our content is what we read (readlen) plus the 0 to 3 bytes we collected before (buffer_offset)
size_t contentlen = readlen + buffer_offset;
// it is possible that we end with a partial UTF-8 sequence, so compute a short length
size_t short_length = simdutf::trim_partial_utf8(buf, contentlen);
// the sequence from 0 to short_length should be valid UTF-8, followed possibly by a truncated character
bool validutf8 = simdutf::validate_utf8(buf, short_length);
// do something with validutf8
// we move the truncated character at the beginning
buffer_offset = contentlen - short_length;
for(size_t i = 0; i < buffer_offset; i++) { buf[i] = buf[i + short_length]; } // copy between 0 and 3 bytes
if (contentlen < sizeof(buf)) break;
}
// we may end with a truncated character...
if(buffer_offset>0) {
std::cerr << "invalid UTF-8" << std::endl;
} |
@DavidBuchanan314 Looks like what you wrote is conceptually equivalent to what I proposed. |
Always opened to new ideas. My code above is about 4 extra lines compared your initial wish. And, it is not too hard to make it go away (as you did in your gist). If you do end up with a design that you like and it works for you and it is solid and elegant, feel free to issue a pull request !!! |
Thank you, I will close this issue for now, but I will consider making a PR if I can think of something particularly elegant. |
@DavidBuchanan314 To elaborate further, I am concerned about making up APIs and interfaces that I like... fooling myself into thinking that they solve the right problems. There is definitively more hard work that will go into simdutf, but I much prefer to rely on actual users as far as the design of the interface goes. That's why it is super minimalistic right now. Just a bunch of function, little abstraction. It is not that I am opposed to having nice abstractions, it is that I think such things need to be validated in practice. |
I did eventually come up with what I think is a clean API https://gist.github.com/DavidBuchanan314/798db4d18ed264920b9afd1c71b7f8bd It's the same API I initially described, relying on a simple byte-wise state machine validator to paper over the cracks, i.e. handling the boundary conditions. The (relatively) slow state machine is invoked a maximum of 6 times, and always 0 times for ASCII-only input, so all the performance of simdutf remains. However, this code currently exists "in a vacuum" - I haven't written the code that depends on it yet! (the large-file validation was just an example, my real use case is more complex) I appreciate what you're saying about making sure the APIs are useful in practice, so I'll circle back if I find that it's a genuinely useful abstraction. |
I would like to be able to validate large UTF-8 strings, which do not fit in memory.
The API I'm looking for might look like this, conceptually:
It would be used something like this:
Is this something that simdutf can already support? My perusal of the docs indicates "no", but perhaps it's something that can safely be constructed from
validate_utf8_with_errors
, by checking for theTOO_SHORT
condition, looking at thecount
, and keeping track of the "remainder" bytes, ready to be prepended to the buffer on next iteration?Edit: It looks like
trim_partial_utf8
and/or sutf'sfind_last_leading_byte
might be able to help me.The text was updated successfully, but these errors were encountered: