Validation with Incremental I/O? #361

DavidBuchanan314 · 2024-01-09T16:44:43Z

I would like to be able to validate large UTF-8 strings, which do not fit in memory.

The API I'm looking for might look like this, conceptually:

state_t validate_utf8_incremental(state_t state, uint8_t *buf, size_t len);

It would be used something like this:

state_t state = INITIAL_STATE;
uint8_t buf[0x10000];
for (;;) {
    size_t readlen = fread(buf, 1, sizeof(buf), stdin);
    state = validate_utf8_incremental(state, buf, readlen);
    if (readlen < sizeof(buf)) break;
}
if (feof(stdin) && IS_VALID_STATE(state)) {
    printf("success\n");
}

Is this something that simdutf can already support? My perusal of the docs indicates "no", but perhaps it's something that can safely be constructed from validate_utf8_with_errors, by checking for the TOO_SHORT condition, looking at the count, and keeping track of the "remainder" bytes, ready to be prepended to the buffer on next iteration?

Edit: It looks like trim_partial_utf8 and/or sutf's find_last_leading_byte might be able to help me.

The text was updated successfully, but these errors were encountered:

DavidBuchanan314 · 2024-01-09T18:56:51Z

I think I've solved my problem, using trim_partial_utf8: https://gist.github.com/DavidBuchanan314/19941d1c9f7182cf2f5189bf8edbd00c

The method I came up with requires a slightly janky API (see code comments).

Although my immediate issue is solved, it would be nice if this functionality could be integrated directly into the library, perhaps with a slightly cleaner API. I'll leave this issue open for consideration, but of course, feel free to close it if you don't think it's worth integrating.

lemire · 2024-01-09T19:06:54Z

You don't need a non-trivial state variable. A simple integer (buffer_offset below) storing the size of a potential truncated character, combined with simdutf::trim_partial_utf8 is enough.

You can do it in this manner (untested, but conceptually correct):

 size_t buffer_offset = 0; // bytes 0 to buffer_offset will contain trailing UTF-8 content (between 0 and 3 bytes)
 for (;;) {
    // we try to read between sizeof(buf) and sizeof(buf) - 3 bytes (depending on buffer_offset)
    size_t readlen = fread(buf+buffer_offset, 1, sizeof(buf)-buffer_offset, stdin);
    // our content is what we read (readlen) plus the 0 to 3 bytes we collected before (buffer_offset)
    size_t contentlen = readlen + buffer_offset;
    // it is possible that we end with a partial UTF-8 sequence, so compute a short length
    size_t short_length = simdutf::trim_partial_utf8(buf, contentlen);
    // the sequence from 0 to short_length should be valid UTF-8, followed possibly by a truncated character
    bool validutf8 = simdutf::validate_utf8(buf, short_length);
    // do something with validutf8
    // we move the truncated character at the beginning
    buffer_offset = contentlen - short_length;
    for(size_t i = 0; i < buffer_offset; i++) { buf[i] = buf[i + short_length]; } // copy between 0 and 3 bytes
    if (contentlen < sizeof(buf)) break;
}
// we may end with a truncated character...
if(buffer_offset>0) {
   std::cerr << "invalid UTF-8" << std::endl;
}

lemire · 2024-01-09T19:08:21Z

@DavidBuchanan314 Looks like what you wrote is conceptually equivalent to what I proposed.

lemire · 2024-01-09T19:13:28Z

it would be nice if this functionality could be integrated directly into the library, perhaps with a slightly cleaner API

Always opened to new ideas. My code above is about 4 extra lines compared your initial wish. And, it is not too hard to make it go away (as you did in your gist).

If you do end up with a design that you like and it works for you and it is solid and elegant, feel free to issue a pull request !!!

DavidBuchanan314 · 2024-01-09T19:14:49Z

Thank you, I will close this issue for now, but I will consider making a PR if I can think of something particularly elegant.

lemire · 2024-01-09T19:19:39Z

@DavidBuchanan314 To elaborate further, I am concerned about making up APIs and interfaces that I like... fooling myself into thinking that they solve the right problems.

There is definitively more hard work that will go into simdutf, but I much prefer to rely on actual users as far as the design of the interface goes. That's why it is super minimalistic right now. Just a bunch of function, little abstraction.

It is not that I am opposed to having nice abstractions, it is that I think such things need to be validated in practice.

DavidBuchanan314 · 2024-01-09T21:50:03Z

I did eventually come up with what I think is a clean API https://gist.github.com/DavidBuchanan314/798db4d18ed264920b9afd1c71b7f8bd

It's the same API I initially described, relying on a simple byte-wise state machine validator to paper over the cracks, i.e. handling the boundary conditions. The (relatively) slow state machine is invoked a maximum of 6 times, and always 0 times for ASCII-only input, so all the performance of simdutf remains.

However, this code currently exists "in a vacuum" - I haven't written the code that depends on it yet! (the large-file validation was just an example, my real use case is more complex) I appreciate what you're saying about making sure the APIs are useful in practice, so I'll circle back if I find that it's a genuinely useful abstraction.

DavidBuchanan314 closed this as completed Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation with Incremental I/O? #361

Validation with Incremental I/O? #361

DavidBuchanan314 commented Jan 9, 2024 •

edited

DavidBuchanan314 commented Jan 9, 2024

lemire commented Jan 9, 2024 •

edited

lemire commented Jan 9, 2024

lemire commented Jan 9, 2024

DavidBuchanan314 commented Jan 9, 2024

lemire commented Jan 9, 2024

DavidBuchanan314 commented Jan 9, 2024

Validation with Incremental I/O? #361

Validation with Incremental I/O? #361

Comments

DavidBuchanan314 commented Jan 9, 2024 • edited

DavidBuchanan314 commented Jan 9, 2024

lemire commented Jan 9, 2024 • edited

lemire commented Jan 9, 2024

lemire commented Jan 9, 2024

DavidBuchanan314 commented Jan 9, 2024

lemire commented Jan 9, 2024

DavidBuchanan314 commented Jan 9, 2024

DavidBuchanan314 commented Jan 9, 2024 •

edited

lemire commented Jan 9, 2024 •

edited