Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation with Incremental I/O? #361

Closed
DavidBuchanan314 opened this issue Jan 9, 2024 · 7 comments
Closed

Validation with Incremental I/O? #361

DavidBuchanan314 opened this issue Jan 9, 2024 · 7 comments

Comments

@DavidBuchanan314
Copy link

DavidBuchanan314 commented Jan 9, 2024

I would like to be able to validate large UTF-8 strings, which do not fit in memory.

The API I'm looking for might look like this, conceptually:

state_t validate_utf8_incremental(state_t state, uint8_t *buf, size_t len);

It would be used something like this:

state_t state = INITIAL_STATE;
uint8_t buf[0x10000];
for (;;) {
    size_t readlen = fread(buf, 1, sizeof(buf), stdin);
    state = validate_utf8_incremental(state, buf, readlen);
    if (readlen < sizeof(buf)) break;
}
if (feof(stdin) && IS_VALID_STATE(state)) {
    printf("success\n");
}

Is this something that simdutf can already support? My perusal of the docs indicates "no", but perhaps it's something that can safely be constructed from validate_utf8_with_errors, by checking for the TOO_SHORT condition, looking at the count, and keeping track of the "remainder" bytes, ready to be prepended to the buffer on next iteration?

Edit: It looks like trim_partial_utf8 and/or sutf's find_last_leading_byte might be able to help me.

@DavidBuchanan314
Copy link
Author

I think I've solved my problem, using trim_partial_utf8: https://gist.github.com/DavidBuchanan314/19941d1c9f7182cf2f5189bf8edbd00c

The method I came up with requires a slightly janky API (see code comments).

Although my immediate issue is solved, it would be nice if this functionality could be integrated directly into the library, perhaps with a slightly cleaner API. I'll leave this issue open for consideration, but of course, feel free to close it if you don't think it's worth integrating.

@lemire
Copy link
Member

lemire commented Jan 9, 2024

You don't need a non-trivial state variable. A simple integer (buffer_offset below) storing the size of a potential truncated character, combined with simdutf::trim_partial_utf8 is enough.

You can do it in this manner (untested, but conceptually correct):

 size_t buffer_offset = 0; // bytes 0 to buffer_offset will contain trailing UTF-8 content (between 0 and 3 bytes)
 for (;;) {
    // we try to read between sizeof(buf) and sizeof(buf) - 3 bytes (depending on buffer_offset)
    size_t readlen = fread(buf+buffer_offset, 1, sizeof(buf)-buffer_offset, stdin);
    // our content is what we read (readlen) plus the 0 to 3 bytes we collected before (buffer_offset)
    size_t contentlen = readlen + buffer_offset;
    // it is possible that we end with a partial UTF-8 sequence, so compute a short length
    size_t short_length = simdutf::trim_partial_utf8(buf, contentlen);
    // the sequence from 0 to short_length should be valid UTF-8, followed possibly by a truncated character
    bool validutf8 = simdutf::validate_utf8(buf, short_length);
    // do something with validutf8
    // we move the truncated character at the beginning
    buffer_offset = contentlen - short_length;
    for(size_t i = 0; i < buffer_offset; i++) { buf[i] = buf[i + short_length]; } // copy between 0 and 3 bytes
    if (contentlen < sizeof(buf)) break;
}
// we may end with a truncated character...
if(buffer_offset>0) {
   std::cerr << "invalid UTF-8" << std::endl;
}

@lemire
Copy link
Member

lemire commented Jan 9, 2024

@DavidBuchanan314 Looks like what you wrote is conceptually equivalent to what I proposed.

@lemire
Copy link
Member

lemire commented Jan 9, 2024

it would be nice if this functionality could be integrated directly into the library, perhaps with a slightly cleaner API

Always opened to new ideas. My code above is about 4 extra lines compared your initial wish. And, it is not too hard to make it go away (as you did in your gist).

If you do end up with a design that you like and it works for you and it is solid and elegant, feel free to issue a pull request !!!

@DavidBuchanan314
Copy link
Author

Thank you, I will close this issue for now, but I will consider making a PR if I can think of something particularly elegant.

@lemire
Copy link
Member

lemire commented Jan 9, 2024

@DavidBuchanan314 To elaborate further, I am concerned about making up APIs and interfaces that I like... fooling myself into thinking that they solve the right problems.

There is definitively more hard work that will go into simdutf, but I much prefer to rely on actual users as far as the design of the interface goes. That's why it is super minimalistic right now. Just a bunch of function, little abstraction.

It is not that I am opposed to having nice abstractions, it is that I think such things need to be validated in practice.

@DavidBuchanan314
Copy link
Author

I did eventually come up with what I think is a clean API https://gist.github.com/DavidBuchanan314/798db4d18ed264920b9afd1c71b7f8bd

It's the same API I initially described, relying on a simple byte-wise state machine validator to paper over the cracks, i.e. handling the boundary conditions. The (relatively) slow state machine is invoked a maximum of 6 times, and always 0 times for ASCII-only input, so all the performance of simdutf remains.

However, this code currently exists "in a vacuum" - I haven't written the code that depends on it yet! (the large-file validation was just an example, my real use case is more complex) I appreciate what you're saying about making sure the APIs are useful in practice, so I'll circle back if I find that it's a genuinely useful abstraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants