-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data_must_be_null_terminated issue #1054
Comments
Thanks for bringing up this issue and explaining your use case. By requiring null termination we can greatly reduce the number of checks and branches, which makes parsing around 5% faster. When Glaze can use padding on your buffer (by using a non-const std::string) this performance is increased even more. The reduction in parsing time due to null termination and padding can be more than the time it takes to memcpy the buffer into an intermediate. I just say this to explain the design motivation and express how copying the data in your case could be faster than implementing a version that supports non-null termination. But, I do agree that it is best to avoid copies and allocations, so this is a very valid concern. Plus, I'm sure your code would be simplified by not requiring these copies. I'll have to think about this more. Could you provide an example structure that you want to parse and how it is being handled for 0-copy? I'm curious as to how TLS is related to JSON. I'm very interested because I'm currently working on HTTP/websocket code and working Glaze into it. |
The underlying buffer simply needs to contain a null character within its allocated memory somewhere after the message. So, intermediate string_views do not need null termination. |
Some gore details in my env: But the rest of the chain on top of it is 0-copy. (except when it gets ziped :) ) Okay, you got the idea, there are filters in the chain that requires copying and massive alteration/mutation of the data. It is their nature. However, JSON parsing should be kind of a read-only thing. I can not add anything to the buffers, all I can do is read from them or copy them. Reason is that the filters are not aware of each other. They just know they have neighbours and they stimulate them with data (inbound and outbound). It makes things super fast and also encourages metaprogramming (no polymorphism in my case). My humble and unsolicited recommendation for glaze library is to just accept any kind of input as long as it has std::string, std::string_view, std::vector<>, std::array<>, etc they all have that overloaded. My custom buffer used inside the chain also has that. It is the most generic and friction-less way of accepting a chunk of memory for processing. |
At the heart of walking sits 2 logical operations:
Currently glaze probably does a pointer increment and then compares the dereferencing of that pointer with The suggestion I'm making is to keep the pointer advance as is, but compare it un-dereferenced with another end pointer which is computed one-time by doing it like this: (pseudo code)
|
If data is null terminated then when we need to check for the existence of a character, let's say a Notice that we now have two comparisons and another boolean operation for every single character that we need to check. I'm just explaining this so you can see the performance reason for requiring null termination. |
I plan to solve this problem in a few different ways for different use cases. One quick question I have for you is would you be able to assume the input JSON is valid? If we can assume the input JSON is valid we can make a more efficient solution. I do plan on adding support for unknown JSON that could have errors without null termination, but this will take a bit more time. But, I'll keep this issue open until I've added the support. |
I have the same problem with binary transactions. @stephenberry #include <cstddef>
#include <glaze/glaze.hpp>
#include <iostream>
int main()
{
// Example Users
std::string filePath = "/Users/meftunca/Desktop/big_projects/son/users.beve";
glz::json_t users = glz::json_t::array_t{glz::json_t{{"id", "58018063-ce5a-4fa7-adfd-327eb2e2d9a5"},
{"email", "devtest@dev.test"},
{"password", "@cWLwgM#Knalxeb"},
{"city", "Sayre ville"},
{"streetAddress", "1716 Harriet Alley"}}};
std::vector<std::byte> bytes;
bytes.shrink_to_fit();
auto ec = glz::write_file_binary(users, filePath, bytes); // Write to file
if (ec) {
std::cerr << "Error: " << glz::format_error(ec) << std::endl;
return 1;
}
glz::json_t user2;
std::vector<std::byte> bytes2;
ec = glz::read_file_binary(user2, filePath, bytes2);
if (ec) {
std::cerr << "Error: " << glz::format_error(ec) << std::endl;
return 1;
}
std::cout << user2[0]["id"].get<std::string>() << std::endl;
std::cout << user2[0]["email"].get<std::string>() << std::endl;
std::cout << user2[0]["password"].get<std::string>() << std::endl;
std::cout << user2[0]["city"].get<std::string>() << std::endl;
std::cout << user2[0]["streetAddress"].get<std::string>() << std::endl;
return 0;
}
Sorry, reading worked fine with std::string. |
This is quite nice example of dilemma what is better .., making some minor or major optimizations in library and requiring from user to null terminate or don't force user and just use generic 'sentinel' approach. Designers of std::ranges/std::views already solved this issue by using generic approach of sentinel_for<> .. |
This is a high priority, I've just been distracted by other issues and development needs for my primary job. The null termination requirement is currently used for a few reasons:
Yes, the solution is to allow sentinel selection in Glaze. This will allow the user to indicate that the sentinel will indeed exist and therefore Glaze can assume safety. |
IMHO Glaze shouldn't assume anything, give responsibility to user by the fact of type provided as source range to parse... a) Glaze should use user provided sentinel to not read overflow. PPl will benefit in such cases where allocating string costs much more than checking end view condition. I have a lot of cases in production code/services where not allocating string and using string_view with additional end() condition check is much faster than null termination assumption and buffers allocation. |
I agree in some cases to using type deduction. However, we don't want In many cases Glaze algorithms just care about the additional character and Glaze doesn't care whether it is a null character or not. But, I've just phrased the rule as "requires null termination" in order to make things simpler for users.
Yes, I understand there are massive benefits to this feature. It is coming. |
Just for clarity: It is just coincidence that some sub group of string_views constructed from std::string or string literals has null termination when they cover whole C string representation. |
Definitely, but what I'm expressing is that a lot of Glaze optimizations can apply to |
IMHO this is very dangerous to relay on some '\0' in some longer string to which string_view is pointing while searching for ex matching } of some opening {. auto a "{..) { ...} "
string_view b{a, 4}; so in normal circumstances You would parse only "{..)" string with syntax error result and return pointer pass by ) as end of parsing. |
No, because Glaze maps to C++ objects you would get an error in parsing well before you hit the null termination character. I need to explain in more detail, but there are a variety of way we can determine if an error has occurred versus just looking at a sentinel. But, in cases where we do need to look for the |
Glaze currently does |
Glaze will never make logical decisions based on the extended buffer in a |
I've opened a pull request here #1203 that adds an |
In thinking about this more I don't want to require users to set an option to handle null terminated buffers. I think it should be the default behavior to handle non-null terminated buffers. And, I've realized this is achievable without a significant performance loss by using an approach that I will call contextual sentinels. Because we know the types of what we are parsing, we can use these to verify that terminating characters match the decoded type, we can then shift the end iterator one place sooner so that the end iterator is always allocated. We then mark the sentinel along the error path so that we can short circuit returns. We check this context error for the sentinel. We handle improper closing due to sub-object termination at end by not allowing the sentinel context error to be set twice, which would produce an actual error. This approach will give us nearly the same performance as before without requiring We'll then have an opt-in option for even faster parsing when we know data is null terminated, and we can turn this on by default for The only negative side effect I see is that this approach will not support trailing whitespace when decoding. The buffer will need to be null terminated to support trailing whitespace. But, I think this is an okay limitation and in the future a trailing whitespace version with additional logic can be added. |
To add more thoughts for the sake of posterity: The challenge without null termination is error handling. This is particularly true because Glaze does not use exceptions. If Glaze used exceptions we could just jump when an invalid end was reached, but without exceptions we need to walk our way back up the stack while ensuring no parent function tries to dereference the iterator. This means that for non-null terminated buffers we would need an additional iterator check after every single function call in Glaze. On top of this, every time we access a raw iterator to check something like Using contextual sentinels and returning through the error path solves these issues. Because Glaze doesn't use exceptions, the error path is extremely efficient, and we can use it both for errors on termination and also for valid termination. This means that we can utilize the current error checking mechanisms to get back up the stack without adding end iterator checks everywhere. It becomes multi-purposed and results in much faster and cleaner code. The downsides of contextual sentinels are two fold:
This second downside has a bit of complexity to it. If the input JSON is valid we have no issue. If the input JSON is invalid then we can easily handle pretty much every error as before. The one error that we no longer handle is partial reading due to sub-object termination. This means that we have JSON like: The simple solution is to add depth-counting. Meaning that we keep track of the opening and closing braces and brackets to ensure that everything was properly closed. We already need to do this to avoid stack overflows with nested variants, so I think this is a very reasonable solution. |
Work on contextual sentinels is now active here: #1213 |
Hi everyone,
This issue is significantly impacting performance since we are required to create copies of the buffer we want to parse to ensure it is null-terminated.
For instance, I have an ultra-fast 0-copy HTTP(S) client. When data reaches the application layer, I end up with several std::string_view instances corresponding to different parts: HTTP header field names, HTTP header field values, HTTP payload, etc. These are all mapped onto the buffer managed by the previous protocol in the stack, which is TLS.
Due to this restriction, and given that TLS buffers are not null-terminated (as is common in networking), I am now compelled to create large data copies instead of directly parsing the std::string_view, which already has the correct size.
Is there a way to remove this restriction?
The text was updated successfully, but these errors were encountered: