Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data_must_be_null_terminated issue #1054

Open
shiretu opened this issue Jun 5, 2024 · 21 comments
Open

data_must_be_null_terminated issue #1054

shiretu opened this issue Jun 5, 2024 · 21 comments
Labels
enhancement New feature or request optimization Making stuff faster

Comments

@shiretu
Copy link

shiretu commented Jun 5, 2024

Hi everyone,

This issue is significantly impacting performance since we are required to create copies of the buffer we want to parse to ensure it is null-terminated.

For instance, I have an ultra-fast 0-copy HTTP(S) client. When data reaches the application layer, I end up with several std::string_view instances corresponding to different parts: HTTP header field names, HTTP header field values, HTTP payload, etc. These are all mapped onto the buffer managed by the previous protocol in the stack, which is TLS.

Due to this restriction, and given that TLS buffers are not null-terminated (as is common in networking), I am now compelled to create large data copies instead of directly parsing the std::string_view, which already has the correct size.

Is there a way to remove this restriction?

@stephenberry
Copy link
Owner

Thanks for bringing up this issue and explaining your use case.

By requiring null termination we can greatly reduce the number of checks and branches, which makes parsing around 5% faster. When Glaze can use padding on your buffer (by using a non-const std::string) this performance is increased even more. The reduction in parsing time due to null termination and padding can be more than the time it takes to memcpy the buffer into an intermediate. I just say this to explain the design motivation and express how copying the data in your case could be faster than implementing a version that supports non-null termination.

But, I do agree that it is best to avoid copies and allocations, so this is a very valid concern. Plus, I'm sure your code would be simplified by not requiring these copies.

I'll have to think about this more. Could you provide an example structure that you want to parse and how it is being handled for 0-copy? I'm curious as to how TLS is related to JSON.

I'm very interested because I'm currently working on HTTP/websocket code and working Glaze into it.

@stephenberry stephenberry added optimization Making stuff faster enhancement New feature or request labels Jun 7, 2024
@stephenberry
Copy link
Owner

The underlying buffer simply needs to contain a null character within its allocated memory somewhere after the message. So, intermediate string_views do not need null termination.
Right now Glaze checks input buffer std::string_views for null termination.
I’ll add an option to turn off this checking by the developer.

@shiretu
Copy link
Author

shiretu commented Jun 12, 2024

Some gore details in my env:
In my case, the TLS is just another filter (in my local terminology) in the chain. This filter is also 0-copy up to it (from the wire) and from it (towards the app layer). Itself, it is not 0 copy, as it has to decrypt/encrypt.

But the rest of the chain on top of it is 0-copy. (except when it gets ziped :) )

Okay, you got the idea, there are filters in the chain that requires copying and massive alteration/mutation of the data. It is their nature. However, JSON parsing should be kind of a read-only thing.

I can not add anything to the buffers, all I can do is read from them or copy them. Reason is that the filters are not aware of each other. They just know they have neighbours and they stimulate them with data (inbound and outbound). It makes things super fast and also encourages metaprogramming (no polymorphism in my case).

My humble and unsolicited recommendation for glaze library is to just accept any kind of input as long as it has std::data() and std::size() overloaded for that type. This way there is no need for custom types to describe the input.

std::string, std::string_view, std::vector<>, std::array<>, etc they all have that overloaded.

My custom buffer used inside the chain also has that. It is the most generic and friction-less way of accepting a chunk of memory for processing.

@shiretu
Copy link
Author

shiretu commented Jun 12, 2024

At the heart of walking sits 2 logical operations:

  1. advance
  2. compare something with something to see if we reached the end

Currently glaze probably does a pointer increment and then compares the dereferencing of that pointer with \0.

The suggestion I'm making is to keep the pointer advance as is, but compare it un-dereferenced with another end pointer which is computed one-time by doing it like this: (pseudo code)

const char *pCursor=std::data(input);
const char *pEnd=pCursor+std::size(input);
while((pCursor++)!=pEnd){
 //do stuff
}

@stephenberry
Copy link
Owner

If data is null terminated then when we need to check for the existence of a character, let's say a {, then we can simply do if (*pCursor == '{'), but if we need to check against an end, then we need to do if (pCursor != pEnd && *pCursor == '{')

Notice that we now have two comparisons and another boolean operation for every single character that we need to check. I'm just explaining this so you can see the performance reason for requiring null termination.

@stephenberry
Copy link
Owner

I plan to solve this problem in a few different ways for different use cases. One quick question I have for you is would you be able to assume the input JSON is valid? If we can assume the input JSON is valid we can make a more efficient solution.

I do plan on adding support for unknown JSON that could have errors without null termination, but this will take a bit more time. But, I'll keep this issue open until I've added the support.

@meftunca
Copy link

meftunca commented Jun 20, 2024

I have the same problem with binary transactions. @stephenberry

#include <cstddef>
#include <glaze/glaze.hpp>
#include <iostream>

int main()
{
   // Example Users
   std::string filePath = "/Users/meftunca/Desktop/big_projects/son/users.beve";
   glz::json_t users = glz::json_t::array_t{glz::json_t{{"id", "58018063-ce5a-4fa7-adfd-327eb2e2d9a5"},
                                                        {"email", "devtest@dev.test"},
                                                        {"password", "@cWLwgM#Knalxeb"},
                                                        {"city", "Sayre ville"},
                                                        {"streetAddress", "1716 Harriet Alley"}}};

   std::vector<std::byte> bytes;
   bytes.shrink_to_fit();
   auto ec = glz::write_file_binary(users, filePath, bytes); // Write to file
   if (ec) {
      std::cerr << "Error: " << glz::format_error(ec) << std::endl;
      return 1;
   }

   glz::json_t user2;
   std::vector<std::byte> bytes2;
   ec = glz::read_file_binary(user2, filePath, bytes2);
   if (ec) {
      std::cerr << "Error: " << glz::format_error(ec) << std::endl;
      return 1;
   }

   std::cout << user2[0]["id"].get<std::string>() << std::endl;
   std::cout << user2[0]["email"].get<std::string>() << std::endl;
   std::cout << user2[0]["password"].get<std::string>() << std::endl;
   std::cout << user2[0]["city"].get<std::string>() << std::endl;
   std::cout << user2[0]["streetAddress"].get<std::string>() << std::endl;

   return 0;
}

Sorry, reading worked fine with std::string.

@arturbac
Copy link
Contributor

arturbac commented Jul 22, 2024

If data is null terminated then when we need to check for the existence of a character, let's say a {, then we can simply do if (*pCursor == '{'), but if we need to check against an end, then we need to do if (pCursor != pEnd && *pCursor == '{')

This is quite nice example of dilemma what is better .., making some minor or major optimizations in library and requiring from user to null terminate or don't force user and just use generic 'sentinel' approach.
I have spent quite a lot of time to write null termination independent stralgo to get rid of that limitation and in most cases 'false' optimization. It is often false as it is requiring to as mentioned here allocate from heap memory and create deep copy of sub strings and null terminate them to get small improvement in some condition.

Designers of std::ranges/std::views already solved this issue by using generic approach of sentinel_for<> ..
It is Up to user how does sentinel looks like in given use case. And source range often it is enough to be just forward range and not required to be contiguous range like string_view ..
In my experience of using high load services most important optimization is to not allocate from heap at a cost of additional condition check, as checking condition is often done on data in cache line of cpu and allocation causes global sync event.
With sentinel approach user can pass sentinel pointing at end of view or assumin always true()
And glaze can then use sentinel composed from user sentinel searching for something ..
So the decision of optimizing null termination can be postponed to user, depending what is he passing forward range, string_view or null terminated std::string

@stephenberry
Copy link
Owner

This is a high priority, I've just been distracted by other issues and development needs for my primary job.

The null termination requirement is currently used for a few reasons:

  • Performance, as noted in avoiding end checks
  • Additional buffer space to optimize reading based on this extra padded byte
  • Performance due to error safety knowing that a null character must exist. This affects core parsing algorithms which take this into account for handling invalid inputs. In the sentinel case these optimizations need to be refactored, which is why this change will take a bit more effort.

Designers of std::ranges/std::views already solved this issue by using generic approach of sentinel_for<>...

Yes, the solution is to allow sentinel selection in Glaze. This will allow the user to indicate that the sentinel will indeed exist and therefore Glaze can assume safety.

@arturbac
Copy link
Contributor

arturbac commented Jul 22, 2024

Designers of std::ranges/std::views already solved this issue by using generic approach of sentinel_for<>...

Yes, the solution is to allow sentinel selection in Glaze. This will allow the user to indicate that the sentinel will indeed exist and therefore Glaze can assume safety.

IMHO Glaze shouldn't assume anything, give responsibility to user by the fact of type provided as source range to parse...

a) Glaze should use user provided sentinel to not read overflow. PPl will benefit in such cases where allocating string costs much more than checking end view condition.
b) Or glaze can determine sentinel based on source range, for input range or string_view use sentinel of view.end() for null terminated std::string assume null termination. Thus You can make composed search sentinel from that in first case it will search for 'X' and use end() for range end check, with string it will just search for 'X' as std::string is null_termianted.

I have a lot of cases in production code/services where not allocating string and using string_view with additional end() condition check is much faster than null termination assumption and buffers allocation.

@stephenberry
Copy link
Owner

I agree in some cases to using type deduction. However, we don't want std::string_view parsing to be slower just because it is a string_view, especially when it is common for string views to have null characters or additional characters at the end. The thing is that we can optimize for dereferencing the end() if it is null terminated (especially true when using SWAR). However, if the string_view is not null terminated then we cannot dereference this pointer.

In many cases Glaze algorithms just care about the additional character and Glaze doesn't care whether it is a null character or not. But, I've just phrased the rule as "requires null termination" in order to make things simpler for users.

I have a lot of cases in production code/services where not allocating string and using string_view with additional end() condition check is much faster than null termination assumption and buffers allocation.

Yes, I understand there are massive benefits to this feature. It is coming.

@arturbac
Copy link
Contributor

especially when it is common for string views to have null characters or additional characters at the end.

Just for clarity:
null termination has nothing to do with string_view which is by default non null terminated sequence of characters ending at data+size.
std::basic_string_view is not C string char const * and it is not convertible to it by design.

It is just coincidence that some sub group of string_views constructed from std::string or string literals has null termination when they cover whole C string representation.

@stephenberry
Copy link
Owner

Definitely, but what I'm expressing is that a lot of Glaze optimizations can apply to std::string_view without a requirement of a null termination, but just by having a single additional character allocated at end, which is very common because string_views are often either an entire string or a subset of the string.

@arturbac
Copy link
Contributor

IMHO this is very dangerous to relay on some '\0' in some longer string to which string_view is pointing while searching for ex matching } of some opening {.
With invalid syntax You will likely start working in buffer overflow area.
simple example:

auto a "{..) { ...} "
string_view b{a, 4};

so in normal circumstances You would parse only "{..)" string with syntax error result and return pointer pass by ) as end of parsing.
with string view You will violate allowed buffer range and You would find matching } in out of buffer area and return pointer pass } , far away after string view end sentinel.

@stephenberry
Copy link
Owner

stephenberry commented Jul 22, 2024

No, because Glaze maps to C++ objects you would get an error in parsing well before you hit the null termination character.

I need to explain in more detail, but there are a variety of way we can determine if an error has occurred versus just looking at a sentinel. But, in cases where we do need to look for the end iterator, Glaze does that. There are just some optimizations that need to be tweaked for when a null character cannot be assumed.

@stephenberry
Copy link
Owner

stephenberry commented Jul 22, 2024

Glaze currently does end iterator checks all over in parsing, so it will use the end iterator of your string_view in most cases. There are just some cases that need to be handled.

@stephenberry
Copy link
Owner

Glaze will never make logical decisions based on the extended buffer in a string_view, it is just that we can do SIMD optimizations if a buffer exists beyond the length of the string view, in some cases with just a single character.

@stephenberry
Copy link
Owner

I've opened a pull request here #1203 that adds an is_null_terminated option that can be set to false for when the input buffer is not null terminated. There's a good bit of work to still do, but you can see the kinds of changes that are necessary.

@stephenberry
Copy link
Owner

In thinking about this more I don't want to require users to set an option to handle null terminated buffers. I think it should be the default behavior to handle non-null terminated buffers. And, I've realized this is achievable without a significant performance loss by using an approach that I will call contextual sentinels. Because we know the types of what we are parsing, we can use these to verify that terminating characters match the decoded type, we can then shift the end iterator one place sooner so that the end iterator is always allocated. We then mark the sentinel along the error path so that we can short circuit returns. We check this context error for the sentinel. We handle improper closing due to sub-object termination at end by not allowing the sentinel context error to be set twice, which would produce an actual error.

This approach will give us nearly the same performance as before without requiring end sentinel checks everywhere, because we will contextually know the terminating character.

We'll then have an opt-in option for even faster parsing when we know data is null terminated, and we can turn this on by default for std::string.

The only negative side effect I see is that this approach will not support trailing whitespace when decoding. The buffer will need to be null terminated to support trailing whitespace. But, I think this is an okay limitation and in the future a trailing whitespace version with additional logic can be added.

@stephenberry
Copy link
Owner

stephenberry commented Jul 24, 2024

To add more thoughts for the sake of posterity:

The challenge without null termination is error handling. This is particularly true because Glaze does not use exceptions. If Glaze used exceptions we could just jump when an invalid end was reached, but without exceptions we need to walk our way back up the stack while ensuring no parent function tries to dereference the iterator. This means that for non-null terminated buffers we would need an additional iterator check after every single function call in Glaze. On top of this, every time we access a raw iterator to check something like *it == '{' we would need an additional end check. And, for every single whitespace character read we would need to check end iterators. This can massively degrade performance and makes the code much uglier.

Using contextual sentinels and returning through the error path solves these issues. Because Glaze doesn't use exceptions, the error path is extremely efficient, and we can use it both for errors on termination and also for valid termination. This means that we can utilize the current error checking mechanisms to get back up the stack without adding end iterator checks everywhere. It becomes multi-purposed and results in much faster and cleaner code.
Another benefit of contextual sentinels is that we are checking the terminating character of the JSON value up front. This means that we can catch termination issues before parsing anything and handle these error cases extremely fast.

The downsides of contextual sentinels are two fold:

  1. They will not allow trailing whitespace, because we need to verify a valid terminating character. This is really a minor loss, because in most cases of using views we only want to parse the content of interest are are concerned with performance that doesn't care about trailing whitespace.
  2. Unnoticed partial reading for invalid JSON with sub-object (or sub-array) termination.

This second downside has a bit of complexity to it. If the input JSON is valid we have no issue. If the input JSON is invalid then we can easily handle pretty much every error as before. The one error that we no longer handle is partial reading due to sub-object termination. This means that we have JSON like: "{{"a":0}". Notice that the final trailing } is missing. This would not be seen as an error without depth-counting. The reason is because we've taken the error route to exit back up the stack, so the checks for the terminating } are skipped.

The simple solution is to add depth-counting. Meaning that we keep track of the opening and closing braces and brackets to ensure that everything was properly closed. We already need to do this to avoid stack overflows with nested variants, so I think this is a very reasonable solution.

@stephenberry
Copy link
Owner

Work on contextual sentinels is now active here: #1213

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request optimization Making stuff faster
Projects
None yet
Development

No branches or pull requests

4 participants