-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VarZeroVec capacity and error handling #1410
Comments
I would rather not make it dynamically configurable, since we lose out on the equality property: two VZVs in such a scheme may be logically equal even if not byte-equal. Static configurability with a default seems fine, iirc const generics don't support defaults yet but we can hardcode a constant and wait for that support to eventually exist to switch over. And yeah, we should be more careful about the fallibility there. |
Why can't we just have an invariant where the index width is always the smallest possible? If, when mutating, we pass a boundary either up or down, we just re-compute the index array. |
Hmmm, tempting; though this does cause two problems:
Writing the mutation code to efficiently handle this will be really annoying, so I suspect we'll just have to use the existing mutation code and then iterate-reallocate. This isn't that big a deal, I guess. I do think that the pathological case will be easier to hit for We can also save the first byte by packing the length and the byte-width in the same number (this does lower our thresholds) I think to start we should make |
I don't think this is a claim we can make until we have numbers. You're still calculating the same number of offsets; you just need to adjust them by a dynamic value loaded into a register rather than a constant in code. I don't see where the additional branching would be.
As you know, I've been skeptical for a long time about a focus on mutation operations. If you want to do a whole lot of work, convert the ZeroMap to a LiteMap/BTreeMap, mutate over there, and then bring it back to the ZeroMap after you're done. Mutation operations within ZeroMap should be rare, and when the occur, they should be small.
Actually, all you need to do is either add an empty byte or remove an empty byte for every entry in the index array. The numerical values within the index array stay the same. It's still a lot of shifting bytes around, of course.
I was quite excited about the prospect of a
Perhaps. This really only works if we start at
I agree that we can start here. |
Note that I don't want to put in work and change APIs to be fallible if we eventually are going to do the auto-sizing approach. So we should keep this issue until we make the decision on that, even if we implement a |
Okay, I'm convinced. Let's start with the baked const (const generic on We can stop on the way at a const generic approach but note that we can't currently do
Ah! I was imagining we'd stuff the width into the first byte of the length. This is probably unnecessary.
That's fair. I do think it's nice for mutation to be convenient but this is an okay position too.
Yeah, I hadn't realized that it only affects the length buffer. This should not be hard to implement since we can "preexpand" or "precontract" the length buffer and then call the code Alyssa wrote. It's also possible but probably not worth it to do it as a one-step thing.
Hmmm fair point. |
The const generic solution can be similar to what we did with FormattedString pub struct VarZeroVecFlex<const N: usize> { ... }
pub type VarZeroVec = VarZeroVecFlex<4>; |
yeah, i was thinking about that, and I was also thinking about how I'd just reduced the number of public VZV types 😄 (note that we'd need one of these for slice, owned, and cow). Could probably be done in an okayish way by only wrapping the cow one |
An algorithm for expanding or contracting the index array can be written in O(N) time, with N = length of the byte buffer.
|
Another question is how exactly we define the boundary. I see two ways to define it:
We can squeeze more bytes out of definition 2, but it is harder to reason about. Some edge cases:
|
Hmm. I wonder if we should have two width parameters: one for the width of the length field, and the other for the width of the index fields. There is no reason that these two need to be coupled. The length field can be a varint, since it is only one field. In the VZV buffer:
|
I don't think it's worth attempting to save the 1-3 bytes we can save by optimizing for the length field having a different length |
Let's answer this question first. If we make the cutoff be on the length of the data array, we can run into cases where there are more elements than the size of the data array (a vector of empty strings). It seems cleaner to decouple the two widths. This isn't about saving a few bytes. For example, if you have a vector of 100,000 empty strings, you could allocate a 3-byte index array, or you could allocate a 1-byte index array, saving 200k of memory. I acknowledge that this is an edge case, but it's food for thought to challenge our assumption of coupling these two numbers. |
Another option would be to keep the number of elements fixed at 2^32 but allow the data array to scale in length. But then we still need to add fallible APIs. |
Yeah this is what I had thought the proposal was which is why I felt that it was saving a few bytes. I'm okay with making things fallible: this was what I was envisioning from the start when you were talking about variable size leng. I don't care that much about the perf characteristics or API pleasantness of mutation. |
I think the fallibility question is the big crux. Vec is not fallible, because it grows to however big it needs, which is as big as In some sense, if VZV is fallible, then we should stick with compile-time sizing, and make developers carefully pick the correct width for their use case. But if we want VZV to be infallible, then it should scale everything automatically. |
Note that 64 bit VZV won't work on WASM anyway, which is why I've been comfy having it be panicky/fallible. But also, the only stuff that becomes fallible that aren''t fallible already are the mutation operations. That seems ... fine to me? I guess it impacts ZeroMap too, which is unfortunate, but I'm okay with ZeroMap mutation panicking at 32 bit boundaries. Or even being fallible, though that sounds like it will grant us little benefit. Basically, even if we make it stretchy, the panics will be replaced by OOMs on WASM. |
If VZV is fallible, then we should make ZeroMap fallible, too. I'm not comfortable making VZV fallible but not ZeroMap. I don't know how concerned I am with OOMs. That's something that can happen on any platform, not just WASM (although it's more likely on WASM since it's 32-bit). A discussion about how we should handle OOMs is the topic of a different thread. Ultimately there are two options I'm okay with:
|
I'm fine with both being fallible.
I don't quite agree: given that the primary use case for these structures is deserialization, being able to create VZVs that will fail to deserialize on some platforms seems like a footgun, and I think that that footgun also exists for ZeroVec. OOMs in general, yes, should be a separate topic, but this is about being able to reliably cause a deserialization failure. Note that in Rust code this is not actually an OOM, a too-large sequence will typically cause the deserializer to return an error, but it's kind of moot anyway because if you have that much data then getting that data into memory will OOM. While we can't prevent all such deserialization OOMs, I think it should be a data model error for a zerovec type that does not have a u32 byte length to be created. This would probably mean all mutation and construction operations become fallible, which is unfortunate, but neither are things we particularly care about the performance or ergonomicness of. To some extent I would even be okay with these panicking instead since ICU4X client code would be unlikely to ever be mutating these, but I understand there is some reluctance to that.
I prefer (1) since it does not incur any additional cost on the reader. It does incur an ergonomic cost on the writer, which is unfortunate, and I'm not super happy about that so I wouldn't be too opposed to (2). (It's worth checking to see what rkyv does) |
I don't understand what you're suggesting about the OOM. If we're deserializing zero-copy, the memory has already been allocated, and we are just pointing to it. There is no memory allocation and therefore no opportunity for an OOM. OOM only happens if you're mutating. |
Yeah I did somewhat intend to note that in the "we can't handle all these cases anyway" but I forgot. I think that's fair; I was thinking about streaming deserialization but that's not a thing here anyway, so I guess you're right about OOMs being a red herring. I still feel like we should strongly avoid imposing costs (even minor ones) on reads when there is a design tradeoff that makes mutations a bit more painful, regardless of the OOM business. |
The way to get zero-cost is to use an array or a struct. If you're using VZV, you are already incurring some cost. If it takes an additional 10% on a VZV read operation, and we get a big benefit as a result (infallible VZV that automatically sizes and scales itself), that's an acceptable tradeoff for me. Here's an example of where a fixed-size VZV might not always work. I used month names as an example: in most cases, they will fit in a |
Sure, but people who are considering VZV will be making a tradeoff between:
The second option is almost always better, but to me it's still worth optimizing.
I don't understand how this is relevant: I'm quite happy with a variable size VZV, what I'm pushing back against is making the length itself an independent variable size; I'd rather it be set as u32. I see it as vanishingly rare that someone will want a VZV with more than u32s worth of elements. |
This is fixed by using FlexZeroVec (#1443) |
FlexZeroVec is implemented, but we still need to try using it in VarZeroVec. |
For 1.0, I want to save FZV-compatible bytes into the postcard files, but we don't need to use FZV yet. This lets us use it with a configurable feature in a compatible way. |
At a high level, there are 3 paths:
|
Decision: pursue option 3. For 1.0, use 16-bit VZVs for most of ICU4X, and 32-bit VZVs for the parts that need it. Extend this to other types of VZVs down the road. |
Followup: #2312 |
Once #2306 lands we should probably split this into followups about:
|
Polish-blockers done in #2306 |
Something else to fix in the context of this issue: When Index16 overflows it says "Attempted to build VarZeroVec out of elements that cumulatively are larger than a u32 in size" which is wrong |
We are currently panicking in potentially unexpected ways when building a VarZeroVec. All code paths to construct a VZV from runtime objects pass through
VarZeroVecOwned::from_elements
, which has the following expectation:In addition, 32 bits is often much more than we need.
I suggest we consider making the length field in VZV configurable, either statically or dynamically. If we make it dynamically configurable, then we can safely remove all the places where we panic. To do this, we can use the first byte of the VZV buffer to signal how many bytes the indices are (1, 2, 4, 8, or 16... or perhaps even allowing odd numbers like 3 and 5), and then use that number when computing all offsets.
If this is not possible, we should at least change all the
From
impls toTryFrom
so that we aren't panicking inside of VZV in a way that clients cannot catch.Needs input from:
The text was updated successfully, but these errors were encountered: