Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VarZeroVec capacity and error handling #1410

Open
1 task done
sffc opened this issue Dec 19, 2021 · 39 comments
Open
1 task done

VarZeroVec capacity and error handling #1410

sffc opened this issue Dec 19, 2021 · 39 comments
Assignees
Labels
C-zerovec Component: Yoke, ZeroVec, DataBake S-large Size: A few weeks (larger feature, major refactoring) T-techdebt Type: ICU4X code health and tech debt
Milestone

Comments

@sffc
Copy link
Member

sffc commented Dec 19, 2021

We are currently panicking in potentially unexpected ways when building a VarZeroVec. All code paths to construct a VZV from runtime objects pass through VarZeroVecOwned::from_elements, which has the following expectation:

.expect("Attempted to build VarZeroVec out of elements that \
    cumulatively are larger than a u32 in size")

In addition, 32 bits is often much more than we need.

I suggest we consider making the length field in VZV configurable, either statically or dynamically. If we make it dynamically configurable, then we can safely remove all the places where we panic. To do this, we can use the first byte of the VZV buffer to signal how many bytes the indices are (1, 2, 4, 8, or 16... or perhaps even allowing odd numbers like 3 and 5), and then use that number when computing all offsets.

If this is not possible, we should at least change all the From impls to TryFrom so that we aren't panicking inside of VZV in a way that clients cannot catch.

Needs input from:

@sffc sffc added C-data-infra Component: provider, datagen, fallback, adapters T-techdebt Type: ICU4X code health and tech debt needs-approval One or more stakeholders need to approve proposal labels Dec 19, 2021
@Manishearth
Copy link
Member

Manishearth commented Dec 19, 2021

I would rather not make it dynamically configurable, since we lose out on the equality property: two VZVs in such a scheme may be logically equal even if not byte-equal. Static configurability with a default seems fine, iirc const generics don't support defaults yet but we can hardcode a constant and wait for that support to eventually exist to switch over.

And yeah, we should be more careful about the fallibility there.

@sffc
Copy link
Member Author

sffc commented Dec 19, 2021

Why can't we just have an invariant where the index width is always the smallest possible? If, when mutating, we pass a boundary either up or down, we just re-compute the index array.

@Manishearth
Copy link
Member

Manishearth commented Dec 19, 2021

Hmmm, tempting; though this does cause two problems:

  • all operations on VZV get a little bit slower (this might be okay, but I'm very reluctant to introduce branches here)
  • mutation can get really expensive, especially if you're dancing around a boundary. This gives us pathological cases for ZeroMap.

Writing the mutation code to efficiently handle this will be really annoying, so I suspect we'll just have to use the existing mutation code and then iterate-reallocate. This isn't that big a deal, I guess.

I do think that the pathological case will be easier to hit for u8 lengths since it's easy to get to 256 on the entire buffer. What do you think of starting the lengths at u16?

We can also save the first byte by packing the length and the byte-width in the same number (this does lower our thresholds)

I think to start we should make N a parameter on VZVBorrowed/VZVComponents and a constant, and then try to move forward from there.

@sffc
Copy link
Member Author

sffc commented Dec 19, 2021

Hmmm, tempting; though this does cause two problems:

  • all operations on VZV get a little bit slower (this might be okay, but I'm very reluctant to introduce branches here)

I don't think this is a claim we can make until we have numbers. You're still calculating the same number of offsets; you just need to adjust them by a dynamic value loaded into a register rather than a constant in code. I don't see where the additional branching would be.

  • mutation can get really expensive, especially if you're dancing around a boundary. This gives us pathological cases for ZeroMap.

As you know, I've been skeptical for a long time about a focus on mutation operations. If you want to do a whole lot of work, convert the ZeroMap to a LiteMap/BTreeMap, mutate over there, and then bring it back to the ZeroMap after you're done. Mutation operations within ZeroMap should be rare, and when the occur, they should be small.

Writing the mutation code to efficiently handle this will be really annoying, so I suspect we'll just have to use the existing mutation code and then iterate-reallocate. This isn't that big a deal, I guess.

Actually, all you need to do is either add an empty byte or remove an empty byte for every entry in the index array. The numerical values within the index array stay the same. It's still a lot of shifting bytes around, of course.

I do think that the pathological case will be easier to hit for u8 lengths since it's easy to get to 256 on the entire buffer. What do you think of starting the lengths at u16?

I was quite excited about the prospect of a u8 VZV. We have a whole lot of cases that will fit. For example, a VarZeroVec<str> containing 7 weekday names or 12 month names will in most cases fit in a u8 VZV.

We can also save the first byte by packing the length and the byte-width in the same number (this does lower our thresholds)

Perhaps. This really only works if we start at u16 because we need all the space in the first byte with u8.

I think to start we should make N a parameter on VZVBorrowed/VZVComponents and a constant, and then try to move forward from there.

I agree that we can start here.

@sffc sffc added the v1 label Dec 20, 2021
@sffc
Copy link
Member Author

sffc commented Dec 20, 2021

Note that I don't want to put in work and change APIs to be fallible if we eventually are going to do the auto-sizing approach. So we should keep this issue until we make the decision on that, even if we implement a const usize type parameter in the mean time.

@Manishearth
Copy link
Member

Okay, I'm convinced. Let's start with the baked const (const generic on VZVComponents, baked const elsewhere), and try the dynamic approach after that.

We can stop on the way at a const generic approach but note that we can't currently do struct VZV<const N: usize = 1> since defaults on consts are unstable, so I'd prefer we go directly to dynamic (using the const generic as a stepping stone for the implementation, perhaps, but never released)

I don't think this is a claim we can make until we have numbers. You're still calculating the same number of offsets; you just need to adjust them by a dynamic value loaded into a register rather than a constant in code. I don't see where the additional branching would be.

Ah! I was imagining we'd stuff the width into the first byte of the length. This is probably unnecessary.

As you know, I've been skeptical for a long time about a focus on mutation operations. If you want to do a whole lot of work, convert the ZeroMap to a LiteMap/BTreeMap, mutate over there, and then bring it back to the ZeroMap after you're done. Mutation operations within ZeroMap should be rare, and when the occur, they should be small.

That's fair. I do think it's nice for mutation to be convenient but this is an okay position too.

Actually, all you need to do is either add an empty byte or remove an empty byte for every entry in the index array. The numerical values within the index array stay the same. It's still a lot of shifting bytes around, of course.

Yeah, I hadn't realized that it only affects the length buffer. This should not be hard to implement since we can "preexpand" or "precontract" the length buffer and then call the code Alyssa wrote. It's also possible but probably not worth it to do it as a one-step thing.

I was quite excited about the prospect of a u8 VZV. We have a whole lot of cases that will fit. For example, a VarZeroVec<str> containing 7 weekday names or 12 month names will in most cases fit in a u8 VZV.

Hmmm fair point.

@sffc
Copy link
Member Author

sffc commented Dec 20, 2021

We can stop on the way at a const generic approach but note that we can't currently do struct VZV<const N: usize = 1> since defaults on consts are unstable, so I'd prefer we go directly to dynamic (using the const generic as a stepping stone for the implementation, perhaps, but never released)

The const generic solution can be similar to what we did with FormattedString

pub struct VarZeroVecFlex<const N: usize> { ... }

pub type VarZeroVec = VarZeroVecFlex<4>;

@Manishearth
Copy link
Member

yeah, i was thinking about that, and I was also thinking about how I'd just reduced the number of public VZV types 😄 (note that we'd need one of these for slice, owned, and cow). Could probably be done in an okayish way by only wrapping the cow one

@sffc
Copy link
Member Author

sffc commented Dec 20, 2021

An algorithm for expanding or contracting the index array can be written in O(N) time, with N = length of the byte buffer.

  1. When contracting, say from 4 bytes to 3 bytes, have two cursors walking forward through the index array: the tortoise walking 3 bytes at a time, and the hare walking 4 bytes at a time. At every interval, memmove 3 bytes from the hare pointer to the tortoise pointer. After you've looped over the entire index array, memmove the entire data section to its new position. Every byte has been read from or written to exactly once.
  2. When expanding, do this backwards: memmove the data section to make room for the wider indexes, and then have two cursors walking backwards, copying each index. One additional step is required, which is that the hare cursor must ensure that the high bit is zeroed out.

@sffc
Copy link
Member Author

sffc commented Dec 20, 2021

Another question is how exactly we define the boundary. I see two ways to define it:

  1. When the length of the entire buffer exceeds the capacity of the integer type
  2. When the length of the data section exceeds the capacity of the integer type or when the number of elements in the array exceeds the capacity of the integer type

We can squeeze more bytes out of definition 2, but it is harder to reason about.

Some edge cases:

  • VZV with one or a small number of very long elements
    • In this case, the data array takes the majority of the space, so there is not much difference between definitions 1 and 2
  • VZV with a large number of very short or zero-length elements
    • In this case, the index array takes the majority of the space, so definition 2 could be more advantageous than definition 1
  • VZV with a maximum number of zero-length elements, such as a vec of 255 empty strings
    • Both definitions are sound for this degenerate case

@sffc
Copy link
Member Author

sffc commented Dec 20, 2021

Hmm. I wonder if we should have two width parameters: one for the width of the length field, and the other for the width of the index fields. There is no reason that these two need to be coupled. The length field can be a varint, since it is only one field.

In the VZV buffer:

  • Byte 0: # of bytes for the length (P)
  • Byte 1: # of bytes for each index (Q)
  • Bytes 2 - (2+P): length (L)
  • Bytes (2+P) - (2+P+L*Q): index section
  • Bytes (2+P+L*Q) - end: data section

@Manishearth
Copy link
Member

I don't think it's worth attempting to save the 1-3 bytes we can save by optimizing for the length field having a different length

@sffc
Copy link
Member Author

sffc commented Dec 20, 2021

I don't think it's worth attempting to save the 1-3 bytes we can save by optimizing for the length field having a different length

Let's answer this question first.

If we make the cutoff be on the length of the data array, we can run into cases where there are more elements than the size of the data array (a vector of empty strings). It seems cleaner to decouple the two widths. This isn't about saving a few bytes.

For example, if you have a vector of 100,000 empty strings, you could allocate a 3-byte index array, or you could allocate a 1-byte index array, saving 200k of memory.

I acknowledge that this is an edge case, but it's food for thought to challenge our assumption of coupling these two numbers.

@sffc
Copy link
Member Author

sffc commented Dec 21, 2021

Another option would be to keep the number of elements fixed at 2^32 but allow the data array to scale in length. But then we still need to add fallible APIs.

@Manishearth
Copy link
Member

Another option would be to keep the number of elements fixed at 2^32 but allow the data array to scale in length. But then we still need to add fallible APIs.

Yeah this is what I had thought the proposal was which is why I felt that it was saving a few bytes.

I'm okay with making things fallible: this was what I was envisioning from the start when you were talking about variable size leng. I don't care that much about the perf characteristics or API pleasantness of mutation.

@sffc
Copy link
Member Author

sffc commented Dec 21, 2021

I think the fallibility question is the big crux. Vec is not fallible, because it grows to however big it needs, which is as big as usize supports. In order to make VarZeroVec infallible, we need to make both the length field and the index array variable-width. If we say that VarZeroVec is fallible, then everyone mutating a VarZeroVec needs to be careful.

In some sense, if VZV is fallible, then we should stick with compile-time sizing, and make developers carefully pick the correct width for their use case.

But if we want VZV to be infallible, then it should scale everything automatically.

@Manishearth
Copy link
Member

Manishearth commented Dec 21, 2021

Note that 64 bit VZV won't work on WASM anyway, which is why I've been comfy having it be panicky/fallible.

But also, the only stuff that becomes fallible that aren''t fallible already are the mutation operations.

That seems ... fine to me? I guess it impacts ZeroMap too, which is unfortunate, but I'm okay with ZeroMap mutation panicking at 32 bit boundaries. Or even being fallible, though that sounds like it will grant us little benefit.

Basically, even if we make it stretchy, the panics will be replaced by OOMs on WASM.

@sffc
Copy link
Member Author

sffc commented Dec 22, 2021

If VZV is fallible, then we should make ZeroMap fallible, too. I'm not comfortable making VZV fallible but not ZeroMap.

I don't know how concerned I am with OOMs. That's something that can happen on any platform, not just WASM (although it's more likely on WASM since it's 32-bit). A discussion about how we should handle OOMs is the topic of a different thread.

Ultimately there are two options I'm okay with:

  1. VZV parameterized by the size of the index array, and all relevant mutation operations are fallible, and ZM becomes fallible
  2. VZV is fully flexible and infallible (OOM aside), and ZM remains infalible

@Manishearth
Copy link
Member

If VZV is fallible, then we should make ZeroMap fallible, too. I'm not comfortable making VZV fallible but not ZeroMap.

I'm fine with both being fallible.

A discussion about how we should handle OOMs is the topic of a different thread

I don't quite agree: given that the primary use case for these structures is deserialization, being able to create VZVs that will fail to deserialize on some platforms seems like a footgun, and I think that that footgun also exists for ZeroVec. OOMs in general, yes, should be a separate topic, but this is about being able to reliably cause a deserialization failure.

Note that in Rust code this is not actually an OOM, a too-large sequence will typically cause the deserializer to return an error, but it's kind of moot anyway because if you have that much data then getting that data into memory will OOM.

While we can't prevent all such deserialization OOMs, I think it should be a data model error for a zerovec type that does not have a u32 byte length to be created. This would probably mean all mutation and construction operations become fallible, which is unfortunate, but neither are things we particularly care about the performance or ergonomicness of. To some extent I would even be okay with these panicking instead since ICU4X client code would be unlikely to ever be mutating these, but I understand there is some reluctance to that.


Ultimately there are two options I'm okay with:

1. VZV parameterized by the size of the index array, and all relevant mutation operations are fallible, and ZM becomes fallible

2. VZV is fully flexible and infallible (OOM aside), and ZM remains infalible

I prefer (1) since it does not incur any additional cost on the reader. It does incur an ergonomic cost on the writer, which is unfortunate, and I'm not super happy about that so I wouldn't be too opposed to (2).

(It's worth checking to see what rkyv does)

@sffc
Copy link
Member Author

sffc commented Dec 22, 2021

I don't understand what you're suggesting about the OOM. If we're deserializing zero-copy, the memory has already been allocated, and we are just pointing to it. There is no memory allocation and therefore no opportunity for an OOM. OOM only happens if you're mutating.

@Manishearth
Copy link
Member

Yeah I did somewhat intend to note that in the "we can't handle all these cases anyway" but I forgot. I think that's fair; I was thinking about streaming deserialization but that's not a thing here anyway, so I guess you're right about OOMs being a red herring.

I still feel like we should strongly avoid imposing costs (even minor ones) on reads when there is a design tradeoff that makes mutations a bit more painful, regardless of the OOM business.

@sffc
Copy link
Member Author

sffc commented Dec 22, 2021

The way to get zero-cost is to use an array or a struct. If you're using VZV, you are already incurring some cost. If it takes an additional 10% on a VZV read operation, and we get a big benefit as a result (infallible VZV that automatically sizes and scales itself), that's an acceptable tradeoff for me.

Here's an example of where a fixed-size VZV might not always work. I used month names as an example: in most cases, they will fit in a u8-backed VZV. But, for some locales, perhaps those that use non-BMP code points, the month names may exceed the capacity of a u8-backed VZV. With a generic parameter, we would need to force all locales to adopt a u16 VZV, costing data size for everyone.

@Manishearth
Copy link
Member

If you're using VZV, you are already incurring some cost

Sure, but people who are considering VZV will be making a tradeoff between:

  • pay a heavy allocation cost upfront and minimum read-time costs (modulo cache behavior)
  • pay a light validation cost upfront and some light read-time costs

The second option is almost always better, but to me it's still worth optimizing.

Here's an example of where a fixed-size VZV might not always work. I used month names as an example: in most cases, they will fit in a u8-backed VZV. But, for some locales, perhaps those that use non-BMP code points, the month names may exceed the capacity of a u8-backed VZV. With a generic parameter, we would need to force all locales to adopt a u16 VZV, costing data size for everyone.

I don't understand how this is relevant: I'm quite happy with a variable size VZV, what I'm pushing back against is making the length itself an independent variable size; I'd rather it be set as u32. I see it as vanishingly rare that someone will want a VZV with more than u32s worth of elements.

@sffc
Copy link
Member Author

sffc commented Mar 4, 2022

This is fixed by using FlexZeroVec (#1443)

@Manishearth Manishearth removed the discuss Discuss at a future ICU4X-SC meeting label Mar 4, 2022
@sffc sffc self-assigned this Apr 1, 2022
@sffc sffc added this to the ICU4X 1.0 milestone Apr 1, 2022
@sffc sffc added S-large Size: A few weeks (larger feature, major refactoring) and removed v1 labels Apr 1, 2022
@sffc
Copy link
Member Author

sffc commented Jun 8, 2022

FlexZeroVec is implemented, but we still need to try using it in VarZeroVec.

@sffc sffc added discuss-priority Discuss at the next ICU4X meeting and removed discuss-priority Discuss at the next ICU4X meeting labels Jul 28, 2022
@sffc
Copy link
Member Author

sffc commented Jul 29, 2022

For 1.0, I want to save FZV-compatible bytes into the postcard files, but we don't need to use FZV yet. This lets us use it with a configurable feature in a compatible way.

@sffc
Copy link
Member Author

sffc commented Aug 1, 2022

At a high level, there are 3 paths:

  1. A feature flag somewhere (either zerovec or icu_provider) that changes ALL VZV impls to use fast mode or slow mode, with appropriate metadata added to data requests
  2. Add a metadata byte to VZV so that the same data files can power both the fast and small VZV impls, but size-concisous consumers can build their own small files
  3. Make VZV fallible and generic and decide at each individual call site which one is most appropriate, expecting that in most cases we use a 16-bit version

@sffc
Copy link
Member Author

sffc commented Aug 1, 2022

Decision: pursue option 3. For 1.0, use 16-bit VZVs for most of ICU4X, and 32-bit VZVs for the parts that need it. Extend this to other types of VZVs down the road.

@Manishearth
Copy link
Member

Followup: #2312

@Manishearth
Copy link
Member

Once #2306 lands we should probably split this into followups about:

@Manishearth
Copy link
Member

Polish-blockers done in #2306

@sffc sffc mentioned this issue Nov 10, 2022
@sffc sffc added C-zerovec Component: Yoke, ZeroVec, DataBake and removed C-data-infra Component: provider, datagen, fallback, adapters labels Dec 22, 2022
@sffc
Copy link
Member Author

sffc commented Dec 13, 2023

Something else to fix in the context of this issue: When Index16 overflows it says "Attempted to build VarZeroVec out of elements that cumulatively are larger than a u32 in size" which is wrong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-zerovec Component: Yoke, ZeroVec, DataBake S-large Size: A few weeks (larger feature, major refactoring) T-techdebt Type: ICU4X code health and tech debt
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants