Consider FlexZeroVec #1443

sffc · 2021-12-23T01:21:32Z

In #1410 we've been talking about making the index array flexible in integer length.

Perhaps we should generalize this to be a new type called FlexZeroVec. A FlexZeroVec is a ZeroVec that can only hold unsigned integers, and where the first byte is a marker to signal how many bytes each element in the array consumes. Mutation operations on a FlexZeroVec may result in the bytes being expanded or shrunk.

For example:

// The array [1,2,3,4,5]:
01 01 02 03 04 05

// Byte 0 = all remaining bytes are 1 long
// Bytes 1-5 = data section

Note that we would need both FlexZeroVec and FlexZeroSlice, and also note that, like VarZeroVec and unlike ZeroSlice, FlexZeroSlice would not support subslices.

Open questions:

What should the .get() function return? It could return a usize, or it could return a u32, depending on whether we decide to limit the capacity of zerovec structs to 2^32.

CC @Manishearth

The text was updated successfully, but these errors were encountered:

sffc · 2021-12-23T01:25:08Z

Hmm, what do you think about making ZeroMapKV for usize be FlexZeroVec ?

Manishearth · 2021-12-23T01:25:18Z

This seems interesting. Do we have any vectors that are both large enough and heterogenous enough that this would be worth it? CPT comes to mind but I suspect this isn't that heterogenous.

Manishearth · 2021-12-23T01:26:08Z

Hmm, what do you think about making ZeroMapKV for usize be FlexZeroVec ?

huh, that's interesting. Yeah that would work!

sffc · 2021-12-23T01:28:33Z

This seems interesting. Do we have any vectors that are both large enough and heterogenous enough that this would be worth it? CPT comes to mind but I suspect this isn't that heterogenous.

The main initial use case I had in mind was the index portion of the VZV buffer. We can cast that section of the buffer to a FlexZeroSlice.

Also, in ZeroMap2k, the joiner array would be a FlexZeroSlice.

Manishearth · 2021-12-23T01:33:50Z

Ah! Well, we should probably add that as an internal abstraction and then consider making it public (the way we did with SliceComponents -> VZVBorrowed)

sffc · 2022-02-14T19:02:07Z

CC @robertbastian

Since d90dbf9 added 10% to the postcard data size, I wanted to drill down and see where the impact came from.

I split out the postcard data into individual files (#1606) and then compared the resource time_zone/generic_long@1/es.

Size differences:

Before:
- Full size: 4834 B
- gzip: 2009 B
After:
- Full size: 5801 B
- gzip: 2505 B

I show gzip to demonstrate that the new postcard files are larger even when compressed. I should note that the difference is smaller as a percentage on the full postcard file.

Looking inside the file, here's what I see. Old, first 40 bytes:

9d 01 04 41 63 72 65 0c   ...Acre.
48 6f 72 61 20 64 65 20   Hora de 
41 63 72 65 0b 41 66 67   Acre.Afg
68 61 6e 69 73 74 61 6e   hanistan
13 68 6f 72 61 20 64 65   .hora de

New, first 16 bytes followed by an excerpt in the middle:

a2 10 9d 00 00 00 00 00   ........
00 00 04 00 00 00 0f 00   ........
<snip>
a5 05 00 00 41 63 72 65   ....Acre
41 66 67 68 61 6e 69 73   Afghanis
74 61 6e 41 66 72 69 63   tanAfric

Interpretation:

In both cases, the byte 9d is the length of the vector.
When Postcard encodes the vector directly, it uses 1 byte per element for the length, whereas we use 4 bytes per element.
The highest value stored in the VZV index array is 0x05a5.

Discussion: If we were to use FlexZeroVec, we could represent the index array using u16 rather than u32. This would cut out about 630 bytes, which is more than half the difference between raw Postcard and VZV, at very little runtime cost and relatively little disruption.

However, this does raise questions about whether we could use an even more compact encoding for VZV. As an extreme example, we could use Postcard's own representation and pay O(N) lookup. But perhaps we can brainstorm other data models.

sffc · 2022-02-14T19:24:28Z

Let's say we want to store the following values in a ZV:

Decimal: [5, 12, 31, 41, 55, 130, 290, 351]
Hex: [0x05, 0x0B, 0x1F, 0x29, 0x37, 0x82, 0x122, 0x15F]
Decimal Derivatives: [5, 7, 19, 10, 14, 116, 160, 61]
Hex Derivatives: [0x05, 0x07, 0x13, 0x0A, 0x0E, 0x74, 0xA0, 0x3D]

Right now, we store:

05 00 00 00 0B 00 00 00  # u32 array
1F 00 00 00 29 00 00 00  # u32 array
37 00 00 00 82 00 00 00  # u32 array
22 01 00 00 5F 01 00 00  # u32 array

We can cut that almost in half by using FlexZeroVec, with one additional read per lookup (the item width):

02  # item width
05 00 0B 00 1F 00 29 00  # u16 array
37 00 82 00 22 01 5F 01  # u16 array

Here's a hypothetical structure storing derivatives (lengths of individual items) with constant-time lookup and 1-4 extra reads per lookup that supports item widths of up to 255 bytes. To read from this structure, first you read the item width, then you read the first item in the i<<2th row, then you add the derivatives until you get the value you were looking for:

01  # item width
05 07 13 0A  # item (0) + derivatives (1-3)
37 74 A0 3D  # item (4) + derivatives (5-7)

My conclusion is that we can keep the same VZV structure, and iterate on the representation of ZeroVec<usize> as needed. I think we should start with FlexZeroVec, and explore other, more complex options as needed.

sffc · 2022-03-04T18:29:54Z

@iainireland - This makes sense in small data where small data is more important than speed. I don't think we want to use this for anything hot. If we know the width at compile time, we can get some wins on certain processors.
@Manishearth - I want ZV to be as efficient as possible; everything being FZV is annoying. But due to the way ZeroMap works, if you want a flexible ZeroMap, you could have a ZeroMap<Flex...>. It would be a little annoying. The other thing is that we have the equality constraint that byte equality is semantic equality, so there needs to be a single valid representation. If you mutate a FlexZeroVec, it would need to resize itself in the process to retain this invariant. This is fine but not fun.
@iainireland - I saw this as replacing things only in isolated situations.
@sffc - My position going into this conversation is that we change VZV to always use the FlexZeroVec for the index array.
@iainireland - Why not do a const parameter?
@sffc - Say we want to store Gregorian month names. In English, this would fit in a 1-byte VZV, but in a 4-byte script like Adlam or Chakma, it would probably require a 2-byte VZV.
@Manishearth - Does FZV need to be a public type?
@iainireland - I'm concerned that if we encourage people to use VZV more, we shouldn't make them slower. For storing strings, it makes sense. But when you're doing case folding, fast access matters a lot. How difficult would it be to support both?
@Manishearth - All of this should be gated on benchmarks.
@echeran - Can we have a flag to say I want this in fast mode or slow mode?
@sffc - We could make a specialized VZV type for fast access, doing some tricks to move the lengths into the data section, and use FZV for the normal case.

(discussion continues)

@Manishearth: Iain's case for fast indexing can also be handled by some unsafe methods on VZV, or by directly using a byte slice and VarULE.

sffc · 2022-03-04T18:40:55Z

We agree on using FlexZeroVec as the only index array type for VarZeroVec.

We will start with the "basic" FlexZeroVec where the first byte is the item width.

+1 from @iainireland, @Manishearth, @echeran

sffc added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters S-medium Size: Less than a week (larger bug fix or enhancement) labels Dec 23, 2021

sffc added good first issue Good for newcomers help wanted Issue needs an assignee labels Jan 6, 2022

sffc added this to the ICU4X 0.6 milestone Jan 6, 2022

Manishearth mentioned this issue Feb 15, 2022

[tracking] ZeroVec improvements #1082

Closed

29 tasks

sffc referenced this issue Feb 18, 2022

Zero copy timezones (#1601)

d90dbf9

sffc added the discuss Discuss at a future ICU4X-SC meeting label Feb 18, 2022

Manishearth mentioned this issue Mar 3, 2022

Make LanguageIdentifier and/or Locale compatible with [Var]ZeroVec #831

Closed

sffc removed the discuss Discuss at a future ICU4X-SC meeting label Mar 4, 2022

This was referenced Mar 4, 2022

Investigate the possibility of u16 lengths in VarZeroVec #1261

Closed

VarZeroVec capacity and error handling #1410

Open

This was referenced Mar 5, 2022

Use TinyStr16 for era codes #1658

Closed

Fallible operations in zerovec #1081

Open

sffc self-assigned this Apr 9, 2022

sffc removed the help wanted Issue needs an assignee label Apr 9, 2022

sffc mentioned this issue Apr 9, 2022

FlexZeroVec #1790

Merged

sffc modified the milestones: ICU4X 0.6, ICU4X 1.0 (Features) May 25, 2022

sffc closed this as completed in #1790 Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider FlexZeroVec #1443

Consider FlexZeroVec #1443

sffc commented Dec 23, 2021 •

edited

Loading

sffc commented Dec 23, 2021

Manishearth commented Dec 23, 2021

Manishearth commented Dec 23, 2021

sffc commented Dec 23, 2021 •

edited

Loading

Manishearth commented Dec 23, 2021

sffc commented Feb 14, 2022

sffc commented Feb 14, 2022

sffc commented Mar 4, 2022 •

edited by Manishearth

Loading

sffc commented Mar 4, 2022 •

edited

Loading

Consider FlexZeroVec #1443

Consider FlexZeroVec #1443

Comments

sffc commented Dec 23, 2021 • edited Loading

sffc commented Dec 23, 2021

Manishearth commented Dec 23, 2021

Manishearth commented Dec 23, 2021

sffc commented Dec 23, 2021 • edited Loading

Manishearth commented Dec 23, 2021

sffc commented Feb 14, 2022

sffc commented Feb 14, 2022

sffc commented Mar 4, 2022 • edited by Manishearth Loading

sffc commented Mar 4, 2022 • edited Loading

sffc commented Dec 23, 2021 •

edited

Loading

sffc commented Dec 23, 2021 •

edited

Loading

sffc commented Mar 4, 2022 •

edited by Manishearth

Loading

sffc commented Mar 4, 2022 •

edited

Loading