Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider FlexZeroVec #1443

Closed
Tracked by #1082
sffc opened this issue Dec 23, 2021 · 9 comments · Fixed by #1790
Closed
Tracked by #1082

Consider FlexZeroVec #1443

sffc opened this issue Dec 23, 2021 · 9 comments · Fixed by #1790
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters good first issue Good for newcomers S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Dec 23, 2021

In #1410 we've been talking about making the index array flexible in integer length.

Perhaps we should generalize this to be a new type called FlexZeroVec. A FlexZeroVec is a ZeroVec that can only hold unsigned integers, and where the first byte is a marker to signal how many bytes each element in the array consumes. Mutation operations on a FlexZeroVec may result in the bytes being expanded or shrunk.

For example:

// The array [1,2,3,4,5]:
01 01 02 03 04 05

// Byte 0 = all remaining bytes are 1 long
// Bytes 1-5 = data section

Note that we would need both FlexZeroVec and FlexZeroSlice, and also note that, like VarZeroVec and unlike ZeroSlice, FlexZeroSlice would not support subslices.

Open questions:

  • What should the .get() function return? It could return a usize, or it could return a u32, depending on whether we decide to limit the capacity of zerovec structs to 2^32.

CC @Manishearth

@sffc sffc added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters S-medium Size: Less than a week (larger bug fix or enhancement) labels Dec 23, 2021
@sffc
Copy link
Member Author

sffc commented Dec 23, 2021

Hmm, what do you think about making ZeroMapKV for usize be FlexZeroVec ?

@Manishearth
Copy link
Member

This seems interesting. Do we have any vectors that are both large enough and heterogenous enough that this would be worth it? CPT comes to mind but I suspect this isn't that heterogenous.

@Manishearth
Copy link
Member

Hmm, what do you think about making ZeroMapKV for usize be FlexZeroVec ?

huh, that's interesting. Yeah that would work!

@sffc
Copy link
Member Author

sffc commented Dec 23, 2021

This seems interesting. Do we have any vectors that are both large enough and heterogenous enough that this would be worth it? CPT comes to mind but I suspect this isn't that heterogenous.

The main initial use case I had in mind was the index portion of the VZV buffer. We can cast that section of the buffer to a FlexZeroSlice.

Also, in ZeroMap2k, the joiner array would be a FlexZeroSlice.

@Manishearth
Copy link
Member

Ah! Well, we should probably add that as an internal abstraction and then consider making it public (the way we did with SliceComponents -> VZVBorrowed)

@sffc sffc added good first issue Good for newcomers help wanted Issue needs an assignee labels Jan 6, 2022
@sffc sffc added this to the ICU4X 0.6 milestone Jan 6, 2022
@sffc
Copy link
Member Author

sffc commented Feb 14, 2022

CC @robertbastian

Since d90dbf9 added 10% to the postcard data size, I wanted to drill down and see where the impact came from.

I split out the postcard data into individual files (#1606) and then compared the resource time_zone/generic_long@1/es.

Size differences:

  • Before:
    • Full size: 4834 B
    • gzip: 2009 B
  • After:
    • Full size: 5801 B
    • gzip: 2505 B

I show gzip to demonstrate that the new postcard files are larger even when compressed. I should note that the difference is smaller as a percentage on the full postcard file.

Looking inside the file, here's what I see. Old, first 40 bytes:

9d 01 04 41 63 72 65 0c   ...Acre.
48 6f 72 61 20 64 65 20   Hora de 
41 63 72 65 0b 41 66 67   Acre.Afg
68 61 6e 69 73 74 61 6e   hanistan
13 68 6f 72 61 20 64 65   .hora de

New, first 16 bytes followed by an excerpt in the middle:

a2 10 9d 00 00 00 00 00   ........
00 00 04 00 00 00 0f 00   ........
<snip>
a5 05 00 00 41 63 72 65   ....Acre
41 66 67 68 61 6e 69 73   Afghanis
74 61 6e 41 66 72 69 63   tanAfric

Interpretation:

  • In both cases, the byte 9d is the length of the vector.
  • When Postcard encodes the vector directly, it uses 1 byte per element for the length, whereas we use 4 bytes per element.
  • The highest value stored in the VZV index array is 0x05a5.

Discussion: If we were to use FlexZeroVec, we could represent the index array using u16 rather than u32. This would cut out about 630 bytes, which is more than half the difference between raw Postcard and VZV, at very little runtime cost and relatively little disruption.

However, this does raise questions about whether we could use an even more compact encoding for VZV. As an extreme example, we could use Postcard's own representation and pay O(N) lookup. But perhaps we can brainstorm other data models.

@sffc
Copy link
Member Author

sffc commented Feb 14, 2022

Let's say we want to store the following values in a ZV:

  • Decimal: [5, 12, 31, 41, 55, 130, 290, 351]
  • Hex: [0x05, 0x0B, 0x1F, 0x29, 0x37, 0x82, 0x122, 0x15F]
  • Decimal Derivatives: [5, 7, 19, 10, 14, 116, 160, 61]
  • Hex Derivatives: [0x05, 0x07, 0x13, 0x0A, 0x0E, 0x74, 0xA0, 0x3D]

Right now, we store:

05 00 00 00 0B 00 00 00  # u32 array
1F 00 00 00 29 00 00 00  # u32 array
37 00 00 00 82 00 00 00  # u32 array
22 01 00 00 5F 01 00 00  # u32 array

We can cut that almost in half by using FlexZeroVec, with one additional read per lookup (the item width):

02  # item width
05 00 0B 00 1F 00 29 00  # u16 array
37 00 82 00 22 01 5F 01  # u16 array

Here's a hypothetical structure storing derivatives (lengths of individual items) with constant-time lookup and 1-4 extra reads per lookup that supports item widths of up to 255 bytes. To read from this structure, first you read the item width, then you read the first item in the i<<2th row, then you add the derivatives until you get the value you were looking for:

01  # item width
05 07 13 0A  # item (0) + derivatives (1-3)
37 74 A0 3D  # item (4) + derivatives (5-7)

My conclusion is that we can keep the same VZV structure, and iterate on the representation of ZeroVec<usize> as needed. I think we should start with FlexZeroVec, and explore other, more complex options as needed.

@sffc
Copy link
Member Author

sffc commented Mar 4, 2022

  • @iainireland - This makes sense in small data where small data is more important than speed. I don't think we want to use this for anything hot. If we know the width at compile time, we can get some wins on certain processors.
  • @Manishearth - I want ZV to be as efficient as possible; everything being FZV is annoying. But due to the way ZeroMap works, if you want a flexible ZeroMap, you could have a ZeroMap<Flex...>. It would be a little annoying. The other thing is that we have the equality constraint that byte equality is semantic equality, so there needs to be a single valid representation. If you mutate a FlexZeroVec, it would need to resize itself in the process to retain this invariant. This is fine but not fun.
  • @iainireland - I saw this as replacing things only in isolated situations.
  • @sffc - My position going into this conversation is that we change VZV to always use the FlexZeroVec for the index array.
  • @iainireland - Why not do a const parameter?
  • @sffc - Say we want to store Gregorian month names. In English, this would fit in a 1-byte VZV, but in a 4-byte script like Adlam or Chakma, it would probably require a 2-byte VZV.
  • @Manishearth - Does FZV need to be a public type?
  • @iainireland - I'm concerned that if we encourage people to use VZV more, we shouldn't make them slower. For storing strings, it makes sense. But when you're doing case folding, fast access matters a lot. How difficult would it be to support both?
  • @Manishearth - All of this should be gated on benchmarks.
  • @echeran - Can we have a flag to say I want this in fast mode or slow mode?
  • @sffc - We could make a specialized VZV type for fast access, doing some tricks to move the lengths into the data section, and use FZV for the normal case.

(discussion continues)

  • @Manishearth: Iain's case for fast indexing can also be handled by some unsafe methods on VZV, or by directly using a byte slice and VarULE.

@sffc
Copy link
Member Author

sffc commented Mar 4, 2022

We agree on using FlexZeroVec as the only index array type for VarZeroVec.

We will start with the "basic" FlexZeroVec where the first byte is the item width.

+1 from @iainireland, @Manishearth, @echeran

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Mar 4, 2022
@sffc sffc self-assigned this Apr 9, 2022
@sffc sffc removed the help wanted Issue needs an assignee label Apr 9, 2022
@sffc sffc mentioned this issue Apr 9, 2022
@sffc sffc modified the milestones: ICU4X 0.6, ICU4X 1.0 (Features) May 25, 2022
@sffc sffc closed this as completed in #1790 Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters good first issue Good for newcomers S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants