Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

minimize in-memory overhead of page indexes #80

Open
achille-roussel opened this issue Feb 24, 2022 · 2 comments
Open

minimize in-memory overhead of page indexes #80

achille-roussel opened this issue Feb 24, 2022 · 2 comments
Labels
enhancement Improve a feature that already exists

Comments

@achille-roussel
Copy link
Contributor

achille-roussel commented Feb 24, 2022

Page indexes are split in two data structures: Column Indexes and Offset Indexes. The goal of this ticket is to track work to optimize storage of those data structures in memory, as the number of pages in large parquet files can be significant, and memory utilization rows linearly with it.

To put the work into perspective, our system currently holds ~75TB of data in the central deployment, mostly in files of ~256GB, so around 300 files, about 10 files per storage node. Assuming we use a page size between 256KB and 2MB (sometimes pages need to be loaded in memory), we are looking at 100K to 1M pages per file, x10 files per storage node. At this scale, optimizing page index entries can save in the order of a GBs of memory per process, making it worth the investment: the cost of the infrastructure remains inversely proportional to the amount of data we can manage per node.

Column Index

The column indexes have the following structure:

type ColumnIndex struct {
  NullPages     []bool        `thrift:"1,required"`
  MinValues     [][]byte      `thrift:"2,required"`
  MaxValues     [][]byte      `thrift:"3,required"`
  BoundaryOrder BoundaryOrder `thrift:"4,required"`
  NullCounts    []int64       `thrift:"5,optional"`
}

Here are a few observations about this structure:

The array of booleans wastes 7 bits per item since there are only two possible values but bool items are one byte each. The null page index could be omitted for required columns or repeated group containing only required columns since the values will always be false.

The arrays of byte slices have a 24 bytes overhead per item (each slice is a pointer + length + capacity), considering values are serialized using the plain encoding, this represents a significant amount for columns holding smaller data types like boolean, int32, int64, etc…

For columns that contain no null values (e.g. required or repeated groups with only required fields), the count of null values will be an array containing only zeroes. For columns that may hold null values, the null count should fit in a 32 bits integer since the number of values and the number of nulls are stored as a 32 bits integer in the data page header. The thrift field is optional, indicating that the arrays could be omitted in some cases (tho the parquet spec does not mention it unfortunately parquet-format/PageIndex.md at master · apache/parquet-format)

While we wouldn’t want to change the definition of the data structure in format/parquet.go (since it needs to match the thrift definition), these observation create opportunities to use more efficient in-memory representations after decoding a column index from a parquet file.

In the parquet package, we represent column indexes via the following interface:

type ColumnIndex interface {
    NumPages() int
    NullCount(int) int64
    NullPage(int) bool
    MinValue(int) []byte
    MaxValue(int) []byte
    IsAscending() bool
    IsDescending() bool
}

Which allows the underlying in-memory objects to have heterogenous shapes as long as they respect the contract of the interface. Take as example a required column of type int64, instead of using the thrift structure in memory (which wastes a lot of memory), we could use a more efficient representation:

type int64ColumnIndex struct {
  minValues []int64
  maxValues []int64
  order     format.BoundaryOrder
}

func (index *int64ColumnIndex) NumPages() int         { return len(index.minValues) }
func (index *int64ColumnIndex) NullCount(i int) int64 { return 0 }
func (index *int64ColumnIndex) NullPage(i int) int64  { return false }
func (index *int64ColumnIndex) MinValue(i int) []byte { /* see below */ }
func (index *int64ColumnIndex) MaxValue(i int) []byte { /* see below */ }
func (index *int64ColumnIndex) IsAscending() bool     { return index.order == format.Ascending }
func (index *int64ColumnIndex) IsDescending() bool    { return index.order == format.Descending }

The overhead of this representation is close to zero, the arrays of min/max values store the values only, there are no more indirections, etc…

We would probably want to change the signature of the MinValue/MaxValue methods to this:

    MinValue(int) Value
    MaxValue(int) Value

Which means that we could box the int64 values into parquet.Value and avoid having to convert them to []byte.

Applying similar specialization strategies for each primitive parquet type, as well as having specialization for nullable columns will provide a much more efficient use of memory when dealing with large parquet files holding thousands or more pages per column.

The use of arrays of plain types over [][]byte also greatly reduces the number of pointers in the program; instead of having 2 slices per page (each holding a pointer to a byte array), there are none. Memory areas containing no pointers are skipped by the garbage collector, reducing the overall compute footprint of memory management by avoiding to touch large amounts of memory, which are unlikely to be in CPU caches and therefore have high load latency and cause more important data to be evicted.

Finally, we should take advantage of the fact that null counts are an optional field in the thrift definition, and omit writing them when the array contains only zero values, this will save space both in memory and on disk, and reduce encoding and decoding time.

Offset Index

The thrift definition of offset index items is the following:

type PageLocation struct {
    Offset             int64 `thrift:"1,required"`
    CompressedPageSize int32 `thrift:"2,required"`
    FirstRowIndex      int64 `thrift:"3,required"`
}

type OffsetIndex struct {
    PageLocations []PageLocation `thrift:"1,required"`
}

There are less immediately obvious ways to optimize the in-memory layout of this data structure, but each item is still using 20 to 24 bytes in memory (depending on alignment rules). However, here are a few observations that we can make:

The page locations are ordered in incrementing Offset/FirstRowIndex

Pages of a column should have roughly similar compressed sizes

Instead of storing page locations as an array of this data structure, we could instead split the fields into three separate arrays holding deltas between values rather than absolute values:

type offsetIndex struct {
  offsetDeltas             []int32
  compressedPageSizeDeltas []int32
  firstRowIndexDeltas      []int32
}

We now need only 12 bytes of memory per page, which represents a reduction in the order of 40-50% compared to using the thrift data format.

However, we are making a trade-off of memory for compute time; for example, in order to reconstruct the first row index of a page, we need to sum the values of all deltas up to that page. Considering the use case of performing a binary search for a specific row index, this turns a O(log(N)) operation into O(N*log(N)), which makes it more interesting to perform a simple O(N) scan then. This may still significantly impact query time when searching rows in columns with large number of rows.

An amelioration of this model would be to use a Frame-of-Reference data structure, where instead of using a single array of deltas, we group deltas into pages and keep the absolute value at the beginning of the page:

type frameOfRef struct {
  first  int64
  deltas []int32
}

Depending on the maximum deltas, we could also reach for further optimizations by storing the deltas as 16 bits integers where 2 bytes is enough to represent the highest delta.

Binary search can now be done using the first value of each frame, then a linear scan on a single page allows reconstructing the final values. Bounding the frame sizes to a constant effectively makes this scan a O(1) operation with a high constant cost. The larger the frames the more efficient memory usage becomes, but the longer the scans; however we now have a mechanism for evolving on the spectrum of compute/memory trade off which allows us to find the right balance.

A second optimization here would consist in using SIMD optimization to reduce the constant cost of summing the deltas. Vectors of 16 or 32 bits integers would get 8 to 16x speed boosts from using optimized routines, allowing us to trade more compute for lower memory footprints without negatively impacting query performance.

@achille-roussel achille-roussel added the enhancement Improve a feature that already exists label Feb 24, 2022
@kevinburkesegment
Copy link
Contributor

Is it worth using https://github.com/orijtech/structslop to optimize the order that we put the fields in each struct?

the parquet spec does not mention it unfortunately

Can we open a ticket? It seems like they should at least say one way or another what the treatment is.

the null count should fit in a 32 bits integer since the number of values and the number of is nulls are stored as a 32 bits integer in the data page header

I'm confused about this. Wouldn't we still need to store which pages specifically are null?

in order to reconstruct the first row index of a page, we need to sum the values of all deltas up to that page.

This plus the added complexity make me think maybe we should wait until we deploy this and have a better sense of the size savings + access patterns to decide if it's worth it to do this.

@achille-roussel
Copy link
Contributor Author

achille-roussel commented Feb 25, 2022

Is it worth using https://github.com/orijtech/structslop to optimize the order that we put the fields in each struct?

I don't think we have a lot of room for optimization here, most of the memory is going to be held in the backing array of slices, and the fields are made of types that already align well (no small types interleaving larger ones, etc...).

Can we open a ticket? It seems like they should at least say one way or another what the treatment is.

That seems like the right thing to do, I'll have to familiarize myself with the issue submission process for parquet-format (it doesn't seem to use Github issues).

I'm confused about this. Wouldn't we still need to store which pages specifically are null?

Yes, I think I wanted to relate to the fact that they are arrays of 64 bits integers in the thrift definition, but that seems larger than necessary. I don't know why it was defined as int64 rather than int32 since the same values are represented with 32 bits integers in other places. Maybe they anticipated to use the column index structures for aggregates as well (where it may need more than 32 bits to represent the null counts then).

This plus the added complexity make me think maybe we should wait until we deploy this and have a better sense of the size savings + access patterns to decide if it's worth it to do this.

I don't know if the access pattern is going to play a big part here, the indexes have to be loaded in memory in order to be effective (if they are kept on disk then we need to issue O(log(N)) I/O operations to binary search through the index. That's why I was trying to estimate the amount of memory needed to represent the indexes in memory. The more compact they are the larger the dataset can be, and the more memory there is left available for other parts of the system.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Improve a feature that already exists
Projects
None yet
Development

No branches or pull requests

2 participants