Skip to content

Tracking Issue: Compressor Optimizations #7697

@connortsui20

Description

@connortsui20

This is a tracking issue for potentially making improvements to our default compressor (which is just all of our encodings/schemes + our sampling compressor).

Design

We recently made several large structural changes to the compressor (tracked in #7216), which both made the sampling compressor pluggable and also made observability into what the compressor is doing easier (perfetto tracing and a clearer sampling algorithm).

Those improvements gave a few insights in different ways we could optimize the compressor, but it is unclear the degree to which those improvements would help.

Below are a list of ideas that I have that I think could be improvements to the compressor, in no particular order.

Steps

  • Tracking Issue: Pluggable Compressor #7216
  • Scheme priorities (i.e., trying one scheme before another)
  • More exclusion rules (with more formal proofs on why they are correct)
  • Hardcode ConstantScheme logic into the compressor
  • Hardcode DictScheme logic into the compressor? (likely more controversial...)
  • Fine-grained stats collection. Not all schemes need all stats, we can be even more lazy with stats generation.
  • Some scheme estimation has to compress the entire array to see if it makes sense (like SequenceScheme, but then we throw away the entire array and recompress later. That can definitely be fixed.
  • Replace hardcoded MAX_CASCADE = 3 with adaptive cascading that leverages the exclusion-bounded search tree (this is a stretch goal as the search space is potentially exponential)

Unresolved questions

  • Should MAX_CASCADE be replaced with an adaptive search strategy now that the exclusion system bounds the search space?
  • Is there a way we can improve FSST scheme estimation and compression? We currently sample for FSST compression every time, which is super expensive. But I tried being smarter with FSST estimation (try to not sample every single time and predict when FSST is good) but that basically only caused regressions (see Potentially better FSST estimation #7611). It is unclear to me if there is a better way to quickly check if FSST is good for the whole array.
  • Because we sample to 1024 values, it is possible for null-dominated sparse arrays to cause sampling to return 0 non-null values, which means the sampled compression ratio is technically infinite. This is obviously wrong, and it is extra wrong because the sampler will assume the entire array is all-null, which we check far in advance. It is unclear what the best solution to this is, see Sampling somehow compresses into 0 bytes #7268.

Implementation history

  • Nothing yet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    tracking-issueShared implementation context for work likely to span multiple PRs.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions