Tracking Issue: Compressor Optimizations

This is a tracking issue for potentially making improvements to our default compressor (which is just all of our encodings/schemes + our sampling compressor).

## Design

We recently made several large structural changes to the compressor (tracked in https://github.com/vortex-data/vortex/issues/7216), which both made the sampling compressor pluggable and also made observability into what the compressor is doing easier (perfetto tracing and a clearer sampling algorithm).

Those improvements gave a few insights in different ways we _could_ optimize the compressor, but it is unclear the degree to which those improvements would help.

Below are a list of ideas that I have that I think could be improvements to the compressor, in no particular order.

## Steps

- [x] https://github.com/vortex-data/vortex/issues/7216
- [ ] Scheme priorities (i.e., trying one scheme before another)
- [ ] More exclusion rules (with more formal proofs on why they are correct)
- [ ] Hardcode `ConstantScheme` logic into the compressor
- [ ] Hardcode `DictScheme` logic into the compressor? (likely more controversial...)
- [ ] Fine-grained stats collection. Not all schemes need all stats, we can be even more lazy with stats generation.
- [ ] Some scheme estimation has to compress the entire array to see if it makes sense (like `SequenceScheme`, but then we throw away the entire array and recompress later. That can definitely be fixed.
- [ ] Replace hardcoded `MAX_CASCADE = 3` with adaptive cascading that leverages the exclusion-bounded search tree (this is a stretch goal as the search space is potentially exponential)

## Unresolved questions

- [ ] Should `MAX_CASCADE` be replaced with an adaptive search strategy now that the exclusion system bounds the search space?
- [ ] Is there a way we can improve FSST scheme estimation and compression? We currently sample for FSST compression every time, which is super expensive. But I tried being smarter with FSST estimation (try to not sample every single time and predict when FSST is good) but that basically only caused regressions (see https://github.com/vortex-data/vortex/pull/7611). It is unclear to me if there is a better way to quickly check if FSST is good for the whole array.
- [ ] Because we sample to 1024 values, it is possible for null-dominated sparse arrays to cause sampling to return 0 non-null values, which means the sampled compression ratio is technically infinite. This is obviously wrong, and it is extra wrong because the sampler will assume the entire array is all-null, which we check far in advance. It is unclear what the best solution to this is, see https://github.com/vortex-data/vortex/issues/7268.


## Implementation history

- Nothing yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue: Compressor Optimizations #7697

Design

Steps

Unresolved questions

Implementation history

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tracking Issue: Compressor Optimizations #7697

Description

Design

Steps

Unresolved questions

Implementation history

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions