Skip to content

AllInvalid arrays have non empty distinct info#7369

Open
robert3005 wants to merge 1 commit intodevelopfrom
rk/fixcompressor
Open

AllInvalid arrays have non empty distinct info#7369
robert3005 wants to merge 1 commit intodevelopfrom
rk/fixcompressor

Conversation

@robert3005
Copy link
Copy Markdown
Contributor

This is required to handle cases where we are attempting to dict compress and
a sample ends up being all null

Signed-off-by: Robert Kruszewski github@robertk.io

Signed-off-by: Robert Kruszewski <github@robertk.io>
@robert3005 robert3005 requested a review from connortsui20 April 9, 2026 16:43
@robert3005 robert3005 enabled auto-merge (squash) April 9, 2026 16:43
@robert3005 robert3005 added the changelog/fix A bug fix label Apr 9, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 9, 2026

Merging this PR will improve performance by 29.94%

⚡ 2 improved benchmarks
✅ 1120 untouched benchmarks
⏩ 1455 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation take_10k_random 251.2 µs 193.3 µs +29.94%
Simulation take_10k_contiguous 315 µs 257 µs +22.56%

Comparing rk/fixcompressor (b485cc0) with develop (ae906c7)

Open in CodSpeed

Footnotes

  1. 1455 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

erased: TypedStats {
distinct: Some(DistinctInfo {
distinct_values: HashSet::with_capacity_and_hasher(0, FxBuildHasher),
distinct_count: 0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distinct count has to be 0, this is an invariant that other code relies on.

Let me check where

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use it in the constant schemes where if the distinct count is not greater than 1, it must be exactly 1.

I think it is actually fine though since that exact code path wont happen, but I would prefer if we just set distinct count to 1 and then add a big TODO and red flag that this is a hack. Once I finish the vector search stuff I can revisit this and come up with a more robust solution while adding all the things that would be good to have in the compressor (#7216)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why setting any value is meaningful here, the idea is that they will not look at it and we just need to have a present value. This is also internally consistent where all null arrays aren't constant

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its just that I rely on that fact at other callsites, so it might panic there instead. I guess it doesn't really matter either way since both of these fixes are technically incorrect

Copy link
Copy Markdown
Contributor

@connortsui20 connortsui20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will fix this next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants