fix: canonicalize sliced VarBinArray failure by a10y · Pull Request #5962 · vortex-data/vortex

a10y · 2026-01-14T23:14:15Z

A user report came in with the following stacktrace:

                 at /usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/arrow-array-56.2.0/src/builder/generic_bytes_view_builder.rs:176:9
      12: <arrow_array::array::byte_view_array::GenericByteViewArray<V> as core::convert::From<&arrow_array::array::byte_array::GenericByteArray<FROM>>>::from
                 at /usr/local/cargo/registry/src/index.crates.io-1949cf8c6b5b557f/arrow-array-56.2.0/src/array/byte_view_array.rs:913:39
      13: vortex_array::arrays::varbin::vtable::canonical::<impl vortex_array::vtable::canonical::CanonicalVTable<vortex_array::arrays::varbin::vtable::VarBinVTable> for vortex_array::arrays::varbin::vtable::VarBinVTable>::canonicalize
                 at /usr/local/cargo/git/checkouts/vortex-e8ac85adeb77362c/f1b2ae9/vortex-array/src/arrays/varbin/vtable/canonical.rs:38:26
      14: <vortex_array::array::ArrayAdapter<V> as vortex_array::array::Array>::to_canonical
                 at /usr/local/cargo/git/checkouts/vortex-e8ac85adeb77362c/f1b2ae9/vortex-array/src/array/mod.rs:609:25
      15: <A as vortex_array::canonical::ToCanonical>::to_varbinview
                 at /usr/local/cargo/git/checkouts/vortex-e8ac85adeb77362c/f1b2ae9/vortex-array/src/canonical.rs:374:14
      16: <vortex_array::builders::varbinview::VarBinViewBuilder as vortex_array::builders::ArrayBuilder>::extend_from_array_unchecked
                 at /usr/local/cargo/git/checkouts/vortex-e8ac85adeb77362c/f1b2ae9/vortex-array/src/builders/varbinview.rs:278:27
      17: <vortex_array::builders::fixed_size_list::FixedSizeListBuilder as vortex_array::builders::ArrayBuilder>::extend_from_array_unchecked
                 at /usr/local/cargo/git/checkouts/vortex-e8ac85adeb77362c/f1b2ae9/vortex-array/src/builders/fixed_size_list.rs:243:31
      18: vortex_array::stats::array::StatsSetRef::compute_stat
                 at /usr/local/cargo/git/checkouts/vortex-e8ac85adeb77362c/f1b2ae9/vortex-array/src/stats/array.rs:177:29
      19: vortex_array::stats::array::StatsSetRef::compute_all
                 at /usr/local/cargo/git/checkouts/vortex-e8ac85adeb77362c/f1b2ae9/vortex-array/src/stats/array.rs:200:35
      20: <vortex_layout::layouts::compressed::CompressingStrategy as vortex_layout::strategy::LayoutStrategy>::write_stream::{{closure}}::{{closure}}::{{closure}}
                 at /usr/local/cargo/git/checkouts/vortex-e8ac85adeb77362c/f1b2ae9/vortex-layout/src/layouts/compressed.rs:144:26

The failure is inside of arrow-rs, in its conversion from LargeStringArray -> StringViewArray.

Arrow checks that the final offset is < i32::MAX, and then pushes the buffer. This will not work if we're trying to convert a large VarBinArray that's been sliced, so that the final offset is not the maximum size of the buffer.

Upstream arrow-rs should get a fix as well, but in the meantime, this should fix the behavior of compressing large chunks, by rezeroing offsets before we canonicalize VarBin.

a10y · 2026-01-14T23:15:27Z

vortex-array/src/arrays/varbin/vtable/canonical.rs

+    fn test_massive() {
+        // Attempt to convert a really large dataset to Arrow.
+        let strings = VarBinArray::from_iter_nonnull(
+            ["1234567890123"].iter().cycle().take(500_000_000),
+            DType::Utf8(Nullability::NonNullable),
+        );
+
+        let sliced = strings.slice(0..5);
+
+        let vbv = sliced.to_varbinview();
+        assert_eq!(vbv.len(), 5);
+    }


i can't check this in because it's too slow, but this was failing before and now it's not (you actually need to update VarBinArray::from_iter_nonnull to use u64 offsets to make it pass)

can we not use the slow run on post commit?

codspeed-hq · 2026-01-14T23:23:06Z

CodSpeed Performance Report

Merging this PR will not alter performance

_{Comparing varbin-arrow-fix (23f6f86) with develop (4bbafe7)}

Summary

✅ 1254 untouched benchmarks
⏩ 1254 skipped benchmarks¹

1254 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

codecov · 2026-01-15T22:38:07Z

Codecov Report

❌ Patch coverage is 85.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.88%. Comparing base (4bbafe7) to head (23f6f86).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines	Patch %	Lines
vortex-array/src/arrays/varbin/array.rs	83.33%	3 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y · 2026-01-16T05:10:26Z

vortex-python/test/test_dataset.py

 def test_fragment_to_table(ds: vx.dataset.VortexDataset):
    fragments = list(ds.get_fragments())

-    # The first fragment contains none of the matching records


now that we slice before canonicalizing, we write fewer segments of the string column, which results in fewer splits getting planned, so this is no longer true.

the test is just updated to make sure the sum of the fragment row counts matches the expected row count after the filter

a10y commented Jan 14, 2026

View reviewed changes

a10y added the changelog/fix A bug fix label Jan 14, 2026

a10y force-pushed the varbin-arrow-fix branch from 9ab1f44 to c5d63c5 Compare January 14, 2026 23:17

a10y force-pushed the varbin-arrow-fix branch from c5d63c5 to ca83938 Compare January 15, 2026 14:36

joseph-isaacs approved these changes Jan 15, 2026

View reviewed changes

a10y force-pushed the varbin-arrow-fix branch 2 times, most recently from fd32064 to f42a231 Compare January 15, 2026 22:28

a10y enabled auto-merge (squash) January 15, 2026 22:28

fix: canonicalize sliced VarBinArray failure

85b569c

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y force-pushed the varbin-arrow-fix branch from f42a231 to 85b569c Compare January 15, 2026 23:00

fix python tests

23f6f86

Signed-off-by: Andrew Duffy <andrew@a10y.dev>

a10y commented Jan 16, 2026

View reviewed changes

a10y merged commit c33bc5c into develop Jan 16, 2026
49 of 50 checks passed

a10y deleted the varbin-arrow-fix branch January 16, 2026 05:19

danking pushed a commit that referenced this pull request Feb 6, 2026

fix: canonicalize sliced VarBinArray failure (#5962)

28d1dc5

a10y mentioned this pull request Feb 27, 2026

Eliminate redundant decompressions in VarBinArray canonicalization #6692

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: canonicalize sliced VarBinArray failure#5962

fix: canonicalize sliced VarBinArray failure#5962
a10y merged 2 commits intodevelopfrom
varbin-arrow-fix

a10y commented Jan 14, 2026

Uh oh!

a10y Jan 14, 2026

Uh oh!

joseph-isaacs Jan 15, 2026

Uh oh!

codspeed-hq bot commented Jan 14, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 15, 2026 •

edited

Loading

Uh oh!

a10y Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

a10y commented Jan 14, 2026

Uh oh!

a10y Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

joseph-isaacs Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

codspeed-hq bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging this PR will not alter performance

Summary

Footnotes

Uh oh!

codecov bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

a10y Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq bot commented Jan 14, 2026 •

edited

Loading

codecov bot commented Jan 15, 2026 •

edited

Loading