Skip to content

pluggable arrow exec#7793

Draft
a10y wants to merge 2 commits intodevelopfrom
aduffy/pluggable-arrow-exec
Draft

pluggable arrow exec#7793
a10y wants to merge 2 commits intodevelopfrom
aduffy/pluggable-arrow-exec

Conversation

@a10y
Copy link
Copy Markdown
Contributor

@a10y a10y commented May 5, 2026

Summary

Closes: #000

Testing

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y added the do not merge Pull requests that are not intended to merge label May 5, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 5, 2026

Merging this PR will degrade performance by 21.4%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 5 improved benchmarks
❌ 36 regressed benchmarks
✅ 1139 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_dict_primitive_into_canonical[u32, (1000, 10, 10)] 117.8 µs 138.3 µs -14.83%
Simulation bench_compare_primitive[(10000, 128)] 103.3 µs 124.9 µs -17.29%
Simulation bench_compare_primitive[(10000, 2)] 102.8 µs 123.6 µs -16.79%
Simulation bench_compare_primitive[(10000, 2048)] 126.3 µs 148 µs -14.7%
Simulation bench_compare_primitive[(10000, 32)] 103.1 µs 124 µs -16.88%
Simulation bench_compare_primitive[(10000, 4)] 103.2 µs 123.9 µs -16.69%
Simulation bench_compare_primitive[(10000, 512)] 111.7 µs 133.2 µs -16.15%
Simulation bench_compare_primitive[(10000, 8)] 103 µs 123.7 µs -16.7%
Simulation bench_compare_sliced_dict_primitive[(1000, 10000)] 76.8 µs 97.7 µs -21.4%
Simulation bench_compare_sliced_dict_primitive[(10000, 10000)] 153 µs 173.3 µs -11.69%
Simulation bench_compare_sliced_dict_primitive[(2000, 10000)] 81.6 µs 102.3 µs -20.22%
Simulation bench_compare_sliced_dict_primitive[(2500, 10000)] 84.4 µs 105.2 µs -19.77%
Simulation bench_compare_sliced_dict_primitive[(3333, 10000)] 88.9 µs 110.4 µs -19.48%
Simulation bench_compare_sliced_dict_primitive[(5000, 10000)] 98.5 µs 119.2 µs -17.33%
Simulation bench_compare_sliced_dict_primitive[(7500, 10000)] 136.4 µs 156.7 µs -12.96%
Simulation bench_compare_sliced_dict_primitive[(9999, 10000)] 152.6 µs 173.3 µs -11.94%
Simulation bench_compare_sliced_dict_varbinview[(1000, 10000)] 107.3 µs 128.4 µs -16.41%
Simulation bench_compare_sliced_dict_varbinview[(2000, 10000)] 133.7 µs 154.4 µs -13.42%
Simulation bench_compare_sliced_dict_varbinview[(2500, 10000)] 147.4 µs 168.3 µs -12.44%
Simulation bench_compare_sliced_dict_varbinview[(3333, 10000)] 170.4 µs 191.4 µs -10.98%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Comparing aduffy/pluggable-arrow-exec (fba88e0) with develop (d7c22ba)

Open in CodSpeed

Comment thread vortex-array/src/arrow/session.rs Outdated
Comment on lines +81 to +85
canonical_encoder: RwLock<Option<ArrowEncoderRef>>,
/// Fallback decoder used after the user chain has declined.
default_decoder: RwLock<Option<ArrowDecoderRef>>,
/// Fallback dtype reader used after the user chain has declined.
default_dtype_reader: RwLock<Option<ArrowDTypeReaderRef>>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A RwLock still has contention if all accessors and readers, maybe an ArcSwap. I think we might want a wrapper for this pattern, not needed here however!

fn to_arrow_array(
&self,
array: ArrayRef,
target: &DataType,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want this to be an optional so if we don't mind we can return encoding specific values

Suggested change
target: &DataType,
target: Option<&DataType>,

Sorry if I missed a comment regarding this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea was that you must pass in an explicit target physical type, whether that's derived from user or by calling preferred_arrow_type() first

Comment on lines +55 to +62
/// Returning [`Ok(None)`] passes the request to the next reader in the chain.
pub trait ArrowDTypeReader: 'static + Send + Sync + Debug {
/// Try to read a Vortex [`DType`] from an Arrow [`Field`].
///
/// Implementations typically inspect [`Field::metadata`] for the `ARROW:extension:name`
/// key and dispatch on it.
fn try_read_dtype(&self, field: &Field) -> VortexResult<Option<DType>>;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have both fields and data_types? I guess nullability?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea both is kinda funny. i think if you want to cover extension dtypes then you need the Field b/c that has metadata. and also nullability yea.

Comment on lines +38 to +42
fn preferred_arrow_type(
&self,
array: &ArrayRef,
session: &ArrowSession,
) -> VortexResult<Option<DataType>>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two passes might be expensive?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to make sure this doens't regress any benchmarks

pub fn register_encoder_for_extension(
&self,
key: impl Into<Id>,
plugin: impl Into<ArrowEncoderRef>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make these ArrowEncoderRef I would think we end up using the same for for many ids so we don't want a different instance for each?

Using might may lead to that?

@palaska
Copy link
Copy Markdown
Contributor

palaska commented May 5, 2026

looks like 3 people started doing the same thing in parallel 😄 #7726

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y
Copy link
Copy Markdown
Contributor Author

a10y commented May 5, 2026

It's a hot topic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge Pull requests that are not intended to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants