Skip to content

Conversation

@a10y
Copy link
Contributor

@a10y a10y commented Mar 24, 2025

Spark column vectors return UTF8Strings, which can wrap one of

  • JVM heap-allocated String
  • JVM heap-allocated byte[]
  • A native pointer + len to native off-heap memory

Previously we've been using the String pathway, this PR changes our Java scans to canonicalize on read (.with_canonicalize(true)), and sends back to Java a ptr + len pair.

This patch when ferried down into Iceberg gives us another 2x speedup on Citibike scan

@a10y a10y force-pushed the aduffy/col-vec-string branch from 480791d to c3dc857 Compare March 24, 2025 22:06
@lwwmanning lwwmanning merged commit e0a78df into aduffy/jni-crate Mar 25, 2025
22 checks passed
@lwwmanning lwwmanning deleted the aduffy/col-vec-string branch March 25, 2025 10:07
a10y added a commit that referenced this pull request Mar 25, 2025
Switch from JNA -> JNI. On some microbenchmarks I've seen this result in
a ~3x speedup for simple string-passing.

I wired this into the Iceberg fork for Vortex and am seeing an immediate
~40% speedup on Citibike scan queries

Subsequently #2781 gives us another 2x speedup on Citibike.
joseph-isaacs pushed a commit that referenced this pull request Mar 26, 2025
Switch from JNA -> JNI. On some microbenchmarks I've seen this result in
a ~3x speedup for simple string-passing.

I wired this into the Iceberg fork for Vortex and am seeing an immediate
~40% speedup on Citibike scan queries

Subsequently #2781 gives us another 2x speedup on Citibike.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants