(improvement) Add VectorType support to numpy_parser for 2D array parsing by mykaul · Pull Request #731 · scylladb/python-driver

mykaul · 2026-03-07T10:01:22Z

Summary

Add native VectorType support to the NumPy row parser, producing 2D masked arrays of shape (num_rows, vector_dimension) instead of falling back to object arrays
Enables zero-copy vector data ingestion for ML/AI workloads using the NumPy result path

Details

Changes to cassandra/numpy_parser.pyx:

make_array() detects VectorType columns and creates 2D np.ma.empty((array_size, vector_size), dtype=...) arrays with the correct numeric dtype (float32, float64, int32, int64, int16)
ArrDesc extended with mask_stride field to handle 2D mask arrays where stride = vector_dimension bools (not 1 bool)
unpack_row() uses direct memcpy of the full vector payload (e.g., 3072 bytes for float[768]) into the pre-allocated 2D array buffer
make_native_byteorder() handles bulk byte-swap on any-dimensional arrays transparently
Falls back to object arrays for unsupported vector subtypes

Result: For a query returning N rows of Vector<float, 768>, the NumpyParser produces an (N, 768) float32 array directly from wire bytes — the fastest possible path when consuming results as numpy arrays.

Tests: Comprehensive unit tests in tests/unit/test_numpy_parser.py covering all supported numeric subtypes, NULL handling, mask strides, and unsupported type fallback.

This commit is fully independent — it only modifies numpy_parser.pyx and adds a new test file.

Extend NumpyParser to handle VectorType columns by creating 2D NumPy arrays (rows × vector_dimension) instead of object arrays. This enables zero-copy parsing for vector embeddings in ML/AI workloads. Features: - Detects VectorType via vector_size and subtype attributes - Creates 2D masked arrays for numeric vector subtypes (float, double, int32, int64, int16) - Falls back to object arrays for unsupported vector subtypes - Handles endianness conversion for both 1D and 2D arrays - Pre-allocates result arrays for efficiency Supported vector types: - Vector<float> → 2D float32 array - Vector<double> → 2D float64 array - Vector<int> → 2D int32 array - Vector<bigint> → 2D int64 array - Vector<smallint> → 2D int16 array Adds comprehensive test coverage for all supported vector types, mixed column queries, and large vector dimensions (384-element embeddings). Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Copilot

Pull request overview

This PR adds native VectorType handling to the Cython NumPy row parser so vector columns can be parsed into 2D NumPy (masked) arrays instead of falling back to object arrays, improving performance for vector/embedding workloads.

Changes:

Extend cassandra/numpy_parser.pyx to allocate 2D arrays for VectorType columns and correctly advance/mark masks for 2D shapes.
Add unit tests in tests/unit/test_numpy_parser.py covering several numeric vector subtypes and mixed scalar+vector results.