Skip to content

(improvement) Add VectorType support to numpy_parser for 2D array parsing#731

Draft
mykaul wants to merge 2 commits intoscylladb:masterfrom
mykaul:vector-numpy-parser-2d
Draft

(improvement) Add VectorType support to numpy_parser for 2D array parsing#731
mykaul wants to merge 2 commits intoscylladb:masterfrom
mykaul:vector-numpy-parser-2d

Conversation

@mykaul
Copy link

@mykaul mykaul commented Mar 7, 2026

Summary

  • Add native VectorType support to the NumPy row parser, producing 2D masked arrays of shape (num_rows, vector_dimension) instead of falling back to object arrays
  • Enables zero-copy vector data ingestion for ML/AI workloads using the NumPy result path

Details

Changes to cassandra/numpy_parser.pyx:

  • make_array() detects VectorType columns and creates 2D np.ma.empty((array_size, vector_size), dtype=...) arrays with the correct numeric dtype (float32, float64, int32, int64, int16)
  • ArrDesc extended with mask_stride field to handle 2D mask arrays where stride = vector_dimension bools (not 1 bool)
  • unpack_row() uses direct memcpy of the full vector payload (e.g., 3072 bytes for float[768]) into the pre-allocated 2D array buffer
  • make_native_byteorder() handles bulk byte-swap on any-dimensional arrays transparently
  • Falls back to object arrays for unsupported vector subtypes

Result: For a query returning N rows of Vector<float, 768>, the NumpyParser produces an (N, 768) float32 array directly from wire bytes — the fastest possible path when consuming results as numpy arrays.

Tests: Comprehensive unit tests in tests/unit/test_numpy_parser.py covering all supported numeric subtypes, NULL handling, mask strides, and unsupported type fallback.

This commit is fully independent — it only modifies numpy_parser.pyx and adds a new test file.

Extend NumpyParser to handle VectorType columns by creating 2D NumPy
arrays (rows × vector_dimension) instead of object arrays. This enables
zero-copy parsing for vector embeddings in ML/AI workloads.

Features:
- Detects VectorType via vector_size and subtype attributes
- Creates 2D masked arrays for numeric vector subtypes (float, double,
  int32, int64, int16)
- Falls back to object arrays for unsupported vector subtypes
- Handles endianness conversion for both 1D and 2D arrays
- Pre-allocates result arrays for efficiency

Supported vector types:
- Vector<float> → 2D float32 array
- Vector<double> → 2D float64 array
- Vector<int> → 2D int32 array
- Vector<bigint> → 2D int64 array
- Vector<smallint> → 2D int16 array

Adds comprehensive test coverage for all supported vector types,
mixed column queries, and large vector dimensions (384-element embeddings).

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds native VectorType handling to the Cython NumPy row parser so vector columns can be parsed into 2D NumPy (masked) arrays instead of falling back to object arrays, improving performance for vector/embedding workloads.

Changes:

  • Extend cassandra/numpy_parser.pyx to allocate 2D arrays for VectorType columns and correctly advance/mark masks for 2D shapes.
  • Add unit tests in tests/unit/test_numpy_parser.py covering several numeric vector subtypes and mixed scalar+vector results.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 12 comments.

File Description
cassandra/numpy_parser.pyx Adds VectorType 2D array allocation + 2D mask stride handling; updates NULL-mask write logic.
tests/unit/test_numpy_parser.py Introduces unit tests for vector parsing into 2D NumPy arrays.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,305 @@
# Copyright DataStax, Inc.
Comment on lines +32 to +35
@unittest.skipUnless(HAVE_NUMPY, "NumPy not available")
class TestNumpyParserVectorType(unittest.TestCase):
"""Tests for VectorType support in NumpyParser"""

Comment on lines +97 to +98
expected = np.array(vectors, dtype='<f4') # little-endian after conversion
np.testing.assert_array_almost_equal(arr, expected)
Comment on lines +130 to +131
expected = np.array(vectors, dtype='<f8')
np.testing.assert_array_almost_equal(arr, expected)
Comment on lines +195 to +196
expected = np.array(vectors, dtype='<i8')
np.testing.assert_array_equal(arr, expected)

import struct
import unittest
from unittest.mock import Mock
arr = result['features']
self.assertEqual(arr.shape, (2, 128))

expected = np.array(vectors, dtype='<i4')
arr = result['small_vec']
self.assertEqual(arr.shape, (2, 8))

expected = np.array(vectors, dtype='<i2')
Comment on lines +266 to +271
np.testing.assert_array_equal(result['id'], np.array([1, 2], dtype='<i4'))

# Verify vec column (2D array)
self.assertEqual(result['vec'].shape, (2, 3))
expected_vecs = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], dtype='<f4')
np.testing.assert_array_almost_equal(result['vec'], expected_vecs)
Comment on lines +300 to +301
expected = np.array(vectors, dtype='<f4')
np.testing.assert_array_almost_equal(arr, expected)
- Add buffer size guard before memcpy in unpack_row() to prevent overflow
- Remove dead mask_true constant and unused uint8_t cimport
- Fix copyright header (DataStax -> ScyllaDB)
- Replace hard-coded little-endian dtypes with native numpy dtypes
- Remove unused Mock import
- Add test for NULL vector mask handling
- Add test for unsupported subtype fallback to object array
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants