(improvement) Add VectorType support to numpy_parser for 2D array parsing#731
Draft
mykaul wants to merge 2 commits intoscylladb:masterfrom
Draft
(improvement) Add VectorType support to numpy_parser for 2D array parsing#731mykaul wants to merge 2 commits intoscylladb:masterfrom
mykaul wants to merge 2 commits intoscylladb:masterfrom
Conversation
Extend NumpyParser to handle VectorType columns by creating 2D NumPy arrays (rows × vector_dimension) instead of object arrays. This enables zero-copy parsing for vector embeddings in ML/AI workloads. Features: - Detects VectorType via vector_size and subtype attributes - Creates 2D masked arrays for numeric vector subtypes (float, double, int32, int64, int16) - Falls back to object arrays for unsupported vector subtypes - Handles endianness conversion for both 1D and 2D arrays - Pre-allocates result arrays for efficiency Supported vector types: - Vector<float> → 2D float32 array - Vector<double> → 2D float64 array - Vector<int> → 2D int32 array - Vector<bigint> → 2D int64 array - Vector<smallint> → 2D int16 array Adds comprehensive test coverage for all supported vector types, mixed column queries, and large vector dimensions (384-element embeddings). Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
There was a problem hiding this comment.
Pull request overview
This PR adds native VectorType handling to the Cython NumPy row parser so vector columns can be parsed into 2D NumPy (masked) arrays instead of falling back to object arrays, improving performance for vector/embedding workloads.
Changes:
- Extend
cassandra/numpy_parser.pyxto allocate 2D arrays forVectorTypecolumns and correctly advance/mark masks for 2D shapes. - Add unit tests in
tests/unit/test_numpy_parser.pycovering several numeric vector subtypes and mixed scalar+vector results.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 12 comments.
| File | Description |
|---|---|
| cassandra/numpy_parser.pyx | Adds VectorType 2D array allocation + 2D mask stride handling; updates NULL-mask write logic. |
| tests/unit/test_numpy_parser.py | Introduces unit tests for vector parsing into 2D NumPy arrays. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/unit/test_numpy_parser.py
Outdated
| @@ -0,0 +1,305 @@ | |||
| # Copyright DataStax, Inc. | |||
Comment on lines
+32
to
+35
| @unittest.skipUnless(HAVE_NUMPY, "NumPy not available") | ||
| class TestNumpyParserVectorType(unittest.TestCase): | ||
| """Tests for VectorType support in NumpyParser""" | ||
|
|
tests/unit/test_numpy_parser.py
Outdated
Comment on lines
+97
to
+98
| expected = np.array(vectors, dtype='<f4') # little-endian after conversion | ||
| np.testing.assert_array_almost_equal(arr, expected) |
tests/unit/test_numpy_parser.py
Outdated
Comment on lines
+130
to
+131
| expected = np.array(vectors, dtype='<f8') | ||
| np.testing.assert_array_almost_equal(arr, expected) |
tests/unit/test_numpy_parser.py
Outdated
Comment on lines
+195
to
+196
| expected = np.array(vectors, dtype='<i8') | ||
| np.testing.assert_array_equal(arr, expected) |
tests/unit/test_numpy_parser.py
Outdated
|
|
||
| import struct | ||
| import unittest | ||
| from unittest.mock import Mock |
tests/unit/test_numpy_parser.py
Outdated
| arr = result['features'] | ||
| self.assertEqual(arr.shape, (2, 128)) | ||
|
|
||
| expected = np.array(vectors, dtype='<i4') |
tests/unit/test_numpy_parser.py
Outdated
| arr = result['small_vec'] | ||
| self.assertEqual(arr.shape, (2, 8)) | ||
|
|
||
| expected = np.array(vectors, dtype='<i2') |
tests/unit/test_numpy_parser.py
Outdated
Comment on lines
+266
to
+271
| np.testing.assert_array_equal(result['id'], np.array([1, 2], dtype='<i4')) | ||
|
|
||
| # Verify vec column (2D array) | ||
| self.assertEqual(result['vec'].shape, (2, 3)) | ||
| expected_vecs = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], dtype='<f4') | ||
| np.testing.assert_array_almost_equal(result['vec'], expected_vecs) |
tests/unit/test_numpy_parser.py
Outdated
Comment on lines
+300
to
+301
| expected = np.array(vectors, dtype='<f4') | ||
| np.testing.assert_array_almost_equal(arr, expected) |
- Add buffer size guard before memcpy in unpack_row() to prevent overflow - Remove dead mask_true constant and unused uint8_t cimport - Fix copyright header (DataStax -> ScyllaDB) - Replace hard-coded little-endian dtypes with native numpy dtypes - Remove unused Mock import - Add test for NULL vector mask handling - Add test for unsupported subtype fallback to object array
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
(num_rows, vector_dimension)instead of falling back to object arraysDetails
Changes to
cassandra/numpy_parser.pyx:make_array()detects VectorType columns and creates 2Dnp.ma.empty((array_size, vector_size), dtype=...)arrays with the correct numeric dtype (float32, float64, int32, int64, int16)ArrDescextended withmask_stridefield to handle 2D mask arrays where stride =vector_dimensionbools (not 1 bool)unpack_row()uses directmemcpyof the full vector payload (e.g., 3072 bytes for float[768]) into the pre-allocated 2D array buffermake_native_byteorder()handles bulk byte-swap on any-dimensional arrays transparentlyResult: For a query returning N rows of
Vector<float, 768>, the NumpyParser produces an(N, 768)float32 array directly from wire bytes — the fastest possible path when consuming results as numpy arrays.Tests: Comprehensive unit tests in
tests/unit/test_numpy_parser.pycovering all supported numeric subtypes, NULL handling, mask strides, and unsupported type fallback.This commit is fully independent — it only modifies
numpy_parser.pyxand adds a new test file.