This update to the BitMagic library adds the sparse_vector_float container, designed to store both sparse and dense floating-point data in a succinct format that reduces memory overhead, with additional compression when serialized to disk. Alongside this new container, the update adds support for serialization and deserialization of this new container, as well as optimized search capabilities integrated directly into the existing sparse_vector_scanner class. Some use cases for a sparse_vector_float include storing geospatial data such as elevations or temperatures, machine learning, sparse graphs, and storing time-series such as financial data.
New Classes:
In this update the following classes have been added
- sparse_vector_float
- sparse_vector_float_serial_layout
- sparse_vector_float_serializer
- sparse_vector_float_deserializer
These are wrapper classes for the sparse vector class type, and allow for the storage of 32-bit floats.
As floats do not have an unsigned version, they cannot be stored in a normal sparse vector. The sparse_vector_float series of classes allows for the storage of these floats by splitting received floats into their sign, exponent, and mantissa. The exponent and mantissa are stored in two sparse vectors, the type of which is given by the user, and the sign is stored in a bit vector of the same type as the given sparse vectors.
This allows for the storage of floats in a single sparse vector object, and instead of the user having to split a float and store it in 3 separate containers themselves, they can instead store it in a single sparse_vector_float object and have operations on that sparse_vector_float affect all 3 inner containers simultaneously.
The sparse_vector_float_serial_layout, sparse_vector_float_serializer, and the sparse_vector_float_deserializer are used for serialization and deserialization of sparse_vector_float objects, and are largely used in the same way as the sparse vector version of the classes.
New Functions:
The following functions have been added:
sparse_vector_float_serialize()
sparse_vector_float_deserialize()
These functions are the same as their sparse vector counterparts but use the sparse_vector_float serializer/deserializer’s
New Methods:
The following methods have been added:
- sparse_vector_scanner::find_gt_float
- sparse_vector_scanner::find_lt_float
- sparse_vector_scanner::find_ge_float
- sparse_vector_scanner::find_le_float
- sparse_vector_scanner::find_range_float
These methods allow for a scanner to find the greater than, less than, greater than or equal to, less than or equal to, and a specific range of values in a sparse_vector_float object, allowing easier parsing of sparse_vector_floats.
New Samples:
The following samples have been created on how to use the sparse_vector_float class
- svfsample01 - Basic operations(import, set, get, optimize, etc.)
- svfsample02 - Serialization and Const iterator operations
- svfsample03 - Range methods(clear, compare, equal, join, merge)
- svfsample04 - Extracting from a sparse_vector_float(decode, extract, gather, extract_range)
Comparing a std::vector to sparse_vector_float objects:
Memory Size Comparison between
Calculated using Bitmagic’s calc_stat::memory_used
- std::vector
- SVF - sparse_vector_float using a sparse_vector
- With Linear data going from -N/2 * .00123 to N/2 * .00123
- With Random data going from -1,000,000 to 1,000,000
- RSC SVF - sparse_vector_float using an rsc_sparse_vector with 35% NULL’s
- With Linear data going from -N/2 * .00123 to N/2 * .00123
- With Random data going from -1,000,000 to 1,000,000
| Container | 100,000 Elements(bytes) | 1,000,000 Elements(bytes) | 10,000,000 Elements(bytes) |
|---|---|---|---|
| std::vector | 400,024 | 4,000,024 | 40,000,024 |
| SVF Linear | 338,176 | 1,819,648 | 12,703,488 |
| SVF Random | 539,232 | 3,870,464 | 36,430,080 |
| RSC SVF Linear | 266,256 | 1,558,288 | 11,536,280 |
| RSC SVF Random | 351,344 | 2,765,584 | 26,786,576 |
Serialized Memory Size Comparison
| Container | 100,000 Elements(bytes) | 1,000,000 Elements(bytes) | 10,000,000 Elements(bytes) |
|---|---|---|---|
| std::vector | 400,024 | 4,000,024 | 40,000,024 |
| SVF Linear | 191,310 | 956,513 | 6,890,859 |
| SVF Random | 375,352 | 3,470,964 | 34,693,378 |
| RSC SVF Linear | 149,633 | 1,166,626 | 9,122,900 |
| RSC SVF Random | 254,964 | 2,546,823 | 25,480,286 |
Search Comparison between Data structures with 10,000,000 elements
- Linear Data
- Data in Data Structures
- 10,000,000 linear elements
- Range of -6150 to 6150 or ( -N/2 * .00123 to N/2 * .00123)
- Search Info
- 1000 searches performed on 1000 ranges
- Ranges are 2 random numbers from -6150 to 6150
- Legend
- Linearly search on an std::vector
- sparse_vector_scanner on a sparse_vector_float using a sparse_vector
- Using the Const Iterator of a sparse_vector_float using a sparse_vector
- sparse_vector_scanner on a sparse_vector_float using an rsc_sparse_vector
- Using the Const Iterator of a sparse_vector_float using an rsc_sparse_vector
- Data in Data Structures
- Random data
- Data in Data structures
- 10,000,000 random numbers
- Range of -1,000,000 to 1,000,000
- Search Info
- 1000 searches performed on 1000 ranges
- Ranges are 2 random numbers from -1,000,000 to 1,000,000
- Legend
- Same as Linear Data
- Data in Data structures
| Query Method / Dataset | Linear Data(sec) | Random Data(sec) |
|---|---|---|
| std::vector | 18.6000 s | 41.1800 s |
| sparse_vector_scanner | 0.8433 s | 5.1880 s |
| Const Iterator | 112.8000 s | 239.2800 s |
| std::vector(35% NaN) | 24.0400 s | 32.9500 s |
| RSC sparse_vector_scanner(35% Null) | 106.6200 s | 230.5200 s |
| RSC Const Iterator(35% Null) | 172.3200 s | 268.3200 s |
Fixes:
General:
- Fixed the use of chrono_taker<> in the perf.cpp and t.cpp as it was used without the <>, causing errors.
- Fixed compilation warnings in bmaggregator.h, bmblocks.h, bmfunc.h, bmserial.h, perf.cpp, and t.cpp
- Updated CMakeLists.txt to not use -march=native by default for non-x86 builds, and to instead use -O2
sparse_vector_scanner
- Fixed an error where in the find_gt_horizontal() method a duplicate of a given bvector was created, but if the given bvector was read only it would copy the read only, causing an error. Changed how it was copying the bvector to using bit_or()
t.cpp:
- Fixed an error in the TestCompressSparseVector() test where in the Compressed load() stress test cmp 3 was testing csv2 instead of csv1
- Fixed an error where in DetailedCheckCompressedDecode() where an assert would fail if given an empty rsc_sparse_vector, despite some tests being an empty rsc_sparse_vector