Open
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
The existing pd.arrays.BooleanArray serves a good purpose to allow True/False with missing values, but the current implementation is horribly inefficient. Coming from the historical NumPy perspective, the implementation uses twice as much memory. Compared to PyArrow the memory usage is 8x as much and computational algorithms can be up to 64x slower
Feature Description
The pd.arrays.BooleanArray could use nanoarrow behind the scenes for its implementation, rather than the existing NumPy approach.
I think the main technical challenges for this would be:
- Build system integration. nanoarrow is already available in the Meson WrapDB and progress is underway with nanobind; probably worth waiting for the latter, but once complete this is less of a concern
- 2D support, if ever needed. You could try to simulate 2D indexing operations with a bitmask, but something like transposition (which are trivial with a bytemask) is a concept that does not translate well moving from bytes to bits . I don't know that this is a huge issue since the existing BooleanArray does not support 2D, but @jbrockmendel probably knows best on any plans for that
Alternative Solutions
status quo
Additional Context
No response