Skip to content

ENH: Back pd.BooleanArray with nanoarrow #59115

Open
@WillAyd

Description

@WillAyd

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The existing pd.arrays.BooleanArray serves a good purpose to allow True/False with missing values, but the current implementation is horribly inefficient. Coming from the historical NumPy perspective, the implementation uses twice as much memory. Compared to PyArrow the memory usage is 8x as much and computational algorithms can be up to 64x slower

Feature Description

The pd.arrays.BooleanArray could use nanoarrow behind the scenes for its implementation, rather than the existing NumPy approach.

I think the main technical challenges for this would be:

  1. Build system integration. nanoarrow is already available in the Meson WrapDB and progress is underway with nanobind; probably worth waiting for the latter, but once complete this is less of a concern
  2. 2D support, if ever needed. You could try to simulate 2D indexing operations with a bitmask, but something like transposition (which are trivial with a bytemask) is a concept that does not translate well moving from bytes to bits . I don't know that this is a huge issue since the existing BooleanArray does not support 2D, but @jbrockmendel probably knows best on any plans for that

Alternative Solutions

status quo

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions