Skip to content

Should vec_chop() materialize ALTREP vectors? #1450

@DavisVaughan

Description

@DavisVaughan

We typically use vec_chop() in two ways:

  1. Like as.list(), where we turn a vector into a list where each list element holds 1 element of the original vector
  2. For chunking groups in dplyr::group_by() / summarise()

In both of these cases, we are guaranteed to touch every element of the vector.

It is possible to not touch every element, like vec_chop(1:5, list(1, 2)), but I don't think I've ever used that.

Currently, vec_chop() internally uses vec_slice() which slices ALTREP vectors using their Extract_subset method if it exists, which can return another ALTREP result (like with vroom). Performance can degrade significantly if we end up with many small ALTREP chunks, as we will eventually have to call DATAPTR() on each chunk and materialize it, which could be slow, like in tidyverse/dplyr#6015.

General advice seems to be that you should go ahead and materialize the original ALTREP vector if you know you are going to touch every element of it (tidyverse/dplyr#6015 (comment)).

So to avoid these performance issues, I think we should consider having vec_chop() always avoid the special ALTREP vec_slice() path, and just go straight into the standard path that would call DATAPTR() and materialize the full vector (forcing downstream chunks to also be materialized).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions