Skip to content

Write more about dplyr memory usage #198

Closed
@hadley

Description

@hadley

Starting from following email. Need to update to use changes. Should be a vignette.


We'll start by making a local copy of the internal dplyr function dfloc(). This function is very useful for helping us understanding how the memory in a data frame works.

library(dplyr)
dfloc <- dplyr:::dfloc

(dfloc will eventually be exported from dplyr once we've thought it through a bit more.)

dfloc() tells us the address of each vector in the data frame.

dfloc(iris)

If these addresses change between operations then we know R has made a copy. It's important to think about data frames as collections as columns rather than monolithic objects because for many operations we can reuse existing columns and not use any extra memory

In base R, a surprising number of operations make copies of the individual vectors. For example, when you extract two columns from a data frame, their contents are actually copied. There's no reason to do this!

# Copies the first two columns
dfloc(iris[1:2])

dfloc(iris)
# Copies all the columns!
iris$blah <- 1
dfloc(iris)

(This is something that may improve in R 3.1.0 due to some work by Michael Lawrence)

The goal of dplyr is to avoid making copies when not needed:

dfloc(iris)
dfloc(group_by(iris, Species))
dfloc(mutate(iris, area = Sepal.Length * Sepal.Width))
dfloc(select(iris, 1:3))

Currently, group_by() doesn't make a copy, but mutate() and select() do, so we'll fix that for the next version. Once we've done that, any sequence of mutate(), select() and group_by() will only need to occupy a little extra memory (i.e. for the indices and new variables). Saving interim results will not have any effect on memory usage.

Obviously there's no way around summarise() making a copy, but it's usually not a big deal since you're reducing the size of the data so much. arrange() also has to make a copy, but generally you can avoid using it since it won't affect any statistical operation. If ordering is important (e.g. for computing a cumulative mean), dplyr provides ways to avoid copying the whole data frame and instead only reorder just the columns you need (see the windowing vignette for more details).

Altogether, this means that dplyr lets you work with data frames with very little extra overhead. Eventually, dplyr will never create a complete copy of the data frame unless you're sorting it, and will provide tools so that you never need to sort it. This should mean that you can keep using data frames, and don't need to switch to a more complex object with reference semantics (like a data table).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions