Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write more about dplyr memory usage #198

Closed
hadley opened this issue Jan 20, 2014 · 1 comment
Closed

Write more about dplyr memory usage #198

hadley opened this issue Jan 20, 2014 · 1 comment
Milestone

Comments

@hadley
Copy link
Member

@hadley hadley commented Jan 20, 2014

Starting from following email. Need to update to use changes. Should be a vignette.


We'll start by making a local copy of the internal dplyr function dfloc(). This function is very useful for helping us understanding how the memory in a data frame works.

library(dplyr)
dfloc <- dplyr:::dfloc

(dfloc will eventually be exported from dplyr once we've thought it through a bit more.)

dfloc() tells us the address of each vector in the data frame.

dfloc(iris)

If these addresses change between operations then we know R has made a copy. It's important to think about data frames as collections as columns rather than monolithic objects because for many operations we can reuse existing columns and not use any extra memory

In base R, a surprising number of operations make copies of the individual vectors. For example, when you extract two columns from a data frame, their contents are actually copied. There's no reason to do this!

# Copies the first two columns
dfloc(iris[1:2])

dfloc(iris)
# Copies all the columns!
iris$blah <- 1
dfloc(iris)

(This is something that may improve in R 3.1.0 due to some work by Michael Lawrence)

The goal of dplyr is to avoid making copies when not needed:

dfloc(iris)
dfloc(group_by(iris, Species))
dfloc(mutate(iris, area = Sepal.Length * Sepal.Width))
dfloc(select(iris, 1:3))

Currently, group_by() doesn't make a copy, but mutate() and select() do, so we'll fix that for the next version. Once we've done that, any sequence of mutate(), select() and group_by() will only need to occupy a little extra memory (i.e. for the indices and new variables). Saving interim results will not have any effect on memory usage.

Obviously there's no way around summarise() making a copy, but it's usually not a big deal since you're reducing the size of the data so much. arrange() also has to make a copy, but generally you can avoid using it since it won't affect any statistical operation. If ordering is important (e.g. for computing a cumulative mean), dplyr provides ways to avoid copying the whole data frame and instead only reorder just the columns you need (see the windowing vignette for more details).

Altogether, this means that dplyr lets you work with data frames with very little extra overhead. Eventually, dplyr will never create a complete copy of the data frame unless you're sorting it, and will provide tools so that you never need to sort it. This should mean that you can keep using data frames, and don't need to switch to a more complex object with reference semantics (like a data table).

@lock
Copy link

@lock lock bot commented Sep 16, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

Loading

@lock lock bot locked and limited conversation to collaborators Sep 16, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants