Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simpler subsetting for [.tbl_df #610

Closed
hadley opened this issue Sep 19, 2014 · 12 comments
Closed

Simpler subsetting for [.tbl_df #610

hadley opened this issue Sep 19, 2014 · 12 comments
Assignees
Labels
Milestone

Comments

@hadley
Copy link
Member

@hadley hadley commented Sep 19, 2014

`[.my_df` <- function (x, i, j) {
  if (missing(i) && missing(j)) return(x)

  # First, subset columns
  if (!missing(j)) {
    x <- .subset(x, j)
  }

  # Next, subset rows
  if (!missing(i)) {
    x <- lapply(x, `[`, i)
  }

  # TODO: handle 0 column case
  structure(x,
    class = "data.frame",
    row.names = .set_row_names(length(x[[1]]))
  )
}

mtcars2 <- mtcars
class(mtcars2) <- c("my_df", "data.frame")

mtcars2[]
mtcars2[,]

mtcars2[0, ]
mtcars2[, 0]

mtcars2[1:5, ]
mtcars2[, 1:5]

mtcars2[, c("mpg", "disp", "hp")]
mtcars2[mtcars2$mpg < 20, ]


library(microbenchmark)

microbenchmark(
  mtcars[1:5, ],
  mtcars2[1:5, ],
  mtcars[, 1:5],
  mtcars2[, 1:5],
  mtcars[],
  mtcars2[]
)
@jimhester
Copy link
Member

@jimhester jimhester commented Sep 19, 2014

The amount of speedup from using this simplified subsetting was pretty surprising to me! Around ~3x faster on my machine for most of the operations. What major functionality is lost by doing this? Mainly just partial matching of dimnames and dropping unused dimensions right?

@hadley
Copy link
Member Author

@hadley hadley commented Sep 19, 2014

Yeah, and a whole lot of special cases that no one ever uses.

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Sep 19, 2014

Main thing missing: df[1:5] grabs the first 5 rows, not columns. Would have to dispatch based on nargs() as [.data.frame does (but gross), since that's the only way to differentiate between df[1:5] and df[1:5, ].

@hadley
Copy link
Member Author

@hadley hadley commented Sep 19, 2014

@kevinushey but now df[1:5] is equivalent to df[, 1:5] since we never drop, so it's not so important.

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Sep 19, 2014

I don't see that:

> mtcars2[1:5]
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

> head(mtcars2[, 1:5])
   mpg cyl disp  hp drat
1 21.0   6  160 110 3.90
2 21.0   6  160 110 3.90
3 22.8   4  108  93 3.85
4 21.4   6  258 110 3.08
5 18.7   8  360 175 3.15
6 18.1   6  225 105 2.76

@hadley
Copy link
Member Author

@hadley hadley commented Sep 19, 2014

Oh yeah, I meant in general: mtcars[1] vs mtcars[, 1] - list style never drops

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Sep 19, 2014

Here's a version with an escape hatch for the single element (column subsetting) case:

`[.my_df` <- function (x, i, j) {
  if (missing(i) && missing(j)) return(x)

  # Escape early if nargs() == 2L; ie, column subsetting
  if (nargs() == 2L) {
    result <- .subset(x, i)
    class(result) <- "data.frame"
    attr(result, "row.names") <- .set_row_names(length(x[[1]]))
    return(result)
  }

  # First, subset columns
  if (!missing(j)) {
    x <- .subset(x, j)
  }

  # Next, subset rows
  if (!missing(i)) {
    x <- lapply(x, `[`, i)
  }

  # TODO: handle 0 column case
  structure(x,
            class = "data.frame",
            row.names = .set_row_names(length(x[[1]]))
  )
}

BTW, I am not sure if it's still true in newer R but I remember @wch remarking that structure would force a copy (whereas assigning directly to attributes can avoid that).

@hadley
Copy link
Member Author

@hadley hadley commented Sep 19, 2014

I don't think that case is important enough that it deserves special handling. But you're right that we should check on structure vs setters.

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Sep 19, 2014

I would disagree :/ I think column subsetting with the df[i] notation is fairly common, and I imagine users expect [.tbl_df and [.data.frame(..., drop = FALSE) to be roughly identical (barring an understanding that there are no row names, and subset of types accepted). Or at least common enough that you would be getting new issues posted on GitHub quite soon after a release and you'd either have to WONTFIX or hack it back in.

Whether it's best practice is another story, I guess.

@hadley
Copy link
Member Author

@hadley hadley commented Sep 19, 2014

@kevinushey interesting - I thought it was actually pretty rare. I've seen hardly anyone use it apart from me, but if you think it's fairly common, the fix isn't too awful.

I wonder if it's worth it to have a C function that definitely coerces a list to a data frame in place.

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Sep 19, 2014

Yeah, definitely -- and it wouldn't be hard (relatively easy with just C, very easy with Rcpp) -- I have an (ugly) example at https://github.com/kevinushey/Kmisc/blob/master/src/list2df.c; would be happy to contribute a cleaned-up version of that.

@hadley
Copy link
Member Author

@hadley hadley commented Sep 19, 2014

That looks good, and Rcpp would be fine since dplyr already has a dependency. I think it'll need an optional argument for the number of rows, since that might be available elsewhere, and will be needed for the 0 column case.

@hadley hadley added the feature label Sep 22, 2014
@hadley hadley added this to the 0.3 milestone Sep 22, 2014
@hadley hadley self-assigned this Sep 22, 2014
@hadley hadley closed this in 6e78f57 Sep 22, 2014
krlmlr pushed a commit to krlmlr/dplyr that referenced this issue Mar 2, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants