New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tbl_df[, i] shouldn't coerce to vector #587

Closed
hadley opened this Issue Sep 8, 2014 · 17 comments

Comments

Projects
None yet
5 participants
@hadley
Member

hadley commented Sep 8, 2014

No description provided.

@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Sep 8, 2014

Contributor

Would it be sufficient to just delegate to [.data.frame w/ drop = FALSE and coerce back to tbl_df at the end? E.g.

`[.tbl_df` <- function(x, i, j, drop = FALSE) {
  as.tbl(`[.data.frame`(x, i, j, drop = drop))
}
Contributor

kevinushey commented Sep 8, 2014

Would it be sufficient to just delegate to [.data.frame w/ drop = FALSE and coerce back to tbl_df at the end? E.g.

`[.tbl_df` <- function(x, i, j, drop = FALSE) {
  as.tbl(`[.data.frame`(x, i, j, drop = drop))
}
@hadley

This comment has been minimized.

Show comment
Hide comment
@hadley

hadley Sep 9, 2014

Member

Yes exactly. I think we might already has a method for grouped objects.

Member

hadley commented Sep 9, 2014

Yes exactly. I think we might already has a method for grouped objects.

@hadley hadley added the feature label Sep 11, 2014

@hadley hadley added this to the 0.3 milestone Sep 11, 2014

@hadley hadley self-assigned this Sep 11, 2014

@hadley

This comment has been minimized.

Show comment
Hide comment
@hadley

hadley Sep 12, 2014

Member

Ugh, I give up for now. Here are some test cases

test_that("tbl_df doesn't drop by default", {
  df <- data_frame(x = 1:4)

  expect_equal(df[], df)
  expect_equal(df[1], df)
  expect_equal(df[, 1], df)
  expect_equal(df[1:2, ], df)
})

the obvious stuff like @kevinushey's idea doesn't work because of the special weirdness of [.

Member

hadley commented Sep 12, 2014

Ugh, I give up for now. Here are some test cases

test_that("tbl_df doesn't drop by default", {
  df <- data_frame(x = 1:4)

  expect_equal(df[], df)
  expect_equal(df[1], df)
  expect_equal(df[, 1], df)
  expect_equal(df[1:2, ], df)
})

the obvious stuff like @kevinushey's idea doesn't work because of the special weirdness of [.

@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Sep 12, 2014

Contributor

Yuck. This is pretty much because of these being different:

df <- data.frame(x = 1:4)
`[.data.frame`(df, 1)
`[.data.frame`(df, 1, )

Is there any way to differentiate between those two cases in a sensible way?

Contributor

kevinushey commented Sep 12, 2014

Yuck. This is pretty much because of these being different:

df <- data.frame(x = 1:4)
`[.data.frame`(df, 1)
`[.data.frame`(df, 1, )

Is there any way to differentiate between those two cases in a sensible way?

@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Sep 12, 2014

Contributor

Here's something gross:

#' @export
`[.tbl_df` <- function(x, i, j, drop = FALSE) {
  ## Needed to distinguish between x[i] and x[i, ]
  num_args <- nargs()

  ## If we have 3 or more arguments, can use regular dispatch
  if (num_args >= 3L) {
    return(as.tbl(`[.data.frame`(x, i, j, drop)))
  }

  ## Otherwise, dispatch depending on missingness
  if (missing(j)) {
    if (missing(i)) {
      return(x)
    } else {
      return(as.tbl(`[.data.frame`(x, , i, drop)))
    }
  }
  as.tbl(`[.data.frame`(x, i, j, drop))
}

It passes these tests:

library(dplyr)
library(testthat)
tbl <- data_frame(x = 1:4)
df <- data.frame(x = 1:4)

check <- function(x, y) {
  expect_equal(as.data.frame(x), y)
}

check(tbl[], df[])
check(tbl[1], df[1])
check(tbl[, 1], df[, 1, drop = FALSE])
check(tbl[1:2, ], df[1:2, , drop = FALSE])

Part of me thinks we should just roll our own subsetting methods though, rather than delegating to [.data.frame.

Contributor

kevinushey commented Sep 12, 2014

Here's something gross:

#' @export
`[.tbl_df` <- function(x, i, j, drop = FALSE) {
  ## Needed to distinguish between x[i] and x[i, ]
  num_args <- nargs()

  ## If we have 3 or more arguments, can use regular dispatch
  if (num_args >= 3L) {
    return(as.tbl(`[.data.frame`(x, i, j, drop)))
  }

  ## Otherwise, dispatch depending on missingness
  if (missing(j)) {
    if (missing(i)) {
      return(x)
    } else {
      return(as.tbl(`[.data.frame`(x, , i, drop)))
    }
  }
  as.tbl(`[.data.frame`(x, i, j, drop))
}

It passes these tests:

library(dplyr)
library(testthat)
tbl <- data_frame(x = 1:4)
df <- data.frame(x = 1:4)

check <- function(x, y) {
  expect_equal(as.data.frame(x), y)
}

check(tbl[], df[])
check(tbl[1], df[1])
check(tbl[, 1], df[, 1, drop = FALSE])
check(tbl[1:2, ], df[1:2, , drop = FALSE])

Part of me thinks we should just roll our own subsetting methods though, rather than delegating to [.data.frame.

@hadley

This comment has been minimized.

Show comment
Hide comment
@hadley

hadley Sep 12, 2014

Member

Does your code work with an explicit drop?

Reimplementing subsetting for data frames from scratch would be a lot of work :(

Member

hadley commented Sep 12, 2014

Does your code work with an explicit drop?

Reimplementing subsetting for data frames from scratch would be a lot of work :(

@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Sep 12, 2014

Contributor

Iteration two (maybe I'll put this in a PR if this seems like the route we want to go):

maybe_tbl_df <- function(object) {
  if (is.data.frame(object)) {
    tbl_df(object)
  } else {
    object
  }
}

#' @export
`[.tbl_df` <- function(x, i, j, drop = FALSE) {

  num_args <- nargs() - !missing(drop)
  if (num_args >= 3L) {
    return(maybe_tbl_df(`[.data.frame`(x, i, j, drop)))
  }

  if (missing(j)) {
    if (missing(i)) {
      return(x)
    } else {
      return(maybe_tbl_df(`[.data.frame`(x, , i, drop)))
    }
  }
  maybe_tbl_df(`[.data.frame`(x, i, j, drop))
}

passes:

library(dplyr)
library(testthat)
tbl <- data_frame(x = 1:4)
df <- data.frame(x = 1:4)

check <- function(x, y) {
  expect_equal(as.data.frame(x), y)
}

check(tbl, df)

check(tbl[], df[])
check(tbl[1], df[1])
check(tbl[, 1], df[, 1, drop = FALSE])
check(tbl[1:2, ], df[1:2, , drop = FALSE])

check(tbl[drop = FALSE], df[drop = FALSE])
check(tbl[1, drop = FALSE], df[1, drop = FALSE])
check(tbl[, 1, drop = FALSE], df[, 1, drop = FALSE])
check(tbl[1:2, , drop = FALSE], df[1:2, , drop = FALSE])

check(tbl[drop = TRUE], df[drop = TRUE])
check(tbl[1, drop = TRUE], df[1, drop = TRUE])
expect_equal(tbl[, 1, drop = TRUE], df[, 1, drop = TRUE]) ## because we don't want as.df
expect_equal(tbl[1:2, , drop = TRUE], df[1:2, , drop = TRUE]) ## don't want as.df
Contributor

kevinushey commented Sep 12, 2014

Iteration two (maybe I'll put this in a PR if this seems like the route we want to go):

maybe_tbl_df <- function(object) {
  if (is.data.frame(object)) {
    tbl_df(object)
  } else {
    object
  }
}

#' @export
`[.tbl_df` <- function(x, i, j, drop = FALSE) {

  num_args <- nargs() - !missing(drop)
  if (num_args >= 3L) {
    return(maybe_tbl_df(`[.data.frame`(x, i, j, drop)))
  }

  if (missing(j)) {
    if (missing(i)) {
      return(x)
    } else {
      return(maybe_tbl_df(`[.data.frame`(x, , i, drop)))
    }
  }
  maybe_tbl_df(`[.data.frame`(x, i, j, drop))
}

passes:

library(dplyr)
library(testthat)
tbl <- data_frame(x = 1:4)
df <- data.frame(x = 1:4)

check <- function(x, y) {
  expect_equal(as.data.frame(x), y)
}

check(tbl, df)

check(tbl[], df[])
check(tbl[1], df[1])
check(tbl[, 1], df[, 1, drop = FALSE])
check(tbl[1:2, ], df[1:2, , drop = FALSE])

check(tbl[drop = FALSE], df[drop = FALSE])
check(tbl[1, drop = FALSE], df[1, drop = FALSE])
check(tbl[, 1, drop = FALSE], df[, 1, drop = FALSE])
check(tbl[1:2, , drop = FALSE], df[1:2, , drop = FALSE])

check(tbl[drop = TRUE], df[drop = TRUE])
check(tbl[1, drop = TRUE], df[1, drop = TRUE])
expect_equal(tbl[, 1, drop = TRUE], df[, 1, drop = TRUE]) ## because we don't want as.df
expect_equal(tbl[1:2, , drop = TRUE], df[1:2, , drop = TRUE]) ## don't want as.df
@hadley

This comment has been minimized.

Show comment
Hide comment
@hadley

hadley Sep 12, 2014

Member

One more test case: tbl[,] - it's horrifying how complicated this code has to be :/

Member

hadley commented Sep 12, 2014

One more test case: tbl[,] - it's horrifying how complicated this code has to be :/

@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Sep 12, 2014

Contributor

Works, but with the different default for drop, ie:

> tbl[,]
Source: local data frame [4 x 1]

  x
1 1
2 2
3 3
4 4
> df[,]
[1] 1 2 3 4
> df[, drop = FALSE]
  x
1 1
2 2
3 3
4 4
Warning message:
In `[.data.frame`(df, , drop = FALSE) : 'drop' argument will be ignored
> df[,, drop = FALSE]
  x
1 1
2 2
3 3
4 4

I was forced to laugh a bit at the warning...

Contributor

kevinushey commented Sep 12, 2014

Works, but with the different default for drop, ie:

> tbl[,]
Source: local data frame [4 x 1]

  x
1 1
2 2
3 3
4 4
> df[,]
[1] 1 2 3 4
> df[, drop = FALSE]
  x
1 1
2 2
3 3
4 4
Warning message:
In `[.data.frame`(df, , drop = FALSE) : 'drop' argument will be ignored
> df[,, drop = FALSE]
  x
1 1
2 2
3 3
4 4

I was forced to laugh a bit at the warning...

@hadley

This comment has been minimized.

Show comment
Hide comment
@hadley

hadley Sep 12, 2014

Member

The last one really should be df[,,drop = FALSE].

Member

hadley commented Sep 12, 2014

The last one really should be df[,,drop = FALSE].

@hadley hadley closed this in 6e78f57 Sep 22, 2014

@trinker

This comment has been minimized.

Show comment
Hide comment
@trinker

trinker Oct 1, 2014

May I ask why this is a desirable behavior? If tbl_df is meant to be used in all the same ways data.frame is used then it may be confusing and cause errors in functions that rely on the behavior of df[, index] rather than df[[index]].

Here's an example that used to be fine with tbl_df because indexing a single column resulted in a vector.

FUN <- function(x, y) {
    if (is.data.frame(x)) stop("supply vector")
    cor(x, y)
}

x <- tbl_df(mtcars)
FUN(x[, "mpg"], x[, "hp"])

Now to get the same effect the user would have to use FUN(x[["mpg"]], x[["hp"]]).

It's likely people use a similar check of a formal throughout many R add on packages. This means dplyr''s tbl_df would cause an error with those functions.

paste comes to mind as a place this would cause issue:

x <- tbl_df(mtcars)
paste(x[, 1], collapse="|")
## vs.
paste(mtcars[, 1], collapse="|")

Similarly this means S3 classed functions that are serching for a vector of some sort, if you index as df[, index] then the incorrect method may be called:

x <- tbl_df(mtcars)

FUN2 <- function(x, ...) {
    UseMethod("FUN2")
}

FUN2.numeric <- function(x, ...) {
    message("A vector can't do `cor`")
}

FUN2.data.frame <- function(x, ...) {
    message(if(ncol(x) == 1) warning("You need 2 vectors for a correlation"))
    cor(x)
}

FUN2(x[, 1])
## You need 2 vectors for a correlation
##     mpg
## mpg   1
## Warning message:
## In message(if (ncol(x) == 1) warning("You need 2 vectors for a correlation")) :
##   You need 2 vectors for a correlation

FUN2(mtcars[, 1])
## A vector can't do `cor`

trinker commented Oct 1, 2014

May I ask why this is a desirable behavior? If tbl_df is meant to be used in all the same ways data.frame is used then it may be confusing and cause errors in functions that rely on the behavior of df[, index] rather than df[[index]].

Here's an example that used to be fine with tbl_df because indexing a single column resulted in a vector.

FUN <- function(x, y) {
    if (is.data.frame(x)) stop("supply vector")
    cor(x, y)
}

x <- tbl_df(mtcars)
FUN(x[, "mpg"], x[, "hp"])

Now to get the same effect the user would have to use FUN(x[["mpg"]], x[["hp"]]).

It's likely people use a similar check of a formal throughout many R add on packages. This means dplyr''s tbl_df would cause an error with those functions.

paste comes to mind as a place this would cause issue:

x <- tbl_df(mtcars)
paste(x[, 1], collapse="|")
## vs.
paste(mtcars[, 1], collapse="|")

Similarly this means S3 classed functions that are serching for a vector of some sort, if you index as df[, index] then the incorrect method may be called:

x <- tbl_df(mtcars)

FUN2 <- function(x, ...) {
    UseMethod("FUN2")
}

FUN2.numeric <- function(x, ...) {
    message("A vector can't do `cor`")
}

FUN2.data.frame <- function(x, ...) {
    message(if(ncol(x) == 1) warning("You need 2 vectors for a correlation"))
    cor(x)
}

FUN2(x[, 1])
## You need 2 vectors for a correlation
##     mpg
## mpg   1
## Warning message:
## In message(if (ncol(x) == 1) warning("You need 2 vectors for a correlation")) :
##   You need 2 vectors for a correlation

FUN2(mtcars[, 1])
## A vector can't do `cor`
@gvfarns

This comment has been minimized.

Show comment
Hide comment
@gvfarns

gvfarns Oct 10, 2014

This feature is very problematic. It breaks existing code of mine that depends on the drop=TRUE default of data frames. When I convert code over to dplyr, I unexpectedly get errors later in the code. In fact, [.tbl_df does not appear to have the option of drop=TRUE so I can't make the direct change to make things work as expected. As a result, I end up stripping "tbl_df" from the class of my dataframes/tbl_df's all over the place.

With respect, one of the biggest advantages dplyr has over other approaches, such as those of data.table, is that it doesn't overload basic operators like [ in such a way that existing R code breaks.

Unless there is a very critical reason to depart from the default behavior of data frames, I would respectfully encourage you to reconsider this decision.

gvfarns commented Oct 10, 2014

This feature is very problematic. It breaks existing code of mine that depends on the drop=TRUE default of data frames. When I convert code over to dplyr, I unexpectedly get errors later in the code. In fact, [.tbl_df does not appear to have the option of drop=TRUE so I can't make the direct change to make things work as expected. As a result, I end up stripping "tbl_df" from the class of my dataframes/tbl_df's all over the place.

With respect, one of the biggest advantages dplyr has over other approaches, such as those of data.table, is that it doesn't overload basic operators like [ in such a way that existing R code breaks.

Unless there is a very critical reason to depart from the default behavior of data frames, I would respectfully encourage you to reconsider this decision.

@hadley

This comment has been minimized.

Show comment
Hide comment
@hadley

hadley Oct 10, 2014

Member

@gvfarns regardless of whether or not we rollback this change you can always use [[

Member

hadley commented Oct 10, 2014

@gvfarns regardless of whether or not we rollback this change you can always use [[

@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Oct 13, 2014

Contributor

If enough people are uncomfortable with this it could be worth rolling back.

Honestly, I think df[, i] should be discouraged for data.frames (it is of course okay if you're living in a data.table world). But in general:

Subset a data.frame by columns: df[c(...)],
Get a particular column: df[[i]]
Get rows: df[i, ]

and the df[, i] syntax should be mostly avoided.

Contributor

kevinushey commented Oct 13, 2014

If enough people are uncomfortable with this it could be worth rolling back.

Honestly, I think df[, i] should be discouraged for data.frames (it is of course okay if you're living in a data.table world). But in general:

Subset a data.frame by columns: df[c(...)],
Get a particular column: df[[i]]
Get rows: df[i, ]

and the df[, i] syntax should be mostly avoided.

@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Oct 13, 2014

Contributor

Put another way: dplyr shouldn't touch the [ function; it should discourage its use (and maybe provide a couple more verbs for row-wise or column-wise subsetting by index)

Contributor

kevinushey commented Oct 13, 2014

Put another way: dplyr shouldn't touch the [ function; it should discourage its use (and maybe provide a couple more verbs for row-wise or column-wise subsetting by index)

@Fablepongiste

This comment has been minimized.

Show comment
Hide comment
@Fablepongiste

Fablepongiste Feb 25, 2015

So is this definitive or a rollback still studied ?

will this code always give different results :

df <- data.frame(a = c(1:3), b = c(2:4))
df[,'a'] -> returns vector numeric
# 
tbl_df(df)[,'a']  -> returns a local data frame with one column a ?

And if yes, you advise to use df[['a']] and tbl_df(df)[['a']] for consistent behaviour between both ?

Fablepongiste commented Feb 25, 2015

So is this definitive or a rollback still studied ?

will this code always give different results :

df <- data.frame(a = c(1:3), b = c(2:4))
df[,'a'] -> returns vector numeric
# 
tbl_df(df)[,'a']  -> returns a local data frame with one column a ?

And if yes, you advise to use df[['a']] and tbl_df(df)[['a']] for consistent behaviour between both ?

@hadley

This comment has been minimized.

Show comment
Hide comment
@hadley

hadley Feb 25, 2015

Member

I'm pretty sure this behaviour is going to stay. Use [ to return a data frame, use [[ to return a column.

Member

hadley commented Feb 25, 2015

I'm pretty sure this behaviour is going to stay. Use [ to return a data frame, use [[ to return a column.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.