Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter, arrange, slice dropping custom attributes of base vectors #4219

Closed
mwh3780 opened this issue Feb 23, 2019 · 11 comments
Closed

filter, arrange, slice dropping custom attributes of base vectors #4219

mwh3780 opened this issue Feb 23, 2019 · 11 comments

Comments

@mwh3780
Copy link

mwh3780 commented Feb 23, 2019

filter, arrange, slice all seem to drop custom attributes of base vectors in version 0.8.0 +. I've found similar posts about this behavior, but they all seem to be around custom / user defined classes, not base vectors (#4079, #3923, #3429 )

This wasn't in the documentation as a breaking change so I am hoping it is a bug, otherwise it's a pretty substantial breaking change.


library(dplyr)
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)
library(data.table)
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last

d <- tibble(
  factor  = structure(1:2, .Label = c("0", "1"), class = "factor"   , label = "foo", attr_1 = "some_value"), 
  logical = structure(0:1,                       class = "logical"  , label = "foo", attr_2 = "some_value"), 
  numeric = structure(0:1,                       class = "numeric"  , label = "foo", attr_3 = "some_value"), 
  integer = structure(0:1,                       class = "integer"  , label = "foo", attr_4 = "some_value"), 
  char    = structure(c("0", "1"),               class = "character", label = "foo", attr_5 = "some_value")
)

### dplyr 0.8.0+ no longer perserves custom attributes as it once did
d %>% str
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  5 variables:
#>  $ factor : Factor w/ 2 levels "0","1": 1 2
#>   ..- attr(*, "label")= chr "foo"
#>   ..- attr(*, "attr_1")= chr "some_value"
#>  $ logical:Class 'logical'  atomic [1:2] 0 1
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_2")= chr "some_value"
#>  $ numeric:Class 'numeric'  atomic [1:2] 0 1
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_3")= chr "some_value"
#>  $ integer:Class 'integer'  atomic [1:2] 0 1
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_4")= chr "some_value"
#>  $ char   :Class 'character'  atomic [1:2] 0 1
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_5")= chr "some_value"
d %>% filter (1 == 1     ) %>% str
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  5 variables:
#>  $ factor : Factor w/ 2 levels "0","1": 1 2
#>  $ logical: int  0 1
#>  $ numeric: int  0 1
#>  $ integer: int  0 1
#>  $ char   : chr  "0" "1"
d %>% slice  (1:2        ) %>% str
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  5 variables:
#>  $ factor : Factor w/ 2 levels "0","1": 1 2
#>  $ logical: int  0 1
#>  $ numeric: int  0 1
#>  $ integer: int  0 1
#>  $ char   : chr  "0" "1"
d %>% arrange(factor     ) %>% str
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  5 variables:
#>  $ factor : Factor w/ 2 levels "0","1": 1 2
#>  $ logical: int  0 1
#>  $ numeric: int  0 1
#>  $ integer: int  0 1
#>  $ char   : chr  "0" "1"


### While [ doesn't preserve attributes, it's worth noting that other ecosystems like data.table do preserve custom attributes
dt <- as.data.table(d)

dt[numeric == 0] %>% str
#> Classes 'data.table' and 'data.frame':   1 obs. of  5 variables:
#>  $ factor : Factor w/ 2 levels "0","1": 1
#>   ..- attr(*, "label")= chr "foo"
#>   ..- attr(*, "attr_1")= chr "some_value"
#>  $ logical:Class 'logical'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_2")= chr "some_value"
#>  $ numeric:Class 'numeric'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_3")= chr "some_value"
#>  $ integer:Class 'integer'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_4")= chr "some_value"
#>  $ char   :Class 'character'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_5")= chr "some_value"
#>  - attr(*, ".internal.selfref")=<externalptr>
dt[1] %>% str
#> Classes 'data.table' and 'data.frame':   1 obs. of  5 variables:
#>  $ factor : Factor w/ 2 levels "0","1": 1
#>   ..- attr(*, "label")= chr "foo"
#>   ..- attr(*, "attr_1")= chr "some_value"
#>  $ logical:Class 'logical'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_2")= chr "some_value"
#>  $ numeric:Class 'numeric'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_3")= chr "some_value"
#>  $ integer:Class 'integer'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_4")= chr "some_value"
#>  $ char   :Class 'character'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_5")= chr "some_value"
#>  - attr(*, ".internal.selfref")=<externalptr>
dt[order(-numeric)] %>% str
#> Classes 'data.table' and 'data.frame':   2 obs. of  5 variables:
#>  $ factor : Factor w/ 2 levels "0","1": 2 1
#>   ..- attr(*, "label")= chr "foo"
#>   ..- attr(*, "attr_1")= chr "some_value"
#>  $ logical:Class 'logical'  atomic [1:2] 1 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_2")= chr "some_value"
#>  $ numeric:Class 'numeric'  atomic [1:2] 1 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_3")= chr "some_value"
#>  $ integer:Class 'integer'  atomic [1:2] 1 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_4")= chr "some_value"
#>  $ char   :Class 'character'  atomic [1:2] 1 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_5")= chr "some_value"
#>  - attr(*, ".internal.selfref")=<externalptr>

Created on 2019-03-08 by the reprex package (v0.2.1)

@mwh3780 mwh3780 changed the title filter, arrange, slice dropping custom attributes of base classes filter, arrange, slice dropping custom attributes of base vectors Feb 23, 2019
@romainfrancois
Copy link
Member

Actually when a column has a class, since 0.8.0 dplyr falls back to use R [ operator, which drops the attributes:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

d <- tibble(
  x = structure(0:1, label = "foo"), 
  y = 0:1
)

# filter() will keep them in that case
#            but I actually believe it should not
filter(d, 1 == 1) %>% str()
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
#>  $ x: int  0 1
#>   ..- attr(*, "label")= chr "foo"
#>  $ y: int  0 1

# but it should perhaps not
str(d$x[1:2])
#>  int [1:2] 0 1
str(d$y[1:2])
#>  int [1:2] 0 1

# vctrs::vec_slice() agrees
str(vctrs::vec_slice(d$x, 1:2))
#>  int [1:2] 0 1

# when the object has a class, dplyr falls back 
# to calling R which drops the attributes
d <- tibble(
  x = structure(0:1, label = "foo", class = "numeric")
)
filter(d, 1 == 1) %>% str()
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  1 variable:
#>  $ x: int  0 1

This is the correct behaviour, because neither dplyr nor base R knows what to do with the attributes. You'd need to define a custom [ method to handle them specifically and control their meaning.

str(structure(0:1, label = "foo")[1:2])
#>  int [1:2] 0 1
str(structure(0:1, label = "foo", class = "numeric")[1:2])
#>  int [1:2] 0 1

@romainfrancois
Copy link
Member

Follow up;

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

d <- tibble(
  y = 12:1,
  x = structure(1:12, label = "foo", class = "myclass")
)

d %>% 
  filter(y > 10) %>% 
  str()
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
#>  $ y: int  12 11
#>  $ x: int  1 2

`[.myclass` <- function(x, ...) {
  structure(unclass(x)[...], class = "myclass", label = "foo")
}
d %>% 
  filter(y > 10) %>% 
  str()
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  2 variables:
#>  $ y: int  12 11
#>  $ x: 'myclass' int  1 2
#>   ..- attr(*, "label")= chr "foo"

Created on 2019-03-04 by the reprex package (v0.2.1.9000)

@romainfrancois
Copy link
Member

This might change when we use vctrs::vec_slice() instead of [

@mwh3780
Copy link
Author

mwh3780 commented Mar 7, 2019

I'm confused. If this is the correct behavior, then why isn't it documented as a breaking change? It seems odd to switch to using [ without notifying the community of the breaking behavior.

What are the odds that vctrs::vec_slice() will support preserving attributes? The most common case we deal with is needing to append attributes to base vectors, such as factors which are required by many / most algorithms, so defining a custom class isn't an option.

@romainfrancois romainfrancois reopened this Mar 7, 2019
@romainfrancois
Copy link
Member

Reopening this, but I guess the discussion should move to vctrs::vec_slice().

I’ll add some more content here in the morning.

@mwh3780
Copy link
Author

mwh3780 commented Mar 7, 2019

Thank you.

@strengejacke
Copy link
Contributor

An intermediate sjlabelled::copy_labels() can be used as workaround, for the moment.

library(dplyr)
library(sjlabelled)

d <- data.frame(
  factor  = structure(1:2, .Label = c("0", "1"), class = "factor"   , label = "foo"), 
  logical = structure(0:1,                       class = "logical"  , label = "foo"), 
  numeric = structure(0:1,                       class = "numeric"  , label = "foo"), 
  integer = structure(0:1,                       class = "integer"  , label = "foo"), 
  char    = structure(c("0", "1"),               class = "character", label = "foo")
)

d %>% sapply(attr, "label")
#> $factor
#> [1] "foo"
#> 
#> $logical
#> [1] "foo"
#> 
#> $numeric
#> [1] "foo"
#> 
#> $integer
#> [1] "foo"
#> 
#> $char
#> NULL

d %>% 
  filter (1 == 1) %>% 
  sjlabelled::copy_labels(d) %>% 
  sapply(attr, "label")
#> $factor
#> [1] "foo"
#> 
#> $logical
#> [1] "foo"
#> 
#> $numeric
#> [1] "foo"
#> 
#> $integer
#> [1] "foo"
#> 
#> $char
#> NULL

Created on 2019-03-08 by the reprex package (v0.2.1)

@strengejacke
Copy link
Contributor

strengejacke commented Mar 8, 2019

(value labels, i.e. attribute labels, are also preserved when using copy_labels())

@mwh3780
Copy link
Author

mwh3780 commented Mar 8, 2019

Yes and no. While the label and labels attributes are common attributes that are used, often other attributes are also needed. I already have some other tools that are like sjlabelled::copy_labels() but preserve all custom attributes for functions and processes that drop attributes. I'll update my original reprex to show other attributes besides label and labels to avoid this confusion.

To me the key issue at hand is if attribute preserving behavior should continue to be exhibited by dplyr functions as it has in the past. Given the number of other related posts about dplyr's new breaking behavior of dropping attributes, it appears there is a desire in the community for attributes to be preserved.

Also, while base R tools like [ don't preserve attributes, I think it's important to note that other ecosystems such as data.table do preserve custom attributes. I've updated my original post to also reflect this.

library(dplyr)
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tibble)
library(data.table)
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last

d <- tibble(
  factor  = structure(1:2, .Label = c("0", "1"), class = "factor"   , label = "foo", attr_1 = "some_value"), 
  logical = structure(0:1,                       class = "logical"  , label = "foo", attr_2 = "some_value"), 
  numeric = structure(0:1,                       class = "numeric"  , label = "foo", attr_3 = "some_value"), 
  integer = structure(0:1,                       class = "integer"  , label = "foo", attr_4 = "some_value"), 
  char    = structure(c("0", "1"),               class = "character", label = "foo", attr_5 = "some_value")
)

### While [ doesn't preserve attributes, it's worth noting that other ecosystems like data.table do preserve custom attributes
dt <- as.data.table(d)

dt[numeric == 0] %>% str
#> Classes 'data.table' and 'data.frame':   1 obs. of  5 variables:
#>  $ factor : Factor w/ 2 levels "0","1": 1
#>   ..- attr(*, "label")= chr "foo"
#>   ..- attr(*, "attr_1")= chr "some_value"
#>  $ logical:Class 'logical'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_2")= chr "some_value"
#>  $ numeric:Class 'numeric'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_3")= chr "some_value"
#>  $ integer:Class 'integer'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_4")= chr "some_value"
#>  $ char   :Class 'character'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_5")= chr "some_value"
#>  - attr(*, ".internal.selfref")=<externalptr>
dt[1] %>% str
#> Classes 'data.table' and 'data.frame':   1 obs. of  5 variables:
#>  $ factor : Factor w/ 2 levels "0","1": 1
#>   ..- attr(*, "label")= chr "foo"
#>   ..- attr(*, "attr_1")= chr "some_value"
#>  $ logical:Class 'logical'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_2")= chr "some_value"
#>  $ numeric:Class 'numeric'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_3")= chr "some_value"
#>  $ integer:Class 'integer'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_4")= chr "some_value"
#>  $ char   :Class 'character'  atomic [1:1] 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_5")= chr "some_value"
#>  - attr(*, ".internal.selfref")=<externalptr>
dt[order(-numeric)] %>% str
#> Classes 'data.table' and 'data.frame':   2 obs. of  5 variables:
#>  $ factor : Factor w/ 2 levels "0","1": 2 1
#>   ..- attr(*, "label")= chr "foo"
#>   ..- attr(*, "attr_1")= chr "some_value"
#>  $ logical:Class 'logical'  atomic [1:2] 1 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_2")= chr "some_value"
#>  $ numeric:Class 'numeric'  atomic [1:2] 1 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_3")= chr "some_value"
#>  $ integer:Class 'integer'  atomic [1:2] 1 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_4")= chr "some_value"
#>  $ char   :Class 'character'  atomic [1:2] 1 0
#>   .. ..- attr(*, "label")= chr "foo"
#>   .. ..- attr(*, "attr_5")= chr "some_value"
#>  - attr(*, ".internal.selfref")=<externalptr>

Created on 2019-03-08 by the reprex package (v0.2.1)

@hadley
Copy link
Member

hadley commented May 27, 2019

Duplicate of #3923

@hadley hadley marked this as a duplicate of #3923 May 27, 2019
@hadley hadley closed this as completed May 27, 2019
@lock
Copy link

lock bot commented Nov 23, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Nov 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants